CN101963897B

CN101963897B - Apparatus and method for dual data path processing

Info

Publication number: CN101963897B
Application number: CN201010276291.9A
Authority: CN
Inventors: S·诺勒斯
Original assignee: Nvidia Technology UK Ltd
Current assignee: Icera LLC
Priority date: 2004-03-31
Filing date: 2005-03-22
Publication date: 2014-03-12
Anticipated expiration: 2025-03-22
Also published as: CA2560093A1; CN101963897A; US20050223197A1; TWI362617B; JP2007531135A; CN1989485A; US8484441B2; WO2005096142A3; CN1989485B; JP5382635B2; WO2005096142A2; EP1735699A2; EP1735699B1; TW200540713A; KR20070037568A

Abstract

A computer processor having control and data processing capabilities, including a decode unit for decoding instructions. The data processing device includes a first data execution path and a second data execution path, the first data execution path includes fixed operators, the second data execution path includes at least configurable operators, and the configurable operators have multiple pre-defined configurations, at least some of which are selectable via opcode portions of data processing instructions. a decoding unit operable to detect whether a data processing instruction defines a fixed data processing operation or a configurable data processing operation, the decoding unit causing the computer system to provide data for processing to the said fixed data processing instruction upon detection of said fixed data processing instruction A first data execution path is provided to the configurable data execution path upon detection of a configurable data processing instruction.

Description

Apparatus and method for dual data path processing

本申请是申请日为2005年3月22日申请号为200580010665.X(PCT/GB2005/001073)的同名中国专利申请的分案申请。 This application is a divisional application of a Chinese patent application with the same name filed on March 22, 2005 with application number 200580010665.X (PCT/GB2005/001073). the

技术领域technical field

本发明涉及一种计算机处理器，一种操作该计算机处理器的方法，以及一种包括计算机用的指令集的计算机程序产品。 The invention relates to a computer processor, a method of operating the computer processor, and a computer program product comprising an instruction set for a computer. the

背景技术Background technique

为了提高计算机处理器的速度，现有技术结构已使用了双执行路径用于执行指令。双执行路径处理器可以根据单指令多数据(SIMD)原理操作，利用操作的并行性用于提高处理器速度。 In order to increase the speed of computer processors, prior art architectures have used dual execution paths for executing instructions. Dual execution path processors may operate according to the Single Instruction Multiple Data (SIMD) principle, exploiting parallelism of operations for increasing processor speed. the

然而，虽然使用双执行路径和SIMD处理，但是仍不断的需要提高处理器速度。典型的双执行路径处理器使用两个大致类似的通路，因此每个通路都处理控制代码和数据路径代码。虽然公知的处理器支持32位标准编码和16位“密集”编码的组合，但是该方案承受着许多不足，包括缺少在16位格式中少数可用位中的语义内容。 However, despite the use of dual execution paths and SIMD processing, there is a continuing need to increase processor speed. A typical dual execution path processor uses two roughly similar paths, so each path processes control code and data path code. While known processors support a combination of 32-bit standard encoding and 16-bit "dense" encoding, this scheme suffers from a number of disadvantages, including a lack of semantic content in the few bits available in 16-bit formats. the

此外，常规的通用数字信号处理器不能匹配用于许多目的的应用特定算法，包括执行诸如卷积、快速傅立叶变换、Trellis/Viterbi编码、相关性、有限脉冲响应过滤和其他操作的专用操作。 Furthermore, conventional general-purpose digital signal processors cannot be matched to application-specific algorithms for many purposes, including performing specialized operations such as convolution, fast Fourier transform, Trellis/Viterbi coding, correlation, finite impulse response filtering, and others. the

发明内容Contents of the invention

在根据本发明的一个实施例中，提供一种具有控制和数据处理能力的计算机处理器。该计算机处理器包括：用于解码指令的解码单元；包括第一数据执行路径和第二数据执行路径的数据处理设备，所述第一数据执行路径包括固定操作符，所述第二数据执行路径至少包括可配置操作符，所述可配置操作符具有多个预定义的配置，通过数据处理指令的操作码部分可选择所述配置中的至少一些；其中所述解码单元可操作用于检测数据处理指令是定义固定数据处理操作还是可配置数据处理操作，所述解码单元使计算机系统将用于处理的数据在检测到所述固定数据处理指令时提供给所述第一数据执行路径，而在检测到可配置数据处理指令时提供给所述可配置数据执行路径。 In one embodiment according to the invention, a computer processor having control and data processing capabilities is provided. The computer processor comprises: a decoding unit for decoding instructions; a data processing device comprising a first data execution path comprising fixed operators and a second data execution path including at least configurable operators having a plurality of predefined configurations, at least some of which are selectable via an opcode portion of a data processing instruction; wherein the decode unit is operable to detect data Whether a processing instruction defines a fixed data processing operation or a configurable data processing operation, the decode unit causes the computer system to provide data for processing to the first data execution path upon detection of the fixed data processing instruction, and upon detection of the fixed data processing instruction, When a configurable data processing instruction is detected, it is provided to the configurable data execution path. the

在另一相关实施例中，解码单元能够解码来自存储器的指令包流，每个包包括多个指令。解码单元也可操作用于检测指令包是否包含数据处理指令。可配置操作符以多位值的级别、或者以字的级别可配置，其中多位值包括具有四个或更多位的多位值。根据单指令多数据原理，第一数据执行路径的多个固定操作符可以被布置用于在独立通道中执行多个固定操作。同样，根据单指令多数据原理，第二数据执行路径的多个可配置操作符可以被布置用于在不同通道中执行多个操作。 In another related embodiment, the decode unit is capable of decoding a stream of instruction packets from the memory, each packet including a plurality of instructions. The decoding unit is also operable to detect whether an instruction packet contains a data processing instruction. The configurable operators are configurable at the level of multi-bit values, including multi-bit values having four or more bits, or at the level of words. According to the Single Instruction Multiple Data principle, multiple fixed operators of the first data execution path may be arranged to perform multiple fixed operations in independent lanes. Likewise, multiple configurable operators of the second data execution path may be arranged to perform multiple operations in different lanes according to the Single Instruction Multiple Data principle. the

在另一相关实施例中，第二执行路径的可配置操作符可以被布置用于接收确定所执行的操作的特性的配置信息。可以从定义可配置数据处理操作的指令的字段接收该信息。第二执行路径的可配置操作符可以被布置用于接收包括控制相关的互连性的信息的可配置信息。该计算机处理器进一步包括与第二数据执行路径的可配置操作符相关联的控制映射，所述控制映射可操作用于从可配置数据处理指令接收至少一个配置位，并给响应于此的可配置操作符提供配置信息。该配置信息可以通过所述可配置操作符确定操作的特性；并且控制两个或多个所述配置操作符之间的互连性。 In another related embodiment, the configurable operator of the second execution path may be arranged to receive configuration information determining characteristics of the operations performed. This information may be received from fields of instructions defining configurable data processing operations. The configurable operator of the second execution path may be arranged to receive configurable information comprising information controlling related interconnectivity. The computer processor further includes a control map associated with the configurable operator of the second data execution path, the control map operable to receive at least one configuration bit from a configurable data processing instruction, and to assign a configurable bit in response thereto. Configuration operators provide configuration information. The configuration information may determine characteristics of operation by said configurable operators; and control interconnectivity between two or more of said configurable operators. the

在另一相关实施例中，第二执行路径的可配置操作符可被布置用于从源而不是从可配置数据处理指令来接收确定待执行的操作的特性的配置信息、或者控制互连性的配置信息。第二数据执行路径的至少一个可配置操作符能够在向结果存储器返回结果之前以比两个计算大的执行深度来执行数据处理指令。该计算机处理器可以包括转换装置，其用于从可配置数据处理指令接收数据处理操作数并在适当时转换所述数据处理操作数用于提供给一个或多个所述可配置操作符。计算机处理器也可以包括以下转换装置，其用于从一个或多个所述可配置操作符接收结果，并在适当时转换所述结果用于提供给结果存储器和反馈循环中的一个或多个。该计算机处理器也包括多个控制映射，其用于将从可配置数据处理指令所接收的配置位映射成用于提供给第二数据执行路径的可配置操作符的配置信息。同样，该计算机处理器可以包括以下转换装置，其用于从控制映射接收配置信息，并在适当时转换该配置信息用于提供给第二数据执行路径的可配置操作符。该计算机处理器也可以包括从一个或多个以下项所选择的可配置操作符：乘累加操作符；算术操作符；状态操作符；和交叉通道换码器。同样，该计算机处理器可以包括能执行从如下项中所选择的一个或多个操作的操作符和指令集：快速傅立叶变换；反向快速傅立叶变换；Viterbi编码/解码；Turbo编码/解码；和有限脉冲响应计算；以及任何其他相关性或卷积。 In another related embodiment, the configurable operators of the second execution path may be arranged to receive configuration information determining characteristics of operations to be performed, or to control interconnectivity, from sources other than configurable data processing instructions. configuration information. At least one configurable operator of the second data execution path is capable of executing data processing instructions with an execution depth greater than two computations before returning a result to a result store. The computer processor may comprise conversion means for receiving data processing operands from configurable data processing instructions and converting said data processing operands as appropriate for supply to one or more of said configurable operators. The computer processor may also include transformation means for receiving results from one or more of said configurable operators and transforming said results as appropriate for feeding to one or more of a result memory and a feedback loop . The computer processor also includes a plurality of control maps for mapping configuration bits received from the configurable data processing instructions to configuration information for providing configurable operators of the second data execution path. Likewise, the computer processor may comprise conversion means for receiving configuration information from the control map and converting the configuration information as appropriate for feeding to the configurable operators of the second data execution path. The computer processor may also include a configurable operator selected from one or more of: a multiply-accumulate operator; an arithmetic operator; a state operator; and a cross-path transcoder. Likewise, the computer processor may include a set of operators and instructions capable of performing one or more operations selected from: Fast Fourier Transform; Inverse Fast Fourier Transform; Viterbi encoding/decoding; Turbo encoding/decoding; and Finite impulse response calculations; and any other correlation or convolution. the

在根据本发明的另一实施例中，提供一种操作具有控制和数据处理能力的计算机处理器的方法，所述计算机处理器包括第一数据执行路径和第二数据执行路径，所述第一数据执行路径包括固定操作符，所述第二数据执行路径包括可配置操作符，所述可配置操作符具有多个预定义的配置，所述配置中的至少一些可通过数据处理指令的操作码部分来选择。该方法包括：解码多个指令以检测所述多个指令的至少一个数据处理指令是定义固定数据处理操作还是可配置数据处理操作；使计算机处理器将用于处理的数据在检测到固定数据处理指令时提供给所述第一数据执行路径，而在检测可配置数据处理指令时提供给所述可配置数据执行路径；以及输出结果。 In another embodiment according to the present invention there is provided a method of operating a computer processor having control and data processing capabilities, the computer processor comprising a first data execution path and a second data execution path, the first The data execution path includes fixed operators, and the second data execution path includes configurable operators having a plurality of predefined configurations, at least some of which are operable via opcodes of the data processing instructions part to select. The method includes: decoding a plurality of instructions to detect whether at least one data processing instruction of the plurality of instructions defines a fixed data processing operation or a configurable data processing operation; causing a computer processor to process data for processing when the fixed data processing operation is detected providing instructions to said first data execution path and upon detection of a configurable data processing instruction to said configurable data execution path; and outputting a result. the

在根据本发明的另一实施例中，提供一种包括程序代码装置的计算机程序产品，所述程序代码装置用于使计算机处理器执行以下步骤，其中所述计算机处理器包括第一数据执行路径和第二数据执行路径，所述第一数据执行路径包括固定操作符，所述第二数据执行路径包括可配置操作符，所述可配置操作符具有多个预定义的配置，所述配置中的至少一些可通过数据处理指令的操作码部分来选择，即：解码多个指令以检测所述多个指令的至少一个数据处理指令是定义固定数据处理操作还是可配置数据处理操作；使计算机处理器将用于处理的数据在检测到固定数据处理指令时提供给所述第一数据执行路径，而在检测可配置数据处理指令时提供给所述可配置数据执行路径；以及输出结果。 In another embodiment according to the present invention there is provided a computer program product comprising program code means for causing a computer processor to perform the following steps, wherein the computer processor comprises a first data execution path and a second data execution path, the first data execution path includes fixed operators, the second data execution path includes configurable operators, the configurable operators have a plurality of predefined configurations, in which At least some of the data processing instructions are selectable through the opcode portion of the data processing instructions, namely: decoding a plurality of instructions to detect whether at least one data processing instruction of the plurality of instructions defines a fixed data processing operation or a configurable data processing operation; causes the computer to process providing data for processing to the first data execution path upon detection of a fixed data processing instruction and to the configurable data execution path upon detection of a configurable data processing instruction; and outputting a result. the

在根据本发明的另一实施例中，提供一种包括第一多个指令和第二多个指令的数据处理指令集，所述第一多个指令具有指示数据处理操作的固定类型的字段，所述第二多个指令具有指示数据处理操作的可配置类型的字段。 In another embodiment according to the present invention there is provided a set of data processing instructions comprising a first plurality of instructions having a field indicating a fixed type of data processing operation and a second plurality of instructions, The second plurality of instructions has a field indicating a configurable type of data processing operation. the

在根据本发明的另一实施例中，提供一种包含可配置操作符的数据执行路径的计算机处理器，其中可配置操作符包括操作符配置的多个预定义的组，每个组包括来自独立的操作符类的操作符。操作符类可以包括从一个或多个如下项中所选择的类：乘累加操作符；算术操作符；状态操作符；和换码器。从操作符配置的每个预定义的组内所选择的操作符之间的连接能够通过由计算机处理器所执行的指令内的操作码部分来配置。同样，从操作符配置的多于一个的预定义的组所选择的操作符之间的连接能够通过由计算机处理器所执行的指令内的操作码部分来配置。 In another embodiment according to the invention there is provided a computer processor comprising a data execution path of configurable operators, wherein the configurable operators comprise a plurality of predefined groups of operator configurations, each group consisting of Operators for individual operator classes. Operator classes may include classes selected from one or more of: multiply-accumulate operators; arithmetic operators; state operators; and escapers. Connections between operators selected from within each predefined group of operator configurations can be configured through opcode portions within instructions executed by a computer processor. Likewise, connections between operators selected from more than one predefined group of operator configurations can be configured through opcode portions within instructions executed by a computer processor. the

本发明提供一种计算机处理器，其包括解码来自存储器的指令包的解码单元，每个指令包包括多个指令；包括多个功能单元且可操作用于执行控制处理操作的处理通道；其中所述解码单元可操作用于接收具有64位位长的指令包，并且可操作用于使用所述指令包中的识别位来检测所述指令包是否定义三个每个都具有21位位长的控制指令，以及其中当所述解码单元检测到所述指令包包括三个这种控制指令时所述控制指令被提供给所述处理通道用于按照所述三个这种控制指令出现在所述指令包中的顺序来执行。 The present invention provides a computer processor comprising a decoding unit for decoding instruction packets from a memory, each instruction packet comprising a plurality of instructions; a processing channel comprising a plurality of functional units and operable to perform control processing operations; wherein the The decode unit is operable to receive an instruction packet having a bit length of 64 bits, and is operable to use an identification bit in the instruction packet to detect whether the instruction packet defines three instruction packets each having a bit length of 21 bits a control instruction, and wherein when the decoding unit detects that the instruction packet includes three such control instructions, the control instruction is provided to the processing channel for appearing in the The sequence in the instruction packet is executed. the

本发明还提供一种操作计算机处理器的方法，该计算机处理器包括处理通道并且能够执行具有多个功能单元的控制处理操作，该方法包括(a)接收来自存储器的指令包序列，所述指令包的每一个均包括多个定义了操作的指令；(b)通过以下方式来依次解码每个指令包：使用所述指令包中的识别位来确定所述指令包是否定义了三个每个具有21位位长的控制指令，并且其中当解码单元检测到所述指令包包括三个这种控制指令时，提供所述控制指令给所述处理通道用于按照所述三个这种控制指令出现在所述指令包中的顺序来执行。 The present invention also provides a method of operating a computer processor comprising a processing channel and capable of performing control processing operations having a plurality of functional units, the method comprising (a) receiving a sequence of instruction packets from a memory, the instruction Each of the packets includes a plurality of instructions defining operations; (b) each instruction packet is decoded in turn by using the identification bits in the instruction packet to determine whether the instruction packet defines three each a control instruction having a bit length of 21 bits, and wherein when the decoding unit detects that the instruction packet includes three such control instructions, providing the control instruction to the processing channel for following the three such control instructions executed in the order in which they appear in the instruction packet. the

本发明的其他优点和新颖特性在如下说明中将会部分地被提出，并且依据下面的审查和附图，对于本领域技术人员而言部分地是显然的；或者可以通过实施本发明被学习到。 Other advantages and novel features of the present invention will be set forth in part in the following description, and in part will be apparent to those skilled in the art from the following examination and drawings; or can be learned by practicing the present invention . the

附图说明Description of drawings

为了更好的理解本发明，并说明可以如何同样实施本发明，现在将仅通过示例参考附图，其中： For a better understanding of the invention, and to illustrate how it may likewise be implemented, reference will now be made, by way of example only, to the accompanying drawings, in which:

图1是根据本发明实施例的不对称的双执行路径计算机处理器的框图； 1 is a block diagram of an asymmetric dual execution path computer processor according to an embodiment of the invention;

图2表示根据本发明实施例的用于图1的处理器的指令的示例性类；以及 Figure 2 represents an exemplary class of instructions for the processor of Figure 1 according to an embodiment of the invention; and

图3是表示根据本发明实施例的可配置深执行单元的组件的示意图； Fig. 3 is a schematic diagram representing components of a configurable deep execution unit according to an embodiment of the present invention;

具体实施方式 Detailed ways

图1是根据本发明实施例的不对称的双路径计算机处理器的框图。图1的处理器将单指令流100的处理在两个不同的硬件执行路径之间划分：即用于处 FIG. 1 is a block diagram of an asymmetric dual-path computer processor according to an embodiment of the present invention. The processor of FIG. 1 divides the processing of a single instruction stream 100 between two distinct hardware execution paths:

理控制代码的控制执行路径102、和用于处理数据代码的数据执行路径103。两个执行路径102、103的数据宽度、操作符和其他特征根据控制代码和数据路径代码的不同特征而不同。典型地，控制代码支持较少、较窄的寄存器，难于并行化，典型地(但不是唯一地)用C代码或另一高级语言来写，并且它的代码密度一般比它的速度性能更重要。相反，数据路径代码典型地支持宽寄存器的大文件，可高度并行化，以汇编语言来写，并且它的性能比它的代码密度更重要。在图1的处理器中，两个不同的执行路径102和103专用于处理两种不同类型的代码，每侧都具有其自己的结构寄存器文件(诸如控制寄存器文件104和数据寄存器文件105)，在寄存器宽度和数量方面是不同的；控制寄存器具有较窄的宽度，以位数计(在一个示例中，32位)，而数据寄存器具有较宽的宽度(在一个示例中，64位)。因为寄存器的两个执行路径执行不同的专门功能而具有不同的位宽度，因此该处理器是不对称的。 Control execution path 102 for processing control code, and data execution path 103 for processing data code. The data width, operators and other characteristics of the two execution paths 102, 103 differ according to the different characteristics of the control code and data path code. Typically, the control code supports fewer, narrower registers, is difficult to parallelize, is typically (but not exclusively) written in C code or another high-level language, and its code density is generally more important than its speed performance . In contrast, datapath code typically supports large files of wide registers, is highly parallelizable, is written in assembly language, and its performance is more important than its code density. In the processor of FIG. 1, two different execution paths 102 and 103 are dedicated to processing two different types of code, each side having its own architectural register file (such as control register file 104 and data register file 105), The difference is in register width and number; control registers have a narrower width in bits (in one example, 32 bits), while data registers have a wider width (in one example, 64 bits). The processor is asymmetric because the two execution paths of the registers perform different specialized functions and thus have different bit widths. the

在图1的处理器中，指令流100由指令包的序列组成。所提供的每个指令包由指令解码单元101解码，其从数据指令中分离控制指令，如下进一步所述。控制执行路径102为指令流处理控制流操作，并利用分支单元106、执行单元107、和载入存储单元108管理机器的状态寄存器，其中在该实施例中所述载入存储单元108被数据执行路径103共享。只有处理器的控制侧需要对编译器(诸如对于C、C++、或Java语言的编译器、或另一高级语言编译器)可视。在控制侧内，分支单元106和执行单元107的操作依照本领域普通技术人员公知的常规处理器设计。 In the processor of FIG. 1, instruction stream 100 consists of a sequence of instruction packets. Each instruction packet provided is decoded by an instruction decode unit 101, which separates control instructions from data instructions, as further described below. The control-execution path 102 handles control-flow operations for the instruction stream and manages the state registers of the machine with the branch unit 106, the execution unit 107, and the load-store unit 108, which in this embodiment is executed by the data Path 103 is shared. Only the control side of the processor needs to be visible to a compiler, such as a compiler for the C, C++, or Java languages, or another high-level language compiler. Within the control side, branch unit 106 and execution unit 107 operate in accordance with conventional processor designs known to those of ordinary skill in the art. the

在固定执行单元109和可配置深度执行单元110中，数据执行路径103使用SIMD(单指令多数据)并行性。就像将在下面进一步描述的那样，除了常规的SIMD处理器所使用的宽度以外，为了增加每指令工作，可配置深度执行单元110提供处理的深度。 In both the fixed execution unit 109 and the configurable depth execution unit 110, the data execution path 103 uses SIMD (Single Instruction Multiple Data) parallelism. As will be described further below, deep execution unit 110 can be configured to provide processing depth in addition to the width used by conventional SIMD processors in order to increase work per instruction. the

如果被解码的指令定义控制指令，则其被施加给机器的控制执行路径上的适当的功能单元(例如分支单元106、执行单元107和载入/存储单元108)。如果被解码的指令定义具有固定或者可配置数据处理操作的指令，则其被供应给数据处理执行路径。在指令包的数据指令部分内，指定位表示指令是固定还是可配置数据处理指令，以及在可配置指令的情况下，另外的指定位定义配置信息。根据被解码的数据处理指令的子类型，将数据提供给机器的数据处理路径的固定或可配置执行子路径。 If the decoded instruction defines a control instruction, it is applied to the appropriate functional units (eg, branch unit 106, execution unit 107, and load/store unit 108) on the control execution path of the machine. If the decoded instruction defines an instruction with fixed or configurable data processing operations, it is fed to the data processing execution path. Within the data instruction portion of an instruction packet, designated bits indicate whether the instruction is a fixed or configurable data processing instruction, and in the case of a configurable instruction, additional designated bits define configuration information. Depending on the subtype of the data processing instruction being decoded, data is provided to a fixed or configurable execution subpath of the data processing path of the machine. the

这里，“可配置”表示从多个预定义的(“伪静态”)操作符配置中选择操作符配置的能力。操作符的伪静态配置是有效的用以使操作符(i)执行特定类型的操作或者(ii)以特定形式与相关元件互连或者(iii)上述(i)和(ii)的组合。实际上，所选的伪静态配置每次可以确定许多操作符元素的特性和互连性。它也能控制与数据路径相关联的转换配置。在优选的实施例中，至少部分多个伪静态操作符配置通过数据处理指令的操作代码部分是可选择的，这将在下面进一步描述。同样根据这里的实施例，“可配置指令”允许以多位值的级别执行定制的操作；例如以四个或多个位多位值的级别，或者以字的级别。 Here, "configurable" denotes the ability to select an operator configuration from a number of predefined ("pseudo-static") operator configurations. A pseudo-static configuration of an operator is effective to cause the operator to (i) perform a particular type of operation or (ii) interconnect with related elements in a particular fashion or (iii) a combination of (i) and (ii) above. In fact, the selected pseudo-static configuration can determine the properties and interconnectivity of many operator elements at a time. It also controls the transformation configuration associated with the datapath. In a preferred embodiment, at least some of the plurality of pseudo-static operator configurations are selectable through the opcode portion of the data processing instruction, as further described below. Also according to embodiments herein, "configurable instructions" allow customized operations to be performed at the level of multi-bit values; for example at the level of four or more bit multi-bit values, or at the level of words. the

需要指出的是，控制和数据处理指令可以定义存储器访问(载入/存储)和基本算术操作，所述控制和数据处理指令在机器的它们的相应不同的侧上被执行。用于控制操作的输入/操作数可被提供给控制寄存器文件104/从控制寄存器文件104提供，而用于数据处理操作的数据/操作数被提供给寄存器文件105/从寄存器文件105提供。 It should be noted that the control and data processing instructions, which are executed on their respective different sides of the machine, may define memory accesses (load/store) and basic arithmetic operations. Inputs/operands for control operations may be provided to/from the control register file 104 , while data/operands for data processing operations are provided to/from the register file 105 . the

根据本发明的实施例，每个数据处理操作的至少一个输入可以是矢量。在这方面，可以认为可配置数据路径的可配置操作符和/或转换电路是可配置的，以利用所执行的操作的特性和/或其间的互连性执行矢量操作。例如，对数据处理操作的64位矢量输入可以包括四个16位的标量操作数。这里，“矢量”是标量操作数的集合。矢量算术可以在多个标量操作数上执行，并可以包括标量元素的转向、移动和置换。不是矢量操作的所有操作数都需要是矢量；例如，矢量操作可以有标量和至少一个矢量作为输入；并且输出或者是标量或者是矢量的结果。 According to an embodiment of the present invention, at least one input of each data processing operation may be a vector. In this regard, the configurable operators and/or transformation circuits of the configurable datapath may be considered configurable to perform vector operations utilizing the nature of the operations performed and/or the interconnectivity therebetween. For example, a 64-bit vector input to a data processing operation may include four 16-bit scalar operands. Here, a "vector" is a collection of scalar operands. Vector arithmetic can be performed on multiple scalar operands, and can include steering, shifting, and permutation of scalar elements. Not all operands of a vector operation need to be vectors; for example, a vector operation can have a scalar and at least one vector as input; and the output is either a scalar or a vector result. the

这里，“控制指令”包括专用于程序流和分支以及地址产生的指令；但不是数据处理。“数据处理指令”包括用于逻辑操作或算术操作的指令，对于该算术操作，至少一个输入是矢量。数据处理指令可以在多个数据指令上操作，例如在SIMD处理中，或在处理数据元素的宽的、短的矢量中。上述的控制指令和数据处理指令的基本功能并不重叠；然而，共性在于两种类型的代码都具有逻辑和标量算术能力。 Here, "control instructions" include instructions dedicated to program flow and branching and address generation; but not data processing. "Data processing instructions" include instructions for logical operations or arithmetic operations for which at least one input is a vector. Data processing instructions may operate on multiple data instructions, such as in SIMD processing, or in processing wide, short vectors of data elements. The basic functions of the control instructions and data processing instructions described above do not overlap; however, the commonality is that both types of codes have logic and scalar arithmetic capabilities. the

图2示出用于图1的处理器的指令包的三种类型。指令包的每种类型都是64位长。指令包211是3标量类型，用于密集控制代码，并包括三个21位控制指令(c21)。指令包212和213是LIW(长指令字)类型，用于数据路径代码的并行执行。在该示例中，每个指令包212、213都包括两个指令，但是如果需要可以包括不同的数目。指令包212包括34位数据指令(d34)和28位存储器指令(m28)；并且被用于并行执行具有数据侧载入存储操作(m28指令)的数据侧算术(d34指令)。存储器类指令(m28)可以利用来自控制侧的地址从处理器的控制侧或数据侧读出，或写入处理器的控制侧或数据侧。指令包213包括34位数据指令(d34)和21位控制指令(c21)；并被用于并行执行具有控制侧操作(c21指令)(例如控制侧算术、分支或者载入存储操作)的数据侧算术(d34指令)。 FIG. 2 shows three types of instruction packets for the processor of FIG. 1 . Each type of instruction packet is 64 bits long. The instruction pack 211 is a 3-scalar type for dense control codes, and includes three 21-bit control instructions (c21). Instruction packets 212 and 213 are LIW (Long Instruction Word) type for parallel execution of datapath code. In this example, each instruction packet 212, 213 includes two instructions, but could include a different number if desired. The instruction packet 212 includes a 34-bit data instruction (d34) and a 28-bit memory instruction (m28); and is used to perform data-side arithmetic (d34 instruction) in parallel with a data-side load-store operation (m28 instruction). Memory class instructions (m28) can be read from, or written to, the control or data side of the processor using an address from the control side. Instruction packet 213 includes 34-bit data instructions (d34) and 21-bit control instructions (c21); and is used to execute data-side operations in parallel with control-side operations (c21 instructions) such as control-side arithmetic, branch, or load-store operations Arithmetic (d34 instruction). the

图1的实施例的指令解码单元101使用每个指令包的初始识别位、或者在预定位位置处的某些其他指定的识别位，用于确定正在解码哪一种类型的包。例如，如图2所示，初始位“1”表示指令包是标量控制指令类型，具有3个控制指令；而初始位“01”和“00”表示类型212和213的指令包，在包212中具有数据和存储器指令或者在包213中具有数据和控制指令。已经解码了每个指令包的初始位，图1的解码单元101根据指令包的类型将每个包的指令适当地传递到控制执行路径102或者数据执行路径103。 The instruction decode unit 101 of the embodiment of FIG. 1 uses the initial identification bits of each instruction packet, or some other designated identification bit at a predetermined bit position, for determining which type of packet is being decoded. For example, as shown in Figure 2, the initial bit "1" indicates that the instruction packet is a scalar control instruction type, with 3 control instructions; while the initial bits "01" and "00" indicate the instruction packets of types 212 and 213, in packet 212 There are data and memory instructions in or data and control instructions in packets 213 . Having decoded the initial bits of each instruction packet, the decoding unit 101 of FIG. 1 appropriately transfers the instructions of each packet to the control execution path 102 or the data execution path 103 according to the type of the instruction packet. the

为了执行图2的指令包，图1的实施例的处理器的指令解码单元101从存储器顺序地取得程序包；并程序包顺序地被执行。在指令包内，顺序地执行包211的指令，其中首先执行64位字的最低有效端的21位控制指令，然后是接下来的21位控制指令，以及然后是最高有效端的21位控制指令。在指令包212和213内，可以同时执行指令(在根据本发明的实施例中，虽然这不是必需的情况)。因此，以图1的实施例的处理器的程序顺序，程序包被顺序地执行；但是包内的指令可以或者顺序地被执行(对于包类型211)，或同时被执行(对于包212和213)。下面，将类型212和213的指令包分别简称为MD和CD包(分别包含一个存储器和一个数据指令；以及一个控制指令和一个数据指令)。 In order to execute the instruction package of Fig. 2, the instruction decoding unit 101 of the processor of the embodiment of Fig. 1 obtains the program package sequentially from the memory; and the program package is executed sequentially. Within an instruction packet, the instructions of packet 211 are executed sequentially, wherein the least significant end 21 bit control instruction of a 64 bit word is executed first, then the next 21 bit control instruction, and then the most significant end 21 bit control instruction. Within instruction packets 212 and 213, instructions may be executed concurrently (in embodiments according to the invention, although this is not necessarily the case). Thus, in the program order of the processor of the embodiment of FIG. 1, the program packages are executed sequentially; but the instructions within the packages may be executed either sequentially (for package type 211) or simultaneously (for packages 212 and 213 ). In the following, the instruction packets of types 212 and 213 are referred to as MD and CD packets (respectively including a memory and a data instruction; and a control instruction and a data instruction). the

通过使用21位控制指令，图1的实施例克服了许多在具有其他长度指令的处理器中以及特别是在支持数据指令用的32位标准编码和控制代码用的16位“密集”编码的组合的处理器中所发现的缺陷。在这种双16/32位处理器中，由于使用每条指令用的双编码、或者使用具有通过分支、提取地址在编码方案之间转换的装置或其他装置的两个独立的解码器而引起冗余。根据本发明实施例，通过使用单21位长度用于所有控制指令来消除该冗余。此外，使用21位控制指令消除在16位“密集”编码方案中不充分的语义内容所产生的缺陷。由于不充分的语义内容，使用16位方案的处理器典型地需要设计折衷的某些混合，诸如：使用两操作数破坏性操作，其中相应的代码膨胀(code bloat)用于复制；使用对寄存器文件的子集的有窗口访问，其中代码膨胀用于溢出/填充或者窗口指针操作；或频繁逆转为32位格式，因为不是所有的操作都可以以16位格式中很少可用的操作码位来表示。在本发明实施例中，通过使用21位控制指令减轻这些缺陷。 By using 21-bit control instructions, the embodiment of FIG. 1 overcomes many combinations of 32-bit standard encoding for data instructions and 16-bit "dense" encoding for control codes in processors with instructions of other lengths and in particular defects found in processors. In such dual 16/32-bit processors, due to the use of dual encoding for each instruction, or the use of two separate decoders with means for switching between encoding schemes by branching, fetching addresses, or other means redundancy. According to an embodiment of the present invention, this redundancy is eliminated by using a single 21-bit length for all control instructions. Furthermore, the use of 21-bit control instructions eliminates the drawbacks arising from insufficient semantic content in 16-bit "dense" encoding schemes. Processors using 16-bit schemes typically require some mix of design trade-offs due to insufficient semantic content, such as: use of two-operand destructive operations with corresponding code bloat for copying; A subset of the file has windowed access, where code bloats for overflow/fill or window pointer operations; or frequently reverses to 32-bit format, since not all operations can be done with the few opcode bits available in 16-bit format express. In an embodiment of the present invention, these drawbacks are mitigated by using 21-bit control instructions. the

根据本发明实施例，可以使用大量指令。例如，指令签名可以是如下任一种，其中C格式、M格式、和D格式分别表示控制、存储器访问和数据格式： According to embodiments of the present invention, a large number of instructions may be used. For example, an instruction signature can be any of the following, where C format, M format, and D format represent control, memory access, and data formats, respectively:

指令签名 instruction signature 参数 parameters 被...使用 used instr instr 指令没有参数 The command has no parameters 仅仅C格式 C format only instr dst instr dst 指令有单个目的参数 Directives have a single purpose parameter 仅仅C格式 C format only instr src0 instr src0 指令有单个源参数 Directives have a single source parameter 仅仅C或D格式 C or D format only instr dst，src0 instr dst,src0 指令有单个目的、单个源参数 Directives have a single purpose, single source parameter D和M格式指令 D and M format instructions instr dst，src0，src1 instr dst, src0, src1 指令有单个目的参数和两个源参数 Directives have a single destination parameter and two source parameters C、D和M格式指令 C, D and M format instructions

同样，根据本发明一个实施例，C格式指令都提供SISD(单指令单数据) 操作，而M格式和D格式指令提供SISD或SIMD操作。例如，控制指令可以提供一般的算术、比较和逻辑指令；控制流指令；存储器载入和存储指令；以及其他。数据指令可以提供一般的算术、移位、逻辑和比较指令；清洗(shuffle)、分类、字节扩展和置换指令；线性反馈偏移寄存指令；以及经由可配置深度执行单元110(如下所述)由用户定义的指令。存储器指令可以提供存储器载入和存储；将所选择的数据寄存器复制到控制寄存器；将广播控制寄存器复制到数据寄存器；以及立即到寄存器指令。 Equally, according to one embodiment of the present invention, C format instruction all provides SISD (single instruction single data) operation, and M format and D format instruction provide SISD or SIMD operation. For example, control instructions may provide general arithmetic, comparison, and logic instructions; control flow instructions; memory load and store instructions; and others. Data instructions may provide general arithmetic, shift, logic, and compare instructions; shuffle, sort, byte extension, and permutation instructions; linear feedback offset register instructions; and Directives defined by the user. Memory instructions may provide memory loads and stores; copy selected data registers to control registers; copy broadcast control registers to data registers; and immediate-to-register instructions. the

根据本发明一个实施例，图1的处理器的特征在于第一固定数据执行路径和第二可配置数据执行路径。第一数据路径具有以与常规的SIMD处理设计类似的形式被分裂为通道的固定SIMD执行单元。第二数据路径具有可配置深度执行单元110。“深度执行”指的是在向寄存器文件返回结果之前在由单个发布的指令所提供的数据上执行多个连续操作的处理器能力。深度执行的一个示例在于常规的MAC操作(乘和累加)，其在来自单个指令的数据上执行两个操作(乘法和加法)，因此具有数量级2的深度。深度执行也可以以操作数输入的数目等于结果输出的数目为特征；或等同地，价进(valency-in)等于价出(valency-out)。因此，例如具有一个结果的常规两操作数加法不是优选的深度执行的示例，因为操作数的数目不等于结果的数目；而卷积、快速傅立叶变换、Trellis/Viterbi编码、相关器、有限脉冲响应过滤器以及其他信号处理算法是深度执行的示例。专用数字信号处理(DSP)算法典型地在位级上以及以存储器映射的形式执行深度执行。但是，常规的寄存器映射通用DSP的算法不执行深度执行，而是在MAC操作中，执行顺序深度最多为数量级2的指令。相反，图1的处理器提供寄存器映射通用处理器，其能够深度执行数量级大于2的动态可配置的字级指令。在图1的处理器中，深度执行指令的特性(待执行的数学函数的图表)可以由指令本身中的配置信息调节/定制。在优选实施例中，格式指令包括被分配给配置信息的位位置。为了提供这个能力，深度执行单元110具有可配置执行资源，其意味着可以上载操作符模式、互连性和常数以适合每个应用。深度执行对执行的并行性添加深度，其正交于由SIMD和LIW处理的早期构思所提供的宽度；因此它表示用于增加目标处理器的每指令工作(work-per-instruction)的其他尺度。 According to one embodiment of the invention, the processor of FIG. 1 is characterized by a first fixed data execution path and a second configurable data execution path. The first data path has fixed SIMD execution units split into lanes in a similar fashion to conventional SIMD processing designs. The second data path has a configurable depth of execution units 110 . "Deep execution" refers to the processor's ability to perform multiple sequential operations on data provided by a single issued instruction before returning the result to the register file. One example of deep execution is the conventional MAC operation (Multiply and Accumulate), which performs two operations (Multiply and Add) on data from a single instruction, and thus has a depth of order 2. Deep execution may also be characterized by the number of operand inputs equaling the number of result outputs; or equivalently, valency-in equals valency-out. Thus, for example, regular two-operand addition with one result is not an example of a preferred deep implementation, since the number of operands is not equal to the number of results; whereas convolution, fast Fourier transform, Trellis/Viterbi encoding, correlators, finite impulse response Filters and other signal processing algorithms are examples of deep implementations. Dedicated digital signal processing (DSP) algorithms typically perform deep execution at the bit level and in memory-mapped form. However, the algorithms of conventional register-mapped general-purpose DSPs do not perform deep execution, but, in MAC operations, execute instructions with a sequential depth of at most an order of two. In contrast, the processor of FIG. 1 provides a register-mapped general-purpose processor capable of executing dynamically configurable word-level instructions orders of magnitude greater than two in depth. In the processor of Figure 1, the characteristics of deeply executed instructions (the graph of mathematical functions to be executed) can be adjusted/customized by configuration information in the instructions themselves. In a preferred embodiment, the format instructions include bit positions assigned to configuration information. To provide this capability, the deep execution unit 110 has configurable execution resources, which means that operator modes, interconnections and constants can be uploaded to suit each application. Deep execution adds depth to the parallelism of execution, which is orthogonal to the width provided by early concepts of SIMD and LIW processing; thus it represents an additional metric for increasing the work-per-instruction of the target processor . the

图3示出根据本发明实施例的可配置深度执行单元310的组件。如图1所示，可配置深度执行单元110是数据执行路径103的一部分，并因此可以由来自图2的 MD和CD指令包212和213的数据侧指令指示。在图3中，从图1的指令解码单元101和数据寄存器文件105将指令314和操作数315提供到深度执行单元310。被解码的指令314中的多位配置代码被用于访问控制映射316，其将多位代码扩展为比较复杂的配置信号集用于配置深度执行单元的操作符。例如，控制映射316可以被实施为查询表，其中将指令的不同的可能多位代码映射为深度执行单元的不同的可能操作符配置。根据对控制映射316的查询表查询的结果，交叉互连317配置一组操作符318-321，在任何布置中对于执行由多位指令代码所表示的操作符配置都是必要的。例如，该操作符可以包括：乘法操作符318、算术逻辑单元(ALU)操作符319、状态操作符320、或交叉通道换码器321。在一个实施例中，深度执行单元包含15个操作符：一个乘法操作符318、八个ALU操作符319、四个状态操作符320、和两个交叉通道换码器321；尽管其他操作符数目也是可能的。被提供到深度执行单元的操作数315可以是例如16位操作数；将这些操作数提供到第二交叉互连322，其可以将操作数提供给合适的操作符318-321。第二交叉互连322也从操作符318-321接收中间结果的反馈324，所述反馈接着又同样可以由第二交叉互连322提供给合适的操作符318-321。第三交叉互连323多路复用来自操作符318-321的结果，并输出最后结果325。各种控制信号可以被用于配置操作符；例如，图3的实施例的控制映射316不必要被实施为单个查询表，而是可以被实施为两个或更多级联查询表的序列。第一查询表中的项目可以从给出的多位指令代码指向第二查询表，因此减少了在每个查询表中用于复杂操作符配置所需的存储量。例如，第一查询表可以被组织为配置种类的库，使得多个多位指令代码在第一查询表中被组合在一起，其中每组指向提供该组的每个多位代码的特定配置的随后的查询表。 FIG. 3 illustrates components of a configurable deep execution unit 310 according to an embodiment of the invention. As shown in FIG. 1 , the configurable depth execution unit 110 is part of the data execution path 103 and thus may be directed by data side instructions from the MD and CD instruction packets 212 and 213 of FIG. 2 . In FIG. 3 , instructions 314 and operands 315 are provided to deep execution unit 310 from instruction decode unit 101 and data register file 105 of FIG. 1 . The multi-bit configuration code in the decoded instruction 314 is used in an access control map 316, which expands the multi-bit code into a more complex set of configuration signals for configuring the operators of the deep execution unit. For example, control map 316 may be implemented as a look-up table in which different possible multi-bit codes for an instruction are mapped to different possible operator configurations for a deep execution unit. Based on the results of a lookup table lookup of the control map 316, the cross-connect 317 configures a set of operators 318-321, in whatever arrangement is necessary to perform the operator configuration represented by the multi-bit instruction code. For example, the operator may include: a multiply operator 318 , an arithmetic logic unit (ALU) operator 319 , a state operator 320 , or a cross-lane transcoder 321 . In one embodiment, the deep execution unit contains 15 operators: one multiply operator 318, eight ALU operators 319, four state operators 320, and two cross-lane transcoders 321; It is also possible. The operands 315 provided to the deep execution units may be, for example, 16-bit operands; these operands are provided to the second cross-connect 322, which may provide the operands to the appropriate operators 318-321. The second cross-connect 322 also receives feedback 324 of intermediate results from the operators 318-321, which in turn may likewise be provided by the second cross-connect 322 to the appropriate operators 318-321. A third cross-connect 323 multiplexes the results from the operators 318 - 321 and outputs a final result 325 . Various control signals may be used to configure the operator; for example, control map 316 of the embodiment of FIG. 3 need not be implemented as a single look-up table, but may be implemented as a sequence of two or more cascaded look-up tables. Entries in the first look-up table can be pointed from given multi-bit instruction codes to the second look-up table, thus reducing the amount of storage required in each look-up table for complex operator configurations. For example, the first look-up table may be organized as a library of configuration classes such that multiple multi-bit instruction codes are grouped together in the first look-up table, with each group pointing to a specific configuration for each multi-bit code of the group. Subsequent lookup table. the

根据图3的实施例，操作符优选地被预配置为各种操作符类。实际上，这通过硬布线的策略层来实现。该方法的优势在于，意味着需要存储更少的预定义的配置，并且控制电路可以更简单。例如，将操作符318预配置在乘法操作符的类中；将操作符319预配置为ALU操作符；将操作符320预配置为状态操作符；以及将操作符321预配置为交叉通道换码器；而且其他预配置的类是可能的。然而，即使操作符的类被预配置，对于用于实施所给出的算法的特定配置的最终布置，指令的运行时间灵活性能够布置至少以下项：(i)在每类中的操作符的连接性；(ii)与来自其他类的操作符的连接性；(iii)任何相关转换装置的连接性。 According to the embodiment of Fig. 3, operators are preferably pre-configured into various operator classes. In practice, this is achieved through a hardwired policy layer. The advantage of this approach is that it means that fewer predefined configurations need to be stored and the control circuitry can be simpler. For example, operator 318 is preconfigured in the class of multiply operators; operator 319 is preconfigured as an ALU operator; operator 320 is preconfigured as a state operator; and operator 321 is preconfigured as a cross-lane escape implementor; and other preconfigured classes are possible. However, even if the classes of operators are pre-configured, for the final arrangement of a particular configuration for implementing a given algorithm, the runtime flexibility of the instructions enables the arrangement of at least the following: (i) the number of operators in each class Connectivity; (ii) connectivity with operators from other classes; (iii) connectivity with any associated transformations. the

技术人员应当理解，虽然上面已描述了什么被认为是本发明的最佳模式以及在什么情况下执行本发明的其他模式是适当的，但是本发明不应局限于在优选实施例的所述描述中公开的特定装置配置或方法步骤。本领域技术人员同样应当认识到，本发明具有广泛的应用，并且实施例允许在不偏离本发明构思的情况下具有广范的不同的实施和修改。特别是，这里提及的示例性位宽不是限制性的，也不是被称为半字、字、长等的位宽的任意选择。 It will be appreciated by those skilled in the art that while the above has described what is considered the best mode of the invention and where other modes of carrying out the invention are appropriate, the invention should not be limited to the described description of the preferred embodiments. Specific apparatus configurations or method steps disclosed in . It will also be appreciated by those skilled in the art that the invention has broad applicability and that the embodiments allow for a wide range of different implementations and modifications without departing from the inventive concept. In particular, the exemplary bit widths mentioned here are not limiting, nor are they arbitrary choices of bit widths referred to as halfwords, words, long, etc. the

Claims

1. A computer processor, said processor comprising:

The decoding unit is used to decode the instruction packet stream from the memory, and each instruction packet includes a plurality of instructions;

a control execution path comprising a branch unit, an execution unit, and a load/store unit and operable to execute control processing operations;

wherein the decoding unit is operable to receive an instruction packet having a bit length of 64 bits, and is operable to use an identification bit in the instruction packet to detect whether the instruction packet defines three bits each having a 21-bit bit length long control instructions, and

wherein when the decoding unit detects that the instruction packet includes three such control instructions, the control instruction is provided to the control execution path for appearing in the instruction packet according to the three such control instructions order to execute.

2. A method of operating a computer processor comprising a control execution path and capable of performing control processing operations having a branch unit, an execution unit and a load/store unit, the method comprising:

(a) receiving from memory a sequence of instruction packets, each of the instruction packets including a plurality of instructions defining operations;

(b) sequentially decode each instruction packet by using the identification bits in the instruction packet to determine whether the instruction packet defines three control instructions each having a length of 21 bits, and wherein when the decoding unit When it is detected that the instruction packet includes three such control instructions, the control instructions are provided to the control execution path for execution in the order in which the three such control instructions appear in the instruction packet.