JPH0850575A

JPH0850575A - Programmable processor,method for execution of digital signal processing by using said programmable processor and its improvement

Info

Publication number: JPH0850575A
Application number: JP7109642A
Authority: JP
Inventors: Keith M Bindloss; ケイス・エム・ビンドロス; Kenneth E Garey; ケニス・イー・ギャレイ; A Watson George; ジョージ・エイ・ワトソン; John Earle; ジョン・アール
Original assignee: Rockwell International Corp
Current assignee: Boeing North American Inc
Priority date: 1994-05-05
Filing date: 1995-05-08
Publication date: 1996-02-20
Anticipated expiration: 2022-03-07
Also published as: US5778241A; DE69519449T2; DE69519449D1; JP3889069B2; EP0681236B1; EP0681236A1

Abstract

PURPOSE: To provide a space vector data path which integrates SIMD schemes to a general purpose programmable processor. CONSTITUTION: A programmable processor includes a mode means which is connected to an instruction means and specifies whether or not an operand is processed for each instruction in one of vector and scalar modes and a processing unit 110 which is connected to the mode means, receives the operand and processes the operand in one of the vector and scalar modes in response to an instruction such that is specified by the mode means. The vector mode shows the processing unit 110 that plural elements exist in the operand and the scalar mode shows the processing unit 110 that a single element exists in the operand.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の分野】この発明は信号プロセッサに関し、より
特定的には空間並列処理能力を備えたデジタル信号プロ
セッサに関する。FIELD OF THE INVENTION The present invention relates to signal processors, and more particularly to digital signal processors with spatial parallel processing capability.

【０００２】[0002]

【発明の背景】近年、コンピュータ技術において、単一
命令多重データ（ＳＩＭＤ）などの並列処理方式を備え
るコンピュータは徐々に認識される割合を獲得してきて
いる。ＳＩＭＤコンピュータは概念的には図１（ａ）で
示すことができるものであり、ここでは複数の処理要素
（ＰＥ）が１つのメインシーケンサによって監視されて
いる。すべてのＰＥはメインシーケンサから通信される
同じ命令を受取るが、別々のデータストリームからの異
なったデータの組に対して動作する。図１（ｂ）に示す
ように、各ＰＥはそれ自身の局所メモリを備える中央処
理装置（ＣＰＵ）として機能する。したがって、ＳＩＭ
Ｄコンピュータは各ＰＥのＣＰＵとともに複数の同期さ
れた算述論理ユニットを用いることによって、空間的並
列性を達成することができる。一旦データが各ＰＥ内に
存在するようになれば、個々のＰＥがそのデータを扱う
ことは比較的容易なことであるとはいえ、相互接続（図
示せず）を介してすべてのＰＥ間で分配および通信を行
なうことは、極めて複雑な仕事である。よって、ＳＩＭ
Ｄマシンは、通常専用とすることを念頭において設計さ
れており、プログラミングやベクトル化の難しさのため
に、これらのマシンは汎用の用途には望ましくないもの
となっている。In recent years, in computer technology, computers having parallel processing schemes such as single instruction multiple data (SIMD) have been gradually gaining recognition. The SIMD computer can be conceptually shown in FIG. 1 (a), where a plurality of processing elements (PEs) are monitored by one main sequencer. All PEs receive the same instructions communicated from the main sequencer, but operate on different sets of data from separate data streams. As shown in FIG. 1 (b), each PE functions as a central processing unit (CPU) having its own local memory. Therefore, SIM
The D-computer can achieve spatial parallelism by using multiple synchronized arithmetic logic units with each PE's CPU. Once the data is within each PE, it is relatively easy for an individual PE to handle the data, but across all PEs via interconnections (not shown). Distributing and communicating is a highly complex task. Therefore, SIM
D-machines are usually designed to be dedicated, and programming and vectorization difficulties make these machines undesirable for general purpose applications.

【０００３】一方で、ＳＰＡＲＣ（登録商標）、Ｐｏｗ
ｅｒＰＣ（登録商標）、および６８０００ベースのマシ
ンなど現在の汎用計算機は、典型的には高性能グラフィ
ック処理に際してはそれらの持つ３２ビットメモリ空間
をフルに利用してはいない。たとえば、これらのマシン
のバスが３２ビットの幅であるのに、映像および画像情
報についてはデータは未だに１６ビット幅または８ビッ
トピクセルで処理されるように制限されている。しかし
ながら、これらの汎用マシンは高級言語のソフトウェア
環境におけるプログラミングの便利さのため魅力的であ
る。したがって、デジタル信号処理に応用されるような
ＳＩＭＤのスピードの利点と、汎用ＣＰＵにおけるプロ
グラミングの便利さとの間で、バランスをとることが望
ましい。そうすれば、低性能なＳＩＭＤマシンの実現例
であっても、汎用マシンに組込まれたならば、あたかも
複数のスカラＣＰＵが並列に働いているかのように総合
的なスループットが激しく向上するであろう。しかしな
がら、汎用マシンにＳＩＭＤが組込まれた場合、高めら
れたスループットは、伝統的なＳＩＭＤマシンに見られ
る複数ユニットのスカラＣＰＵと典型的にはかかわりの
ある、シリコンの使用と引替えにもたらされるものでは
ない。On the other hand, SPARC (registered trademark), Pow
Current general purpose computers such as erPC® and 68000 based machines typically do not fully utilize their 32-bit memory space for high performance graphics processing. For example, while the buses of these machines are 32 bits wide, for video and image information the data is still limited to being processed at 16 bit widths or 8 bit pixels. However, these general purpose machines are attractive because of the convenience of programming in a high level language software environment. Therefore, it is desirable to balance the speed advantages of SIMD as applied to digital signal processing with the programming convenience of general purpose CPUs. Then, even if a low-performance SIMD machine is implemented, if it is incorporated into a general-purpose machine, the overall throughput will be significantly improved as if a plurality of scalar CPUs are working in parallel. Let's do it. However, when SIMD is integrated into a general-purpose machine, the increased throughput comes at the expense of using silicon, typically associated with the multi-unit scalar CPU found in traditional SIMD machines. Absent.

【０００４】したがって、コード強調応用およびスピー
ド強調計算のためのＳＩＭＤ能力を備える汎用プロセッ
サを有することが望ましいだろう。Therefore, it would be desirable to have a general purpose processor with SIMD capabilities for code enhancement applications and speed enhancement computations.

【０００５】本発明の目的は、ＳＩＭＤ方式を汎用ＣＰ
Ｕアーキテクチャに組入れてスループットを高めること
である。An object of the present invention is to use the SIMD system as a general purpose CP.
Incorporation into the U architecture to increase throughput.

【０００６】実質的にシリコンの使用を招くことなくス
ループットを高めることも本発明の目的である。It is also an object of the present invention to increase throughput without substantially incurring the use of silicon.

【０００７】この発明のさらなる目的は、同じ命令実行
速度で各命令において処理されるデータ要素の数に比例
してスループットを増大させることである。A further object of the invention is to increase throughput in proportion to the number of data elements processed in each instruction at the same instruction execution speed.

【０００８】[0008]

【発明の概要】ＳＩＭＤ方式を汎用プログラマブルプロ
セッサに組入れるための空間ベクトルデータ経路が開示
される。プログラマブルプロセッサは、命令手段に結合
され、オペランドがベクトルおよびスカラモードの１つ
において処理されるかどうかを各命令について特定する
ためのモード手段と、モード手段に結合され、オペラン
ドを受取り、モード手段によって特定された命令に応答
して、オペランドをベクトルおよびスカラモードのうち
１つにおいて処理するための処理ユニットとを備え、ベ
クトルモードは処理ユニットに、オペランド内に複数個
の要素があることを示し、スカラモードは処理ユニット
に、オペランド内に１つの要素があることを示す。SUMMARY OF THE INVENTION A space vector data path for incorporating a SIMD scheme into a general purpose programmable processor is disclosed. The programmable processor is coupled to the instruction means and is coupled to the mode means for identifying for each instruction whether the operand is processed in one of the vector and scalar modes, and coupled to the mode means for receiving the operands and by the mode means. A processing unit for processing the operand in one of a vector and a scalar mode in response to the identified instruction, the vector mode indicating that the processing unit has multiple elements in the operand, Scalar mode indicates to the processing unit that there is one element in the operand.

【０００９】本発明はまた、汎用コンピュータを用いて
複数のデータ経路を介しデジタル信号処理を行なう方法
をも開示するものであって、汎用コンピュータは、各オ
ペランドが少なくとも１つの要素を有する状態で複数個
のオペランドをストアするためのデータメモリと、複数
個のサブ処理ユニットを有する処理ユニットとを含む。
この方法は以下のステップを含む。ａ）処理ユニットに
よって実行されるべき命令の予め定められたシーケンス
の中から命令を提供する。ｂ）命令はオペランドに対す
る処理ユニットによる処理についてスカラモードおよび
ベクトルモードのうち１つを特定する。スカラモードは
オペランド内に１つの要素があることを処理ユニットに
示し、ベクトルモードは複数個のサブ要素がオペランド
内にあることを前記処理ユニット示す。ｃ）スカラモー
ドの場合、処理ユニットにおける各サブ処理ユニットは
命令に応答して処理すべきオペランドのそれぞれの部分
を受取り、部分的中間結果を発生する。ｄ）各サブ処理
ユニットは複数のサブ処理ユニット間にその中間結果を
送り、その部分的結果を他のサブ処理ユニットと合せ
て、オペランドのための最終的な結果を発生する。ｅ）
最終的結果に対応するように第１の条件コードを発生す
る。ｆ）ベクトルモードの場合、処理ユニットにおける
各サブ処理ユニットは命令に応答してオペランド内の複
数個のサブ要素からそれぞれのサブ要素を受取りかつそ
れを処理して、部分的中間結果を発生し、各中間結果は
不能化され、各部分的結果はその対応する要素のための
最終的結果を表わす。ｇ）複数個の第２の条件コードを
発生する。ここで第２の条件コードの各々は独立した結
果に対応する。The present invention also discloses a method for performing digital signal processing using a general purpose computer via a plurality of data paths, the general purpose computer comprising a plurality of operands, each operand having at least one element. A data memory for storing one operand and a processing unit having a plurality of sub-processing units are included.
The method includes the following steps. a) Providing instructions from among a predetermined sequence of instructions to be executed by the processing unit. b) The instruction specifies one of scalar mode and vector mode for processing by the processing unit for the operand. Scalar mode indicates to the processing unit that there is one element in the operand, and vector mode indicates to the processing unit that there are multiple sub-elements in the operand. c) In scalar mode, each sub-processing unit in the processing unit receives a respective portion of an operand to be processed in response to an instruction and produces a partial intermediate result. d) Each sub-processing unit sends its intermediate result between multiple sub-processing units and combines its partial results with other sub-processing units to generate the final result for the operand. e)
Generate a first condition code to correspond to the final result. f) in vector mode, each sub-processing unit in the processing unit receives and processes each sub-element from a plurality of sub-elements in an operand in response to an instruction to produce a partial intermediate result, Each intermediate result is disabled and each partial result represents the final result for its corresponding element. g) Generate a plurality of second condition codes. Here, each of the second condition codes corresponds to an independent result.

【００１０】[0010]

【発明の詳しい説明】一般的な実現例の考察ＳＩＭＤ方式を汎用マシンに組込む場合、望ましくは考
慮されるべきである問題がいくつかある。DETAILED DESCRIPTION General Implementation Considerations When incorporating the SIMD scheme into a general purpose machine, there are some issues that should preferably be considered.

【００１１】１）スカラまたはベクトルの動作の選択
は、好ましくは、ある期間ベクトルモードに切換わるの
ではなく、命令単位で行なわれるべきである。なぜな
ら、いくつかのアルゴリズムはベクトルサイズが大きい
と容易にベクトル化されないからである。また、ベクト
ル演算が選択される場合、ベクトルの次元を特定しなけ
ればならない。1) The selection of scalar or vector operation should preferably be done on an instruction-by-instruction basis, rather than switching to vector mode for a period of time. This is because some algorithms are not easily vectorized with large vector sizes. Also, if vector operations are selected, the dimensions of the vector must be specified.

【００１２】現在、本発明に従い、スカラ／ベクトルに
ついての情報はＳＩＭＤ能力を有する各命令内のデータ
タイプ修飾子フィールドによって特定される。たとえ
ば、命令はワードまたはハーフワード対演算を特定する
ことのできる１ビット「経路」修飾子フィールドを特徴
としていてもよい。さらに、より大きいベクトル次元、
たとえば４、８などを選択するために、このフィールド
は好ましくはストリーマコンテキストレジスタ内のデー
タタイプ変換フィールドと組合せられるべきである。ス
トリーマの完全な説明は、「ＲＩＳＣデジタル信号プロ
セッサのためのストリーマ（STREAMER FOR DIGITAL SIG
NAL PROCESSOR ）」と題され、その開示がここに引用に
よって援用される、１９９２年７月２３日に提出された
関連の米国特許出願連続番号第９１７，８７２号に開示
されている。Currently, in accordance with the present invention, information about scalars / vectors is specified by a data type modifier field in each instruction that has SIMD capability. For example, an instruction may feature a 1-bit "path" modifier field that can specify a word or halfword pair operation. Furthermore, a larger vector dimension,
This field should preferably be combined with the data type conversion field in the streamer context register, for example to select 4, 8, etc. For a complete description of Streamers, see "STREAMER FOR DIGITAL SIG for RISC Digital Signal Processors".
NAL PROCESSOR) ", the disclosure of which is disclosed in related U.S. Patent Application Serial No. 917,872, filed July 23, 1992, the disclosure of which is incorporated herein by reference.

【００１３】２）マシンは、ベクトル結果に基づく条件
付実行に備えるものでなければならない。ＳＩＭＤ演算
の結果を、それがちょうど多重スカラ演算を用いて行な
われたかのようにテストできることが重要である。この
理由により、ステータスレジスタ内の条件コードフラグ
は、データ経路の１セグメントごとに１組が存在するよ
うに二重にされることが好ましい。たとえば、ベクトル
次元が４であれば４組の条件コードが必要であろう。2) The machine must be prepared for conditional execution based on vector results. It is important to be able to test the result of a SIMD operation as if it were done using multiple scalar operations. For this reason, the condition code flags in the status register are preferably duplicated such that there is one set for each segment of the data path. For example, a vector dimension of 4 would require 4 sets of condition codes.

【００１４】また、条件付命令は、条件コードのどの組
を使用するかを特定することを必要とする。たとえば
「１つでもけた上げフラグがセットされていれば」また
は「すべてのけた上げフラグがセットされていれば」な
どの条件の組合せをテストすることができれば有用であ
る。Conditional instructions also need to specify which set of condition codes to use. It would be useful to be able to test a combination of conditions, such as "if any carry flags are set" or "all carry flags are set".

【００１５】３）ＳＩＭＤ方式は可能な限り多くの演算
に応用可能であるべきである。これから述べる本発明の
好ましい実施例は、１６ビット乗算器および３２ビット
入力データなどの現在の実現例におけるマシンを示して
いるが、本発明に従い他の変形が容易に構成され得ると
いうことは当業者には認識されるであろう。3) The SIMD method should be applicable to as many operations as possible. Although the preferred embodiment of the invention described below illustrates the machine in its current implementation, such as a 16-bit multiplier and 32-bit input data, it will be appreciated by those skilled in the art that other variations can be readily constructed in accordance with the invention. Will be recognized by.

【００１６】次の演算は、空間ベクトル（ＳＶ）技術の
性能を高めることができる、可能な演算（図２８〜３６
で列挙）の例である。The following operations are possible operations (FIGS. 28-36) that can enhance the performance of space vector (SV) techniques.
Enumerated in) is an example.

【００１７】ＡＢＳ，ＮＥＧ，ＮＯＴ，ＰＡＲ，ＲＥ
Ｖ，ＡＤＤ，ＳＵＢ，ＳＵＢＲ，ＡＳＣ，ＭＩＮ，ＭＡ
Ｘ，ＴｃｏｎｄＳＢＩＴ，ＣＢＩＴ，ＩＢＩＴ，ＴＢＺ，ＴＢＮＺＡＣＣ，ＡＣＣＮ，ＭＵＬ，ＭＡＣ，ＭＡＣＮ，ＵＭＵ
Ｌ，ＵＭＡＣＡＮＤ，ＡＮＤＮ，ＯＲ，ＸＯＲ，ＸＯＲＣＳＨＲ，ＳＨＬ，ＳＨＲＡ，ＳＨＲＣ，ＲＯＲ，ＲＯＬＢｃｏｎｄＬＯＡＤ，ＳＴＯＲＥ，ＭＯＶＥ，Ｍｃｏｎｄここでｃｏｎｄは、ＣＣ，ＣＳ，ＶＣ，ＶＳ，ＺＣおよ
びＺＳであってもよい。ABS, NEG, NOT, PAR, RE
V, ADD, SUB, SUBR, ASC, MIN, MA
X, Tcond SBIT, CBIT, IBIT, TBZ, TBNZ ACC, ACCN, MUL, MAC, MACN, UMU
L, UMAC AND, ANDN, OR, XOR, XORC SHR, SHL, SHRA, SHRC, ROR, ROL Bcond LOAD, STORE, MOVE, Mcond where cond is CC, CS, VC, VS, ZC and ZS. Good.

【００１８】４）メモリデータ帯域幅はＳＩＭＤデータ
経路の性能に適合可能であるべきである。4) The memory data bandwidth should be compatible with the SIMD data path performance.

【００１９】メモリおよびバス帯域幅をハードウェアの
複雑さを増大させることなく空間ベクトルデータ経路の
データ要求に適合させることが望ましい。現在実現され
ているマシンにおけるデュアルアクセスの３２ビットメ
モリを備える２つの３２ビットバスは、算述論理ユニッ
ト（ＡＬＵ）およびデュアル１６×１６乗算／累算ユニ
ット（ＭＡＣ）によく適合している。これらはまた、４
つの８×８ＭＡＣにもよく適合するだろう。It is desirable to adapt memory and bus bandwidth to the data requirements of the space vector data path without increasing hardware complexity. Two 32-bit buses with dual-access 32-bit memory in currently implemented machines are well suited for arithmetic logic units (ALUs) and dual 16x16 multiply / accumulate units (MACs). These are also 4
It will also fit well in one 8x8 MAC.

【００２０】５）実現されるいかなる付加および変形
も、付加的なハードウェアの複雑さは最小限で性能を最
大限にすることによって、コスト効率の良さをもたらす
べきである。5) Any additions or modifications implemented should be cost-effective by maximizing performance with minimal additional hardware complexity.

【００２１】加算器／減算器は、けた上げ伝播を止め、
条件コード論理を二重にすることによって、空間ベクト
ルモードにおいて動作させることができる。The adder / subtractor stops carry propagation,
It is possible to operate in space vector mode by duplicating the condition code logic.

【００２２】シフタは、ラップアラウンド論理をも再構
成し、条件コード論理を二重にすることによって空間ベ
クトルモードにおいて動作させることができる。The shifter can also be operated in space vector mode by reconfiguring the wraparound logic and duplicating the condition code logic.

【００２３】ビット論理ユニットは、条件コード論理を
二重にするだけで空間ベクトルモードにおいて動作させ
ることができる。The bit logic unit can be operated in space vector mode simply by duplicating the condition code logic.

【００２４】空間ベクトル条件付移動動作は、条件コー
ドフラグのベクトルを用いてマルチプレクサを制御し、
ベクトルの各要素が独立的に移動させられるようにする
ことによって達成され得る。The space vector conditional move operation controls the multiplexer using a vector of condition code flags,
This can be achieved by allowing each element of the vector to be moved independently.

【００２５】空間ベクトルの乗算は、乗算器アレイを二
重にし、部分積を組合せることを必要とする。たとえば
適切な組合せ論理を備える４つの１６×８乗算器は、４
つの１６×８または２つの１６×１６のベクトル演算、
もしくは１つの３２×１６スカラ演算を行なうのに用い
ることができる。空間ベクトル乗算−累算演算はまた、
けた上げ伝播を止め、条件コード論理を二重にすること
ができる累算加算器を、ベクトル化された累算器レジス
タと同様に必要とする。Space vector multiplication requires doubling the multiplier array and combining partial products. For example, four 16x8 multipliers with appropriate combinational logic
16x8 or 2 16x16 vector operations,
Alternatively it can be used to perform one 32x16 scalar operation. The space vector multiplication-accumulation operation also
We need an accumulator adder that can stop carry propagation and duplicate the condition code logic, as well as a vectorized accumulator register.

【００２６】６）汎用コンピュータにおける空間ベクト
ルの実現に起因するプログラミングの複雑さは、最小限
にされるべきである。空間ベクトル結果をスカラ結果に
組合せるために命令を考え出すことができる。6) The programming complexity due to the realization of space vectors in general purpose computers should be minimized. Instructions can be devised to combine the space vector result with the scalar result.

【００２７】ACC Az,Ax,Ay 累算器を加算する。SA Ay,
Mz スケーリングされた累算器対をメモリにストアす
る。ACC Az, Ax, Ay Accumulators are added. SA Ay,
Mz Store scaled accumulator pair in memory.

【００２８】MAR Rz，Ax スケーリングされた累算器対
をレジスタに移動させる。７）ベクトルが物理的メモリ
境界と交差するとき、ベクトルへのアクセスがそれでも
可能であるべきである。たたみ込みなどのいくつかのア
ルゴリズムは、データアレイを介しての増分を必要とす
る。アレイが長さＮのベクトルとして扱われる場合、ベ
クトルが部分的に１つの物理的メモリ位置の中に存在
し、部分的に隣接する物理的メモリ位置の中に存在する
ということがあり得る。そのような空間ベクトル演算に
対する性能を維持するには、物理的境界と交差するデー
タアクセスに対処するようにメモリを設計するか、また
は前述の米国特許出願「ＲＩＳＣデジタル信号プロセッ
サのためのストリーマ」に記載されたようなストリーマ
を用いることが好ましい。MAR Rz, Ax Move scaled accumulator pair to register. 7) When a vector crosses a physical memory boundary, it should still be accessible. Some algorithms, such as convolution, require increments through the data array. If the array is treated as a vector of length N, it is possible that the vectors are partially in one physical memory location and partially in adjacent physical memory locations. To maintain performance for such space vector operations, either design the memory to handle data accesses that cross physical boundaries, or refer to the aforementioned US patent application "Streamer for RISC Digital Signal Processors". Preference is given to using a streamer as described.

【００２９】全体システム図２は、本発明の空間ベクトルデータ経路を組入れても
よいプログラマブルプロセッサを一般化して表わしたも
のである。本発明に取入れられたコンセプトの１つは、
スカラオペランドまたはアレイの要素に一度に１つずつ
対処するように設計されたコンピュータを変形し、同時
に１つより多くのオペランドを処理できるようにするこ
とによって、その性能を高めることができるということ
である。 Overall System FIG. 2 is a generalized representation of a programmable processor which may incorporate the space vector data paths of the present invention. One of the concepts incorporated into the present invention is
By modifying a computer designed to deal with scalar operands or elements of an array one at a time, it is possible to increase its performance by allowing more than one operand to be processed at a time. is there.

【００３０】図２に示されているのは、プログラムおよ
びデータオペランドをストアするためのプログラムおよ
びデータ記憶ユニット１００を有するプログラマブルプ
ロセッサ、または広い意味でいう「コンピュータ」であ
る。命令収集ユニット１３０が記憶ユニット１００から
命令をフェッチし、これは命令フェッチ／デコード／シ
ーケンスユニット１４０によりデコードかつ解釈され、
処理ユニット１１０によって実行される。このようにし
て処理ユニット１１０は記憶ユニット１００から供給さ
れるオペランドで命令を実行する。Shown in FIG. 2 is a programmable processor, or broadly "computer" having a program and data storage unit 100 for storing program and data operands. The instruction collection unit 130 fetches an instruction from the storage unit 100, which is decoded and interpreted by the instruction fetch / decode / sequence unit 140,
It is executed by the processing unit 110. In this way, the processing unit 110 executes the instruction with the operand supplied from the storage unit 100.

【００３１】性能を高めるため、オペランドがスカラで
あるかベクトルであるかを特定するためのビットが各命
令の中にある。また、それらがベクトルである場合、各
オペランド内にいくつの要素があるかが特定される。こ
の情報は典型的なデコードされた命令とともに処理ユニ
ット１１０に送られるので、処理ユニット１１０はオペ
ランドをスカラとして処理するべきがベクトルとして処
理するべきかを「知る」。To improve performance, there is a bit in each instruction to specify whether the operand is a scalar or a vector. Also, if they are vectors, it is specified how many elements are in each operand. This information is sent to the processing unit 110 along with the typical decoded instruction so that the processing unit 110 "knows" whether to process the operands as scalars but as vectors.

【００３２】処理ユニット１１０はＡＬＵでもシフタで
もＭＡＣでもよい。記憶ユニット１００は一般に何らか
の種類のメモリであってよく、レジスタファイルでも、
半導体メモリでも、磁気メモリでも、またはいくつかの
種類のメモリのいずれのものでもよい。処理ユニット１
１０は加算、減算、論理ＡＮＤ、論理ＯＲ、バレルシフ
タでのようなシフト、乗算、累算、およびデジタル信号
プロセッサにおいて典型的に見られる乗算および累算の
ような、典型的な演算を行なってもよい。処理ユニット
１１０はオペランドを、命令において用いられる１つの
オペランド、命令において用いられる２つのオペラン
ド、またはそれ以上多くのものなどのうちいずれかとし
てとる。処理ユニット１１０は次にこれらのオペランド
で演算を行なって、それらの結果を得る。スカラまたは
ベクトルオペランドで開始することにより、オペランド
は演算を最後まで行なわれ、それぞれスカラまたはベク
トル結果をもたらす。The processing unit 110 may be an ALU, shifter or MAC. The storage unit 100 may generally be some kind of memory, even a register file,
It may be semiconductor memory, magnetic memory, or any of several types of memory. Processing unit 1
10 is also capable of performing typical operations such as addition, subtraction, logical AND, logical OR, shifts such as in barrel shifters, multiplications, accumulations, and multiplications and accumulations typically found in digital signal processors. Good. The processing unit 110 takes an operand as either one operand used in an instruction, two operands used in an instruction, or many more. Processing unit 110 then performs operations on these operands to obtain their results. By starting with a scalar or vector operand, the operand completes the operation, yielding a scalar or vector result, respectively.

【００３３】次のステップは、処理ユニット１１０がど
のように形成されてもよく、どのように機能するかをよ
り特定的に認識するためのものである。データおよびプ
ログラムは記憶ユニット１００内で組合せられているよ
うに示されているが、それらは同じ物理的メモリ内で組
合せることもできるし、別個になった物理的メモリ内で
実現することもできるということは明らかであろう。各
オペランドは典型的な３２ビットの長さを有するものと
して説明されているが、一般に、オペランドはいくつか
の長さのいずれとすることもできるだろう。１６ビット
マシン、８ビットマシン、または６４ビットマシン等々
とすることができる。一般的なアプローチは、Ｎビット
オペランドが、ともにとられて加算されるとＮビットに
なる複数オペランドとして考えられ得るということであ
ると、当業者は認識するだろう。したがって、３２ビッ
トワードはたとえば２つの１６ビットハーフワード、も
しくは４つの８ビットクォーターワードまたはバイトで
あり得るだろう。発明者らによる現在の実現例では、１
つのオペランド中の要素は各々同じ幅のものとしてい
る。しかしながら、３２ビットオペランドを、一方の要
素を２４ビットとし、他方の要素を８ビットとすること
もできる。オペランド中で複数のデータ経路および複数
の要素を用いることから導き出される利点とは、すべて
の要素が独立的かつ同時に処理されており、処理のスル
ープットの増加がなし遂げられるということである。The next step is to more specifically recognize how the processing unit 110 may be formed and how it functions. Although data and programs are shown as being combined in storage unit 100, they can be combined in the same physical memory or implemented in separate physical memories. That would be clear. Each operand is described as having a typical length of 32 bits, but in general the operands could be any of several lengths. It can be a 16-bit machine, an 8-bit machine, a 64-bit machine, and so on. One of ordinary skill in the art will recognize that the general approach is that N-bit operands can be thought of as multiple operands taken together and N-bits added. Thus, a 32-bit word could be, for example, two 16-bit halfwords, or four 8-bit quarter words or bytes. In the current implementation by the inventors, 1
The elements in the two operands have the same width. However, a 32-bit operand can be 24 bits for one element and 8 bits for the other element. An advantage derived from using multiple data paths and multiple elements in the operands is that all elements are processed independently and simultaneously, and an increase in processing throughput is achieved.

【００３４】命令はどのようなサイズであってもよい。
現在は３２ビット命令が用いられている。しかしながら
当業者は、８ビット、１６ビット、３２ビット、および
６４ビットにおいて特に有用性を見出すかもしれない。
より重要なことは、命令については固定長でさえなくと
もよいということである。同じコンセプトが、３２ビッ
ト命令に拡張可能な１６ビット命令を備えるものなど
の、または命令がいくつかの数の８ビットバイトで形成
されており、その数はそれがどの特定の命令であるかに
よって決まる、可変長命令マシンにおいて用いられた場
合でも、働くだろう。当業者のために、図２８〜３６に
例示的な命令セットのまとめを示し、本発明に従って実
現されてもよい命令を示す。The instructions can be of any size.
Currently, 32-bit instructions are used. However, one of ordinary skill in the art may find particular utility in 8-bit, 16-bit, 32-bit, and 64-bit.
More importantly, the instructions do not have to be even fixed length. The same concept, such as those with 16-bit instructions expandable to 32-bit instructions, or instructions are formed by some number of 8-bit bytes, which depends on which particular instruction it is Definitely, it will work even if used in a variable length instruction machine. For those skilled in the art, a summary of an exemplary instruction set is shown in Figures 28-36 to illustrate the instructions that may be implemented in accordance with the present invention.

【００３５】処理ユニット１１０は典型的にはＡＬＵ１
２１および／またはＭＡＣ１２２を含んでもよい。また
これは、シフタ１２３または論理ユニット１２４を実現
するだけのものであってもよい。The processing unit 110 is typically ALU1.
21 and / or MAC 122. It may also only implement the shifter 123 or the logic unit 124.

【００３６】加算器図３は処理ユニット（図２の１１０）のための、ＡＬＵ
において実現されてもよい加算器を模式的に表わしたも
のである。図３（ａ）は従来の３２ビット加算器を示
す。図３（ｂ）はハーフワード対モードのために接続さ
れた２つの１６ビット加算器を表わしたものである。図
３（ｃ）はワードモードのために接続された２つの１６
ビット加算器を表わしたものである。 Adder FIG. 3 shows an ALU for the processing unit (110 in FIG. 2).
Is a schematic representation of an adder that may be implemented in. FIG. 3A shows a conventional 32-bit adder. FIG. 3 (b) shows two 16-bit adders connected for the halfword pair mode. FIG. 3 (c) shows two 16 connected for word mode.
It represents a bit adder.

【００３７】図３（ａ）から（ｃ）は、図３（ａ）にお
ける３２ビットの従来のマシンにおける典型的なハード
ウェアが、本発明に従うハーフワード対モードまたはワ
ードモードの所望される目的を達成するためにどのよう
に変形されてもよいかということを示す役割を果たす。
ベクトルはここでは２つの要素を持つものとして示され
る。より特定的には、３２ビットの従来のオペランドが
どのようにして各々１６ビットの２つの要素に分割され
得るかということが示される。同じ原理を、等しい長さ
または等しくない長さのものがあるいくつかの要素に分
割するのに適用することができるだろう。FIGS. 3 (a) to 3 (c) show that the typical hardware in the 32-bit conventional machine in FIG. It serves to show how it may be modified to achieve.
Vectors are shown here as having two elements. More specifically, it is shown how a 32-bit conventional operand can be split into two elements, each 16 bits. The same principle could be applied to break up into several elements, some of equal length or unequal length.

【００３８】図３（ａ）を参照して、従来の加算器２０
０はＸオペランドのための入力ＸとＹオペランドのため
の入力Ｙとを有する。またこれは、加算器と関連して典
型的に見出されるキャリー−イン２０１および条件コー
ド２０５のための入力をも有する。条件コード２０５
は、オーバフローを表わすのがＶ、キャリー−アウトが
Ｃ、ゼロ結果、すなわち加算器から出される結果がゼロ
である場合がＺであってよい。さらにこれは加算器から
出される結果オペランドを有しており、これはＳであ
る。Ｘ、Ｙ、およびＳはすべて３２ビットワードで表わ
される。制御入力ｓ／ｕ２０２は符号付または符号なし
オペランドを表わし、ここで最上位ビットはその数が正
または負である場所を示し、もしくは符号なしオペラン
ドではその最上位ビットがオペランドの大きさに関与す
る。図３（ｂ）は、典型的な３２ビット加算器に類似し
てはいるが、そうではなく単なる１６ビット加算器であ
る２つの加算器が、どのようにともに組合せられてハー
フワード対、すなわち１つのオペランドにつき２つのハ
ーフワード要素があるものに対してベクトル演算を行な
うことができるかということを示す。Ｙオペランドはこ
こでは２つのハーフワードオペランド、すなわち下半分
のＹ０からＹ１５、および上半分のＶ０からＶ１５とし
て分割されている。同様に、Ｘオペランドは２つのハー
フワードオペランド、すなわち下半分のＸ０からＸ１
５、および上半分のＵ０からＵ１５として分割されてい
る。結果Ｓは、加算器２１０からくるＳ０からＳ１５、
および加算器２２０からくる上半分のＷ０からＷ１５と
して認識される。本質的には、３２ビット加算器２００
を中央で分割して２つの１６ビット加算器２１０および
２２０を形成してもよい。しかしながら、上位ビットに
はオペランドの符号ビットの性質を決定するための論理
が必要であろう。したがって３２ビット加算器２００を
分割する際には、３２ビット加算器から分割されて加算
器２１０を形成する下方の１６ビットの符号制御のため
に付加的な論理が必要となるであろう。この場合これら
２つの加算器２１０および２２０は、加算器２１０のた
めの入力オペランドが３２ビットオペランドの下半分か
らきており、１６ビット加算器２２０のための入力オペ
ランドが３２ビットオペランドの上半分からきていると
いうことを除けば、同一なものとなるであろう。Referring to FIG. 3A, the conventional adder 20
0 has an input X for the X operand and an input Y for the Y operand. It also has inputs for carry-in 201 and condition code 205, which are typically found in association with adders. Condition code 205
May be V for overflow, C for carry-out, and Z for zero result, i.e. the result from the adder is zero. In addition, it has the result operand output from the adder, which is S. X, Y, and S are all represented by 32-bit words. The control input s / u 202 represents a signed or unsigned operand, where the most significant bit indicates where the number is positive or negative, or in an unsigned operand the most significant bit contributes to the size of the operand. . FIG. 3 (b) is similar to a typical 32-bit adder, but instead is simply a 16-bit adder, where two adders are combined together to form a halfword pair, ie Indicates whether vector operations can be performed on what has two halfword elements per operand. The Y operand is split here as two halfword operands: the lower half Y0 to Y15 and the upper half V0 to V15. Similarly, the X operand is two halfword operands, namely X0 through X1 in the lower half.
5, and the upper half U0 to U15. The result S is S0 to S15 coming from the adder 210,
And the upper half W0 to W15 coming from the adder 220 are recognized. Essentially, the 32-bit adder 200
May be split in the middle to form two 16-bit adders 210 and 220. However, the upper bits will need logic to determine the nature of the sign bit of the operand. Therefore, dividing the 32-bit adder 200 will require additional logic for the lower 16-bit sign control that is divided from the 32-bit adder to form adder 210. In this case, these two adders 210 and 220 have input operands for adder 210 coming from the lower half of the 32-bit operand and input operands for 16-bit adder 220 coming from the upper half of the 32-bit operand. It will be the same, except that it is present.

【００３９】オペランド要素ＸおよびＵが別個にそれぞ
れＹおよびＶと加算されて合された場合、それらはそれ
ぞれ結果ＳおよびＷをもたらす。またそれらは加算器の
各々のために独立的な条件コードを生成する。加算器２
１０は条件コード２１５を生成し、加算器２２０は条件
コード２２５を生成する。これらの条件コードは、それ
らが関連している特定のハーフワード加算器に適用され
る。したがって、これで独立的なハーフワード対演算を
行なうために従来の３２ビット加算器がわずかに変形さ
れる様が見てとれる。When the operand elements X and U are separately added and combined with Y and V, respectively, they yield the results S and W, respectively. They also generate independent condition codes for each of the adders. Adder 2
10 generates the condition code 215, and the adder 220 generates the condition code 225. These condition codes apply to the particular halfword adder with which they are associated. Therefore, it can be seen that the conventional 32-bit adder is slightly modified to perform independent halfword pair operations.

【００４０】図３（ｃ）を参照して、図３（ｂ）におけ
る同じ加算器ユニットが、図３（ａ）の加算器２００に
おいて行なわれた、もとのワード演算を行なうべく再接
続されてもよい。これは、オペランドが３２ビットスカ
ラを表わす場合である。スカラはＹ０からＹ３１および
Ｘ０からＸ３１である。これらのオペランドの下半分は
加算器２３０によって処理され、上半分は加算器２４０
によって処理される。これを可能にするメカニズムは、
加算器２３０のキャリー−アウトを加算器２４０のキャ
リー−イン２３６に接続することによるものである。図
２（ｃ）に示されるように、組合せられた２つの１６ビ
ット加算器は図３（ａ）の１つの３２ビット加算器と同
じ機能を果たす。したがって、図３（ｂ）および３
（ｃ）に示した実現例では、加算器２１０は本質的に加
算器２３０と同じものであってもよく、一方で加算器２
２０は加算器２４０と同じものであってもよい。この説
明ではこれら２つの加算器がどのようにハーフワード対
モードまたはワードモードのいずれかで機能できるかが
示されているが、当業者は、拡張によってベクトルの独
立した要素を同時に扱うために従来の加算器をいくつか
の加算器に変形すること、およびこれを再結合してスカ
ラ演算をスカラオペランドで行なうことをしてもよい。Referring to FIG. 3 (c), the same adder unit in FIG. 3 (b) is reconnected to perform the original word operation performed in adder 200 of FIG. 3 (a). May be. This is the case when the operand represents a 32-bit scalar. The scalars are Y0 to Y31 and X0 to X31. The lower half of these operands is processed by adder 230 and the upper half is processed by adder 240.
Processed by. The mechanism that enables this is
By connecting the carry-out of adder 230 to the carry-in 236 of adder 240. As shown in FIG. 2 (c), the two combined 16-bit adders perform the same function as the single 32-bit adder of FIG. 3 (a). Therefore, as shown in FIGS.
In the implementation shown in (c), adder 210 may be essentially the same as adder 230, while adder 2
20 may be the same as the adder 240. Although this description shows how these two adders can work in either half-word pair mode or word mode, those skilled in the art have traditionally used extensions to handle the independent elements of a vector simultaneously. May be transformed into several adders, and may be recombined to perform scalar operations with scalar operands.

【００４１】図３の加算器について、１つ注目すべきこ
とがある。図３（ｃ）では２組の条件コード２３５およ
び２４５が示されている。一方、もとの従来の加算器で
は１組の条件コード２０５しかない。図３（ｃ）の条件
コードは、本当は条件コードＺを除いては２４５の条件
コードである。２３５における条件コード、すなわちオ
ーバフローＶおよびキャリーＣは、条件コード２０５に
おける条件コードおよび条件コードＺが、効果的に２３
５のＺ条件コードとＡＮＤ処理される２４５のＺ条件コ
ードである限り、無視される。ここでは２０５の条件コ
ードＶは２４５のＶに対応する。２０５のＣは２４５の
Ｃに対応し、２０５のＺはコード２３５のＺとＡＮＤ処
理されたコード２４５のＺに対応する。当業者は、適合
すると思われるどの特定のやり方でもこれらを組合せる
ことができるだろう。One thing to note about the adder of FIG. In FIG. 3C, two sets of condition codes 235 and 245 are shown. On the other hand, the original conventional adder has only one set of condition codes 205. The condition code of FIG. 3C is actually a condition code of 245 except for the condition code Z. The condition code in 235, that is, the overflow V and the carry C is effectively 23 when the condition code in the condition code 205 and the condition code Z are 23.
As long as it is the 245 Z condition code ANDed with the 5 Z condition code, it is ignored. Here, the condition code V of 205 corresponds to the V of 245. The C of 205 corresponds to the C of 245, and the Z of 205 corresponds to the Z of code 245 ANDed with the Z of code 235. The person skilled in the art will be able to combine these in any particular way which seems to fit.

【００４２】論理ユニット図４は本発明に従い実現されてもよい論理ユニットの模
式図である。図４（ａ）はビット単位の論理演算、ビッ
ト単位の補数または現在のプロセッサにおいて典型的に
見られるいくつかの組合せを行なう典型的な３２ビット
論理ユニットを示すものであって、これらの演算につい
て重要かもしれないのは、それらが条件コードにおける
異なったビットのために独立的に働くということであ
る。オーバフロービットは通常、３０５における条件コ
ードでは全く重要性を持たない。キャリー−アウトは論
理演算においてまったく重要ではないが、ゼロには、結
果がゼロであるということを示すことにおいてまだ重要
性がある。ハーフワード対演算のために、もとの３２ビ
ット加算器は「動作的」には２つの１６ビット論理ユニ
ットに分割されるだろう。入力オペランドにおける上方
の１６ビット３２０および下方の１６ビット３１０は、
加算器のときと同じ態様で２つのハーフワードに分割さ
れるだろう。論理演算を処理するにあたっては、ビット
は一般に独立的に処理されるので、２つの論理ユニット
３１０および３２０の間には動作的な接続は全くない。 Logical Unit FIG. 4 is a schematic diagram of a logical unit that may be implemented in accordance with the present invention. FIG. 4 (a) illustrates a typical 32-bit logical unit that performs bitwise logical operations, bitwise complements, or some combination typically found in current processors. What may be important is that they work independently for different bits in the condition code. Overflow bits are usually of no significance in the condition code at 305. Carry-out is not significant at all in logic operations, but zero is still significant in indicating that the result is zero. For halfword pair operations, the original 32-bit adder would be "operationally" split into two 16-bit logical units. The upper 16 bits 320 and the lower 16 bits 310 in the input operand are
It will be split into two halfwords in the same manner as for the adder. In processing logical operations, bits are generally processed independently, so there is no operational connection between the two logical units 310 and 320.

【００４３】図４（ｃ）はスカラ処理のための典型的な
論理ユニットを形成するようにもう一度再結合された論
理ユニットを示す。条件コードエリア以外ではユニット
間には接続が必要ではないということに注意されたい。
従来の論理ユニットのゼロ条件コード３０５はここでは
ユニット３４５のゼロ条件コードをユニット３３５のゼ
ロ条件コードとＡＮＤ処理することによって表わされて
もよい。したがって当業者には、デュアルモード論理ユ
ニットが前述のようにデュアルモード加算器のコンセプ
トおよび実現例を拡張することによって構成され得ると
いうことが明らかなはずである。FIG. 4 (c) shows the logic units recombined to form a typical logic unit for scalar processing. Please note that no connection is required between units except in the condition code area.
The conventional logic unit zero condition code 305 may be represented herein by ANDing the unit 345 zero condition code with the unit 335 zero condition code. It should therefore be apparent to a person skilled in the art that the dual mode logic unit can be constructed by extending the concept and implementation of the dual mode adder as described above.

【００４４】シフタ図５から８は、本発明に従い実現されてもよいバレルシ
フタを模式的に表わしたものである。いくつかのプロセ
ッサは図５（ｂ）に示されるようにバレルシフタを有す
るが、他のものは図５（ａ）、図６、および図７に示さ
れる１ビットシフタを有する。バレルシフタは典型的に
はプロセッサユニット内に必要なものではないが、高性
能マシンについては、プロセッサユニットは図５（ｂ）
に表わされるようなシフタユニットを実現してもよい。
以下の説明では、処理を高速化する、または必要なハー
ドウェアの量を最小限にするために、当業者によってシ
フタがどのように構成され実現されてもよいかを示す。
図５（ａ）は、左シフトまたは右シフトのどちらかであ
る１ビットシフトが典型的なプロセッサにおいてどのよ
うに実現されてもよいかを示す。シフタ４１５は、３２
ビット入力オペランドＸが、左または右へ１ビットシフ
トされる、または方向入力ＤＩＲ４０１の制御下ではシ
フトされないようにして、Ｚ出力を生成できる。シフト
が起こった場合、それが左へのシフトなら、選択ボック
ス４１６によって最下位ビットの位置にビットが入れら
れなければならない。 Shifters FIGS. 5-8 are schematic representations of barrel shifters that may be implemented in accordance with the present invention. Some processors have barrel shifters as shown in Figure 5 (b), while others have 1-bit shifters as shown in Figures 5 (a), 6 and 7. Barrel shifters are not typically required within the processor unit, but for high performance machines, the processor unit is shown in FIG.
You may implement | achieve the shifter unit as represented by.
The following description shows how shifters may be configured and implemented by those skilled in the art to speed up the process or minimize the amount of hardware required.
FIG. 5 (a) shows how a 1-bit shift, either a left shift or a right shift, may be implemented in a typical processor. The shifter 415 has 32
The bit input operand X can be shifted one bit to the left or right, or not shifted under the control of the direction input DIR 401 to produce the Z output. If a shift occurs, if it is a shift to the left, then the selection box 416 must place the bit in the least significant bit position.

【００４５】シフタが右へシフトされる場合、選択ボッ
クス４００からのビットが最上位ビットの位置に入れら
れる。選択ボックス４００および４１６はシフタ４１５
に入れるために選択され得るいくつかの入力を有する。
双方のボックスにはＳＥＬとラベル付けされる選択入力
もあり、これは命令からくるものであって、従来のマシ
ンには典型的なものである。ＳＥＬはこれらの入力ビッ
トのうちどちらがシフタに入れられるために選択される
であろうかを決定する。一般に、これらの選択ボックス
があるため、シフトは、シフタの外へシフトされるビッ
トがシフタのもう一方の端で中にシフトされる回転でも
あり得るし、他のビットが右へシフトされる際に符号ビ
ットまたは最上位ビットがドラッグされる算術的右シフ
トでもあり得るし、他のビットが左へシフトされる際に
０が入れられる算術的左シフトでもあり得る。論理シフ
トについては、「０」がビットとして入れられる。ま
た、「１」は論理シフトに入れられる新しいビットとし
て、入れられる。When the shifter is shifted to the right, the bits from select box 400 are placed in the most significant bit position. Selection boxes 400 and 416 are shifters 415
It has several inputs that can be selected to enter.
Both boxes also have a select input labeled SEL, which comes from the instruction and is typical of conventional machines. SEL determines which of these input bits will be selected for entry into the shifter. In general, because of these select boxes, the shift can be a rotation in which the bits shifted out of the shifter are shifted in at the other end of the shifter, and when other bits are shifted right. Can be an arithmetic right shift in which the sign bit or the most significant bit is dragged, or an arithmetic left shift in which a 0 is placed when the other bits are shifted to the left. For logical shifts, "0" is entered as a bit. Also, a "1" is entered as a new bit that is placed in the logical shift.

【００４６】当業者は、加算器および論理ユニットのた
めの条件コードの説明を参照することによって、容易に
条件コードをシフタに割当て、算術的左シフト演算のた
めのオーバフロー、シフト演算の最後のビットを保持す
るためのキャリー、およびシフトの結果が０値であった
ときにそれを記録するゼロフラグを表わすことができる
だろう。Those skilled in the art can easily assign the condition code to the shifter by referring to the description of the condition code for adder and logic unit, overflow for arithmetic left shift operation, the last bit of shift operation. Could hold a carry, and a zero flag to record it when the result of the shift was a zero value.

【００４７】図５（ａ）のシフタを組合せて用いること
で、図５（ｂ）のシフタを３２ビット左／右バレルシフ
タとして形成してもよい。これは図５（ａ）におけるシ
フタを３２個組合せ、それらを次々にカスケード接続
し、第１のものの出力が第２のものの入力に入る、とい
うふうに最後まで続いていくようにすることによって行
なわれてもよい。シフトされるべきビットの数は個々の
シフタへの方向入力ＤＩＲの１および０のパターンによ
って決定される。図５（ａ）ではシフタのための方向は
３値であるということに注意されたい。すなわち左、
右、またはまったくシフトがなし遂げられない真っ直ぐ
前方、である。そこで図５（ｂ）では、個々の３２ビッ
トの１ビットシフタへの方向入力は、左でも右でもシフ
トなしでもあり得る。３２ビットが左へシフトすべきで
ある場合、すべての方向入力が左を示すだろう。By using the shifters of FIG. 5 (a) in combination, the shifters of FIG. 5 (b) may be formed as a 32-bit left / right barrel shifter. This is done by combining 32 shifters in FIG. 5 (a), cascaded them one after the other, so that the output of the first one enters the input of the second one, and so on. You may The number of bits to be shifted is determined by the pattern of 1's and 0's of the directional inputs DIR to the individual shifters. Note that in FIG. 5 (a) the orientation for the shifter is ternary. Ie left,
Right, or straight ahead, where no shift can be accomplished. Thus, in FIG. 5 (b), the directional input to each 32-bit 1-bit shifter can be left, right, or unshifted. If 32 bits should be shifted to the left, all directional inputs will point to the left.

【００４８】左へシフトすべきなのが１ビットだけの場
合、第１のボックスが左への１ビットシフトを示し、他
の３１個はすべてシフトなしを示す。Ｎビットが左へシ
フトすべきである場合、始めのＮ個のボックスが左への
１ビットの方向入力を有し、残りのボックスがシフトな
しを示すだろう。同じことが右へのシフトにも適用でき
るだろう。この場合には方向は右へのシフトまたはシフ
トなしのいずれかを示し、同じように右シフトにおいて
０ビットから３２ビットまでのシフトが可能であろう。If only one bit should be shifted to the left, the first box indicates a 1-bit shift to the left, the other 31 all indicate no shift. If N bits should be shifted to the left, the first N boxes would have a 1-bit directional input to the left and the remaining boxes would indicate no shift. The same would apply to a shift to the right. In this case the direction indicates either a shift to the right or no shift, and a shift from 0 to 32 bits would be possible in the right shift as well.

【００４９】この図５（ａ）における典型的な１ビット
シフタは、ここで図６を参照して２つの１６ビットシフ
タに分割することができる。ここではハーフワード対モ
ードのために接続された２つの１６ビットＬ／Ｒ１ビッ
トシフタが示される。図５（ａ）におけるシフタ４１５
は、動作的には２つの１６ビットの１ビットシフタ４５
０および４３５に分割できる。これらの１６ビットシフ
タの各々は、この場合特に４１６および４００を参照す
る図５（ａ）に示す入力選択論理を有しており、これは
ボックス４５０がボックス４６０および４４５を有し、
ボックス４３５がボックス４４０および４３０を有する
ように二重にされる。入力論理は同じであるが、選択ボ
ックスへの入力は異なったように結線される。したがっ
て、ハーフワード対モードのために接続される図６のシ
フタとワードモードのために接続される図７のシフタと
の違いは、入力選択ボックスの結線のされ方にある。下
方のシフタ４５０のための図６の入力オペランド要素は
Ｘ０からＸ１５であり、シフタ４３５のための入力オペ
ランド要素はＹ０からＹ１５である。このようにしてＸ
およびＹは２つのハーフワードを示す。The typical 1-bit shifter in FIG. 5 (a) can now be divided into two 16-bit shifters with reference to FIG. Shown here are two 16-bit L / R 1-bit shifters connected for halfword pair mode. The shifter 415 in FIG.
Operationally, two 16-bit 1-bit shifters 45
It can be divided into 0 and 435. Each of these 16-bit shifters has the input selection logic shown in FIG. 5 (a), which in this case specifically refers to 416 and 400, which includes box 450 having boxes 460 and 445,
Box 435 is duplicated to have boxes 440 and 430. The input logic is the same, but the inputs to the select boxes are wired differently. Therefore, the difference between the shifter of FIG. 6 connected for the half word pair mode and the shifter of FIG. 7 connected for the word mode is in how the input selection boxes are wired. The input operand elements of FIG. 6 for the lower shifter 450 are X0 to X15, and the input operand elements for the shifter 435 are Y0 to Y15. In this way X
And Y indicate two halfwords.

【００５０】結果Ｚ出力オペランドは２つのハーフワー
ドとして示される。下方の１６ビットはＺ０からＺ１５
であり、上方の１６ビットはＷ０からＷ１５である。入
力セレクタは、回転においてシフタから出力されるビッ
トがシフタの他方の端にフィードバックされるように結
線される。シフタ４３５が左シフトを行なうと、回転さ
れるビットはＹ１５となり、右シフトを行なうと回転さ
れるビットはＹ０となる。同様にシフタ４５０につい
て、それが左回転であれば、入力ビットはＸ１５であ
り、右回転であれば入力ビットはＸ０である。同様に、
選択は算術的シフトおよび論理的シフトについても図５
（ａ）でのように働く。Result The Z output operand is shown as two halfwords. Lower 16 bits are Z0 to Z15
And the upper 16 bits are W0 to W15. The input selector is wired so that the bits output from the shifter during rotation are fed back to the other end of the shifter. When the shifter 435 shifts left, the rotated bit becomes Y15, and when it shifts right, the rotated bit becomes Y0. Similarly, for shifter 450, if it is a left rotation, the input bit is X15, and if it is a right rotation, the input bit is X0. Similarly,
The choice is also made for arithmetic and logical shifts in FIG.
Works as in (a).

【００５１】図７は、これらの同じ２つのシフタの動作
がどのようにワードモードのために接続され得るかを示
す。ここではシフトパターンは図６での２つのハーフワ
ードとは違ってオペランドにおける３２ビット全体に対
して働く。左への回転については、下方のシフタ４８６
から外へ回転させられるビット（ＭＳＢビットＸ１５）
は上方のシフタ４７５の中に回転させられる一方で、Ｌ
ＳＢビットはシフタ４７５に入力される。これは上方の
１ビットシフタと下方の１ビットシフタとの間で連続的
なシフトを形成する。２つのシフタをめぐる回転につい
ては、Ｘ３１がＸ０にシフトされるだろう。図７に示し
たようにセレクタ４８０のすべての入力がＸ１５に接続
されており、セレクタ４８５のすべての入力がＸ１６に
接続されていれば、図７の組合せられたシフタは図５
（ａ）におけるシフタとして効果的に動作する。入力セ
レクタ４７０は入力セレクタ４００と同じパターンを有
するだろう。入力セレクタ４８８はセレクタ４１６と同
じ入力パターンを有するだろう。したがって、図７の組
合せられたシフタは図５（ａ）におけるシフタと同じス
カラオペランドのためのシフト動作を行なうだろう。FIG. 7 shows how the operation of these same two shifters can be connected for word mode. Here, the shift pattern works for the entire 32 bits in the operand, unlike the two halfwords in FIG. For leftward rotation, lower shifter 486
Bit rotated from the outside (MSB bit x15)
Is rotated into the upper shifter 475, while L
The SB bit is input to the shifter 475. This forms a continuous shift between the upper 1-bit shifter and the lower 1-bit shifter. For rotations around the two shifters, X31 will be shifted to X0. If all the inputs of the selector 480 are connected to X15 and all the inputs of the selector 485 are connected to X16 as shown in FIG. 7, then the combined shifter of FIG.
It effectively operates as the shifter in (a). The input selector 470 will have the same pattern as the input selector 400. Input selector 488 will have the same input pattern as selector 416. Therefore, the combined shifter of FIG. 7 will perform the same shift operation for scalar operands as the shifter in FIG. 5 (a).

【００５２】図６および７における１ビットシフタはさ
らに、１ビットシフタを３２個カスケード接続すること
によって、図５（ｂ）と類似の態様で、図８に示した３
２ビットバレルシフタに拡張することができる。１ビッ
トシフトが所望されるならば、方向制御信号が第１のシ
フタに対して用いられ、１ビットシフトを示す。他のカ
スケード接続された１ビットシフタに対しては、示され
るシフトはない。Ｎビットシフトについては、最初のＮ
個の１ビットシフタにおける方向入力が、１ビットだけ
シフトすることを示し、残りの１ビットシフタはシフト
せずデータを通過させる。The 1-bit shifter in FIGS. 6 and 7 is further connected to the 3-bit shifter shown in FIG. 8 in a manner similar to FIG. 5B by cascade-connecting 32 1-bit shifters.
It can be extended to a 2-bit barrel shifter. If a 1 bit shift is desired, the direction control signal is used for the first shifter to indicate a 1 bit shift. There is no shift shown for the other cascaded 1-bit shifters. For N-bit shift, the first N
The direction input in each 1-bit shifter indicates that it shifts by 1 bit, and the remaining 1-bit shifters pass the data without shifting.

【００５３】同様にこの図８のバレルシフタはワードま
たはハーフワード対モード演算のいずれをも行なうこと
ができる。なぜなら、個々のビットシフタはワードまた
はハーフワード対演算のどちらでも行なうことができる
からである。この図５から８の実施例はバレルシフタを
実現する１つの方法を代表するものであるが、バレルシ
フタを真中で分割して入力選択論理を提供する同じコン
セプトが、バレルシフタの多くの他の実現例にも応用で
きる。当業者は、特定のハードウェアまたはスループッ
トの要求に応じて適切な実現例を見出すことができるは
ずである。Similarly, the barrel shifter of FIG. 8 can perform either word or halfword pair mode operations. This is because the individual bit shifters can perform either word or halfword pair operations. Although the embodiments of FIGS. 5-8 are representative of one method of implementing a barrel shifter, the same concept of splitting the barrel shifter in the middle to provide the input selection logic applies to many other implementations of the barrel shifter. Can also be applied. One of ordinary skill in the art will be able to find suitable implementations depending on the particular hardware or throughput requirements.

【００５４】乗算累算器図９および１０は、本発明に従い実現されてもよい乗算
および累算（ＭＡＣ）ユニットの模式図である。 Multiply Accumulator FIGS. 9 and 10 are schematic diagrams of a multiply and accumulate (MAC) unit that may be implemented in accordance with the present invention.

【００５５】典型的な３２ビットプロセッサは通常、高
価な３２×３２乗算器アレイの実現を必要とはしないだ
ろう。乗算はおそらく他の方法で確立されるだろう。し
かしながら典型的な１６ビット信号プロセッサでは、１
６×１６乗算器アレイが極めて普通に見られる。高速な
乗算を必要とするタイプの計算には、典型的に１６ビッ
トデータが用いられるので、１６×１６乗算器アレイの
方が普及したものとなっており、これはいくつかの３２
ビットプロセッサにおいてさえ当てはまることである。
したがって、３２ビットオペランドを２つの１６ビット
ハーフワード対として扱うことにより、１つのベクトル
化されたオペランド中の３２ビットワードオペランド、
ハーフワードオペランド、またはハーフワード要素の空
間ベクトルの概念を利用すべく２つの１６×１６乗算器
アレイを実現することができる。A typical 32-bit processor will usually not require an expensive 32x32 multiplier array implementation. Multiplication will probably be established in other ways. However, in a typical 16-bit signal processor, 1
A 6x16 multiplier array is quite common. Since 16-bit data is typically used for types of computations that require fast multiplication, 16 × 16 multiplier arrays have become more prevalent, which is a few
This is true even in bit processors.
Therefore, by treating a 32-bit operand as two 16-bit halfword pairs, a 32-bit word operand in one vectorized operand,
Two 16x16 multiplier arrays can be implemented to take advantage of the concept of halfword operands, or spatial vectors of halfword elements.

【００５６】次の例は、どのようにして１６×１６乗算
器アレイを二重にして２つのハーフワード対乗算器とし
て用いることができるかを示すものであって、これらの
乗算器はともに接続されて３２×１６スケーラ乗算をも
たらしてもよい。この３２×１６スカラ乗算には、これ
らの乗数の２つを一緒に用いて３２×３２ビット乗算を
なすことができるという有用さがある。または、３２×
１６乗算をそれ自体で用いることもでき、この場合３２
ビットの精度のオペランドがただ１６ビットだけの精度
のオペランドによって乗算されてもよい。The following example shows how a 16 × 16 multiplier array can be duplicated and used as two halfword pair multipliers, which are connected together. May result in a 32 × 16 scaler multiplication. This 32 × 16 scalar multiplication has the usefulness that two of these multipliers can be used together to make a 32 × 32 bit multiplication. Or 32x
16 multiplications can also be used on their own, in this case 32
Bit precision operands may be multiplied by only 16 bit precision operands.

【００５７】ＭＡＣユニットはすべてのプロセッサで典
型的に見られるわけではない。しかし信号処理の用途の
ための高性能プロセッサでは、これは典型的に実現され
ている。図９は、ＭＡＣユニットの従来の実現例を示
す。ＭＡＣは様々なサイズのうちどのサイズでもあり得
る。これは３２ビットの積を形成する乗算器における１
６ビット×１６ビットのユニットである。この３２ビッ
トの積は累算加算器内で第３のオペランドと加算されて
もよく、これは「ガードビット」と呼ばれる余剰の上位
ビットがあるためその積よりも長いかもしれない。MAC units are not typically found on all processors. However, this is typically implemented in high performance processors for signal processing applications. FIG. 9 shows a conventional implementation example of a MAC unit. The MAC can be any of various sizes. This is a 1 in the multiplier that forms the 32-bit product.
It is a unit of 6 bits × 16 bits. This 32-bit product may be added with the third operand in the accumulator adder, which may be longer than the product due to the extra high order bits called "guard bits".

【００５８】図９に示されるように、入力オペランドは
１６ビットであって、Ｘ０からＸ１５およびＹ０からＹ
１５で表わされる。これらは３２ビットの積Ｚを発生
し、これはフィードバックオペランドＦに加えられても
よい。この場合、Ｆは４０ビットのフィードバックワー
ドまたはオペランドを表わすＦ０からＦ３９として示さ
れる。これが４０ビットなのは、積を保持するのに３２
ビット、加えてガードビットのために付加的な８ビット
が必要とされるであろうからである。ガードビットはオ
ーバフローを扱うために含まれている。なぜなら、いく
つかの積が加算されると、オーバフローが起こる可能性
があり、ガードビットはオーバフローを累算してそれら
を保護するからである。典型的にはガードビットの数は
４または８であろう。この例では８ビットが示されてい
るが、いくつかのサイズが可能であろう。累算器の結果
は４０ビットの結果Ａ０からＡ３９として示される。As shown in FIG. 9, the input operand is 16 bits and contains X0 to X15 and Y0 to Y.
It is represented by 15. These produce a 32-bit product Z, which may be added to the feedback operand F. In this case, F is designated as F0 through F39, which represents a 40-bit feedback word or operand. This is 40 bits because 32 holds the product.
This is because an additional 8 bits would be needed for the bits plus the guard bits. Guard bits are included to handle overflow. This is because overflow can occur when several products are added and the guard bits accumulate overflows to protect them. Typically the number of guard bits will be 4 or 8. Although eight bits are shown in this example, several sizes are possible. The accumulator results are presented as 40-bit results A0 through A39.

【００５９】乗算アレイは乗算器なしで用いることもで
きるし、乗算器とともに用いることもできるということ
に注意すべきである。符号付または符号なしを意味する
別の入力Ｓ／Ｕが、入力オペランドが符号付数として扱
われるべきか符号なし数として扱われるべきかを示すと
いうことが注意されるべきである。当業者は、乗算器の
上方のビットが、入力オペランドが符号付であるか符号
なしであるかによって異なったように扱われるというこ
とを認識するであろう。It should be noted that the multiplier array can be used without a multiplier or with a multiplier. It should be noted that another input S / U, which means signed or unsigned, indicates whether the input operand should be treated as a signed or unsigned number. Those skilled in the art will recognize that the upper bits of the multiplier are treated differently depending on whether the input operand is signed or unsigned.

【００６０】図１０は、ハーフワード対オペランドを扱
うためにどのように典型的な１６×１６アレイが形成さ
れるかを示す。この場合、３２ビット入力オペランドＸ
が２つのハーフワードに分割される。乗算器５２０のた
めの下方のハーフワードはＸ０からＸ１５であり、乗算
器５１５のための上方のハーフワードはＸ１６からＸ３
１である。Ｙ入力オペランドも２つのハーフワードオペ
ランドに分割される。乗算器５２０のための下方のハー
フワードはＹ０からＹ１５であり、乗算器５１５のため
の上半分はＹ１６からＹ３１である。図１０はこのよう
にしてＸオペランドのハーフワードオペランドをそれぞ
れＹオペランドのハーフワードオペランドと乗算するた
めの接続を表わす。Ｘの最下位ハーフワードは乗算器５
２０においてＹの最下位ハーフワードと乗算されるとい
うことに注意されたい。また乗算器５１５において独立
的かつ同時に、Ｘの上方のハーフワードがＹの上方のハ
ーフワードと乗算される。これらの２つの乗算は、２つ
の積を生じる。乗算器５２０からの３２ビットの積はＺ
０からＺ３１で表わされ、同様に乗算器５１５の３２ビ
ットの結果はＷ０からＷ３１によって表わされる。２つ
の積はその精度を保つために各々１６ビットよりも大き
い。この時点で、ハーフワードの積は独立的なオペラン
ドの表現として保存される。FIG. 10 shows how a typical 16 × 16 array is formed to handle halfword pair operands. In this case, the 32-bit input operand X
Is divided into two halfwords. The lower halfwords for multiplier 520 are X0 to X15 and the upper halfwords for multiplier 515 are X16 to X3.
It is 1. The Y input operand is also split into two halfword operands. The lower halfwords for multiplier 520 are Y0 to Y15, and the upper half for multiplier 515 is Y16 to Y31. FIG. 10 thus represents a connection for multiplying each halfword operand of the X operand with each halfword operand of the Y operand. The lowest halfword of X is multiplier 5
Note that at 20 is multiplied with the least significant halfword of Y. Also, in the multiplier 515 independently and simultaneously, the upper halfword of X is multiplied with the upper halfword of Y. These two multiplications yield two products. The 32-bit product from multiplier 520 is Z
0 to Z31, and similarly the 32-bit result of multiplier 515 is represented by W0 to W31. The two products are each larger than 16 bits to preserve their precision. At this point, the halfword product is saved as a representation of the independent operands.

【００６１】乗算器５２０より出された下方のハーフワ
ードからの積は、累算器５３０に送られＦ０からＦ３９
で表わされるフィードバックレジスタで加算される。こ
れにより、Ａ０からＡ３９で表わされる累積された積Ａ
が形成される。同様に上方のハーフワードにおいて、積
はＷ０からＷ３１によって表わされており、かつ累算器
５２５の中でＧ０からＧ３９によって表わされるフィー
ドバックレジスタに加算されて４０ビットの結果Ｂを形
成し、この結果はＢ０からＢ３９で表わされる。これら
の累算器の結果は一般に、乗算の精度を保つためにより
大きい数またはビットで表わされるオペランドとして累
算器の中でより大きい数として保存される。The product from the lower halfword output from multiplier 520 is sent to accumulator 530, which is F0 through F39.
It is added in the feedback register represented by. This gives the accumulated product A represented by A0 to A39.
Is formed. Similarly, in the upper halfword, the product is represented by W0 through W31 and is added in accumulator 525 to the feedback register represented by G0 through G39 to form the 40 bit result B, which The results are represented by B0 to B39. The results of these accumulators are generally stored as larger numbers in the accumulator as operands represented by larger numbers or bits to preserve the precision of the multiplication.

【００６２】フィードバックビットは通常、メモリ（図
２の１００）またはより大きい数のビットをストアする
ことのできる特殊なメモリのいずれからでももたらされ
るだろう。典型的なメモリ位置が扱えるのは３２ビット
であるが、典型的には累算器ファイルと呼ばれる特殊な
メモリは、スカラ積のために４０ビット、またはハーフ
ワード対の積のために８０ビットをストアすることがで
きるだろう。この場合スカラオペランドを扱うことので
きる２つの累算レジスタが、ハーフワード対オペランド
のための記憶を形成するのに用いられてもよい。換言す
れば、ハーフワード対演算の２つの４０ビットの結果を
ストアするのに２つの４０ビット累算器を用いることが
できるだろう。The feedback bits will typically come from either memory (100 in FIG. 2) or a specialized memory capable of storing a larger number of bits. A typical memory location can handle 32 bits, but a special memory, typically called an accumulator file, can store 40 bits for a scalar product or 80 bits for a halfword pair product. Could be stored. In this case, two accumulation registers capable of handling scalar operands may be used to form the storage for the halfword pair operands. In other words, two 40-bit accumulators could be used to store the two 40-bit results of a halfword pair operation.

【００６３】ＭＡＣ相互接続図１１および図１２は、スカラオペランドのための１６
×３２ビット乗算を形成するために、図１０のアレイの
２つの１６ビット乗算器がどのように相互接続され得る
かを示す。この例では、乗算器アレイは加算器列として
実現される。最下位乗算器アレイ６１０のキャリー−ア
ウト６０５は、上位乗算器アレイ６００の加算器にキャ
リー入力として与えられる。さらに、上位乗算器アレイ
６００の最下位端に形成される合計ビット６０６は、下
位乗算器アレイ６１０の加算器の最上位端に与えられ
る。 MAC Interconnect FIGS. 11 and 12 show 16 for scalar operands.
11 illustrates how two 16-bit multipliers of the array of FIG. 10 can be interconnected to form a x32-bit multiplication. In this example, the multiplier array is implemented as an adder train. Carry-out 605 of lowest multiplier array 610 is provided as a carry input to the adder of upper multiplier array 600. Further, the sum bit 606 formed at the bottom end of the upper multiplier array 600 is provided at the top end of the adder of the lower multiplier array 610.

【００６４】他の接続は、累算器６１５および６０５に
生ずる。積の下位部分を表わす累算器６１５は３２ビッ
トに制限され、上位８ガードビットは使用されない。３
２ビットのキャリーアウトは上位４０ビット累算器６０
５のキャリー入力に与えられ、その結果はＢ３９を通る
Ｂ０としてここでは示される７２ビットオペランドであ
る。典型的にはこのオペランドは２つのオペランドとし
てストアされ、下位３２ビットは１つの累算器６１５に
ストアされ、上位４０ビットは第２の累算器６０５にス
トアされる。さらにこの演算では、符号付ビットおよび
符号なしビットのために、入力オペランドＸの下位半分
は乗算器６１０において符号なし数として扱われ、入力
オペランドＸの上位１６ビットは上位乗算器アレイ６０
０において符号付または符号なしオペランドとして扱わ
れる。Other connections occur to accumulators 615 and 605. The accumulator 615, which represents the lower part of the product, is limited to 32 bits and the upper 8 guard bits are unused. Three
2-bit carry out is the upper 40-bit accumulator 60
The carry input of 5 and the result is a 72 bit operand, shown here as B0 through B39. This operand is typically stored as two operands with the lower 32 bits stored in one accumulator 615 and the upper 40 bits stored in a second accumulator 605. Further, in this operation, the lower half of the input operand X is treated as an unsigned number in the multiplier 610 because of the signed and unsigned bits, and the upper 16 bits of the input operand X are in the upper multiplier array 60.
At 0, it is treated as a signed or unsigned operand.

【００６５】さらに、下位累算器６１５においては積は
符号なしオペランドとして扱われ、一方上位累算器６０
５ではオペランドは符号付数として扱われる。４０ビッ
ト累算器は図１１および図１２のすべての例においては
符号付数として扱われることを付け加えるべきである。
これは、符号なし数でさえも符号付数の正の部分と考え
られ得るようなビットを、累算器の拡張であるガードビ
ットが可能にするからである。ゆえに、拡張累算器にお
ける符号付数は、符号付オペランドと符号なしオペラン
ドの両方を含む。Further, in the lower accumulator 615 the product is treated as an unsigned operand while the upper accumulator 60
In 5, the operand is treated as a signed number. It should be added that the 40-bit accumulator is treated as a signed number in all the examples of FIGS.
This is because the guard bit, an extension of the accumulator, allows bits that even an unsigned number could be considered the positive part of a signed number. Therefore, the signed number in the extended accumulator includes both signed and unsigned operands.

【００６６】図１２は、乗算器アレイ６００および６１
０を構成する加算器間でキャリーおよび合計ビットがど
のように相互作用するかをより詳細に示す。たとえば、
加算器６２５および６３５は乗算器アレイ６１０の一部
として示され、加算器６２０および６３０は乗算器アレ
イ６００の一部として示される。乗算器アレイ６１０お
よび６００は典型的には加算器の何らかの構成でもって
実現されることが注目されるべきである。特定的な実現
例において、加算器の相互接続は様々な方法でなされる
であろう。図１０は加算器の単純なカスケードを示す
が、この同じ技術を、加算器がたとえばブース乗算器ま
たはウォレス・ツリー乗算器におけるように接続される
であろうような他の方法に用いてもよい。図１２に示さ
れるように、下位乗算器アレイ６１０の加算器６２５
は、上位乗算器アレイ６１０の対応する加算器６２０の
キャリー入力に与えられるキャリー−アウト６２１を与
える。下位乗算器アレイ６１０は、Ｘ−入力の入力オペ
ランドがあたかも符号なしであるかのように演算を行な
う。入力オペランドの符号は特定されて、上位乗算器６
００アレイの上位加算器６２０および６３０の符号制御
に用いられる。FIG. 12 illustrates multiplier arrays 600 and 61.
It shows in more detail how the carry and sum bits interact between the adders that make up 0. For example,
Adders 625 and 635 are shown as part of multiplier array 610 and adders 620 and 630 are shown as part of multiplier array 600. It should be noted that multiplier arrays 610 and 600 are typically implemented with some form of adder. Depending on the particular implementation, the interconnection of the adders may be done in various ways. Although FIG. 10 shows a simple cascade of adders, this same technique may be used in other ways where the adders would be connected as in, for example, a Booth multiplier or a Wallace tree multiplier. . As shown in FIG. 12, the adder 625 of the lower multiplier array 610 is
Provides a carry-out 621 provided to the carry input of the corresponding adder 620 of the upper multiplier array 610. Lower multiplier array 610 operates as if the input operands of the X-input were unsigned. The sign of the input operand is specified, and the upper multiplier 6
00 is used for sign control of the upper adders 620 and 630 of the array.

【００６７】さらに加算器は、それらが乗算器の最下位
ビットから乗算器の最上位ビットにオフセットされるよ
うな方法で接続されるため、それは合計ビットを再び加
算し戻す機会を与える。より特定的には、加算器６２５
および６２０は、Ｙｉとされる、乗算器のより下位のビ
ットに対応する。加算器６３５および６３０は、Ｙ（ｉ
＋１）とされる、乗算器の次のより上位のビットに対応
する。このオフセットは、加算器６３５の入力Ｂ０に与
えられる加算器６２５の出力Ｓ１、および加算器６３５
の入力Ｂ１４に与えられる加算器６２５のＳ１５として
見られ得る。この１ビットのオフセットは加算器６２０
からの入力Ｓ０を受取るよう加算器６３５の入力のＢ１
５を解放して、最上位乗算器アレイ６００からの合計ビ
ットは最下位乗算器アレイ６１０へ入力ビットとして与
えられる。Furthermore, the adders are connected in such a way that they are offset from the least significant bit of the multiplier to the most significant bit of the multiplier, which gives the opportunity to add back the total bits back. More specifically, adder 625
And 620 correspond to the lower bits of the multiplier, designated Yi. Adders 635 and 630 have Y (i
+1) corresponding to the next higher bit of the multiplier. This offset is the output S1 of the adder 625 given to the input B0 of the adder 635 and the adder 635.
Can be seen as S15 of the adder 625 fed to the input B14 of This 1-bit offset is added by the adder 620.
B1 of input of adder 635 to receive input S0 from
5, the sum bits from the most significant multiplier array 600 are provided as input bits to the least significant multiplier array 610.

【００６８】さらに、加算器６２５からの合計ビットＳ
０は、６４０として示される次の部分積に直接進み、さ
らなる乗算器または加算器段を通る必要はない。したが
って、連続する加算器段６２５、６３５などからＳ０を
出力することは、図９の出力ビットＺ０からＺ１５を生
じさせる。加算器６３５のＳ０からＳ１５に対応する最
終部分積からの出力ビットは、Ｚ１６からＺ３１の図９
のアレイ６１０からの出力ビットを生じさせるであろ
う。Furthermore, the total bit S from the adder 625
The 0 goes directly to the next partial product, shown as 640, and does not have to go through an additional multiplier or adder stage. Therefore, outputting S0 from successive adder stages 625, 635, etc., produces output bits Z0 through Z15 of FIG. The output bits from the final partial product corresponding to S0 to S15 of adder 635 are shown in FIG.
Will produce output bits from array 610 of.

【００６９】万一乗算器Ｙが負である場合に補償を与え
るために最終加算器段がどのように用いられ得るかにつ
いては、当業者ならば理解するであろう。Those skilled in the art will understand how the final adder stage could be used to provide compensation should multiplier Y be negative.

【００７０】オペランドデータのタイプの分類オペランドデータのタイプの分類に関してここで注目す
る。オペランドモードのタイプをスカラまたはベクトル
として特定するための１つのアプローチは命令にその情
報を含むことであるが、代替的アプローチはその情報を
オペランドの付加的なビットにおいて付け加えることで
ある。たとえば、オペランドが３２ビットの場合、１つ
の付加的なビットを用いて、オペランドをスカラまたは
ベクトルのいずれかとして識別してもよい。仮にベクト
ル要素の数がはっきりと示されるか、またはベクトル要
素の数が２のような何らかの数であると仮定され得るよ
うな場合、付加的なビットがさらに用いられてもよい。
オペランド処理ユニットは、オペランドに付加される情
報に応答することによってオペランドをスカラとしてま
たはベクトルとして処理するのに適合されるであろう。 Classification of Operand Data Types Attention is now paid to the classification of operand data types. One approach to specifying the type of operand mode as a scalar or vector is to include that information in the instruction, but an alternative approach is to add that information in additional bits of the operand. For example, if the operand is 32 bits, one additional bit may be used to identify the operand as either a scalar or a vector. Additional bits may also be used if the number of vector elements is explicitly indicated or may be assumed to be some number such as two.
The operand processing unit will be adapted to process the operand as a scalar or as a vector by responding to the information added to the operand.

【００７１】オペランドがスカラであるかまたはベクト
ルであるかは、オペランドが選択される方法によってさ
らに特定されてもよい。たとえば、オペランドのアドレ
スをさらに特定するメモリ位置にあるビットフィールド
に、情報が含まれてもよい。Whether the operand is a scalar or a vector may be further specified by the way the operand is selected. For example, the information may be contained in a bit field at a memory location that further identifies the address of the operand.

【００７２】２つのオペランドが処理ユニットによって
処理され、モード情報がその２つのオペランドにおいて
異なる場合には、混合されたモード演算を処理するため
に当業者によって処理ユニットに規定が設定されてもよ
い。たとえば、ベクトルオペランドおよびスカラオペラ
ンドを伴うＡＤＤ演算は、処理ユニットによって、スカ
ラからベクトルを形成し、必要ならば切捨て、次いでベ
クトル演算を行なうことによって処理されてもよい。If two operands are processed by a processing unit and the mode information is different in the two operands, the processing unit may be prescribed by a person skilled in the art to handle mixed mode operations. For example, an ADD operation with a vector operand and a scalar operand may be processed by the processing unit by forming a vector from the scalar, truncating if necessary, and then performing the vector operation.

【００７３】空間ハードウェアに対する代替としてのタ
イムシェアリング実現手段をタイムシェアリングすることは空間に分散す
る実現手段の代用にしばしばなり得ることは、当業者に
は理解されるであろう。たとえば、空間に分散されるベ
クトル処理ユニットにおいて多重加算器を効果的に実現
するのに、１つのベクトル加算器が何度も用いられても
よい。ハードウェアの多重化および非多重化を用いて、
入力オペランドおよび結果を順序づけることも可能であ
る。付加的なサポートハードウェアを有するベクトル加
算器をさらに用いて、スカラオペランドを処理するのに
分散ベクトル加算器が相互接続され得る方法と類似の態
様でスカラオペランドをばらばらに処理することも可能
である。サポートハードウェアは、ベクトル演算処理素
子間を通る中間結果を処理するのに用いられる。 An alternative to spatial hardware
It will be appreciated by those skilled in the art that time-sharing an im-sharing implementation can often be a substitute for a spatially-distributed implementation. For example, one vector adder may be used multiple times to effectively implement multiple adders in a spatially distributed vector processing unit. With hardware multiplexing and demultiplexing,
It is also possible to order the input operands and the result. A vector adder with additional support hardware may also be used to handle the scalar operands in a manner similar to how distributed vector adders may be interconnected to handle the scalar operands. . Support hardware is used to process intermediate results that pass between vector processing elements.

【００７４】この発明の上記の説明に留意して、この発
明の空間ベクトルデータ経路を組込む例示のＲＩＳＣ型
プロセッサがこれより説明される。以下のプロセッサシ
ステムは、当業者がこの発明を組込むであろう方法の一
例にすぎないことに注意されたい。他の例は、記載され
るこの発明に基づく、それらの有利なアプリケーション
を見出すであろう。With the above description of the invention in mind, an exemplary RISC-type processor incorporating the space vector data path of the present invention will now be described. Note that the processor system below is but one example of how a person of ordinary skill in the art would incorporate the present invention. Other examples will find their advantageous application according to the invention described.

【００７５】この発明を組込む例示的プロセッサこの発明を組込む演算処理素子の機能図を示す図１３を
参照する。以下の説明は特定のビット幅を参照するが、
それらは例示のためであり、この発明の教示に従って他
の幅が容易に構成され得ることを、当業者は理解するで
あろう。 Exemplary Processor Incorporating the Invention Reference is made to FIG. 13 which shows a functional diagram of a processing element incorporating the invention. The following description refers to specific bit widths,
Those skilled in the art will appreciate that they are for illustration only and that other widths can be readily constructed in accordance with the teachings of the present invention.

【００７６】図１３を参照すると、図示されるデータ処
理ユニットを制御するために、２つのソースオペランド
および１つの宛先オペランドを特定することのできる命
令が用いられる。Referring to FIG. 13, an instruction capable of specifying two source operands and one destination operand is used to control the illustrated data processing unit.

【００７７】オペランドは典型的にはレジスタにおよび
データメモリ（２００）にストアされる。演算命令、論
理命令、およびシフト命令がＡＬＵ２４０およびＭＡＣ
２３０においてレジスタ空間からのオペランドを用いて
実行され、その結果はレジスタ空間に戻される。レジス
タ空間はレジスタファイル２２０と幾つかの他の内部レ
ジスタ（図示せず）とから構成される。レジスタ空間に
ストアされるオペランドは、３２ビットワードまたはハ
ーフワード対のいずれかである。オペランドは、ロード
およびストア命令によってレジスタ空間とメモリ２００
との間を、または既に記載したようにレジスタ空間と自
動メモリアクセスユニットであるストリーマ２１０との
間を往復する。Operands are typically stored in registers and in data memory (200). Arithmetic, logical, and shift instructions are ALU240 and MAC
At 230, the operand is executed from the register space and the result is returned to the register space. The register space consists of register file 220 and some other internal registers (not shown). Operands stored in register space are either 32-bit words or halfword pairs. Operands are loaded and stored in register space and memory 200.
To or from the register space and the streamer 210, which is an automatic memory access unit, as described above.

【００７８】図１４を参照すると、ＡＬＵ２４０の機能
ブロック図が示される。ＡＬＵは加算器４１０、４２０
とバレルシフタ４７０とから構成される。一般に、ＡＬ
Ｕ命令は、レジスタ空間から２つのオペランドをとり、
レジスタ空間にその結果を書込む。ＡＬＵ命令は、各ク
ロックサイクルを実行することができ、ＡＬＵパイプに
おいて僅か１つの命令クロックサイクルを必要とするだ
けである。Referring to FIG. 14, a functional block diagram of ALU 240 is shown. ALU is an adder 410, 420
And barrel shifter 470. Generally, AL
The U instruction takes two operands from the register space,
Write the result to the register space. The ALU instruction can execute each clock cycle, requiring only one instruction clock cycle in the ALU pipe.

【００７９】加算器４１０、４２０およびシフタ４７０
は、ワードまたはハーフワード対オペランドを用いて演
算を行なう。符号付オペランドは２の補数表記法で表わ
される。現在、符号付、符号なし、小数、および整数オ
ペランドが、ＡＬＵ演算のための命令によって特定可能
である。Adders 410 and 420 and shifter 470
Performs an operation using a word or halfword pair operand. Signed operands are represented in two's complement notation. Currently, signed, unsigned, decimal, and integer operands can be specified by instructions for ALU operations.

【００８０】加算器加算器（４１０、４２０）はワードおよびハーフワード
対で加算および論理演算を行なう。ハーフワード対演算
の場合、加算器４１０、４２０は半分のものが２つある
ものとして機能する。下半分４２０はハーフワード対の
下位オペランド４６０を用いて演算を実行し、上半分４
１０は同じ演算をハーフワード対の上位オペランド４５
０を用いて実行する。ハーフワード対モードにある場合
は、２つの加算器４１０、４２０は本質的に互いから独
立している。３２ビット論理ユニット４４０は、下の加
算器４２０から上の加算器４１０へ情報を送り、２つの
加算器がワードモードで動作しているときには情報を逆
に送るために用いられる。 Adder Adders (410, 420) perform addition and logical operations on word and halfword pairs. For halfword pair operations, adders 410, 420 function as if there were two halves. The lower half 420 uses the lower operand 460 of the halfword pair to perform the operation and the upper half 4
10 is the same operation as the upper operand 45 of a halfword pair
Run with 0. When in halfword pair mode, the two adders 410, 420 are essentially independent of each other. The 32-bit logic unit 440 is used to send information from the lower adder 420 to the upper adder 410 and vice versa when the two adders are operating in word mode.

【００８１】加算器演算は、２つのキャリー（ＣＵおよ
びＣＬ）、２つのオーバフロー（ＶＵおよびＶＬ）、お
よび２つのゼロ（ＺＵおよびＺＬ）条件コードビットに
影響する。ＣＵはワード演算のための桁上げフラグであ
り、ＣＵおよびＣＬはハーフワード対演算のための桁上
げフラグである。同様に、ＶＵはワード演算におけるオ
ーバフローを示し、ＶＵおよびＶＬはハーフワード対演
算におけるオーバフローを示す。The adder operation affects two carry (CU and CL), two overflow (VU and VL), and two zero (ZU and ZL) condition code bits. CU is a carry flag for word operation, and CU and CL are carry flags for halfword pair operation. Similarly, VU indicates overflow in word operation, and VU and VL indicate overflow in halfword pair operation.

【００８２】オーバフローフラグに作用するオーバフロ
ーは、加算器演算命令からおよびＭＡＣスカラ命令から
結果として生じ得る。オーバフローフラグは、実行され
た命令がたとえ結果を飽和したとしてもセットされる。
一度セットされると、条件コードは、フラグをセットす
ることのできる別の命令があるまで変わらない。Overflow affecting the overflow flags can result from adder arithmetic instructions and from MAC scalar instructions. The overflow flag is set even if the executed instruction saturates the result.
Once set, the condition code does not change until there is another instruction that can set the flag.

【００８３】飽和のない加算器演算命令がオーバフロー
し、誤り例外が可能化されると、誤り例外要求が生ず
る。飽和のあるオーバフローおよび飽和のないオーバフ
ローを示すために、別個の信号がデバッグ論理に送られ
る。An error exception request occurs when an adder operation instruction without saturation overflows and error exceptions are enabled. Separate signals are sent to the debug logic to indicate saturated and unsaturated overflows.

【００８４】バレルシフタ図１４を参照すると、１クロックサイクルの間に、バレ
ルシフタは３２ビット位置までのワードオペランドにあ
るすべてのビットを左または右のいずれにもシフトさせ
ながら、ゼロ、オペランドの符号ビット、または加算器
の上位桁上げフラグ（ＣＵ）を回転または挿入すること
ができる。ハーフワード対演算の場合には、１クロック
サイクルで、シフタは１６ビット位置までの両方のハー
フワードを左または右へシフトさせながら、ゼロ、符号
ビット、または加算器の桁上げフラグ（ＣＵおよびＣ
Ｌ）を回転または挿入することができる。 Barrel Shifter Referring to FIG. 14, during one clock cycle, the barrel shifter shifts all bits in the word operand up to a 32-bit position to either the left or the right, while zero, the sign bit of the operand, Alternatively, the upper carry flag (CU) of the adder can be rotated or inserted. For halfword pair operations, in one clock cycle, the shifter shifts both halfwords up to the 16-bit position to the left or right, while zero, sign bit, or adder carry flags (CU and C).
L) can be rotated or inserted.

【００８５】典型的なシフト／回転演算の場合、バレル
シフタ４７０は、両方のソースオペランドの位置にある
各ビットを演算によって示される方向に移動させる。各
位置のシフトに対して、バレルシフタ４７０は、選択さ
れる演算に依って、終わりのビットを回転させるか、ま
たは符号ビット、桁上げフラグ（ＣＵまたはＣＬ）、も
しくはゼロを挿入する。For a typical shift / rotate operation, barrel shifter 470 moves each bit in both source operand positions in the direction indicated by the operation. For each position shift, barrel shifter 470 either rotates the last bit or inserts a sign bit, carry flag (CU or CL), or zero, depending on the operation selected.

【００８６】たとえば、左回転の場合、ビットは左側へ
シフトされる。ビット３１はワードモードではビット０
にシフトされる。ハーフワード対モードの場合には、ビ
ット３１はビット１６に回転させられ、ビット１５はビ
ット０に回転させられる。右回転の場合は、ビットは右
側にシフトされる。ゼロはワードモードではビット３１
に挿入される。ハーフワード対モードの場合には、ゼロ
はビット３１およびビット１５の両方に挿入される。同
様に、キャリー伝搬を伴うシフトでは、桁上げフラグ
（ＣＵ）はワードモードではビット３１に挿入される。
ハーフワード対モードの場合には、各ハーフワードの桁
上げフラグ（ＣＵおよびＣＬ）はビット３１およびビッ
ト１５に挿入される。For example, for left rotation, the bits are shifted to the left. Bit 31 is bit 0 in word mode
Is shifted to. For halfword pair mode, bit 31 is rotated to bit 16 and bit 15 is rotated to bit 0. For right rotation, the bits are shifted to the right. Zero is bit 31 in word mode
Is inserted into. For halfword pair mode, zeros are inserted in both bit 31 and bit 15. Similarly, for shifts with carry propagation, the carry flag (CU) is inserted in bit 31 in word mode.
In the halfword pair mode, carry flags (CU and CL) of each halfword are inserted in bit 31 and bit 15.

【００８７】次に図１５を参照する。デュアルＭＡＣユ
ニットは、２つの１６×１６の積または１６×３２の積
のいずれをも生ずることができるよう一体的に相互接続
された、２つのＭＡＣユニット５２０、５５０、５７
０、５９０および５１０、５４０、５６０、５８０から
構成される。各ＭＡＣは、１６×１６乗算アレイ５１
０、５２０と、累算加算器５６０、５７０と、累算器レ
ジスタファイル５８０、５９０と、スケーラ５９１とか
ら構成される。Next, referring to FIG. The dual MAC units are two MAC units 520, 550, 57 that are interconnected together to produce either two 16x16 or 16x32 products.
0, 590 and 510, 540, 560, 580. Each MAC has a 16 × 16 multiplication array 51
0 and 520, accumulators and adders 560 and 570, accumulator register files 580 and 590, and a scaler 591.

【００８８】幾つかの例示的な命令：乗算、累算、乗算
および累算、ユニバーサルハーフワード対乗算、ユニバ
ーサルハーフワード対乗算および累算、ダブル乗算ステ
ップ、ならびにダブル乗算および累算ステップが、図２
８−図３６に挙げられる命令のまとめに見られる。Some exemplary instructions are: multiplication, accumulation, multiplication and accumulation, universal halfword pair multiplication, universal halfword pair multiplication and accumulation, double multiplication step, and double multiplication and accumulation step. Two
8--See the summary of instructions listed in FIG.

【００８９】ワード演算はどちらかのＭＡＣユニットで
実行され得る。ＭＡＣは現在１６×１６演算であるた
め、ＭＡＣユニットで用いられる「ワード」は１６ビッ
トであることは注目されるべきである。しかしながら、
より便利なアプローチは、ベクトル長１、２、４または
８を用いて演算を表わすことである。したがって、ＭＡ
Ｃにおけるワード演算はベクトル長１と呼ばれることが
でき、一方ハーフワード対演算はベクトル長２となるだ
ろう。宛先累算器を含むＭＡＣは、演算を行なうのに現
在用いられているものである。Word operations can be performed on either MAC unit. It should be noted that the "word" used in the MAC unit is 16 bits, as the MAC is currently a 16x16 operation. However,
A more convenient approach is to represent the operation with a vector length of 1, 2, 4 or 8. Therefore, MA
Word operations in C can be called vector length one, while halfword pair operations will be vector length two. The MAC, including the destination accumulator, is the one currently used to perform operations.

【００９０】ハーフワード対演算は両方のＭＡＣユニッ
トを用いる。命令は特定の累算器を宛先累算器として特
定し、これはアドレス指定される累算器となる。アドレ
ス指定される宛先累算器を含むＭＡＣは下位のハーフワ
ード対要素で演算を行ない、他方の（「対応する」）Ｍ
ＡＣは同じ演算を上位のハーフワード対要素で行なう。
対応するＭＡＣからの結果は対応する累算器にストアさ
れ、アドレス指定される累算器と対応する累算器とはそ
れらのそれぞれのレジスタファイルにおいて同じ相対位
置に位置する。Halfword pair operations use both MAC units. The instruction identifies the particular accumulator as the destination accumulator, which becomes the addressed accumulator. A MAC that includes an addressed accumulator that is addressed operates on the lower halfword pair element and the other ("corresponding") M
AC performs the same operation on the upper halfword pair element.
The result from the corresponding MAC is stored in the corresponding accumulator, and the addressed accumulator and the corresponding accumulator are located at the same relative position in their respective register files.

【００９１】倍精度演算はハーフワードおよびワードで
行なわれ、この演算は二重ＭＡＣとして組合せられる２
つのＭＡＣによって行なわれる。「上位」ＭＡＣは計算
の最上位部を行ない、「下位」ＭＡＣは計算の最下位部
を行なう。Double precision operations are performed on halfwords and words, and this operation is combined as a dual MAC.
One MAC. The "upper" MAC does the top part of the calculation and the "lower" MAC does the bottom part of the calculation.

【００９２】ＭＡＣユニットは、整数オペランドまたは
小数オペランド、および符号付または符号なしオペラン
ドをサポートしてもよい。The MAC unit may support integer or fractional operands and signed or unsigned operands.

【００９３】累算器レジスタファイル２つのＭＡＣユニットは上位ＭＡＣおよび下位ＭＡＣと
呼ばれる。各ＭＡＣは４つの４０ビットのガードされる
累算器レジスタから構成される累算器レジスタファイル
を有し、ＡＬＵには合計８つの累算器がある。各ガード
される累算器（ＡＧｎ）は、最上位端が８ビットのガー
ドレジスタ（Ｇｎ）でもって拡張される３２ビット累算
器レジスタ（Ａｎ）から構成される。図１６は累算器レ
ジスタファイルのレイアウトを示す。 Accumulator Register File The two MAC units are called upper MAC and lower MAC. Each MAC has an accumulator register file consisting of four 40-bit guarded accumulator registers, for a total of eight accumulators in the ALU. Each guarded accumulator (AGn) consists of a 32-bit accumulator register (An) that is extended with an 8-bit guard register (Gn) at the most significant end. FIG. 16 shows the layout of the accumulator register file.

【００９４】ハーフワード対オペランドの累算器は２つ
の累算器にストアされる。ハーフワード対の下位要素
は、いずれかのＭＡＣの１つの累算器において、４０ビ
ット数として累算される。ハーフワード対の上位要素
は、他方のＭＡＣにある対応する累算器において、４０
ビット数として累算される（図１７は対応するアドレス
を示す）。The halfword pair operand accumulator is stored in two accumulators. The subelements of a halfword pair are accumulated as a 40-bit number in one accumulator of either MAC. The upper element of the halfword pair is 40 in the corresponding accumulator in the other MAC.
It is accumulated as the number of bits (FIG. 17 shows the corresponding address).

【００９５】２つの累算器は、倍精度ステップ演算の結
果をストアするためにさらに用いられる。結果の最上位
部は、上位ＭＡＣのガードされる累算器ＡＧにストアさ
れる。結果の最下位部は、下位ＭＡＣの累算器Ａにスト
アされる。下位ＭＡＣ累算器のガードビットは使用され
ない。The two accumulators are further used to store the results of double precision step operations. The most significant part of the result is stored in the guarded accumulator AG of the upper MAC. The least significant part of the result is stored in the accumulator A of the lower MAC. The lower MAC accumulator guard bits are not used.

【００９６】各累算器は、レジスタ空間に、上位および
下位累算器アドレスまたは上位および下位冗長アドレス
と呼ばれる２つのアドレスを有する。（累算器ｎのため
のこれらのアドレスのアセンブリ言語名はそれぞれＡｎ
ＨおよびＡｎＬである。）どちらのアドレスが使用され
るかということの効果は、レジスタが命令においてどの
ように用いられるかに依存し、これらの効果は以下のサ
ブセクションにおいて詳細に述べられる。Each accumulator has two addresses in the register space called the upper and lower accumulator addresses or the upper and lower redundant addresses. (The assembly language name of these addresses for accumulator n is An respectively
H and AnL. The effect of which address is used depends on how the register is used in the instruction, and these effects are discussed in detail in the subsections below.

【００９７】命令フォーマット（およびアセンブリ言
語）はアドレス指定累算器の幾つかの方法を提供する。The instruction format (and assembly language) provides several ways of addressing accumulators.

【００９８】・レジスタ空間の要素として。各累算器
は、１１１ないし１２７の範囲に、アセンブリ言語記号
をＡＲｎＨおよびＡＲｎＬとする上位アドレスおよび下
位アドレスを有する。As an element of register space. Each accumulator has a high and low address in the range 111 to 127 with assembly language symbols ARnH and ARnL.

【００９９】・累算器オペランドとして。命令フォーマ
ットは範囲０−７にある数をとり、対応するアセンブリ
言語記号はＡｎ形式である。As an accumulator operand. The instruction format takes a number in the range 0-7 and the corresponding assembly language symbol is the An format.

【０１００】・別々の上位アドレスおよび下位アドレス
を有する累算器オペランドとして。命令フィールドは範
囲０−１５にある値をとり、アセンブリ言語フォーマッ
トはＡｎＨまたはＡｎＬである。As an accumulator operand with separate upper and lower addresses. The instruction field takes values in the range 0-15 and the assembly language format is AnH or AnL.

【０１０１】８つのガードレジスタの各々は拡張レジス
タ空間にアドレスを有する（１６０−１６７；アセンブ
リ言語記号はＡＧｎ形式を有する）。Each of the eight guard registers has an address in the extended register space (160-167; assembly language symbols have the AGn format).

【０１０２】このセクションの残りのサブセクション
は、累算器およびガードレジスタの、命令としての取扱
いを特定する。レジスタがソースであるかまたは宛先で
あるか、および演算の要素がワードであるかまたはハー
フワード対であるかによって、多数の特別な例がある。The remaining subsections of this section specify the treatment of accumulators and guard registers as instructions. There are many special examples, depending on whether the register is the source or the destination and the element of the operation is a word or a halfword pair.

【０１０３】１．ワードソースオペランドとしての累算
器上位累算器アドレスは累算器Ａｎの上位３２ビットを小
数ワードオペランドとして特定し、下位アドレスはＡｎ
の下位３２ビットを整数ワードオペランドとして特定す
る。プロセッサの現在のバージョンでは、累算器は３２
ビットの長さなので、両方のアドレスとも同じ３２ビッ
トを参照する。しかしながら、一般的なプロセッサアー
キテクチャはより長い累算器を可能にする。ガードビッ
トは、累算器（アセンブリ言語Ａｎ）を３２ビットソー
スオペランドとして用いる命令によって無視される。命
令が、ガードされる累算器（アセンブリ言語ＡＧｎ）を
用いることを、たとえば累算レジスタのためにまたはス
ケーラへの入力として特定する場合には、ガードビット
は４０ビットソースオペランドに含まれる。1. Accumulation as word source operand
The upper-order accumulator address specifies the upper 32 bits of the accumulator An as a decimal word operand, and the lower-order address is An.
The lower 32 bits of are specified as an integer word operand. In the current version of the processor, the accumulator is 32
Because of the length of the bits, both addresses refer to the same 32 bits. However, common processor architectures allow longer accumulators. Guard bits are ignored by instructions that use accumulators (assembly language An) as 32-bit source operands. If the instruction specifies to use a guarded accumulator (assembly language AGn), for example for an accumulator register or as input to the scaler, the guard bit is included in the 40-bit source operand.

【０１０４】バス構造は、現在、各ＭＡＣからの１つの
累算器レジスタが任意の所与の命令において明示される
ソースオペランドとして用いられることを可能にする。The bus structure currently allows one accumulator register from each MAC to be used as the source operand specified in any given instruction.

【０１０５】累算器が乗算演算のためのソースオペラン
ドとして選択されると、３２ビットすべてが累算器によ
って提示される。命令はさらに、整数／小数オプション
によって、乗算アレイへの入力のための下位または上位
ハーフワードを選択する。When the accumulator is selected as the source operand for the multiply operation, all 32 bits are presented by the accumulator. The instruction further selects the lower or upper halfword for input to the multiply array with an integer / fraction option.

【０１０６】２．ハーフワード対ソースオペランドとし
ての累算器ハーフワード対の各要素は、累算器に、あたかもワード
オペランドであるかのように保持される。ハーフワード
対の２つの要素は、別個のＭＡＣにある対応する累算器
にストアされる。それらのそれぞれのＭＡＣ内で累算器
レジスタとしてまたはスケーラへの入力として用いられ
るときは、それらは４０ビットソースオペランドとして
用いられる。2. Halfword vs source operand
Each element of every accumulator halfword pair is held in the accumulator as if it were a word operand. The two elements of the halfword pair are stored in corresponding accumulators on separate MACs. When used as accumulator registers in their respective MACs or as inputs to the scaler, they are used as 40-bit source operands.

【０１０７】それ以外の場合には、要素はハーフワード
対オペランドで２つのハーフワードとしてアセンブルさ
れる。ハーフワード対ソースオペランドが上位累算器ア
ドレスである場合には、各要素に対し累算器の上位ハー
フワードが用いられる。下位累算器アドレスが用いられ
る場合には、下位ハーフワードが用いられる。アドレス
指定される累算器は下位ハーフワードを与え、対応する
累算器は上位ハーフワードを与える。いずれのＭＡＣも
ハーフワード対のいずれの要素をも供給することができ
る。Otherwise, the elements are assembled as two halfwords in a halfword pair operand. If the halfword-to-source operand is the upper accumulator address, the upper halfword of the accumulator is used for each element. If the lower accumulator address is used, the lower halfword is used. The addressed accumulator provides the lower halfword and the corresponding accumulator provides the upper halfword. Any MAC can supply any element of a halfword pair.

【０１０８】３．倍精度ソースオペランドとしての累算
器累算器は倍精度ステップ演算においてのみ精度ソースオ
ペランドのために用いられる。アドレス指定される累算
器は最下位３２ビットを与え、対応するガードされる累
算器は最上位４０ビットを与える。3. Accumulate as double precision source operand
The accumulator is used for precision source operands only in double precision step operations. The addressed accumulator provides the least significant 32 bits and the corresponding guarded accumulator provides the most significant 40 bits.

【０１０９】４．ソースオペランドとしてのガードレジ
スタ８ビットガードレジスタ（Ｇｘ）は符号拡張整数として
拡張レジスタ空間から直接アクセスすることができる。
ガードレジスタがハーフワード対演算のソースオペラン
ドである場合、アドレス指定されるガードは最下位ハー
フワードオペランドとなり、対応するガードは最上位ハ
ーフオペランドとなる。両方の例において、ガードレジ
スタは１６ビットに符号拡張される。4. Guard register as source operand
The star 8-bit guard register (Gx) can be directly accessed from the extension register space as a sign extension integer.
If the guard register is the source operand of a halfword pair operation, the guard addressed is the lowest halfword operand and the corresponding guard is the highest half operand. In both examples, the guard register is sign extended to 16 bits.

【０１１０】５．ワード宛先オペランドとしての累算器ＭＡＣを用いるワード演算では、乗算演算の３２ビット
結果は、宛先累算器にストアされ、そのガードレジスタ
を介して符号拡張される。累算演算の４０ビット結果は
宛先ガード累算器にストアされる。5. For word operations with accumulator MAC as the word destination operand, the 32-bit result of the multiply operation is stored in the destination accumulator and sign extended via its guard register. The 40-bit result of the accumulate operation is stored in the destination guard accumulator.

【０１１１】他の、レジスタからレジスタへの命令で
は、結果は、宛先累算器に移動させられ、そのガードレ
ジスタを介して符号拡張される。In another register-to-register instruction, the result is moved to the destination accumulator and sign-extended through its guard register.

【０１１２】６．ワード対宛先オペランドとしての累算
器ワード対のデータタイプの変換を特定する累算器を目標
とするロード命令では、下位メモリアドレスからのワー
ドはアドレス指定される累算器にロードされ、より上位
のメモリアドレスからのワードの最下位バイトは累算器
のガードレジスタにロードされる。6. Accumulate as word versus destination operand
In a load instruction that targets an accumulator that specifies the conversion of the data type of an instrument word pair, the word from the lower memory address is loaded into the addressed accumulator, and the highest word from the higher memory address is loaded. The low byte is loaded into the accumulator guard register.

【０１１３】７．ハーフワード対宛先オペランドとして
の累算器２つのＭＡＣユニットを用いるハーフワード対演算で
は、各ＭＡＣの結果はその累算器ファイルにストアされ
る。宛先累算器を含むＭＡＣは下位のハーフワード対要
素を処理し、その４０ビット結果はそのガードされる累
算器（ＡＧ）にストアされる。対応するＭＡＣは上位の
ハーフワード対要素を処理し、その４０ビット結果は対
応するガードされる累算器（ＡＧＣ）にストアされる。7. Halfword vs as destination operand
Accumulator In a halfword pair operation using two MAC units, the result of each MAC is stored in its accumulator file. The MAC containing the destination accumulator processes the lower halfword pair element and its 40-bit result is stored in its guarded accumulator (AG). The corresponding MAC processes the upper halfword pair element and its 40-bit result is stored in the corresponding guarded accumulator (AGC).

【０１１４】他の、レジスタからレジスタへの命令で
は、宛先累算器のために選択される特定の累算器アドレ
スが、結果をどのようにストアするかを判断する。上位
アドレスが用いられる場合には、最下位ハーフワード
は、選択される累算器の最上位半分にロードされ、右側
へゼロ拡張され、そのガードレジスタを介して符号拡張
される。最上位ハーフワードは、対応する累算器の最上
位半分にロードされ、右側へゼロ拡張され、そのガード
レジスタを介して符号拡張される。下位アドレスが用い
られる場合には、最下位ハーフワードは、選択される累
算器の最下位半分にロードされ、選択される累算器の最
上位半分を介し、次いでそのガードレジスタを介して符
号拡張される。最上位ハーフワードは、対応する累算器
の最下位半分にロードされ、上述のように符号拡張され
る。In another register-to-register instruction, the particular accumulator address selected for the destination accumulator determines how the result is stored. If the high address is used, the lowest halfword is loaded into the highest half of the selected accumulator, zero-extended to the right, and sign-extended through its guard register. The most significant halfword is loaded into the most significant half of the corresponding accumulator, zero-extended to the right, and sign-extended through its guard register. If the lower address is used, the least significant halfword is loaded into the least significant half of the selected accumulator and coded through the most significant half of the selected accumulator and then through its guard register. To be extended. The most significant halfword is loaded into the least significant half of the corresponding accumulator and sign extended as described above.

【０１１５】８．倍精度オペランドとしての累算器倍精度乗算ステップ演算の結果の最下位３２ビットは宛
先累算器にストアされ、最上位４０ビットは対応するガ
ードされる累算器にストアされる。宛先累算器のガード
ビットはすべてゼロにセットされる。8. Accumulator as Double Precision Operand The least significant 32 bits of the result of the double precision multiply step operation are stored in the destination accumulator and the most significant 40 bits are stored in the corresponding guarded accumulator. The destination accumulator guard bits are all set to zero.

【０１１６】９．宛先オペランドとしてのガードレジス
タガードレジスタが宛先オペランドである場合、結果の８
つの最下位ビットはアドレス指定されるガードレジスタ
にストアされる。ガードレジスタがハーフワード対演算
の宛先オペランドとしてもちいられる場合には、結果の
８つの最下位ビットはアドレス指定されるガードレジス
タにストアされ、上位ハーフワードの８つの最下位ビッ
トは対応するガードレジスタにストアされる。9. Guard register as destination operand
If data guard register is the destination operand, the result 8
The two least significant bits are stored in the addressed guard register. If the guard register is used as the destination operand of a halfword pair operation, the 8 least significant bits of the result are stored in the addressed guard register and the 8 least significant bits of the upper halfword are stored in the corresponding guard register. Stored.

【０１１７】乗算アレイここで図１５を参照する。各ＭＡＣのための乗算アレイ
または乗算ユニットは、２つの１６ビット入力から３２
ビットの積を生ずる。符号付および符号なし入力、整数
および小数入力は、任意の組合せで乗算されてもよい。
整数入力の場合、ソースオペランドの最下位ハーフワー
ドが用いられる。小数入力の場合は、最上位ハーフワー
ドが用いられる。図１８は入力のスケーリングを示し、
図１９は出力スケーリングを示す。 Multiplication Array Reference is now made to FIG. The multiplication array or unit for each MAC is 32 from the two 16-bit inputs.
Produces a product of bits. Signed and unsigned inputs, integer and decimal inputs may be multiplied in any combination.
For integer inputs, the least significant halfword of the source operand is used. For decimal input, the most significant halfword is used. Figure 18 shows the input scaling,
FIG. 19 shows output scaling.

【０１１８】２つのワードオペランドまたは１つのワー
ドおよび１つの即値オペランドが乗算される場合には、
宛先累算器を含むＭＡＣのみが用いられる。２つのＨＰ
オペランドまたは１つのＨＰおよび１つの即値オペラン
ドが乗算される場合には、両方のＭＡＣが用いられ、宛
先累算器を含むＭＡＣは下位のＨＰ要素を乗算する。If two word operands or one word and one immediate operand are to be multiplied,
Only the MAC containing the destination accumulator is used. 2 HP
If the operands or one HP and one immediate operand are multiplied, both MACs are used and the MAC containing the destination accumulator multiplies the lower HP elements.

【０１１９】ともに用いられる２つの乗算アレイは、図
１８に従ってスケーリングされる１つの１６ビット入力
と１つの３２ビット入力とから４８ビットの積を生ず
る。この積は、図２０および図２１に従ってスケーリン
グされる。The two multiply arrays used together yield a 48-bit product from one 16-bit input and one 32-bit input scaled according to FIG. This product is scaled according to FIGS. 20 and 21.

【０１２０】乗算飽和 −１．０が累算なしで（１６ビットの符号付小数とし
て）−１．０によって乗算される場合、結果（＋１．
０）は飽和して、ガードビットへのオーバフローを防
ぐ。最大の正の数は累算器（Ａ）に置かれ、ガードビッ
トはゼロにセットされる。乗算命令が累算を含む場合に
は、結果は飽和せず、代わりに完全な結果が宛先ガード
累算器において累算されそこに置かれる。 Multiply Saturation If -1.0 is multiplied by -1.0 (as a 16-bit signed fraction) without accumulation, the result (+1.
0) saturates to prevent overflow to guard bits. The largest positive number is placed in the accumulator (A) and the guard bit is set to zero. If the multiply instruction involves an accumulate, the result will not saturate and instead the complete result will be accumulated and placed in the destination guard accumulator.

【０１２１】乗算スケーリング図１８、図１９、図２０および図２１は、乗算演算のた
めのソースオペランドおよび結果のスケーリングを示
す。表は、小数点の想定された位置および任意の符号ビ
ットの処理を示す。 Multiply Scaling FIGS. 18, 19, 20 and 21 show the scaling of source operands and results for multiply operations. The table shows the assumed position of the decimal point and the treatment of any sign bit.

【０１２２】図１８は乗算演算のためのソースオペラン
ドのスケーリングを示す。図１９は３２ビットの積のた
めのスケーリングを示す。図２０および図２１は４８ビ
ットの積のためのスケーリングを示す。（図２０（ａ）
および（ｂ）は、下位および上位ＭＡＣにおいてそれぞ
れ右寄せされる積のスケーリングを示し、同様に図２１
（ａ）および（ｂ）は左寄せされた積のスケーリングを
示す。）累算加算器図１５を参照すると、各ＭＡＣは、累算器に入力を加算
することのできる（または累算器から入力を減算するこ
とのできる）累算加算器を含む。考えられ得る入力は、
乗算アレイからの積、即値オペランド、いずれかのＭＡ
Ｃからの累算器、またはワードもしくはハーフワード対
含むレジスタである。FIG. 18 illustrates the scaling of source operands for multiply operations. FIG. 19 shows the scaling for a 32-bit product. 20 and 21 show scaling for a 48-bit product. (Fig. 20 (a)
And (b) show scaling of right-justified products in lower and upper MAC, respectively, also in FIG.
(A) and (b) show scaling of left-justified products. ) Accumulating Adder Referring to FIG. 15, each MAC includes an accumulating adder capable of adding inputs to (or subtracting inputs from) an accumulator. Possible inputs are:
Product from multiplication array, immediate operand, either MA
Accumulator from C, or register containing word or halfword pairs.

【０１２３】累算初期化特性は、ステータスレジスタ
（ＳＴ）（図示せず）のＩＭＡＣ（抑止ＭＡＣ累算）ビ
ットによって制御される。乗算／累算演算を行なう命令
が実行され、ＩＭＡＣビットが真（＝１）である場合に
は、宛先累算器は入力オペランドに初期化され、ＩＭＡ
Ｃビットは偽（＝０）にリセットされる（実際には、宛
先累算器は、入力が累算される前に０にセットされ
る）。The accumulation initialization characteristics are controlled by the IMAC (inhibit MAC accumulation) bit of the status register (ST) (not shown). If the instruction performing the multiply / accumulate operation is executed and the IMAC bit is true (= 1), the destination accumulator is initialized to the input operand and the IMA
The C bit is reset to false (= 0) (actually the destination accumulator is set to 0 before the inputs are accumulated).

【０１２４】同様の初期化および丸め特性は、ステータ
スレジスタのＩＭＡＲビットによって制御される。ＩＭ
ＡＲビットが真である間に、累算加算器演算を行なう命
令が実行されると、累算レジスタは丸め係数によって置
き換えられ、宛先累算器は入力オペランドに切上げビッ
トを加えたものに初期化され、ＩＭＡＲビットは偽にリ
セットされる。丸め係数は、下位ハーフワードの最上位
ビットにある１を除き、すべて０である。Similar initialization and rounding characteristics are controlled by the IMAR bit in the status register. IM
When an instruction that performs a cumulative adder operation is executed while the AR bit is true, the accumulate register is replaced by the rounding factor and the destination accumulator is initialized to the input operand plus the rounding bit. And the IMAR bit is reset to false. The rounding factors are all 0s, except for the 1s in the most significant bit of the lower halfword.

【０１２５】いくつかの乗算命令は、累算加算器におい
て実行される丸めオプションを含む。丸められた結果は
宛先累算器の上位ハーフワードに置かれ、ゼロは下位ハ
ーフワードに置かれる。結果は、下位ハーフワードと上
位ハーフワードとの間に小数点を有すると考えられるべ
きであり、結果は最も近い整数に丸められ、下位ハーフ
ワードが１／２である場合には（つまり上位ビットが１
である場合には）、結果は最も近い偶数の整数に丸めら
れる。Some multiply instructions include a rounding option implemented in the accumulator adder. The rounded result is placed in the upper halfword of the destination accumulator and zero is placed in the lower halfword. The result should be considered to have a decimal point between the lower halfword and the upper halfword, the result is rounded to the nearest integer, and if the lower halfword is ½ (ie the upper bits are 1
, The result is rounded to the nearest even integer.

【０１２６】累算加算器のオーバフローはオーバフロー
フラグをセットしない。飽和オプションを有する累算命
令に対してオーバフロー生ずると、ガードされる累算器
はオーバフローの方向に従ってそれの最も大きい正の数
または最も小さい負の数にセットされる。命令が飽和を
特定せず、かつ誤り例外が可能化される場合には、オー
バフローは誤り例外要求を生ずる。飽和を有するオーバ
フローおよび飽和を有しないオーバフローのために、デ
バッグ論理に別個の信号が送られる。Overflow of the accumulator does not set the overflow flag. When an overflow occurs for an accumulate instruction with the saturation option, the guarded accumulator is set to its largest positive or smallest negative number depending on the direction of overflow. If the instruction does not specify saturation and error exceptions are enabled, overflow results in an error exception request. Separate signals are sent to the debug logic for overflow with saturation and overflow without saturation.

【０１２７】図２２は、累算レジスタに加算されるワー
ドまたは累算器オペランドを示す。図２３は、累算レジ
スタにあるハーフワード対に加算される（レジスタまた
は累算器からの）ハーフワード対オペランドを示す。FIG. 22 shows a word or accumulator operand that is added to the accumulate register. FIG. 23 shows a halfword pair operand (from a register or accumulator) that is added to a halfword pair in an accumulation register.

【０１２８】図２４は、累算レジスタに加算される積を
示す。図２５は、累算レジスタにあるハーフワード対に
加算されるハーフワード対の積を示す。FIG. 24 shows the products added to the accumulation register. FIG. 25 shows the product of halfword pairs that are added to the halfword pairs in the accumulation register.

【０１２９】図２６は、右寄せオプションを用いて累算
される４８ビットの積を示す。このオプションは、整数
結果が所望される１６×３２積、または３２×３２積の
第１のステップに適用できる。FIG. 26 shows a 48-bit product accumulated using the right justify option. This option is applicable to the first step of a 16x32 product, or a 32x32 product, where an integer result is desired.

【０１３０】図２７は、左寄せオプションを用いて累算
される４８ビット積を示す。このオプションは、小数結
果が所望される１６×３２積、または３２×３２積の第
２のステップに適用できる。FIG. 27 shows a 48-bit product accumulated using the left justification option. This option is applicable to the second step of a 16x32 product, or a 32x32 product, where a fractional result is desired.

【０１３１】図２８−図３６は、この発明の空間ベクト
ルデータ経路に従って実現されるであろう演算の命令の
まとめである。28-36 are a summary of the operational instructions that would be implemented according to the space vector data path of the present invention.

【０１３２】スケーラ図１５を参照すると、スケーラユニットは、ガードされ
る累算器の全長上で、０ないし８ビット位置の右バレル
シフトを行なうことができる。最上位ガードビットは空
いたビットに伝搬される。 Scaler Referring to FIG. 15, the scaler unit can perform a right barrel shift of 0 to 8 bit positions over the length of the guarded accumulator. The most significant guard bit is propagated to the vacant bit.

【０１３３】ガードビットと結果の最上位ビットとがす
べて一致しない場合には、スケーラ命令の間にオーバフ
ローが生ずる。（これらのビットが一致する場合には、
それは、累算器の符号ビットがガードレジスタ全体を通
って伝搬し、累算器のオーバフローはガードビットには
生じなかったことを意味する。）スケーラ命令はオーバフローが生じた際に結果を飽和す
るオプションをサポートする。この例においては、結果
は、オーバフローの方向に依って、最も大きい正の数ま
たは最も小さい負の数に１つの最下位ビットを加えても
のにセットされる（最上位ガードビットは、元の数が正
であったかまたは負であったかを示す。）オーバフローが生じ、飽和が特定されなかったときに、
誤り例外が可能化された場合には誤り例外が生ずる。飽
和のないオーバフローおよび飽和のあるオーバフロー
は、別個の信号でデバッグ論理に報告される。If the guard bits and the most significant bit of the result do not all match, an overflow occurs during the scaler instruction. (If these bits match,
That means the sign bit of the accumulator propagates through the entire guard register and no overflow of the accumulator occurred in the guard bit. The scaler instruction supports an option to saturate the result when an overflow occurs. In this example, the result is set to the largest positive or smallest negative number plus one least significant bit, depending on the direction of the overflow (the most significant guard bit is the original number). Was positive or negative.) When overflow occurred and saturation was not specified,
An error exception occurs if the error exception is enabled. Unsaturated and saturated overflows are reported to the debug logic in separate signals.

【０１３４】累算器を正規化するのに、レジスタへの移
動がスケーリングされる累算器（ＭＡＲ）を用いてもよ
い。累算器Ａｎを正規化するためには：ＭＡＲＲｘ，ＡｎＨ，♯８；８ビットでＡＧｎを
スケーリングするＭＥＸＰＲｃ，Ｒｘ；指数を測定するＳＵＢＲＵ．Ｗ．ＳＡＴＲｃ，Ｒｃ，♯８；正規
化に必要なシフト数を計算するＭＡＲＲｘ，ＡｎＨ，Ｒｃ；累算器の内容を正規
化するこのシーケンスの後、Ｒｃはガードされる累算器を正規
化するのに必要なシフト数を含み、Ｒｘは正規化された
結果を含む。To normalize the accumulator, a move to register scaled accumulator (MAR) may be used. To normalize the accumulator An: MAR Rx, AnH, # 8; Scale AGn with 8 bits MEXP Rc, Rx; Measure exponent SUBRU. W. SAT Rc, Rc, # 8; Calculate the number of shifts required for normalization MAR Rx, AnH, Rc; Normalize the contents of the accumulator After this sequence, Rc normalizes the guarded accumulator Rx contains the normalized result.

【０１３５】この発明は図１−図３６を参照して記載さ
れてきたが、この発明の教示は当業者によって決定され
るようなさまざまな処理スキームに適用されてもよいこ
とが理解される。Although the present invention has been described with reference to FIGS. 1-36, it is understood that the teachings of the present invention may be applied to various processing schemes as determined by those skilled in the art.

[Brief description of drawings]

【図１】（ａ）は、従来の単一命令、多重データ（ＳＩ
ＭＤ）コンピュータの概念的な図である。（ｂ）はＳＩ
ＭＤコンピュータに用いられる処理素子の単純な図であ
る。FIG. 1A shows a conventional single instruction, multiple data (SI).
MD) is a conceptual diagram of a computer. (B) is SI
FIG. 3 is a simple diagram of a processing element used in an MD computer.

【図２】この発明を組込むであろうプログラマブルプロ
セッサの一般化された図である。FIG. 2 is a generalized diagram of a programmable processor that will incorporate the present invention.

【図３】（ａ）は、処理ユニットのためのＡＬＵに組込
まれるであろう従来の加算器の模式図である。（ｂ）お
よび（ｃ）は、この発明を実現するであろう加算器の模
式図である。FIG. 3 (a) is a schematic diagram of a conventional adder that would be incorporated into an ALU for a processing unit. (B) and (c) are schematic diagrams of an adder that will implement the present invention.

【図４】（ａ）は、処理ユニットのためのＡＬＵに組込
まれるであろう従来の論理ユニットの模式図である。
（ｂ）および（ｃ）は、この発明を実現するであろう論
理ユニットの模式図である。FIG. 4 (a) is a schematic diagram of a conventional logic unit that would be incorporated into an ALU for a processing unit.
(B) and (c) are schematic diagrams of logic units that will implement the present invention.

【図５】（ａ）および（ｂ）は、この発明を実連するで
あろう従来のシフタの模式図である。5 (a) and 5 (b) are schematic views of a conventional shifter that will actually implement the present invention.

【図６】この発明を組込むであろうシフタの図である。FIG. 6 is a diagram of a shifter that may incorporate the present invention.

【図７】この発明を組込むであろうシフタの図である。FIG. 7 is a diagram of a shifter that may incorporate the present invention.

【図８】この発明を組込むであろうシフタの図である。FIG. 8 is a diagram of a shifter that may incorporate the present invention.

【図９】従来の乗算累算器（ＭＡＣ）の単純な図であ
る。FIG. 9 is a simple diagram of a conventional Multiply Accumulator (MAC).

【図１０】ＭＡＣがこの発明をどのように組込み得るか
を示す図である。FIG. 10 is a diagram showing how a MAC may incorporate the present invention.

【図１１】ＭＡＣが３２×１６モードでこの発明をどの
ようにして組込み得るかを示す図である。FIG. 11 is a diagram showing how a MAC may incorporate the present invention in a 32 × 16 mode.

【図１２】３２×１６モードのためのＭＡＣ内の相互接
続を示す図である。FIG. 12 shows the intra-MAC interconnection for 32 × 16 mode.

【図１３】この発明を組込む処理素子の単純な機能図で
ある。FIG. 13 is a simple functional diagram of a processing element incorporating the present invention.

【図１４】この発明を組込むＡＬＵおよびシフタの単純
な図である。FIG. 14 is a simple diagram of an ALU and shifter incorporating the present invention.

【図１５】デュアルＭＡＣ構成を示す図である。FIG. 15 is a diagram showing a dual MAC configuration.

【図１６】累算器レジスタファイルのレイアウトを示す
図である。FIG. 16 is a diagram showing a layout of an accumulator register file.

【図１７】（ａ）および（ｂ）は、対応する累算器アド
レスを示す図である。17 (a) and (b) are diagrams showing corresponding accumulator addresses.

【図１８】乗算演算のためのソースオペランドおよび結
果のスケーリングを示す図である。FIG. 18 is a diagram showing source operands for multiplication operations and scaling of results.

【図１９】乗算演算のためのソースオペランドおよび結
果のスケーリングを示す図である。FIG. 19 is a diagram showing source operands for multiplication operations and scaling of results.

【図２０】（ａ）および（ｂ）は、乗算演算のためのソ
ースオペランドおよび結果のスケーリングを示す図であ
る。20 (a) and (b) are diagrams showing source operand and result scaling for multiplication operations.

【図２１】（ａ）および（ｂ）は、乗算演算のためのソ
ースオペランドおよび結果のスケーリングを示す図であ
る。21 (a) and (b) are diagrams showing source operand and result scaling for multiplication operations.

【図２２】累算器レジスタに加算されるワードまたは累
算器オペランドを示す図である。FIG. 22 illustrates a word or accumulator operand added to an accumulator register.

【図２３】累算レジスタでハーフワード対に加算される
ハーフワード対オペランドを示す図である。FIG. 23 illustrates a halfword pair operand that is added to a halfword pair in an accumulation register.

【図２４】累算レジスタに加算される積を示す図であ
る。FIG. 24 is a diagram showing products added to an accumulation register.

【図２５】累算レジスタでハーフワード対に加算される
ハーフワード対の積を示す図である。FIG. 25 is a diagram showing a product of a halfword pair added to a halfword pair in an accumulation register.

【図２６】右寄せオプションを用いて累算される４８ビ
ットの積を示す図である。FIG. 26 is a diagram illustrating a 48-bit product accumulated using the right justification option.

【図２７】左寄せオプションを用いて累算される４８ビ
ットの積を示す図である。FIG. 27 shows a 48-bit product accumulated using the left justification option.

【図２８】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 28 is a diagram showing a summary of instructions that may be implemented in accordance with the present invention.

【図２９】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 29 shows a summary of instructions that may be implemented in accordance with the present invention.

【図３０】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 30 shows a summary of instructions that may be implemented in accordance with the present invention.

【図３１】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 31 is a diagram showing a summary of instructions that may be implemented in accordance with the present invention.

【図３２】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 32 is a diagram showing a summary of instructions that may be implemented in accordance with the present invention.

【図３３】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 33 is a diagram showing a summary of instructions that may be implemented in accordance with the present invention.

【図３４】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 34 shows a summary of instructions that may be implemented in accordance with the present invention.

【図３５】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 35 is a diagram showing a summary of instructions that may be implemented in accordance with the present invention.

【図３６】この発明に従って実現されるであろう命令の
まとめを示す図である。FIG. 36 illustrates a summary of instructions that may be implemented in accordance with the present invention.

[Explanation of symbols]

１００プログラムおよびデータ記憶ユニット１１０処理ユニット１２１ＡＬＵ１２２ＭＡＣ１２３シフタ１２４論理ユニット１３０命令収集ユニット１４０命令フェッチ／デコーダ／シーケンスユニット 100 program and data storage unit 110 processing unit 121 ALU 122 MAC 123 shifter 124 logic unit 130 instruction collection unit 140 instruction fetch / decoder / sequence unit

フロントページの続き (72)発明者ケニス・イー・ギャレイアメリカ合衆国、92714 カリフォルニア州、アーバイン、フレンズ・コート、 17531 (72)発明者ジョージ・エイ・ワトソンアメリカ合衆国、92635 カリフォルニア州、フラートン、ツリービュー・プレイス、2952 (72)発明者ジョン・アールアメリカ合衆国、92680 カリフォルニア州、タスティン、ウィリアムズ・ストリート、15512−ピーFront Page Continuation (72) Inventor Kennis E. Galley United States, 92714 California, Irvine, Friends Court, 17531 (72) Inventor George A. Watson United States, 92635 Fullerton, California, Treeview Place , 2952 (72) Inventor John Earl, United States, 92680 Williams Street, Tustin, California, 15512-pea

Claims

[Claims]

1. A programmable processor for multiple data path processing of at least one operand, each operand comprising at least one element, said processor being predetermined as determined by instruction fetch / decode / sequence means. Executing instructions in sequence, the programmable processor a) coupled to the instruction means to identify for each instruction whether the at least one operand is processed in one of a vector and a scalar mode B) a processing unit coupled to said mode means, said processing unit receiving said at least one operand and responsive to said vector identified by said mode means and said vector and In one of the scalar modes Process at least one operand, the vector mode indicates to the processing unit that there are multiple elements in the operand, and the scalar mode indicates to the processing unit that one element is in the operand. , Programmable processors.

2. The processing unit is a) responsive to an instruction from the mode means to concurrently parallel process each respective element in the at least one operand to provide an independent operation for each respective element in the vector mode. B) responsive to the instruction from the mode means, a first element in the at least one operand is associated with at least one of the operands in the vector mode. Second vector means for processing in selective combination with a second element, and c) processing each respective part of said operand in response to said instruction from said mode means to produce a respective partial result. And scalar means for deriving a scalar result in said scalar mode by combining each respective partial result. Programmable processor.

3. The first vector means and the scalar means are at least one of a) a plurality of multiply-accumulators, b) a plurality of shifters, c) a plurality of arithmetic units, and d) a logic unit. The programmable processor of claim 2, comprising one, each processing one of at least one respective element in the vector operand and each portion of the scalar operand.

4. The scalar means performs conditional movement,
The second vector means and the scalar means are the second
4. The programmable processor of claim 3, wherein conditional branching is performed based on said selective combination of said first and second elements in said operand in said vector mode.

5. The processing unit comprises: a) a plurality of adders operating in one of the vector and scalar modes, each adder of the plurality of adders being a vector specified by the mode means. The elements from the operands are received and processed individually, the adders receive a scalar operand specified by the mode means and process it together, the processing unit further comprising: b) the adders. And adder control means for sending a carry status between each of the plurality of adders in the scalar mode so that the plurality of adders treat the scalar operand as one adder. The programmable processor according to claim 1.

6. The adder control means further sends an overflow status between each of the plurality of adders in the scalar mode so that the plurality of adders processes the scalar operand as one adder. The programmable processor according to claim 5.

7. The processing unit comprises: a) a plurality of Multiplier Accumulators (MACs) operating in one of the vector and scalar modes, each MAC
Receives the elements in the vector operand as specified by the mode means and concurrently processes them separately in parallel, the plurality of MACs receives the scalar operand specified by the mode means and processes them together, The processing unit further comprises: b) MAC control means coupled to the MAC and responsive to the mode means for operating the plurality of MACs independently of each other in the vector mode and together in the scalar mode. The programmable processor according to claim 1.

8. The processing unit comprises: a) a plurality of logic units operating in one of the vector and scalar modes, each logic unit of the plurality of logic units being identified by the mode means. An element in a different vector operand and concurrently processing them separately in parallel, the plurality of logical units receiving a scalar operand specified by the mode means and processing it together, the processing unit further comprising: b) A logic coupled to the plurality of logical units for sending a zero status between each of the plurality of logical units in the scalar mode so that the plurality of logical units treat the scalar operand as one logical unit. The programmable processor according to claim 1, comprising control means.

9. The processing unit comprises: a) a plurality of shifters for selectively operating as one integrated shifter in the scalar mode and as a plurality of shifters in the vector mode, Each of the shifters is responsive to the mode means in the first mode of operation and receives an element from the identified vector operand for parallel processing separately and the plurality of shifters are Responsive to the mode means in operation of a mode, receiving a scalar operand and processing it together, the processing unit further comprising: b) being coupled to the shifter, the plurality of shifters processing the scalar operand, A shifter control means for sending a shifted operand bit between each of the plurality of shifters in the scalar mode; Motor control means, in the vector mode, disabling the sending shifted operand bits from each of said plurality of shifters, programmable processor of claim 1.

10. The programmable processor of claim 1, further comprising comparison means for evaluating the condition of the operand in one of scalar and vector modes.

11. A programmable processor according to claim 10, wherein said comparing means evaluates the condition of each operand to modify the sequence of instruction execution.

12. A first operand is conditionally moved from a first storage location to a second storage location based on said comparison means, said comparison means being such that the corresponding element in the first operand is moved. The programmable processor of claim 10, including a plurality of sub-comparators each comparing corresponding elements in the second and third operands to determine if they are done.

13. The programmable processor of claim 1, wherein the mode means is included as a field within each instruction so that each instruction specifies one of a vector and a scalar mode on an instruction-by-instruction basis.

14. The programmable processor of claim 13, wherein the mode means is included as a bit field within each instruction.

15. A data memory for storing an operand having at least one element in each operand, an instruction memory for storing an instruction for execution, an instruction means, and a plurality of arithmetic logic units (ALUs). ) Is a structure for performing a multi-data digital signal processing in a general-purpose computer coupled to a), and a) is coupled to the instruction memory and the instruction means, and the operand is in a vector mode or a scalar mode by the processing unit. Mode means for specifying in each instruction whether or not to process as one of: b) in response to said mode means, said ALU together with said ALU in the first mode as one unit And in the case of vector operands with each unit in the second mode A for selectively performing be operated independently 's operation unit and by
LU control means, c) coupled to the ALU control means and the ALU,
Carry condition means for selectively sending a carry condition between each of the ALUs for a scalar operand and ignoring the carry condition for each of the ALUs for a vector operand. A data memory for storing an operand having at least one element in the operand, an instruction memory for storing an instruction for execution, an instruction means, and a plurality of arithmetic logic units (ALUs). A configuration for performing multi-data digital signal processing in a general-purpose computer.

16. A data memory for storing an operand, each having at least one element therein, an instruction memory for storing an instruction for execution, an instruction means, and a first multiply accumulate. A structure for performing multiple data digital signal processing in a general-purpose computer coupled to a processor (MAC), comprising: a) being coupled to the instruction memory and the instruction means, the operands being in vector mode by the processing unit; Mode means for specifying in each instruction whether to be processed as one of the scalar modes, b) a plurality of MACs, c) coupled to each of the first and a plurality of MACs,
Responsive to said mode means, said first and plurality of MACs
For operating a plurality of data digital signals, each of which is operated independently of each other in the vector mode, and selectively operated together in the scalar mode.

17. Processing of operands by an ALU includes: a) a plurality of condition codes, each set of condition codes being combined with each independent element in the operand; and b) a selective combination of combinations in the operands. 16. The arrangement of claim 15, modified based on one of: c) and c) the one set of condition codes for the scalar operand.

18. The sequence of execution of the instructions comprises: a) each set of condition codes associated with each independent element in the operand; and b) multiple sets of condition codes in selective combinations. 16. The arrangement of claim 15, wherein one modifies the first instruction to a second instruction.

19. The operands are: a) each set of condition codes associated with each independent element in the operand, b) multiple sets of condition codes in selective combinations, and c) of the scalar operands. 19. The signal processor of claim 18, wherein the signal processor is selectively moved from a first storage location to a second storage location based on one of the set of condition codes for and.

20. a) further comprising a plurality of shifters for selectively operating as one integrated shifter in the scalar mode and as a plurality of shifters in the vector mode, each of the plurality of shifters , Responsive to said mode means in a first mode of operation, receiving an element from a vector operand as specified and processing it independently, said plurality of shifters being in a second mode of operation Responsive to the mode means, receiving a scalar operand and processing it together, and b) coupling the operand bits shifted to the scalar mode so that the plurality of shifters process the scalar operand. And including shifter control means for sending between each of the plurality of shifters in the vector mode. 16. The arrangement of claim 15, disabling sending shifted operand bits from each of the plurality of shifters.

21. A programmable processor for computing multiple data paths using a general purpose computer, comprising:
The general-purpose computer is coupled to the instruction memory, a data memory for storing an operand, a memory access bus for transferring an operand from the data memory, an instruction memory for storing an instruction for execution. Instruction means for fetching, decoding and ordering instructions, the programmable processor a) being coupled to the instruction means, wherein an operand from a data memory is one of a single data path mode and a multiple data path mode. Mode means for specifying in each instruction whether to be processed in one instruction, b) each data path includes an arithmetic unit, a multiply accumulator (MAC), and the programmable processor further comprises: c) In response to the mode means, the arithmetic unit in each data path is In order to selectively operate as a single unit in one mode in the case of a operand, and independently as an arithmetic unit in the case of a vector operand as each unit is in a different mode. Arithmetic control means, d) coupled to the arithmetic control means and the arithmetic units on each path, selectively sending a carry condition between each of the arithmetic units in the case of scalar operands, and vector operands And carry condition means for disabling the carry condition corresponding to each arithmetic unit, e) coupled to each MAC and responsive to the mode means,
A programmable processor including MAC control means for selectively operating each MAC independently of each other in the vector mode and operating them together in the scalar mode.

22. a) further comprising a plurality of shifters for selectively operating as one integrated shifter in the scalar mode and as a plurality of shifters in the vector mode, each of the plurality of shifters Responsive to said mode means in a first mode of operation, receiving an element from a specified vector operand and processing it independently, said plurality of shifters being in said second mode of operation Responsive to the means for receiving a scalar operand and processing it together, and b) coupling the operand bits shifted in the scalar mode to the shifter such that the plurality of shifters process the scalar operand. A shifter control means for transmitting between each of the plurality of shifters, the shifter control means comprising: 22. The programmable processor of claim 21, which disables sending shifted operand bits from each of the number shifters.

23. A method of digital signal processing using a programmable processor via multiple data paths, wherein the programmable processor operates on at least one operand, each having at least one element, the programmable processor comprising: Has a plurality of sub-processing units, the method comprising: a) supplying instructions from a predetermined sequence of instructions to be executed by the programmable processor; and b) the instructions by the programmable processor. Causing one of a scalar and a vector mode of processing on at least one operand, the scalar mode indicating to the programmable processor that there is one element in the at least one operand. The vector mode indicates to the programmable processor that there are multiple sub-elements in the at least one operand, the method further comprising: c) in scalar mode, each sub-processing unit of the programmable processor is Responding to commands,
Receiving and processing respective portions of said operands to produce partial and intermediate results, and d) each sub-processing unit sending its intermediate results among said plurality of sub-processing units and its partial results. Together with other sub-processing units to generate a final result for the operands; e) generate a first condition code to correspond to the final result; and f) in vector mode. , Each sub-processing unit of the programmable processor is responsive to the instruction, receives each sub-element from the plurality of sub-elements in the operand and processes it, each intermediate result being disabled and each part Generating partial and intermediate results with the explicit result representing the final result for its corresponding element, and g) a plurality of second conditions. The over-de, and a step of generating in a state corresponding to the result of each of independent, digital signal processing method.

24. A programmable processor for computing multiple data paths via a general purpose computer, comprising:
The general-purpose computer includes a data memory for storing operands, an instruction memory for storing program instructions, and instruction means, the programmable processor is coupled to the instruction means, and the operands from the data memory are Mode means for identifying whether to be treated as one of vector and scalar modes,
The vector mode determines a plurality of elements in each operand, the scalar mode determines a single element in the operands, and the programmable processor further includes a plurality of processing units coupled to the mode means and the data memory, Each processing unit receives a respective element of an operand and processes it to obtain partial result and propagation information, the programmable processor further operating in the vector mode and coupled to the processing unit, Sending each partial result as its final result of the processing of each element, vector means for ignoring propagation information, operating in said scalar mode and coupled to said processing unit, for each partial result and propagation information Together with scalar means for obtaining its final result of processing each operand,
Programmable processor.

25. Each processing unit comprises a set of condition codes for storing a processing condition, said set of condition codes comprising: a) each set individually in a first vector mode; and b) a second set. Modifying the processing of the programmable processor by one of each set in selective combination with another set in vector mode, and c) all sets of scalar operands combined in said scalar mode. Item 24. The processor according to Item 24.

26. The processor of claim 25, wherein each processing unit comprises at least one of: a) an arithmetic unit; b) a multiplication accumulator; c) a logical operator; and d) a barrel shifter. .

27. The programmable processor of claim 1, wherein the mode means is specified by a bit field in the at least one operand.

28. The at least one mode means comprises:
Specified by the way the two operands are selected,
The programmable processor according to claim 1.

29. The mode means comprises the at least one
3. A bit field located in a memory location that further specifies the address of one operand.
8. The programmable processor according to item 8.

30. Responsive to the mode means, further comprising third vector means for processing each respective element of the first operand in vector mode with the second operand in scalar mode. The programmable processor according to claim 2.