JP2004342106A

JP2004342106A - Modular binary multiplier for signed and unsigned operands of variable width

Info

Publication number: JP2004342106A
Application number: JP2004141968A
Authority: JP
Inventors: Fadi Y Busaba; ファディ・ワイ・ブサダ; Steven R Carlough; スティーブン・アール・カーロー; David S Hutton; ディビッド・エス・ハットン; Christopher A Krygowski; クリストファー・エイ・クルイゴウスキー; Jr John G Rell; ジョン・ジィ・レル・ジュニア; Sheryll H Veneracion; シェリル・エイチ・ベネレーション
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2003-05-12
Filing date: 2004-05-12
Publication date: 2004-12-02
Anticipated expiration: 2024-05-12
Also published as: US20040230631A1; JP3891997B2; US20070233773A1; US7266580B2; US7853635B2; US20070214205A1; US7490121B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize a shorter CPI for multiplication instructions with respect to a design of a binary multiplier to be used together with functions shown in a general processor environment in fields of arithmetic techniques and logic techniques in computer and processor architectures. <P>SOLUTION: A concept may be split into two parts, the first of which is multiplication hardware itself and is a compact and less than-full sized multiplier which employs Booth or other type of recording methods upon the multiplier to reduce the number of partial products per scan and is implemented in such a manner that a multiplication operation with large operands may be broken into subgroups of operations that will fill into this mid-sized multiplier whose results, here called modular products, may be knitted back together to form a correct final product. The second part of the concept is supporting hardware used to separate the operands into subgroups and to input data and control signals to the multiplier and the algorithm and apparatus used to align and combine the modular products properly to obtain the final product. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、コンピュータおよびプロセッサアーキテクチャにおける算術技法および論理技法の分野に関し、詳細には、一般的なプロセッサ環境に見られる機能と共に使用する２進乗算器の設計に関する。 The present invention relates to the field of arithmetic and logic techniques in computer and processor architectures, and in particular, to the design of binary multipliers for use with features found in common processor environments.

２進乗算は、符号付きと符号なしの両方の整数のみを扱い、したがってそのオペランドと結果が完全に２進表現可能な、乗算のサブセットである。２進乗算の最も単純な方法は、人間による実行方法を模倣したもので、被乗数を乗数によって、一度に乗数一桁ずつ処理して部分積を生成し、それらの部分積を合計して最終積を生成する。人間による乗算方法を２進数に適用した一例を以下に示す。
符号なし乗算（完全結果表現について４ビット×４ビット−＞８ビット）
１１１０被乗数１４
０１１１乗数０７
−−−− −−
１１１０ｐｐ１９８
１１１０ｐｐ２００
１１１０ｐｐ２ −−
００００ｐｐ３
−−−−−−−
０１１０００１０積９８ Binary multiplication is a subset of multiplication that deals only with both signed and unsigned integers, so that its operands and results can be completely represented in binary. The simplest way of performing binary multiplication is to imitate a human execution method, in which the multiplicand is processed by the multiplier one digit at a time to produce a partial product, and those partial products are summed to form the final product. Generate a product. An example in which a human multiplication method is applied to a binary number is shown below.
Unsigned multiplication (4 bits x 4 bits-> 8 bits for complete result representation)
1110 Multiplicand 14
0111 Multiplier 07
−−−− −−
1110 pp1 98
1 110 pp200
1 110 pp2 ---
0000 pp3
−−−−−−−−
01100010 Product 98

この方法は、符号付き乗算にも適用される。しかし、２の補数の負の表現と符号拡張の使用により、結果の符号を処理する追加処理が不要になる。
符号付き乗算（オペランドを８に符号拡張４ビット×４ビット−＞８ビット）
１１１１１１１０被乗数 −２
０００００１１１乗数＋７
−−−− −−
１１１１１１１０ｐｐ１ −１４
１１１１１１０ｐｐ２
１１１１１０ｐｐ３
０００００ｐｐ４
００００ｐｐ５
０００ｐｐ６
００ｐｐ７
０ｐｐ８
−−−−−
１１１１００１０積 −１４ This method also applies to signed multiplication. However, the use of two's complement negative representation and sign extension eliminates the need for additional processing of the resulting sign.
Signed multiplication (operand sign extension to 8 4 bits x 4 bits-> 8 bits)
1111 1110 Multiplicand -2
0000 0111 Multiplier +7
−−−− −−
1111 1110 pp1 -14
1111 110 pp2
1111 10 pp3
0000 0 pp4
0000 pp5
000 pp6
00 pp7
0 pp8
−−−−−−
1111 0010 Product -14

この乗算方法はきわめて単純で、シフタ、加算器、および積累算器を使用したハードウェアでの実施が比較的容易であるが、１乗数ビットを処理して部分積を生成するのに１サイクル要するとすれば、ｎビットの乗数による演算を完了するにはｎサイクルも要することになる。現在の高速コンピューティングの分野でのこのような長い１命令当たりサイクル（ＣＰＩ）時間は、阻害要素とみなされる。乗算命令のためのより短いＣＰＩを実現する解決策は、部分積をグループにして一度に計算するために追加のハードウェアを使用し、それらを同時に処理するのに必要な加算器を構築することである。この問題に対してハードウェアを投入するこの強引な手法によって、ＣＰＩは短縮されるが、乗算機能専用のチップ面積を増大させることにもなる。特に加算器は、特に機能仕様に通常付随する面積とタイミングの制約の点で扱いが難しい。したがって、乗数の複数のビットを一度に処理することによって部分積を減らすことにより加算器のサイズを小さくする多くの方法が策定されている。比較的一般的な方法の１つは、ブースの記録アルゴリズムである。 This multiplication method is very simple and relatively easy to implement in hardware using shifters, adders, and accumulators, but requires one cycle to process the multiplier bits and generate the partial product. Then, it takes n cycles to complete the operation using the n-bit multiplier. Such a long cycle-per-instruction (CPI) time in the current field of high-speed computing is considered an impediment. The solution to achieve a shorter CPI for multiply instructions is to use additional hardware to group the partial products and compute them at once, and build the adders needed to process them simultaneously It is. This brute force approach to hardware entry reduces the CPI but also increases the chip area dedicated to the multiplication function. In particular, adders are difficult to handle, especially in terms of area and timing constraints usually associated with functional specifications. Therefore, many methods have been formulated to reduce the size of the adder by reducing the partial product by processing multiple bits of the multiplier at one time. One of the more common methods is the Booth recording algorithm.

ブースの記録アルゴリズムは、複数ビットを走査することによって、所与のｎビット乗数から生成される部分積の数を減らす方法である。この方法は、値「１」の最下位ビットが有効値２ｎを持ち、１のストリングがｚビット長である２進１のストリングを、２^ｎ＋ｚ−２^ｎとも表すことができるという考え方に基づいている。たとえば、ストリング０ｂ０１１１は、２^３−２^０＝７と表すことができ、ストリング０ｂ１１１０は２^４−２^１＝１４と表すことができる。 Booth's recording algorithm is a method of reducing the number of partial products generated from a given n-bit multiplier by scanning multiple bits. This method is based on the idea that a binary 1 string in which the least significant bit of the value "1" has a valid value 2n and one string is z bits long can also be represented as ^{2n + z} - ^2n. I have. For example, the string 0b0111 ^may be expressed as 2 ^3-2 0 = 7, strings 0b1110 can be represented as ² 4 ^-2 1 = 14.

上記の例では、各ビットの重みは２ｎであり、ｎは該当するビットの位取り値である。１のストリングの検出は、走査乗数ビットのグループを１ビットだけ重複させることによって行う。走査数値が重複ビットによる１ビット走査における乗数である乗算へのこのカウント方法の適用は、その右側の重複ビットが「０」である「１」ビットによって検出されたストリングの最後にあるビット（ストリングの最下位ビット）に、値−（２^ｎ）＊（被乗数）を与え、重複ビットが「１」である位置の「０」によって検出されたストリングの最初にあるビット（ｚビットストリングの最上位ビット）に値（２^ｎ）＊（被乗数）を与え、０または１のストリングの中央にあるビットに値ゼロを与えるという単純なものである。これを、以下の表にまとめる。この表では、左端のビットがストリングの位置ｎにあるビットであり、右端のビットがストリングの検出に必要な重複ビットである。「調整被乗数値」の列は、被乗数倍数値を示す。この値の重みは、該当する走査ビットの位置によって暗黙に示すことができる。 In the above example, the weight of each bit is 2n, where n is the scale value of the corresponding bit. The detection of one string is performed by overlapping the group of scanning multiplier bits by one bit. The application of this counting method to multiplication, where the scan value is a multiplier in a one-bit scan with duplicate bits, requires that the last bit of the string detected by the "1" bit to the right of the duplicate bit be "0" (string Of the string (the least significant bit of the z-bit string) given the value-(2 ⁿ ) * (multiplicand) Bit) is given the value (2 ⁿ ) * (multiplicand) and the bit at the center of the 0 or 1 string is given the value zero. This is summarized in the table below. In this table, the leftmost bit is the bit at position n of the string, and the rightmost bit is the duplicate bit required to detect the string. The column of “adjusted multiplicand value” indicates a multiplicand value. The weight of this value can be implicitly indicated by the position of the corresponding scan bit.

ブースの記録方法を有利に実施する鍵は、１グループとして走査するビット数を増やし、それによって乗数の必要な走査全体を減らすとともに、部分積の数と、部分積を結合するのに必要なハードウェアを少なくすることである。一般的な走査グループサイズは３ビットであり、２走査ビットと、最下位位置にある重複ビットとから成る。これが普及している理由は、記録を実現するのに必要な被乗数倍数が、単に０ｘ、±１ｘ、および±２ｘであり、シフタ、インバータ、および２の補数の方法を使用してすべての可能な倍数を実現することにより、すべて比較的容易に実現されるのに対し、より大きな走査グループサイズでは±３ｘなどのより高い倍数を実現する加算器が必要になることによる。 The key to advantageously implementing Booth's recording method is to increase the number of bits scanned as a group, thereby reducing the overall scan required for the multiplier, as well as the number of partial products and the hardware required to combine the partial products. It is to reduce wear. A typical scan group size is 3 bits, consisting of 2 scan bits and the least significant bit at the bottom. The reason that this is so popular is that the multiplicand multiples required to achieve the recording are simply 0x, ± 1x, and ± 2x, and all possible using shifter, inverter, and two's complement methods Implementing multiples is all relatively easy, while larger scan group sizes require adders that implement higher multiples, such as ± 3x.

本明細書において一実施例で開示する２進乗算の方法は、被乗数を入手するステップと、乗数を入手するステップと、該乗数が選択された長さを超える場合、該乗数を複数の乗数サブグループに区分化するステップと、該被乗数が選択された長さを超える場合、該被乗数を複数の被乗数サブグループに区分化し、該被乗数サブグループの不使用ビットをゼロ設定することと該被乗数サブグループのより小さい部分を符号拡張することとのうちの少なくとも一方を行うステップとを含む。また、この方法は、該複数の被乗数サブグループのうちの選択された被乗数サブグループと該被乗数とのうちの少なくとも一方に基づいて、複数の被乗数倍数を設定するステップと、該複数の乗数サブグループの各該乗数サブグループに基づいて、該複数の被乗数倍数のうちの１つまたは複数の該被乗数倍数を選択するステップと、該選択方法に基づいてモジュラ積を生成するステップとを含む。 The method of binary multiplication disclosed in one embodiment herein includes the steps of obtaining a multiplicand, obtaining a multiplier, and, if the multiplier exceeds a selected length, dividing the multiplier by a plurality of sub-multiples. Partitioning into groups; partitioning the multiplicand into a plurality of multiplicand subgroups if the multiplicand exceeds a selected length, zeroing unused bits of the multiplicand subgroup; Sign extending a smaller portion of. The method also includes setting a plurality of multiplicand multiples based on at least one of the selected multiplicand subgroup of the plurality of multiplicand subgroups and the multiplicand. Selecting one or more of the plurality of multiplicand multiples based on each of the multiplier subgroups of. And generating a modular product based on the selection method.

本明細書において他の実施例で開示するスーパースケーラ・プロセッサのための２進乗算器は、ｍビットが、該プロセッサのレジスタの全幅と選択された２進乗算命令のオペランド・データの最大幅とのうちの少なくとも一方を含む、最大ｍビット長の被乗数データを含む第１のオペランドを受け取るように構成された第１のレジスタ入力と、乗数値と、ｎビット長の該乗数値の一区分を含む乗数値サブグループとのうちの一方を含む第２のオペランドを受け取るように構成され、ｎが選択された２進乗算命令のオペランド・データの最大幅以下である第２のレジスタ入力とを含む。また、この乗算器は、該乗算器が該被乗数データ入力から該当ビットを選択して有効な被乗数を生成し、該被乗数データ入力の不使用ビットをゼロ設定することができるようにする、前記被乗数データのサイズを示す第１の制御信号を受け取るように構成された制御信号入力と、該被乗数演算が符号付きであるか符号なしであるかを示す第２の制御信号を受け取るように構成され、該乗算器が該有効被乗数の符号拡張を行えるようにするように構成された第２の制御信号入力とを含む。 The binary multiplier for a superscalar processor disclosed in another embodiment herein has m bits, where m is the full width of the register of the processor and the maximum width of the operand data of the selected binary multiply instruction. A first register input configured to receive a first operand including multiplicand data of at most m bits, including at least one of the following: a multiplier value; and a portion of the multiplier value of n bits length. And a second register input configured to receive a second operand including one of the multiplier value subgroups, wherein n is less than or equal to a maximum width of operand data of the selected binary multiply instruction. . The multiplicand further comprises: a multiplier configured to select a corresponding bit from the multiplicand data input to generate a valid multiplicand and to set unused bits of the multiplicand data input to zero. A control signal input configured to receive a first control signal indicative of a size of data and a second control signal configured to receive whether the multiplicand operation is signed or unsigned; A second control signal input configured to enable the multiplier to perform sign extension of the effective multiplicand.

本明細書においてさらなる他の実施例でさらに開示するスーパースケーラ・プロセッサにおける２進乗算のためのシステムは、２進乗算器と、該２進乗算器と動作可能に通信し、乗数データと被乗数データをシフトさせて、選択されたサブグループが該２進乗算器に確実に送られるようにすると共に、モジュラ積を他の１つのモジュラ積と累積モジュラ積のうちの少なくとも一方と適切に結合するために確実に桁合わせされるようにするデータ幅シフタと、該シフタと動作可能に通信し、該乗数データと該被乗数データを該シフタに入力するために保持するレジスタと、該レジスタと動作可能に通信し、モジュラ積を、他のモジュラ積と累算モジュラ積とのうちの少なくとも一方と累算する加算器と、該加算器と動作可能に通信し、該モジュラ積および該累算モジュラ積を該加算器への入力のためと、該累算モジュラ積の出力とのために保持する複数のレジスタとを含む。 A system for binary multiplication in a superscalar processor, further disclosed herein in yet another embodiment, includes a binary multiplier, and in operable communication with the binary multiplier, multiplier data and multiplicand data. To ensure that the selected subgroup is sent to the binary multiplier and to properly combine the modular product with at least one of the other modular product and the cumulative modular product. A data width shifter for ensuring digit alignment, a register operably communicating with the shifter, and holding the multiplier data and the multiplicand data for input to the shifter; and a register operable with the register. An adder for communicating and accumulating a modular product with at least one of another modular product and an accumulating modular product; and operably communicating with the adder, Includes for the product and 該累 calculation modular product of the input to the adder, and a plurality of registers for holding for the output of 該累 calculation modular product.

本明細書においてさらなる他の実施例でさらに開示するスーパースケーラ・プロセッサにおける２進乗算のためのシステムは、第１のレジスタと、第２のレジスタと、第３のレジスタと、該第１のレジスタ、該第２のレジスタ、および第１のマルチプレクサと動作可能に通信するビット論理ユニットおよび２進加算器を含む実行ユニットであって、該第１のマルチプレクサが該第３のレジスタとも動作可能に通信する実行ユニットと、該第１のレジスタおよび該実行ユニットと動作可能に通信する第１のローテータと、該実行ユニットおよび該第２のレジスタと動作可能に通信する先行ゼロ検出レジスタとを含む第１のパイプラインとを含む。また、このシステムは、第４のレジスタと、第５のレジスタと、第６のレジスタと、該第４のレジスタ、該第５のレジスタ、および第２のマルチプレクサと動作可能に通信する他のビット論理ユニットおよび他の２進加算器を含む第２の実行ユニットであって、該第２のマルチプレクサが該第６のレジスタとも動作可能に通信する第２の実行ユニットと、該第４のレジスタおよび該実行ユニットと動作可能に通信するローテータと、該第２の実行ユニットおよび該第５のレジスタと動作可能に通信する先行ゼロ検出レジスタとを含む第２のパイプラインとを含む。また、このシステムは、第７のレジスタと、第８のレジスタと、第９のレジスタと、該第７のレジスタおよび該第８のレジスタと動作可能に通信する乗算器とを含む第３のパイプラインと、データの格納と取出しのための汎用レジスタと、第１のオペランドと第２のオペランドとを入手するオペランド・バッファと、該第１のパイプラインと該第２のパイプラインと該第３のパイプラインと該汎用レジスタと該オペランド・バッファとのうちの少なくとも２者の間の通信のための通信バスとを含む。 A system for binary multiplication in a superscalar processor, further disclosed herein in yet another embodiment, includes a first register, a second register, a third register, and the first register. , An execution unit including a bit logic unit and a binary adder operably communicating with the second register and a first multiplexer, the first multiplexer operably communicating also with the third register. A first rotator in operable communication with the first register and the execution unit; and a leading zero detection register in operable communication with the execution unit and the second register. Including the pipeline. The system also includes a fourth register, a fifth register, a sixth register, and other bits in operable communication with the fourth register, the fifth register, and the second multiplexer. A second execution unit including a logic unit and another binary adder, wherein the second multiplexer is in operable communication with the sixth register; and A rotator in operable communication with the execution unit, and a second pipeline including a leading zero detect register in operable communication with the second execution unit and the fifth register. The system also includes a third pipe including a seventh register, an eighth register, a ninth register, and a multiplier operably communicating with the seventh register and the eighth register. A general register for storing and retrieving data, an operand buffer for obtaining a first operand and a second operand, the first pipeline, the second pipeline, and the third pipeline. And a communication bus for communication between at least two of the general purpose registers and the operand buffer.

上記およびその他の改善点は、以下の詳細な説明で述べる。本発明をその利点および特徴とともによりよく理解することができるように、以下の説明および図面を参照されたい。 These and other improvements are described in the detailed description below. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.

本発明について、例を用い、添付図面を参照しながら以下に説明する。いくつかの図では、同様の要素は同様の番号を付して示す。 The invention will be described below by way of example and with reference to the accompanying drawings. In some figures, similar elements are indicated by similar numbers.

図面を参照しながら、本発明の実施例について、利点および特徴を示しながら例として説明する。 Embodiments of the present invention will be described by way of example with reference to the drawings, showing advantages and features.

本明細書では、そのモジュール性によって様々なサポート算術論理演算装置（ＡＬＵ）ハードウエアと共に使用するのに適合する特殊スーパーブース乗算器を使用して、有損失または無損失結果を有する符号付きまたは符号なしオペランドである様々な長さの２進数を乗算する方法およびシステム・アーキテクチャを開示する。このアーキテクチャのモジュール性は、より大きなオペランドを効率的に扱うことができる、より小型の乗算器の必要性により促されたものである。したがって、このアーキテクチャは、算術関数および論理関数と、ほとんどのプロセッサ環境に見られる資源と共に機能するように設計されている。このアーキテクチャは、乗数オペランドとして、全データ幅未満のサイズの入力を含み、さらに、被乗数として、全データ幅ではないとしても、全データ幅未満の入力を含むことができる。また、このアーキテクチャは、乗算のサブグループを、最終的に結合して最終的な正しい積を生成することができるように互いに結合する手段となる重複ビット信号も含む。また、このアーキテクチャは、たとえば符号付きまたは符号なしなどの演算のタイプを反映するように入来被乗算データを制御し、変更を加え、モジュラ積とも呼ぶ出力を、他の可能なモジュラ積と桁合わせし、組み合わせるように作成するためのいくつかの追加の信号も使用する。 As used herein, signed or signed with lossy or lossless results using special superbooth multipliers that are adapted for use with various supporting arithmetic and logic unit (ALU) hardware due to their modularity. Disclosed is a method and system architecture for multiplying various lengths of binary numbers that are none operands. The modularity of this architecture is driven by the need for smaller multipliers that can handle larger operands efficiently. Thus, this architecture is designed to work with arithmetic and logic functions and the resources found in most processor environments. The architecture may include an input of a size less than the full data width as a multiplier operand, and may further include an input of less than a full data width, if not a full data width, as a multiplicand. The architecture also includes a duplicate bit signal that provides a means for combining the subgroups of multiplications together so that they can be finally combined to produce the final correct product. This architecture also controls the incoming multiplicand data to reflect the type of operation, e.g., signed or unsigned, modifies and modulates the output, also referred to as a modular product, with other possible modular products. Some additional signals to combine and create to combine are also used.

また、本明細書では、乗算器の実施例と共に使用するこのようなサポート・ハードウェアを、このような環境内での乗算を実現するために作成された、付随するアルゴリズムと共に示す。 Also, herein, such support hardware for use with the multiplier embodiment is shown, along with the associated algorithms that have been created to implement multiplication in such an environment.

一実施例では、乗算器とサポート・ハードウェアのためのハードウェア・アーキテクチャを開示する。他の実施例では、そのハードウェア・アーキテクチャを使用して処理するための乗算アルゴリズムを開示する。モジュラ２進乗算器のアーキテクチャを、図１に示す。 In one embodiment, a hardware architecture for the multiplier and supporting hardware is disclosed. In another embodiment, a multiplication algorithm for processing using the hardware architecture is disclosed. The architecture of a modular binary multiplier is shown in FIG.

モジュラ２進乗算器
このモジュラ２進乗算器１０は、６個の一次入力を含むが、６個には限定されない。すなわち、第１のオペランドは１として示されている被乗数として使用される６４ビットデータであり、第２のオペランドは、２として示されている乗数または乗数サブグループとして使用される１５ビットデータである。３として示され、Ｚ＿ＢＩＴとも呼ぶ乗数サブグループ間の重複ビットは、１つの乗数サブグループから次の乗数サブグループへの連続性をもたせ、ブースの記録アルゴリズムを使用する際に必要なストリング検出に使用される。４として示され、ＭＣＡＮＤ＿６４とも呼ぶ演算制御信号は、演算で３２ビットと６４ビットのいずれの被乗数１を使用するかを２進乗算器１０に示し、ＵＮＳＩＧＮＥＤとも表記する「符号なし」制御信号５は、２進乗算器１０を符号付きまたは符号なし演算用に準備するために使用され、ＭＴＥＲＭ＿ＳＨＩＦＴ信号６は、２進乗算器１０のモジュラ積出力を、他のモジュラ積と適切に桁合わせする追加のサイクルを必要としないようにシフトするために使用される。表２に、２つの制御信号ＭＣＡＮＤ＿６４４およびＵＮＳＩＧＮＥＤ５のデコード論理の概略を示す。 Modular Binary Multiplier This modular binary multiplier 10 includes, but is not limited to, six primary inputs. That is, the first operand is 64-bit data used as a multiplicand shown as 1 and the second operand is 15-bit data used as a multiplier or multiplier subgroup shown as 2. . The overlap bits between multiplier subgroups, denoted as 3 and also referred to as Z_BIT, provide continuity from one multiplier subgroup to the next and are used for string detection as required when using Booth's recording algorithm. Is done. 4, also referred to as MCAND_64, indicates to the binary multiplier 10 whether to use a 32-bit or 64-bit multiplicand 1 in the operation, and an “unsigned” control signal 5, also referred to as UNSIGNED, The MTERM_SHIFT signal 6 is used to prepare the binary multiplier 10 for signed or unsigned operations, and an additional MTERM_SHIFT signal 6 is provided to properly align the modular product output of the binary multiplier 10 with other modular products. Used to shift so that no cycle is needed. Table 2 shows an outline of the decode logic of the two control signals MCAND_644 and UNSIGNED5.

表２において、

In Table 2,

［０，０］は、３２ビットの被乗数を使用した符号付き演算を示す。ロウワードである、オペランド・データのビット３２ないし６３は変更されないままであるのに対し、ハイワードであるビット０ないし３１は、３２ビットの被乗数データから符号拡張される。 [0,0] indicates a signed operation using a 32-bit multiplicand. Bits 32-63 of the operand data, the low word, remain unchanged, while bits 0-31, the high word, are sign-extended from the 32-bit multiplicand data.

［０，１］は、符号なし３２ビット演算を示す。該当するロウワードとともにごみデータが入力された場合、被乗数データのロウワードは保持されるのに対し、ハイワードはゼロ設定される。 [0,1] indicates an unsigned 32-bit operation. When the garbage data is input together with the corresponding low word, the low word of the multiplicand data is retained, while the high word is set to zero.

［１，０］は、符号付き６４ビット演算を示す。これは、２進乗算器１０のこのハードウェア・アーキテクチャでは無効な構成として扱われる。その理由は、被乗数１の入力と加算器のデータ幅が６４ビットであり、符号拡張が不可能であり、そのために、モジュラ積が不適切に表現される可能性があるためである。 [1,0] indicates a signed 64-bit operation. This is treated as an invalid configuration in this hardware architecture of the binary multiplier 10. The reason is that the input of the multiplicand 1 and the data width of the adder are 64 bits, sign extension is not possible, and therefore, the modular product may be inappropriately expressed.

［１，１］は、符号なし６４ビットのオペランドを示し、したがって、オペランド・データのハイワードは、被乗数のハイワードとして使用されることになる。 [1,1] indicates an unsigned 64-bit operand, so the high word of the operand data will be used as the high word of the multiplicand.

ＭＣＡＮＤ＿６４およびＵＮＳＩＧＮＥＤのデコード・論理と、被乗数１のハイワードに加える、必要な変更を、被乗数ハイワード変換モジュール１１に示す。被乗数ハイワード変換モジュールの結果は、マージ１２でオペランド・データの未変更ロウワードとマージされ、２進乗算器１０で使用される、１Ａで示された演算被乗数データを生成する。一実施例では、２走査ビットとグループ間の重複ビットとから成る３ビット走査を使用した、基数４形式のブース・アルゴリズムを使用する。この形式のブース・アルゴリズムを使用するのは、必要な被乗数の倍数０ｘ、±１ｘ、および±２ｘの形成が比較的容易であるためである。必要な乗数が＋１ｘの場合、積を生成するための被乗数データの変更は不要であることを理解されたい。したがって、演算被乗数データ１Ａが＋１ｘの場合、選択論理１７に直接供給されるように図示されている。 The decode and logic of MCAND_64 and UNSIGNED and the necessary changes to make to the high word of the multiplicand 1 are shown in the multiplicand high word conversion module 11. The result of the multiplicand high word conversion module is merged with the unmodified low word of the operand data in a merge 12 to generate the arithmetic multiplicand data indicated by 1A for use in the binary multiplier 10. In one embodiment, a radix-4 Booth algorithm is used, using a 3-bit scan consisting of 2 scan bits and overlapping bits between groups. This type of Booth algorithm is used because the required multiplicand multiples 0x, ± 1x, and ± 2x are relatively easy to form. It should be understood that if the required multiplier is + 1x, no changes to the multiplicand data to generate the product are necessary. Therefore, when the operation multiplicand data 1A is + 1x, it is shown to be directly supplied to the selection logic 17.

さらに図１の説明を続けると、左シフタ１３が、被乗数データの上位６３ビットを入力として受け取り、右に「０」ｂを付加してそのデータを１ビット左シフトしたデータ、または一般に知られているように、被乗数データの＋２ｘ倍数を生成し、これも選択論理１７に送られる。シフタ１３の出力は、１の補数モジュール１５にも供給され、ビット・フリップを実行し、それによって、演算被乗数データ１Ａの１の補数形式の＋２ｘ乗数を形成し、選択論理１７において「ホット１」を付加して演算被乗数１Ａの−２ｘ乗数の生成を完了するだけで済む。この「ホット１」の追加は、有効乗数グループのビットの記録に基づく演算被乗数１Ａのどの倍数を使用するかの選択と共に、選択論理１７で扱われる。同様に、ビット単位反転によって、演算被乗数データ１Ａの全６４ビットを使用して演算被乗数データ１Ａの１の補数を形成し、２進１を追加して、演算被乗数データ１Ａの２の補数、または−１ｘ乗数を形成するだけでよい。この場合も、選択論理１７で「ホット１」を追加する。 Continuing with the description of FIG. 1, the left shifter 13 receives the upper 63 bits of the multiplicand data as input, adds "0" b to the right, shifts the data left by one bit, or is commonly known. As a result, a + 2x multiple of the multiplicand data is generated, which is also sent to the selection logic 17. The output of shifter 13 is also provided to one's complement module 15 to perform a bit flip, thereby forming a one's complement + 2x multiplier of arithmetic multiplicand data 1A, and a "hot one" in select logic 17. To complete the generation of the −2 × multiplier of the operation multiplicand 1A. The addition of this "hot 1" is handled by the selection logic 17, along with the selection of which multiple of the multiplicand 1A to use based on the recording of bits in the effective multiplier group. Similarly, by bitwise inversion, the 1's complement of the calculated multiplicand data 1A is formed using all 64 bits of the calculated multiplicand data 1A, and binary 1 is added to add the binary 1 to the 2's complement of the calculated multiplicand data 1A, or It is only necessary to form a -1x multiplier. Also in this case, the selection logic 17 adds “hot 1”.

図１の説明を続けると、基数４ブース記録論理機能１６は、一度に１６ビットの乗数を使用し、８回の同時・重複３ビット走査を行って、８個の部分積を生成する。表３に、３２ビット乗数を、完全に処理するために必要な１６の走査の構成を示すために、２つのグループに分け、積み重ねた分解図を示す。この表で、Ｚは、この走査アルゴリズムに必要な１７ビットのシーケンスを完結させ、１６ビットより大きい倍数に対応可能にする、前述の重複ビット３（Ｚ＿ＢＩＴとも示す）を示す。Ｚ＿ＢＩＴ３を適切に設定することによって、ハードウェアは１６ビットから成る任意のグループを、乗数２の全体を含むか一部を含むかを問わず、同様に扱うことができる。 Continuing with FIG. 1, the radix-4 booth recording logic function 16 performs eight simultaneous, overlapping 3-bit scans using a 16-bit multiplier at a time to generate eight partial products. Table 3 shows an exploded view, divided into two groups and stacked, to show the configuration of the 16 scans required to completely process the 32-bit multiplier. In this table, Z indicates the above-mentioned duplicate bit 3 (also referred to as Z_BIT) that completes the 17-bit sequence required for this scanning algorithm and allows for multiples greater than 16 bits. By setting Z_BIT3 appropriately, the hardware can treat any group of 16 bits in the same way, whether it includes the whole or part of the multiplier 2.

各３ビット走査の結果、３ビットの内部信号が生成される。ブースのデコードによって決まる、ＳＸおよびＳ２Ｘで示す２つの変数が、各乗数の絶対係数を示す。すなわち、ＳＸがアクティブの場合、１ｘ乗数を使用し、特別な処置は必要としない。Ｓ２Ｘがアクティブの場合、演算被乗数１Ａを左に一桁シフトし、ＬＳＢを埋めるためにゼロを挿入する。ＳＸおよびＳ２Ｘは同時にアクティブになることができないことを理解されたい。さらに、ＳＸとＳ２Ｘの両方が同時にアクティブでない状態は、選択論理１７の０ｘ乗数を示す。第３の内部制御信号ＳＩＮＶは、各走査ごとにブース・デコードによって決まる倍数の符号と等しい。正の倍数が生成される場合、ＳＩＮＶはゼロに設定され、特別な処置は不要である。負の倍数の場合、ＳＩＮＶは「１」に設定され、被乗数１Ａの１の補数形式の１ｘおよび２ｘ倍数が選択される。２の補数を完結させるために、アクティブなＳＩＮＶは、現行部分積のＬＳＢとして同じ列またはビット位置の次の部分積（すなわち、部分積配列内の次の行）に入れられる前述の「ホット１」として使用される。表４に、ブースのデコードの真理値表を示す。この実施態様は、負のゼロを実施しないことを理解されたい。すなわち、「１１１」ｂという３ビット走査によって、「０００」ｂという同じ真のゼロデコードが生成される。真のゼロの効果は、部分積がゼロのシーケンスとなり、当然ながら、次意の積の「ホット１」がクリアされることである。 As a result of each 3-bit scan, a 3-bit internal signal is generated. Two variables, SX and S2X, determined by Booth decoding indicate the absolute coefficients of each multiplier. That is, when SX is active, a 1x multiplier is used and no special action is required. When S2X is active, the arithmetic multiplicand 1A is shifted left by one digit and zeros are inserted to fill the LSB. It should be understood that SX and S2X cannot be active at the same time. Furthermore, a state where both SX and S2X are not active at the same time indicates a 0x multiplier of the selection logic 17. The third internal control signal SINV is equal to a multiple sign determined by Booth decode for each scan. If a positive multiple is generated, SINV is set to zero and no special action is required. In the case of a negative multiple, SINV is set to "1" and 1's and 2x multiples in the one's complement form of the multiplicand 1A are selected. To complete the two's complement, the active SINV is placed into the next partial product of the same column or bit position (ie, the next row in the partial product array) as the LSB of the current partial product, as described above. Used as Table 4 shows a truth table of Booth decoding. It should be understood that this embodiment does not implement negative zero. That is, a 3-bit scan of "111" b produces the same true zero decode of "000" b. The effect of a true zero is that the partial product results in a sequence of zeros, which, of course, clears the "hot one" of the intent.

被乗数倍数生成論理（たとえば１２、１３、１４、および１５）からの入力と、ブース記録論理１６とを使用して選択論理機能１７で生成された部分積配列の右端のドット表現を、表５に示す。この表から、１５を超える入力（部分積ビットに桁上げを加えた入力）を必要とする圧縮構造を使用する列はないことがわかる。さらに、８回のみの走査では、８回の走査による「ホット１」から成る行９（部分積９、すなわちｐｐ９）の列４９を除けば、基本的に多くとも８個の部分積しかないことに留意されたい。都合のよいことに、その右側の列には積項が７個しかなく、これは、列５０から列４９に伝播する桁上げが１つ少ないことを示している。したがって、下位からの桁上げ（carry-in）が７個ある８：２の圧縮構造を使用する加算器１８を使用し、ｐｐ９のホット１を、加算器１８への桁上げとしてではなく９番目の入力として使用することによって、６個の桁上げのある９：２の圧縮構造を容易に形成することができる。言い換えると、前述の構造は両方とも、部分積ビットと桁上である厳密に１５の入力を必要とし、合計ビットと上位への桁上げ（carry-out）である厳密に９個の出力を生成する。 Table 5 shows the input from the multiplicand multiple generation logic (eg, 12, 13, 14, and 15) and the rightmost dot representation of the partial product array generated by selection logic function 17 using Booth recording logic 16. Show. From this table, it can be seen that no columns use a compression structure that requires more than 15 inputs (inputs with partial product bits plus carry). Furthermore, with only eight scans, there are basically at most eight partial products, except for column 49 of row 9 (partial product 9, ie pp9) consisting of "hot 1" from eight scans. Please note. Conveniently, the right hand column has only seven product terms, indicating that there is one less carry carried from column 50 to column 49. Thus, using adder 18 using an 8: 2 compression structure with seven carry-ins from the bottom, hot 1 of pp9 is used as the ninth instead of as a carry to adder 18. , It is possible to easily form a 9: 2 compression structure with six carry. In other words, both of the above structures require a partial product bit and exactly 15 inputs that are carry-over, and produce exactly 9 outputs that are total bits and carry-out. I do.

一実施例では、結果としての合計と桁上げの結果を、任意選択によりレジスタ１９および２０にラッチしてから、桁上げ伝播加算器２１で結合して、１つの最終モジュラ積を生成する。このモジュラ積とその１６ビットの左シフト形式はマルチプレクサ２２に送られる。マルチプレクサ２２は、入力信号ＭＴＥＲＭ＿ＳＨＩＦＴによって制御されて、現行演算のアルゴリズムに従ってモジュラ積を桁合わせのために左シフトする必要があるか否かを決定する。２進乗算器１０の最終積出力を、本明細書ではＭＰ＿ＯＵＴで示す。 In one embodiment, the resulting sum and carry results are optionally latched in registers 19 and 20 and then combined in carry propagate adder 21 to produce one final modular product. This modular product and its 16-bit left-shifted form are sent to multiplexer 22. Multiplexer 22 is controlled by input signal MTERM_SHIFT to determine whether the modular product needs to be left shifted for alignment according to the algorithm of the current operation. The final product output of binary multiplier 10 is denoted herein as MP_OUT.

サポート・ハードウェア
図２に、一実施例による３パイプライン・スーパースケーラ固定小数点プロセッサ・アーキテクチャ２００の略ブロック図とデータの流れを示す。３本のパイプライン５０は、それぞれＸパイプ５０ａ、Ｙパイプ５０ｂ、およびＺパイプ５０ｃとも呼ぶ。３本のパイプ５０ａ、５０ｂ、および５０ｃはそれぞれ、バスとインタフェースし、少なくとも２個の６４ビット・オペランド・レジスタを含む。Ｘパイプ５０ａのオペランド・レジスタをＡ１およびＢ１（５１および５２）、Ｙパイプ５０ｂのオペランド・レジスタをＡ２およびＢ２（５３および５４）、Ｚパイプ５０ｃのオペランド・レジスタをＡ３およびＢ３（５５および５６）で示す。Ｘパイプ５０ａとＹパイプ５０ｂは両方ともそれぞれ、マルチプレクサ６０および６２を介してデータが供給される出力Ｃレジスタを有し、それぞれＣ１５７およびＣ２５８で示す。この２つの出力レジスタＣ１５７およびＣ２５８は、それぞれ、汎用レジスタ・ファイル９０にデータを書き込むために使用される。汎用レジスタ・ファイル９０は、ＲＥＧＩＳＴＥＲ＿ＤＡＴＡバスを介してオペランドレジスタにデータを供給する。オペランド・レジスタには、ＳＴＯＲＡＧＥ＿ＤＡＴＡバスによってオペランド・バッファ９２を介してメモリからもデータを供給することができる。レジスタ・ファイルからは１回のサイクルで２つの値を書き込むことができ、４つの値を読み取ることができる。先行ゼロの検出や有効データの検査などのデータ処理を行うために、図示されていない追加の論理を組み込むこともできる。マルチプレクサ６０および６２にはそれぞれ、Ｘ５０ａおよびＹ５０ｂパイプの、Ｂｉｎ１６４およびＢｉｎ２６６で示す２進加算器ユニットと、Ｂｌｕ１７２およびＢｌｕ２７４で示すビット論理ユニットから、データが供給される。ビット論理ユニットＢｌｕ１７２およびＢｌｕ２７４には、それぞれｒｏｔ１７６およびｒｏｔ２７８で示すローテータからデータが供給される。ローテータｒｏｔ１７６およびｒｏｔ２７８は、ビット論理ユニットＢｌｕ１７２およびＢｌｕ２７４と共に使用し、マスキングを効果的に使用して、シフト操作にも使用される。説明を簡単・簡潔にするために、本明細書では、ビット論理ユニットあるいはＢｌｕ１７２またはＢｌｕ２７４と言う場合、ローテータｒｏｔ１７６およびｒｏｔ２７８と、ビット論理ユニットＢｌｕ１７２およびＢｌｕ２７４を含むものとみなすことがある。レジスタＢ１５２およびＢ２５４の内容は、それぞれ、Ｘパイプ５０ａおよびＹパイプ５０ｂの先行ゼロ検出論理８２および８４ともインタフェースし、早期終了のためにこの実施例のハードウェア・アーキテクチャ２００で実施される一部の命令で使用される。 Supported Hardware FIG. 2 shows a simplified block diagram and data flow of a three-pipeline superscaler fixed-point processor architecture 200 according to one embodiment. The three pipelines 50 are also called an X pipe 50a, a Y pipe 50b, and a Z pipe 50c, respectively. Each of the three pipes 50a, 50b, and 50c interfaces with a bus and includes at least two 64-bit operand registers. A1 and B1 (51 and 52) as operand registers of X pipe 50a, A2 and B2 (53 and 54) as operand registers of Y pipe 50b, and A3 and B3 (55 and 56) as operand registers of Z pipe 50c. Indicated by Both X-pipe 50a and Y-pipe 50b each have an output C register to which data is supplied via multiplexers 60 and 62, and are designated C157 and C258, respectively. The two output registers C157 and C258 are used to write data to the general-purpose register file 90, respectively. The general purpose register file 90 supplies data to the operand registers via the REGISTER_DATA bus. The operand registers can also be supplied with data from memory via the operand buffer 92 via the STORAGE_DATA bus. Two values can be written and four values can be read from the register file in one cycle. Additional logic (not shown) may be incorporated to perform data processing, such as detecting leading zeros and checking for valid data. Multiplexers 60 and 62 are respectively supplied with data from the binary adder units denoted Bin 164 and Bin 266 and the bit logic units denoted Blu 172 and Blu 274 of the X50a and Y50b pipes, respectively. Data is supplied to the bit logic units Blu172 and Blu274 from rotators indicated by rot176 and rot278, respectively. Rotators rot 176 and rot 278 are used with bit logic units Blu 172 and Blu 274 to effectively use masking and are also used for shift operations. For simplicity and simplicity, in this specification, a bit logic unit or Blu172 or Blu274 may be regarded as including the rotators rot176 and rot278 and the bit logic units Blu172 and Blu274. The contents of registers B152 and B254 also interface with leading zero detect logic 82 and 84 of X-pipe 50a and Y-pipe 50b, respectively, for some terminations implemented in hardware architecture 200 of this embodiment for early termination. Used in instructions.

Ｚパイプ５０ｃは、一部の命令が情報を後で使用するために保持するために使用する、Ｅ５９で示す第３の作業用レジスタを含む。一実施例では、図１に詳細に示す２進乗算器１０が、固定小数点スーパースケーラ・アーキテクチャ２００のＺパイプ５０ｃ内に存在する。２進乗算器１０には、Ａ３５５で示す完全６４ビット・レジスタから入力されるとともに、Ｂ３レジスタ５６の最下位１６ビットからも入力される。ＭＰ＿ＯＵＴで示す乗算器の出力は、バッファ８６でトライステート化され、Ｘパイプ５０ａの２進加算器Ｂｉｎ１６４の出力によって、Ｃ１５７レジスタに入力するマルチプレクサへの追加の入力が不要になる。 Z-pipe 50c includes a third working register, designated E59, that some instructions use to hold information for later use. In one embodiment, the binary multiplier 10, shown in detail in FIG. 1, resides in the Z-pipe 50c of the fixed-point superscaler architecture 200. The input to the binary multiplier 10 is from the complete 64-bit register indicated by A355, and also from the least significant 16 bits of the B3 register 56. The output of the multiplier, denoted MP_OUT, is tri-stated in buffer 86, and the output of binary adder Bin 164 of X pipe 50a eliminates the need for an additional input to the multiplexer that feeds the C157 register.

ＡレジスタおよびＢレジスタ５１〜５６は、一部はＲＥＧＩＳＴＥＲ＿ＤＡＴＡバスを介してレジスタ・ファイル９０によって、ＳＴＯＲＡＧＥ＿ＤＡＴＡバスを介してオペランド・バッファ９２によって、また、命令テキストの即値フィールドから取り出されたデータを受け取るＩＭＭＥＤＩＡＴＥ＿ＤＡＴＡバスを介して命令自体に含まれているデータによってデータ供給されるバス網から入力を入手する。ＳＴＯＲＡＧＥ＿ＤＡＴＡとＩＭＭＥＤＩＡＴＥ＿ＤＡＴＡは事前桁合わせされているため、オペランドが作業レジスタＡおよびＢ（５１〜５６）に到着した後は、それらのオペランドはデータの供給源にかかわらずすべて同じに扱われることを理解することが重要である。 The A and B registers 51-56 receive data retrieved, in part, by the register file 90 via the REGISTER_DATA bus, by the operand buffer 92 via the STORAGE_DATA bus, and from the immediate field of the instruction text. The input is obtained from a bus network fed by the data contained in the instruction itself via the bus. Understand that STORAGE_DATA and IMMEDIATE_DATA are pre-aligned so that after the operands arrive at work registers A and B (51-56), they are all treated the same regardless of the source of the data. This is very important.

本明細書では、乗算器１０と共に動作するプロセッサ・アーキテクチャ２００の一実施態様について述べることを理解されたい。いくつかの重要な機能を備える限り、その他の構成も可能である。そのような機能の１つは、必ずしも同時である必要はないが乗数２と被乗数１を回転させて、適切なサブグループを適切なサブグループ処理のための適切な場所に配置することができる機能である。他の機能は、モジュラ積を累算して１つの最終項とすることができる機能である。 It should be understood that this specification describes one embodiment of a processor architecture 200 that operates with the multiplier 10. Other configurations are possible as long as they have some important functions. One such function is the ability to rotate the multiplier 2 and the multiplicand 1, although not necessarily simultaneously, to place the appropriate subgroups in the proper locations for proper subgroup processing. It is. Another feature is the ability to accumulate modular products into one final term.

乗算方法
上述の乗算のモジュラ的性質のため、特定のアーキテクチャおよび環境において選択された演算を実施するために多くのアルゴリズムを考えることができることがわかるであろう。本明細書では、前述の固定小数点スーパースケーラ・プロセッサの環境において実施される演算に関する方法について説明する。この方法は、乗算器１０の使用の特定の利点と、図２に示すサポート・ハードウェアにおいて利用可能な機能を明確に示す。 Multiplication Method It will be appreciated that, due to the modular nature of multiplication described above, many algorithms can be envisaged to perform selected operations in a particular architecture and environment. Described herein are methods relating to operations performed in the context of the aforementioned fixed-point superscalar processor. This method clearly illustrates the particular advantages of using multiplier 10 and the features available in the support hardware shown in FIG.

図３を参照すると、一実施例による乗算プロセス２００の高水準概要図が示されている。このプロセスは、どの特定の乗算を実行するかを示す命令情報と、その演算の被乗数および乗数として使用される２オペランド・データとを、入力として受け取る。この命令情報は、ここでは制御論理と呼ぶ、論理のグループに送られ、異なるタイプの乗算命令の実施のために使用されるアルゴリズムに基づいて、データフロー・ハードウェアを制御して、１サイクルごとに必要な機能を実行するのに必要な場所にデータをルーティングし、最終積を求める。この実施態様では、ＭＣＡＮＤ＿６４、ＵＮＳＩＧＮＥＤ、およびＭＴＥＲＭ＿ＳＨＩＦＴの各制御信号が生成される。命令情報は、プロセス２１０を制御して、オペランド・データを適切に回転またはシフトさせ、サイクル別または命令別に適切な被乗数１サブグループまたは乗数２サブグループあるいはその両方を得ると同時に、１のストリングの検出に必要な、連続性を持たせるために乗算器と共に使用される適切なＺ＿ＢＩＴを獲得する。この特定の実施態様は、一部の乗算命令のために制御論理に早期終了情報を提供するための、乗数データ２で使用する先行ゼロおよび先行１検出機能も備える。 Referring to FIG. 3, a high-level schematic diagram of a multiplication process 200 according to one embodiment is shown. The process receives as input instruction information indicating which particular multiplication is to be performed, and two-operand data used as the multiplicand and multiplier for the operation. This instruction information is sent to a group of logic, referred to herein as control logic, which controls the dataflow hardware based on the algorithm used to implement the different types of multiply instructions, and Route the data where it is needed to perform the required functions and find the final product. In this embodiment, MCAND_64, UNSIGNED, and MTERM_SHIFT control signals are generated. The instruction information controls the process 210 to rotate or shift the operand data appropriately to obtain the appropriate multiplicand 1 and / or multiplier 2 subgroups by cycle or by instruction, as well as one string. Obtain the appropriate Z_BIT used with the multiplier to provide continuity needed for detection. This particular implementation also provides a leading zero and leading one detection function for use with multiplier data 2 to provide early termination information to the control logic for some multiply instructions.

図３の乗算器１０は、被乗数データ１と乗数データ２とを入力として受け取る。乗数データ２は、Ｚ＿ＢＩＴ３で示す入力重複ビットと共に、プロセス・ブロック２１４で走査グループに区分化される。この実施例では、Ｚ＿ＢＩＴと共に１６ビットの乗数データをそれぞれ３ビットの８個の重複走査グループに区分化するが、適切なサポート、すなわち、被乗数倍数の設定、記録、倍数のための選択論理などを使用して、他の乗数および走査グループ幅も可能である。プロセス・ブロック２１８で、乗数データ２は、基数４ブース・アルゴリズムに従って記録される。次に、被乗数１に移って説明すると、プロセス・ブロック２１２で、制御信号ＭＣＡＮＤ＿６４４およびＵＮＳＩＧＮＥＤ５がデコードされ、この２進乗算が符号付きか符号なしかを判断するか、あるいは、この実施例では、被乗数１が３２ビットか６４ビットかを判断する。プロセス・ブロック２１４で、被乗数１の選択された倍数、たとえば０ｘ、±１ｘ、および±２ｘを生成する。 The multiplier 10 in FIG. 3 receives multiplicand data 1 and multiplier data 2 as inputs. Multiplier data 2 is partitioned into scan groups at process block 214, with the input duplicate bits indicated by Z_BIT3. In this embodiment, the 16-bit multiplier data along with Z_BIT is partitioned into eight 3-bit overlap scan groups, but appropriate support, ie, setting the multiplicand multiple, recording, selection logic for the multiple, etc. Other multipliers and scan group widths are possible using. At process block 218, multiplier data 2 is recorded according to a radix-4 Booth algorithm. Turning now to multiplicand 1, at process block 212, control signals MCAND_644 and UNSIGNED5 are decoded to determine whether the binary multiplication is signed or unsigned, or, in this embodiment, It is determined whether the multiplicand 1 is 32 bits or 64 bits. At process block 214, selected multiples of the multiplicand 1 are generated, for example, 0x, ± 1x, and ± 2x.

図３の説明を続けると、プロセス・ブロック２２０で、前述のようにＳＸ、Ｓ２Ｘ、およびＳＩＮＶの８個のグループに区分化された乗数２とＺ＿ＢＩＴ３のプロセス・ブロック２１８でのブース記録の結果に基づいて、プロセス・ブロック２１８から被乗数１の所望の倍数を選択し、ゼロを適切に右付加して、部分積を生成する。プロセス・ブロック２２２で、これらの部分積を合計して１個のモジュラ積を生成する。最後に、プロセス・ブロック２２４で、制御信号ＭＴＥＲＭ＿ＳＨＩＦＴ６に基づいて必要／可能な場合は、モジュラ積出力をシフトし、それによって、プロセス・ブロック２５０において他のモジュラ積と結合するために、当該モジュラ積を適切に桁合わせする必要をなくす。このブロックでは、加算ハードウェア、シフト／回転ハードウェア、ホールド・パス、フィードバック・パス、および作業レジスタを使用し、制御論理ブロック２１０で制御し、モジュラ積を適切に結合して最終の正しい積を生成する。この方法では、このブロックの異なる構成を使用してモジュラ積を処理することもできる。 Continuing with FIG. 3, at process block 220, the result of the booth recording at process block 218 of multiplier 2 and Z_BIT3, partitioned into eight groups of SX, S2X, and SINV as described above, Based on this, select the desired multiple of multiplicand 1 from process block 218 and add zeros to the right as appropriate to generate the partial product. At process block 222, these partial products are summed to produce one modular product. Finally, in process block 224, the modular product output is shifted, if necessary / possible, based on the control signal MTERM_SHIFT6, thereby combining the modular product output with other modular products in process block 250. Eliminates the need for proper digit alignment. This block uses summing hardware, shift / rotation hardware, hold paths, feedback paths, and working registers and is controlled by control logic block 210 to properly combine modular products to produce the final correct product. Generate. In this way, different configurations of this block may be used to handle modular products.

本発明のこの実施例で実施したいくつかの乗算命令について、以下に詳述する。図４、図５、および図６を参照すると、図２に示す前述のハードウェアと、実行プロセス２００と、乗算器ハードウェア１０とを使用した乗算演算および関連する実行アルゴリズムを示す略図が示されている。この乗算アルゴリズムは、サイズの異なるオペランドを含む様々なタイプの乗算を扱うように構成されている。 Some of the multiply instructions implemented in this embodiment of the present invention are detailed below. Referring to FIGS. 4, 5 and 6, there is shown a schematic diagram illustrating the multiplication operation and associated execution algorithm using the hardware described above, the execution process 200, and the multiplier hardware 10 shown in FIG. ing. The multiplication algorithm is configured to handle various types of multiplication involving operands of different sizes.

異なるサイズの被乗数による演算
図４を参照すると、様々なサイズの被乗数１を使用した一実施態様を示す例が図示されている。一実施例では、乗算器１０は、１６ビット×６４ビットの有損失２進乗算を処理して、可能な８０ビットの下位６４ビットを生成するように構成されている。これによって、ハードウェア・アーキテクチャ２００は、積の下位３２ビットが生成される１６ビット×３２ビットの演算と、積の下位６４ビットが生成される１６ビット×６４ビットの演算とを同じサイクル数で処理することができる。ＭＣＡＮＤ＿６４を３２ビット演算の場合はゼロに設定し、６４ビット演算の場合には１に設定することによって、同じアルゴリズムを使用して両方のタイプの乗算演算を容易に行うことができる。必要な追加の変更は、結果を記憶しておくことだけである。制御は、最初の演算、たとえば１６ビット×３２ビット乗算については３２ビット・ワードに設定し、後の演算、たとえば１６ビット×６４ビット乗算については完全長に設定する。 Operation with Multiplicands of Different Sizes Referring to FIG. 4, an example is shown illustrating one embodiment using multiplicands 1 of various sizes. In one embodiment, multiplier 10 is configured to process a 16-bit by 64-bit lossy binary multiplication to generate the lower 80 bits of the possible 80 bits. This allows the hardware architecture 200 to perform the 16-bit × 32-bit operation in which the lower 32 bits of the product are generated and the 16-bit × 64-bit operation in which the lower 64 bits of the product are generated in the same number of cycles. Can be processed. By setting MCAND_64 to zero for 32-bit operations and to 1 for 64-bit operations, both types of multiplication operations can be easily performed using the same algorithm. The only additional change required is to remember the results. Control is set to a 32-bit word for the first operation, e.g., a 16-bit x 32-bit multiply, and set to full length for the subsequent operation, e.g.

たとえば図４で、１つのプロセッサのための一群の命令、たとえば、被乗数にハーフワードを乗じる乗算のために使用されるＭＨ、ＭＨＩ、およびＭＧＨＩは、前述の特徴を十分に活用する。命令ＭＨおよびＭＨＩは両方とも、３２ビットの積を生成し、それぞれ、その被乗数を汎用レジスタ（ＧＰＲ）９０から入手する。しかし、ＭＨＩは、その乗数データを、命令テキストの一部である即値フィールドとして入手するのに対し、ＭＨはその乗数を、ＧＰＲ９０から入手する。同様に、命令ＭＧＨＩは、その６４ビットの被乗数データをＧＰＲ９０から受け取り、その乗数データを命令テキストの即値フィールドから受け取る。乗算ハーフワード即値ファミリは、乗数入力として即値データを保持する命令テキストの１６ビット・フィールドをとる有損失命令のセットである。図には、乗数を１つのサブグループに含めることができ、被乗数も、結果が有損失性であるため、１つのサブグループに含めることができ、その結果、乗算ハードウェアによる１回の反復実行で済む様子が示されている。以下の表６に、乗数データ２および被乗数データ１に対する１サイクル・プロセスの例と、Ｘパイプ５０Ａ、Ｙパイプ５０Ｂ、Ｚパイプ５０ｃ、および乗算器１０との相互作用を示す。サイクル１で、表中でＭ１で示されている乗数データ２と、表中でｍ１で示されている被乗数データ１が、それぞれＺパイプ５０ｃのＡ３５５およびＢ３５６レジスタに入力される。１サイクルの待ち時間後に、この例では最終積でもある、ｍｐ１で示すモジュラ積が、この演算の最終サイクルとみなされる２回目のサイクルで乗数ＭＰ＿ＯＵＴの出力として示され、次のサイクルで記憶域に入れられる。表中で使用されている構文に注目されたい。レジスタが、被乗数１および乗数２オペランド値をそれぞれ指定する「ｍｃａｎｄ」値または「ｍｐｌｉｅｒ」値を受け取るという場合、これらの値がデータ・バスから入来することを意味する。この実施態様の場合も、乗数入力はレジスタＢ３５６の最下位１６ビットから入力される。適切な乗数サブグループを入手するのに、必要なサブグループが最下位ハーフワード位置になるように乗数データを回転またはシフトするだけでよい。 For example, in FIG. 4, a group of instructions for one processor, such as MH, MHI, and MGHI used for multiplication of a multiplicand by a halfword, take full advantage of the foregoing features. Instructions MH and MHI both produce a 32-bit product, each of which obtains its multiplicand from general purpose register (GPR) 90. However, the MHI obtains its multiplier data as an immediate field that is part of the instruction text, while the MH obtains its multiplier from the GPR 90. Similarly, instruction MGHI receives its 64-bit multiplicand data from GPR 90 and its multiplier data from the immediate field of the instruction text. The multiply halfword immediate family is a set of lossy instructions that take a 16-bit field of instruction text that holds immediate data as a multiplier input. In the figure, the multiplier can be included in one subgroup, and the multiplicand can also be included in one subgroup because the result is lossy, so that one iteration with multiplication hardware Is shown. Table 6 below shows an example of a one-cycle process for multiplier data 2 and multiplicand data 1 and the interaction with X pipe 50A, Y pipe 50B, Z pipe 50c, and multiplier 10. In cycle 1, multiplier data 2 indicated by M1 in the table and multiplicand data 1 indicated by m1 in the table are input to the A355 and B356 registers of the Z pipe 50c, respectively. After one cycle of latency, the modular product, mp1, which is also the final product in this example, is shown as the output of multiplier MP_OUT in the second cycle, which is considered the last cycle of this operation, and is stored in storage in the next cycle. Can be put in. Note the syntax used in the table. If a register receives "mcand" or "mplier" values that specify the multiplicand 1 and multiplier 2 operand values, respectively, it means that these values come from the data bus. Also in this embodiment, the multiplier input is input from the least significant 16 bits of the register B356. To obtain the appropriate multiplier subgroup, it is only necessary to rotate or shift the multiplier data so that the required subgroup is at the least significant halfword position.

サイクルｅ１で、レジスタＡ３５５およびＢ３５６にラッチされたオペランド・データ（それぞれｍｃａｎｄおよびｍｐｌｉｅｒ）が、２進乗算器１０に渡され、Ａ３レジスタ５５からのデータを使用して、被乗数データ１が生成され、Ｂ３レジスタ５６からのデータを使用して乗数データ２が生成される。Ｍ１として示されているＡ３レジスタ５５からのデータは、制御信号ＭＣＡＮＤ＿６４およびｕｎｓｉｇｎｅｄ５に従って変更され、有効被乗数が生成される。次に、図１について前述したように、この有効被乗数をシフトし、補数をとって処理し、±１ｘおよび±２ｘ被乗数倍数を生成する。一方、この実施例では、Ｂ３レジスタ５６の右端の１６ビットからの乗数データ１と、右に付加された重複ビットＺ＿ＢＩＴ３から成る第１のサブグループにも何らかの処理が施される。重複ビットは、この場合、これが最下位（かつ唯一の）サブグループであるため、ゼロである。この１７ビット・ストリングが、図１のブース記録論理によって８個の３ビット・グループ（２ビット＋１重複ビット）に分解され、各３ビット走査グループについて被乗数の１ｘ倍数と２ｘ倍数のいずれを選択するかを示す８個の２ビット信号（前述のＳＸおよびＳ２Ｘ信号）が生成され、当該グループの部分積のために負の倍数を選択するか否かを示すとともに、負の倍数の場合は、被乗数の負の倍数を生成するために使用される（数値の補数をとる処理と「１」を加える処理を含む）２の補数をとる処理を完結させるために、入力された「ホット１」を次の走査グループの部分積に供給する、８個の１ビットＳＩＮＶ信号から成るもう一つのグループが生成される。同じサイクルで、８個の部分積と、８個の３ビット走査グループのブース記録の結果としての８番目のグループからの１個の「ホット１」入力とが、加算器１８（図１）で結合されて、２つの冗長６４ビット合計および桁上げ項に圧縮され、図１の構造２０および１９である合計レジスタおよび桁上げレジスタにラッチされる。 At cycle e1, the operand data (mcand and mplier, respectively) latched in registers A355 and B356 are passed to binary multiplier 10 and data from A3 register 55 is used to generate multiplicand data 1; The multiplier data 2 is generated using the data from the B3 register 56. Data from the A3 register 55, shown as M1, is modified according to the control signals MCAND_64 and unsigned5 to generate a valid multiplicand. The effective multiplicand is then shifted and complemented and processed to generate ± 1x and ± 2x multiplicand multiples, as described above for FIG. On the other hand, in this embodiment, some processing is also performed on the first subgroup including the multiplier data 1 from the rightmost 16 bits of the B3 register 56 and the duplicate bit Z_BIT3 added to the right. The duplicate bit is zero in this case because it is the lowest (and only) subgroup. This 17-bit string is decomposed into eight 3-bit groups (2 bits + 1 duplicate bits) by the Booth recording logic of FIG. 1 to select either a 1x or 2x multiple of the multiplicand for each 3-bit scan group. Are generated, indicating whether or not to select a negative multiple for the partial product of the group, and in the case of a negative multiple, the multiplicand In order to complete the two's complement process (including the process of complementing the number and the process of adding "1") used to generate a negative multiple of Another group of eight 1-bit SINV signals is generated that feeds the partial products of the scan groups. In the same cycle, the eight partial products and one "Hot 1" input from the eighth group as a result of the booth recording of the eight 3-bit scan groups are added at adder 18 (FIG. 1). Combined and compressed into two redundant 64-bit sum and carry terms and latched into the sum and carry registers, structures 20 and 19 of FIG.

サイクル２で、前記の合計レジスタおよび桁上げレジスタ２０および１９からのそれぞれのデータが加算器２１によって１つにされる。この出力と、その１６ビット左シフトされた形式とが、図１のマルチプレクサ２２に入力され、入力制御信号ＭＴＥＲＭ＿ＳＨＩＦＴ６を使用して、２つのうちのいずれか選択される。このサイクルの最初のモジュラ積ｍｐ１である乗算器の出力ＭＰ＿ＯＵＴが、図２の２進加算器６４の出力でトライステート・バッファ８６の出力として選択され、マルチプレクサ６０を通してＣ１レジスタ５７にラッチされる。これは、演算の最終結果に達したため、この命令の最終実行サイクルとみなされる。 In cycle 2, the respective data from the sum and carry registers 20 and 19 are combined by adder 21. This output and its 16-bit left shifted form are input to the multiplexer 22 of FIG. 1 and one of two is selected using the input control signal MTERM_SHIFT6. The output MP_OUT of the multiplier, which is the first modular product mp1 of this cycle, is selected as the output of the tri-state buffer 86 at the output of the binary adder 64 of FIG. This is considered the last execution cycle of this instruction, since the final result of the operation has been reached.

乗算ハードウェアを複数幅の被乗数を処理するように実施することによって、わずかな制御上の変更を加えた実質的に同じアルゴリズムを使用して、入来オペランド・データを事前フォーマットする必要なしに、異なる被乗数幅の乗算が処理される。被乗数データ１のフォーマットは、２進乗算器１０の内部で適切に処理され、単に２つの信号のみによって制御される。この信号の一方は、被乗数の長さを示す信号、たとえばＭＣＡＮＤ＿６４であり、他方はオペランドが符号付きであるか符号なしであるかを示す信号、たとえばＵＮＳＩＧＮＥＤ５であり、この特定の実施例では前者の信号を示した。 By implementing the multiplication hardware to handle multi-width multiplicands, it is possible to use substantially the same algorithm with minor control changes, without having to preformat the incoming operand data. Multiplications of different multiplicand widths are processed. The format of the multiplicand data 1 is appropriately processed inside the binary multiplier 10 and is controlled by only two signals. One of the signals is a signal indicating the length of the multiplicand, for example, MCAND_64, and the other is a signal indicating whether the operand is signed or unsigned, for example, UNSIGNED5. The signal was shown.

１命令当たりサイクル数の削減
本開示の各実施態様の柔軟性を示す他の例は、ＭＳＧおよびＭＳＧＲアルゴリズムの実施態様に基づくものである。両者は、６４ビットの有損失積を生成する６４×６４演算であり、相違点は、ＭＳＧＲがその乗数データ２を汎用レジスタ、たとえばレジスタ・ファイル９０から入手するのに対し、ＭＳＧはそのデータを、この実施態様では、オペランド・バッファ９２を介してメモリから入手することである。有損失６４ビット結果を得るために、制御信号ＭＣＡＮＤ＿６４４が１に設定される。このアルゴリズムのサイクルごとの動作を、以下の表７に示し、命令サブグループ処理順序を図５に示す。この図には、乗数が４個の乗数サブグループに分割されている様子が示されており、この演算は有損失演算であるため、完全な最初のオペランドが６４ビット被乗数データとして保持される。 Reducing Cycles per Instruction Another example that illustrates the flexibility of embodiments of the present disclosure is based on implementations of the MSG and MSGR algorithms. Both are 64 × 64 operations that generate a 64-bit lossy product, with the difference that MSGR obtains its multiplier data 2 from a general purpose register, eg, register file 90, while MSG obtains its data. In this embodiment, it is obtained from the memory via the operand buffer 92. Control signal MCAND_644 is set to 1 to obtain a lossy 64-bit result. The operation of this algorithm for each cycle is shown in Table 7 below, and the instruction subgroup processing order is shown in FIG. This figure shows that the multiplier is divided into four multiplier subgroups. Since this operation is a lossy operation, the complete first operand is held as 64-bit multiplicand data.

図１および図２を再度参照すると、サイクル１（ｅ１で示す）で、ＭＳＧ／ＭＳＧＲは、６４ビット×６４ビット命令であるため、完全被乗数データ１が、この場合は唯一のグループである、有効被乗数サブグループＭ１を形成する。２進乗算器１０への乗数データ入力２は、被乗数１よりも短く、この場合もＢ３レジスタ５６から１６ビットの最下位乗数データを入手して、ここではｍ１とも呼ぶ第１の乗数サブグループを生成する。被乗数「サブグループ」Ｍ１と乗数サブグループｍ１は、前述の例と同様にして２進乗算器１０によって処理され、このサイクルの終わりに冗長合計および桁上げ項が、合計レジスタ２０および桁上げレジスタ１９にそれぞれラッチされる。同時に、Ｘパイプ５０Ａでは、Ａ１レジスタに事前に入力されていた完全乗数が、ビット論理ユニット７２に入れられ、回転されて、第２の乗数サブグループ（以下、ｍ２と呼ぶ）、たとえば元の乗数のビット３２ないし４７が、データ・フィールドの右端の位置（ビット４８ないし６３）に配置される。次に、ＢＬＵ１７２の出力は、次のサイクルで使用するために、このサイクルの終わりにレジスタＡ２５３およびＢ３５４にラッチされる。 Referring again to FIGS. 1 and 2, in cycle 1 (denoted by e1), the MSG / MSGR is a 64-bit × 64-bit instruction, so complete multiplicand data 1 is the only group in this case, the valid group. Form a multiplicand subgroup M1. The multiplier data input 2 to the binary multiplier 10 is shorter than the multiplicand 1 and again obtains the 16-bit least significant data from the B3 register 56 and defines the first multiplier subgroup, also referred to herein as m1. Generate. The multiplicand "subgroup" M1 and the multiplier subgroup m1 are processed by the binary multiplier 10 in the same manner as in the previous example, and at the end of this cycle the redundant sum and carry terms are summed by the sum register 20 and the carry register 19. Respectively. At the same time, in X pipe 50A, the complete multiplier previously entered in the A1 register is placed in bit logic unit 72 and rotated to a second multiplier subgroup (hereinafter m2), for example, the original multiplier. Are placed in the rightmost position (bits 48 to 63) of the data field. Next, the output of BLU 172 is latched into registers A253 and B354 at the end of this cycle for use in the next cycle.

サイクル２では、前のサイクルでのＭ１とｍ１の乗算の結果を保持する合計レジスタ２０および桁上げレジスタ１９が結合されて、２進乗算器１０の第１の出力としてモジュラ積ｍｐ１が生成される。このモジュラ積ｍｐ１も、次のサイクルで使用するためにこのサイクルの終わりにＢ２レジスタ５４に入力としてラッチされる。一方、乗算器は、サブグループＭ１＊ｍ２を処理しており、その結果を、やはり合計レジスタ２０および桁上げレジスタ１９に入れる。Ｙパイプ５０Ｂで、Ａ２レジスタ５３に保持されているｍ２乗数がやはり回転され、第３の乗数サブグループｍ３がデータ経路の最下位位置に配置される。この場合も、この新しい乗数サブグループｍ３は、次のサイクルで使用するためにＢ３レジスタ５６に入力される。 In cycle 2, the sum register 20 and the carry register 19, which hold the result of the multiplication of M1 and m1 in the previous cycle, are combined to produce the modular product mp1 as the first output of the binary multiplier 10. . This modular product mp1 is also latched as an input into the B2 register 54 at the end of this cycle for use in the next cycle. On the other hand, the multiplier is processing the subgroup M1 * m2 and puts the result in the sum register 20 and the carry register 19 as well. In the Y pipe 50B, the m2 multiplier held in the A2 register 53 is also rotated, and the third multiplier subgroup m3 is placed at the lowest position of the data path. Again, this new multiplier subgroup m3 is input to the B3 register 56 for use in the next cycle.

サイクル３では、それぞれ前のサイクルでのＭ１とｍ２の乗算の結果を保持する合計レジスタ２０および桁上げレジスタ１９が結合され、次のモジュラ積（以下ｍｐ２と呼ぶ）が生成され、１６ビット左にシフトされ、モジュラ積ｍｐ１との結合に備えて桁合わせされる。シフトされたモジュラ積ｍｐ２は、次のサイクルで使用するためにレジスタＡ２にラッチされる。一方、２進乗算器１０は、サブグループＭ１＊ｍ３を処理し、その結果の積を合計レジスタ２０および桁上げレジスタ１９に入れる。Ｙパイプ５０Ｂで、前のサイクルでＢ３レジスタ５６から入力されたｍ２乗数が、ｍ４で示す第４の乗数サブグループ（たとえば元の乗数値のビット０ないし１５）が、データの最下位１６ビット位置にくるように回転される。その結果は、次のサイクルで使用するためにＢ３レジスタ５６に入力される。 In cycle 3, the sum register 20 and carry register 19, each holding the result of the multiplication of M1 and m2 in the previous cycle, are combined to generate the next modular product (hereinafter referred to as mp2), which is 16 bits left Shifted and digitized in preparation for combination with the modular product mp1. The shifted modular product mp2 is latched in register A2 for use in the next cycle. On the other hand, the binary multiplier 10 processes the subgroup M1 * m3 and puts the resulting product into the sum register 20 and the carry register 19. In the Y pipe 50B, the m2 multiplier input from the B3 register 56 in the previous cycle is set to the fourth multiplier subgroup indicated by m4 (for example, bits 0 to 15 of the original multiplier value), and the least significant 16 bit position of the data is set. It is rotated to come. The result is input to the B3 register 56 for use in the next cycle.

サイクル４では、前のサイクルでのＭ１とＭ３の乗算の結果を保持する合計レジスタ２０および桁上げレジスタ１９が結合され、モジュラ積ｍｐ３が２進乗算器１０の第３の出力として生成され、シフトせずにマルチプレクサ２２に通され、次のサイクルで使用するためにＢ２レジスタ５４にラッチされる。一方、モジュラ積ｍｐ１およびｍｐ２をそれぞれ保持するＡ２レジスタ５３およびＢ２レジスタ５４が、ＢＩＮ２６６で加算され、累算モジュラ積ｍｐ１：２が生成されて、Ｅレジスタ５９に入力され、後で使用するために保持される。 In cycle 4, the sum register 20 and carry register 19 holding the result of the multiplication of M1 and M3 in the previous cycle are combined, and the modular product mp3 is generated as the third output of the binary multiplier 10 and shifted. Without being passed through multiplexer 22 and latched in B2 register 54 for use in the next cycle. On the other hand, A2 register 53 and B2 register 54, which hold modular products mp1 and mp2, respectively, are added at BIN 266 to produce an accumulated modular product mp1: 2, which is input to E register 59 for later use. Will be retained.

サイクル５では、前のサイクルでのＭ１とｍ４の乗算の結果を保持する合計レジスタ２０および桁上げレジスタ１９が結合されて、これらの命令の第４かつ最終のモジュラ積である、ｍｐ４で示すもう一つのモジュラ積が生成され、モジュラ積ｍｐ３との結合での桁合わせのために、左シフトされてマルチプレクサ２２から出力される。シフトされたモジュラ積ｍｐ４は、次のサイクルで使用するためにＡ２レジスタ５３の入力に入力される。 In cycle 5, the sum register 20 and carry register 19 holding the result of the multiplication of M1 and m4 in the previous cycle are combined to form a fourth and final modular product of these instructions, denoted by mp4. One modular product is generated and left shifted and output from multiplexer 22 for digit alignment in combination with modular product mp3. The shifted modular product mp4 is input to the input of the A2 register 53 for use in the next cycle.

サイクル６では、Ａ２レジスタ５３とＢ２レジスタ５４のそれぞれの内容である、ｍｐ３と左シフトされたｍｐ４とが、Ｙパイプ５０Ｂの２進加算器Ｂｉｎ２６６で加算され、累算ｍｐ３：４項が生成される。これはその後、次のサイクルで使用するためにＡ２レジスタ５３に入力される。一方、Ｅレジスタ５９に保持されているｍｐ１：２項が、後で使用するためにＢ２レジスタ５４に入力される。 In cycle 6, mp3 and left-shifted mp4, which are the contents of the A2 register 53 and the B2 register 54, are added by the binary adder Bin266 of the Y pipe 50B to generate an accumulated mp3: 4 term. You. This is then input to A2 register 53 for use in the next cycle. On the other hand, the mp1: 2 term held in the E register 59 is input to the B2 register 54 for later use.

サイクル７では、Ａ２レジスタの内容、たとえば累算モジュラ積ｍｐ３：４が３２ビット左にシフトされ、図４に示すようにｍｐ１：２項との加算に備えて桁合わせされる。この桁合わせは、ビット論理ユニットＢＬＵ２７４で行われ、ｍｐ３：４がＢＬＵ２７４を通されて、シフトされ、シフトされた結果（ｍｐ３：４）が、次のサイクルで使用するためにＡ２レジスタ５３にフィードバックされる。 In cycle 7, the contents of the A2 register, e.g., the accumulated modular product mp3: 4, are shifted 32 bits to the left and aligned as shown in FIG. 4 in preparation for addition with the mp1: 2 term. This alignment is performed in bit logic unit BLU 274, where mp3: 4 is shifted through BLU 274 and the shifted result (mp3: 4) is fed back to A2 register 53 for use in the next cycle. Is done.

最後に、サイクル８で、それぞれＡ２レジスタ５３およびＢ２レジスタ５４の内容である、３２ビットシフトされた累算モジュラ積（ｍｐ３：４）とｍｐ１：２が、２進加算器ＢＩＮ２６６で結合されて、この最後の実行サイクルでの最終積ｍｐ１：４が生成される。この最終結果の積は、次のサイクルに取っておくためにＣ２５８レジスタに入力される。 Finally, in cycle 8, the 32-bit shifted accumulated modular product (mp3: 4) and mp1: 2, the contents of A2 register 53 and B2 register 54, respectively, are combined in a binary adder BIN266, The final product mp1: 4 in this last execution cycle is generated. The product of this final result is input to the C258 register to save for the next cycle.

具体的に（ｍｐ１，ｍｐ２）および（ｍｐ３，ｍｐ４）として示すモジュラ積の対の場合、図１の乗算器１０の最後でマルチプレクサ２２とシフタ６を使用してモジュラ積を１６ビット左にシフトすることによって、処理サイクルが節約されるので有利であることが表７から容易にわかる。このようにしない場合、シフトされていないモジュラ積（ｍｐ２またはｍｐ４）は、Ｘパイプ５０ＡまたはＹパイプ８０Ｂに戻し、シフトし、次に、それを他のモジュラ積（それぞれｍｐ１およびｍｐ３）と結合することになり、これは、シフトに必要な１処理サイクルではなく２処理サイクル使用することになる。 Specifically, in the case of a pair of modular products shown as (mp1, mp2) and (mp3, mp4), the modular product is shifted left by 16 bits using the multiplexer 22 and the shifter 6 at the end of the multiplier 10 in FIG. Table 7 readily shows that this is advantageous because it saves processing cycles. Otherwise, the unshifted modular product (mp2 or mp4) returns to X pipe 50A or Y pipe 80B, shifts, and then combines it with the other modular products (mp1 and mp3, respectively). This results in the use of two processing cycles instead of one required for the shift.

図２に示すサポート・ハードウェアのこの特定の実施態様の既存の機能を、乗算器ハードウェア１０およびプロセス２００と共に使用したことによって、処理サイクルが節約され、具体的には、既存の先行ゼロ検出機能、たとえば８２および８４（図２）を使用せずに済む。大きなオペランドを使用するが実際の乗数値の絶対値が小さい乗算の場合の早期終了論理を完全にサポートするために、先行１検出をサポートするように先行ゼロ検出機能８２および８４にわずかな変更を加えることができる。 The use of the existing features of this particular embodiment of the support hardware shown in FIG. 2 in conjunction with the multiplier hardware 10 and the process 200 saves processing cycles and, in particular, the existing leading zero detection. The functions, for example, 82 and 84 (FIG. 2) need not be used. To fully support early termination logic for multiplications that use large operands but have small actual multiplier values, minor changes have been made to leading zero detection functions 82 and 84 to support leading one detection. Can be added.

一例として、１回乗算（６４ビット）命令に戻ると、この命令対は、乗数サイズに関して早期終了に最適であることが容易にわかるであろう。乗数データは右から左に処理されるため、このアルゴリズムを、乗数データのサイズに対応する有利な箇所で終了させることができる。たとえば、乗数データが（０ｘ０００ないし０ｘ７ｆｆｆ）、または（０ｘＦＦＦＦｆｆｆｆＦＦＦＦ８０００）ないし（０ｘＦＦＦＦｆｆｆｆＦＦＦＦｆｆｆｆ）の範囲である場合、乗数データは本質的に、ｍ１によって表現可能であり、サイクルｅ３の後で実行を終了することができ、その時点で、Ｂ２レジスタ５４に保持されているモジュラ積ｍｐ１を最終出力としてＣ２レジスタ５８にルーティングすることができ、これによって、図の全８サイクルではなく有効サイクル数３で乗算を完了することができる。同様に、このような手っ取り早い方法は、異なる有効長の乗数データを使用して容易に実施することができ、有効長１５、３１、３２、４７、および４８ビットの乗数（すなわち絶対値をこれらのビット数で完全に表現することができる）の場合、それぞれ３、４、５、６、および７サイクルという実行サイクル数が可能になる。最初の２つの早期終了の場合の実行表を、表８と表９にそれぞれ、示す。ここには、乗算器処理サイクル毎のレジスタの内容が、示されている。 Returning to a single multiply (64 bit) instruction as an example, it will be readily apparent that this instruction pair is optimal for early termination with respect to multiplier size. Since the multiplier data is processed from right to left, the algorithm can be terminated at an advantageous point corresponding to the size of the multiplier data. For example, if the multiplier data is in the range (0x000 to 0x7fff), or (0xFFFFffffFFFF8000) to (0xFFFFffffFFFFffffff), the multiplier data can be essentially represented by m1, and execution may be terminated after cycle e3. At that point, the modular product mp1 held in the B2 register 54 can be routed as a final output to the C2 register 58, thereby completing the multiplication with 3 valid cycles instead of the full 8 cycles shown. be able to. Similarly, such a quick method can be easily implemented using multiplier data of different effective lengths, and the effective length 15, 31, 32, 47, and 48 bit multipliers (ie, the absolute value Can be completely represented by these numbers of bits), allowing 3, 4, 5, 6, and 7 execution cycles, respectively. The execution tables for the first two early terminations are shown in Tables 8 and 9, respectively. Here, the contents of the register for each multiplier processing cycle are shown.

ブースのアルゴリズムを使用して符号なし乗算を行う場合、部分積を減らす方法がストリング検出に基づいているために付加する必要がある、本明細書でモジュラ積補正（ｍｐｃ）項と呼ぶ補正項がある場合がある。ストリングの終わり（最上位ビット）に来ることは、計算の終わりを意味しない。符号なし乗算では、常に、ＭＳＢを超える、１または複数の「０」ｂ値の暗黙的ビットが存在する。乗数のＭＳＢが０の場合、ＭＳＢの左側のこのような暗黙ビットは、ゼロの連続ストリングを示し、何の処置も行われない。しかし、左にゼロが付加された「１」ｂのＭＳＢは、ストリングの終わりに達したことを示し、３ビット走査の場合、これらのビットは「００１」ｂであり、補正のＬＳＢが乗数のＭＳＢの左の１ビットになるように、結果に付加して桁合わせしなければならない＋１ｘ被乗数倍数を表す（図６参照）。３２ビットで完全に表現可能な乗数による演算が、最終積に補正を組み込む必要があるために３１ビットで完全に表現可能な乗数による演算よりも１サイクル多く要する理由である。命令の実施形態および資源に応じて、このｍｐｃ項の追加をアルゴリズムに組み込んでも組み込まなくてもよい。 When performing unsigned multiplication using the Booth algorithm, a correction term, referred to herein as a modular product correction (mpc) term, which must be added because the method of reducing partial products is based on string detection. There may be. Coming to the end of the string (most significant bit) does not mean the end of the computation. In unsigned multiplication, there will always be one or more implicit bits of the "0" b value that exceed the MSB. If the MSB of the multiplier is 0, such implicit bits to the left of the MSB indicate a continuous string of zeros and no action is taken. However, the MSB of "1" b with zeros added to the left indicates that the end of the string has been reached, and for a 3-bit scan these bits are "001" b and the LSB of the correction is the multiplier LSB. Represents a + 1x multiplicand multiple that must be added to the result and aligned to be one bit to the left of the MSB (see FIG. 6). This is why an operation with a multiplier that can be completely represented by 32 bits requires one cycle more than an operation with a multiplier that can be completely represented by 31 bits because it is necessary to incorporate correction into the final product. Depending on the embodiment and resources of the instruction, this additional mpc term may or may not be incorporated into the algorithm.

実施例の利点および柔軟性の他の例は、１２８ビットの結果を生成する論理６４ビット×６４ビット無損失（たとえば桁上げを考慮に入れ、対処する場合）乗算命令の使用に基づく。図６を参照すると、積の無損失性のために、制御信号ＭＣＡＮＤ＿６４４をゼロに設定して、生成されたすべてのモジュラ積も無損失として設定されるようにする必要がある。図６では、乗数は４個の乗数サブグループに分割されているが、６４ビットの被乗数データは２個のサブグループに分割されて、１６ビット×３２ビット乗算が作成され、有損失結果を回避する様子が図示されている。表１０に、本発明の一実施例におけるＭＬＧ／ＭＬＧＲの実施態様におけるハードウェアの使用を示す。表１０には、乗算器処理サイクル毎のレジスタの内容が示されている。 Another example of the benefits and flexibility of the embodiment is based on the use of a logical 64-bit x 64-bit lossless (eg, taking into account and taking into account carry) multiply instruction that produces a 128-bit result. Referring to FIG. 6, for lossless product, the control signal MCAND_644 must be set to zero so that all generated modular products are also set as lossless. In FIG. 6, the multiplier is divided into four multiplier subgroups, but the 64-bit multiplicand data is divided into two subgroups and a 16-bit × 32-bit multiplication is created to avoid lossy results Is illustrated. Table 10 shows the use of hardware in an MLG / MLGR implementation in one embodiment of the present invention. Table 10 shows the contents of the registers for each multiplier processing cycle.

サイクルｅ１で、２進乗算器１０は、Ｘパイプ５０Ａにある間にサブグループＭ１＊ｍ１を処理し、ビット論理ユニットＢＬＵ１７２が乗数データ２を回転させて、第２の乗数サブグループｍ２を生成する。この乗数サブグループｍ２は、次のサイクルで使用するためにＡ２レジスタ５３およびＢ３レジスタ５４にラッチされる。 At cycle e1, binary multiplier 10 processes subgroup M1 * m1 while in X pipe 50A, and bit logic unit BLU 172 rotates multiplier data 2 to generate a second multiplier subgroup m2. . This multiplier subgroup m2 is latched in A2 register 53 and B3 register 54 for use in the next cycle.

サイクル２で、２進乗算器１０は乗数サブグループＭ１＊ｍ２を処理し、Ｍ１＊ｍ１の結果をモジュラ積項ｍｐ１に圧縮し終わり、次のサイクルで使用するためにＢ２レジスタ５４に入力する。Ｙパイプ５０Ｂで、（前のサイクルからの）Ａ２レジスタ５３の内容ｍ２をＢＬＵ２７４で回転して、乗数サブグループｍ４を生成し、後で使用するためにＥレジスタ５９にラッチされる。レジスタＡ２５３は、次のサイクルで使用するためにバスから被乗数データｍｃａｎｄを入力する。 In cycle 2, the binary multiplier 10 processes the multiplier subgroup M1 * m2, compresses the result of M1 * m1 into a modular product term mp1, and inputs it to the B2 register 54 for use in the next cycle. At Y pipe 50B, the contents m2 of A2 register 53 (from the previous cycle) are rotated at BLU 274 to produce multiplier subgroup m4, which is latched into E register 59 for later use. Register A253 receives multiplicand data mand from the bus for use in the next cycle.

サイクル３で、２進乗算器１０は、Ｍ１＊ｍ２の結果を圧縮し終わり、結果を１６ビット左にシフトして、モジュラ積項ｍｐ２を後でｍｐ１と結合するために桁合わせし、それを次のサイクルで使用するためにＡ２レジスタ５３に入力する。レジスタＢ３５６は、そのｍ２データを後で使用するために保持する。一方、Ｙパイプ５０Ｂで、ビット論理ユニットＢＬＵ２７４が、被乗数データを回転させて、第２の被乗数サブグループＭ２を生成し、次のサイクルで使用するためにレジスタＡ３５５に入力する。 At cycle 3, the binary multiplier 10 finishes compressing the result of M1 * m2, shifts the result 16 bits to the left, and aligns the modular product term mp2 for later combining with mp1, and Input to the A2 register 53 for use in the next cycle. Register B 356 holds the m2 data for later use. On the other hand, in Y pipe 50B, bit logic unit BLU 274 rotates the multiplicand data to generate a second multiplicand subgroup M2, which is input to register A355 for use in the next cycle.

サイクル４で、２進乗算器１０は、サブグループＭ２＊ｍ２を処理する。Ｅレジスタ５９の内容ｍ４が、次のサイクルで使用するためにＢ３レジスタ５６に供給され、Ａ３レジスタ５５がバスから被乗数データを入手し、次のサイクルでマルチプレクサで使用するためにサブグループＭ１を生成する。 In cycle 4, the binary multiplier 10 processes the subgroup M2 * m2. The contents m4 of E register 59 are provided to B3 register 56 for use in the next cycle, and A3 register 55 gets the multiplicand data from the bus and creates subgroup M1 for use by the multiplexer in the next cycle. I do.

サイクル５で、２進乗算器１０は、サブグループＭ１＊ｍ４を処理し、Ｍ２＊ｍ２の結果の圧縮を終え、それを１６ビット左にシフトしてモジュラ積ｍｐ３を生成し、それを後で使用するためにＢ２レジスタ５３に入力する。Ｙパイプ５０Ｂにおいて、レジスタＡ２５３とレジスタＢ２５４の内容を加算して結合モジュラ積ｍｐ１：２を生成し、後で使用するためにＺパイプ５０ＣのＥレジスタ５９に入力する。 At cycle 5, binary multiplier 10 processes subgroup M1 * m4, finishes compressing the result of M2 * m2, shifts it 16 bits to the left to generate modular product mp3, which is Input to the B2 register 53 for use. In the Y pipe 50B, the contents of the registers A253 and B254 are added to generate a combined modular product mp1: 2, which is input to the E register 59 of the Z pipe 50C for later use.

サイクル６で、２進乗算器１０は、Ｍ１＊ｍ４の結果の圧縮を終え、その結果を１６ビット左にシフトしてモジュラ積ｍｐ４を生成し、その結果を、次のサイクルで使用するためにレジスタＡ２５３に入力する。Ｘパイプ５０ＡのレジスタＡ１５１が、次のサイクルで使用するためにバスから被乗数値を入力する。 At cycle 6, binary multiplier 10 finishes compressing the result of M1 * m4, shifts the result left 16 bits to generate modular product mp4, and uses the result for use in the next cycle. Input to register A253. Register A151 of X pipe 50A inputs the multiplicand from the bus for use in the next cycle.

サイクル７で、ＢＩＮ２６６が、前にレジスタＡ２５３に入力したｍｐ４データと前にＢ２５４に保持したｍｐ３データとを加算して、結合モジュラ積ｍｐ３：４を生成し、後で使用するためにＡ２５３に入力する。一方、Ｘパイプ５０Ａで、ＢＬＵ１７２が、レジスタＡ１５１からｍｃａｎｄデータを入手し、それを回転して第２の被乗数サブグループＭ２を生成し、次のサイクルで使用するためにＡ３５５に入力する。レジスタＡ１５１が、次のサイクルで使用するために、バスから乗数を入力する。レジスタＢ３５６も、次のサイクルで第１のサブグループｍ１を使用するために、バスから乗数値をラッチする。 At cycle 7, BIN 266 adds the mp4 data previously input to register A253 and the mp3 data previously stored in B254 to generate a combined modular product mp3: 4, which is input to A253 for later use. I do. On the other hand, at X pipe 50A, BLU 172 obtains mcand data from register A151, rotates it to generate a second multiplicand subgroup M2, and inputs it to A355 for use in the next cycle. Register A 151 receives a multiplier from the bus for use in the next cycle. Register B356 also latches a multiplier value from the bus to use the first subgroup m1 in the next cycle.

サイクル８で、乗算器は、サブグループＭ２＊ｍ１を処理し、次のサイクルで第１の被乗数サブグループＭ１を使用するために被乗数をＡ３レジスタ５５にラッチする。Ｘパイプ５０Ａで、Ａ１レジスタ５１に保持されている乗数を回転させて、第３の乗数サブグループｍ３を生成し、次のサイクルで使用するためにレジスタＢ３５６にラッチする。 At cycle 8, the multiplier processes the subgroup M2 * m1 and latches the multiplicand into the A3 register 55 for using the first multiplicand subgroup M1 in the next cycle. The multiplier held in A1 register 51 is rotated by X pipe 50A to generate a third multiplier subgroup m3, which is latched in register B356 for use in the next cycle.

サイクル９で、乗算器は、サブグループＭ１＊ｍ３を処理し、Ｍ２＊ｍ１の結果の処理を終えてモジュラ積ｍｐ５を生成し、次のサイクルで使用するためにＢ２レジスタ５４にラッチする。レジスタＢ３５６は、後で使用するためにそのｍ３値を保持する。 At cycle 9, the multiplier processes the subgroup M1 * m3, finishes processing the result of M2 * m1, generates a modular product mp5, and latches it in the B2 register 54 for use in the next cycle. Register B 356 holds that m3 value for later use.

サイクル１０で、乗算器は、Ｍ１＊ｍ３の処理を終えてモジュラ積ｍｐ６を生成し、次のサイクルで使用するためにＢ２レジスタ５４にラッチする。一方、Ｙパイプ５０Ｂでは、２進加算器ＢＩＮ２６６がＡ２レジスタ５３とＢ２レジスタ５４のそれぞれの内容であるｍｐ３：４とｍｐ５を加算して、結合モジュラ積３：５を生成している。Ｘパイプ５０Ａで、Ａ１レジスタ５１が、次のサイクルで使用するためにバスから被乗数をラッチする。 At cycle 10, the multiplier finishes processing M1 * m3 to generate a modular product mp6, which is latched in B2 register 54 for use in the next cycle. On the other hand, in the Y pipe 50B, the binary adder BIN 266 adds mp3: 4 and mp5, which are the contents of the A2 register 53 and the B2 register 54, to generate a combined modular product 3: 5. At X pipe 50A, A1 register 51 latches the multiplicand from the bus for use in the next cycle.

サイクル１１では、Ｙパイプ５０Ｂにおいて、ＢＩＮ２６６がレジスタＡ２５３とレジスタＢ２５４の内容を加算して結合モジュラ積３：６を生成する。Ｘパイプ５０Ａでは、ＢＬＵ１７２が、レジスタＡ１５１内の乗数データを処理して第２の被乗数サブグループＭ２を生成し、後で使用するためにレジスタＡ３５５に入力する。 In cycle 11, in Y pipe 50B, BIN 266 adds the contents of register A 253 and register B 254 to generate a combined modular product 3: 6. In X pipe 50A, BLU 172 processes the multiplier data in register A151 to generate a second multiplicand subgroup M2, which is input to register A355 for later use.

サイクル１２では、Ｚパイプ５０Ｃにおいて、値Ｍ２およびｍ３が、次のサイクルで使用するためにＡ３レジスタ５５とＢ３レジスタ５６に保持される。一方、Ｙパイプ５０Ｂでは、前のサイクルでＢＩＮ２６６から入力されたレジスタＡ２５３内のｍｐ３：６データがＢＬＵ２７４で回転され、それによって、ロウ３２ビットワードがハイ３２ビットワードと交換され、他のモジュラ積と桁合わせして結合し、１２８ビットの最終結果を生成するのに必要な回転モジュラ積ｒｍｐ３：６が生成される。Ｂ２レジスタ５４は、Ｅレジスタ５９からのｍｐ１：２データを次のサイクルで使用するためにラッチする。Ｘパイプ５０Ａでは、次のサイクルで使用される可能性に備えて、被乗数がＡ１レジスタ５３に入力される。 In cycle 12, in the Z pipe 50C, the values M2 and m3 are held in the A3 register 55 and the B3 register 56 for use in the next cycle. On the other hand, in the Y pipe 50B, the mp3: 6 data in the register A 253 input from the BIN 266 in the previous cycle is rotated by the BLU 274, whereby the low 32-bit word is exchanged for the high 32-bit word, and another modular product is used. To produce the rotational modular product rmp3: 6 needed to produce the 128-bit final result. The B2 register 54 latches the mp1: 2 data from the E register 59 for use in the next cycle. In the X pipe 50A, the multiplicand is input to the A1 register 53 in preparation for the possibility of being used in the next cycle.

サイクル１３で、乗算器は、サブグループＭ２＊ｍ３を処理する。Ｙパイプ５０Ｂで、回転されたモジュラ積ｒｍｐ３：６が、そのロウワードがゼロ設定されて、ｍｐ１：２と結合され、結果のクワッドワードの最下位ダブルワードが生成される。これは、次のサイクルに取っておくためにレジスタＣ２５８にラッチされる。レジスタＡ２５３は、次のサイクルで使用するために乗数データをバスから入力する。一方、Ｘパイプ５０Ａでは、乗数データの最上位ビットがたまたま「１」ビットである場合という条件付きで、被乗数データが、回転されたモジュラ積ｒｍｐ３：６のロウワードに加えられ、補正されたモジュラ積ｃｍｐ３：６が生成される。この結果は、後で使用するためにＢ２レジスタ５４に入力される。 At cycle 13, the multiplier processes subgroup M2 * m3. In the Y-pipe 50B, the rotated modular product rmp3: 6 is combined with mp1: 2, with its lowwords set to zero, to produce the least significant doubleword of the resulting quadword. This is latched into register C258 to save for the next cycle. Register A253 inputs multiplier data from the bus for use in the next cycle. On the other hand, in the X pipe 50A, the multiplicand data is added to the rotated modular product rmp3: 6 low word, with the condition that the most significant bit of the multiplier data happens to be “1” bit, and the corrected modular product is added. cmp3: 6 is generated. This result is input to the B2 register 54 for later use.

サイクル１４で、２進乗算器１０は、Ｍ２＊ｍ３の結果を圧縮し終わってモジュラ積ｍｐ７を生成し、次のサイクルで使用するためにレジスタＢ２５４に入力される。Ｚパイプ５０ＣのＡ３レジスタ５５が、次のサイクルで使用するためにそのＭ２値を保持する。レジスタＡ２５３内にある乗数データがＢＬＵ２７４を通され、第４の乗数サブグループｍ４が生成され、次のサイクルで使用するためにレジスタＢ３５６にラッチされる。 At cycle 14, binary multiplier 10 finishes compressing the result of M2 * m3 to produce modular product mp7, which is input to register B254 for use in the next cycle. A3 register 55 of Z pipe 50C holds its M2 value for use in the next cycle. The multiplier data in register A253 is passed through BLU 274 to generate a fourth multiplier subgroup m4, which is latched in register B356 for use in the next cycle.

サイクル１５で、２進乗算器１０はサブグループＭ２＊ｍ４を処理する。Ｙパイプ５０Ｂで、ＢＩＮ２６６は、Ａ２レジスタ５３内のモジュラ積ｍｐ７と、補正されたモジュラ積ｃｍｐ３：６とを結合してｍｐ３：７を生成し、これは、次のサイクルで使用するためにＡ２レジスタ５３にラッチされる。 At cycle 15, binary multiplier 10 processes subgroup M2 * m4. At Y pipe 50B, BIN 266 combines the modular product mp7 in A2 register 53 with the corrected modular product cmp3: 6 to produce mp3: 7, which is A2 for use in the next cycle. It is latched by the register 53.

サイクル１６で、２進乗算器１０は、Ｍ２＊ｍ４の結果の圧縮を終え、その結果を１６ビット左にシフトし、モジュラ積ｍｐ８を生成し、これは次のサイクルで使用するためにレジスタＡ１５１に入力される。レジスタＡ２５２内のｍｐ３：７データがＢＩＮ２６６を通され、次のサイクルで使用するためにレジスタＢ２５４に入力される。 At cycle 16, binary multiplier 10 finishes compressing the result of M2 * m4, shifts the result 16 bits to the left, and generates modular product mp8, which is stored in register A151 for use in the next cycle. Is input to The mp3: 7 data in register A 252 is passed through BIN 266 and input to register B 254 for use in the next cycle.

実行の最終サイクルであるサイクル１７で、レジスタＡ１５１内のデータとレジスタＢ１５２内のデータが結合され、補正がすでに加えられた最終結果の最上位ダブルワードが生成される。これは、次のサイクルにとっておくためにレジスタＣ１５７にラッチされる。 At cycle 17, the last cycle of execution, the data in register A 151 and the data in register B 152 are combined to produce the most significant doubleword of the final result with corrections already applied. This is latched in register C157 to save for the next cycle.

上記の例では、ハードウェアを無損失乗算でどのように使用することができるかを示した。 The above example has shown how hardware can be used in lossless multiplication.

有利には、無損失命令を使用する上記の実施例は、乗数２と被乗数１が実際のハードウェア・データ経路よりもはるかに大きい場合のデータを処理するように、実施例の乗算器１０を容易に設定することができることを示している。図６は、モジュラ積を異なる組合せで計算することもできることを示している。図示されている特定の順序は、これらの演算を実施する多くの方法の１つに過ぎないことがわかるであろう。たとえば、ハードウェアが即時シフト・アウトと、最終積を計算したら直ちに格納する機能とをサポートしている場合、それを行うために、ｍｐ５：ｍｐ６の順序をｍｐ３：ｍｐ４と交換することもできる。あるいは、ＭＳＧ／ＭＳＧＲなどの命令の実施に関して前述した実施例と同様に、特定の演算の早期終了をサポートするには、乗数２を右から左に処理するように順序を調整することができる。乗算サブグループのこのような「アウト・オブ・オーダー」処理により、最終的には、たとえば、柔軟性のある、高性能な中規模乗算ハードウェアが実現される。 Advantageously, the above embodiment using lossless instructions allows the multiplier 10 of the embodiment to process data where the multiplier 2 and the multiplicand 1 are much larger than the actual hardware data path. This indicates that it can be easily set. FIG. 6 shows that the modular products can also be calculated in different combinations. It will be appreciated that the particular order shown is only one of many ways to perform these operations. For example, if the hardware supports immediate shift out and the ability to store the final product as soon as it is calculated, the order of mp5: mp6 could be exchanged with mp3: mp4 to do so. Alternatively, as in the embodiments described above for the implementation of instructions such as MSG / MSGR, to support early termination of certain operations, the order can be adjusted to process multiplier 2 from right to left. Such "out-of-order" processing of the multiplication subgroup ultimately results in, for example, flexible, high-performance, medium-scale multiplication hardware.

本開示の発明は、コンピュータ、制御装置、またはプロセッサ１００実施方法およびそのような方法を実行する装置の形態で実施することができる。また、本発明は、フロッピィ（Ｒ）・ディスケット、ＣＤ−ＲＯＭ、ハード・ドライブ、まはその他の任意のコンピュータ可読記憶媒体など、有形な媒体１０２に実現された命令を含むコンピュータ・プログラム・コードの形態で実施することもでき、該コンピュータ・プログラム・コードが、コンピュータ、制御装置、またはプロセッサ１００にロードされ、実行されると、該コンピュータ、制御装置、またはプロセッサ１００が、本発明を実施する装置となる。また、本発明は、たとえば、記憶媒体に記憶され、コンピュータ、制御装置、またはプロセッサ１００にロードまたは実行されるか、あるいは、電子配線またはケーブル、光ファイバ、電磁放射などの何らかの伝送媒体を介して伝送されるかを問わず、データ信号１０３としてのコンピュータ・プログラム・コードの形態で実施することもでき、その場合、該コンピュータ・プログラム・コードがコンピュータにロードされて実行されると、該コンピュータは本発明を実施する装置になる。汎用プロセッサ１００上で実施した場合、コンピュータ・プログラム・コード・セグメントは、特定の論理回路を構築するようにプロセッサを設定する。 The invention of this disclosure may be embodied in the form of a computer, a controller, or a method of implementing processor 100 and an apparatus for performing such method. The present invention also relates to a computer program code comprising instructions embodied on a tangible medium 102, such as a floppy diskette, CD-ROM, hard drive, or any other computer-readable storage medium. When the computer program code is loaded and executed on a computer, a control device, or the processor 100, the computer, the control device, or the processor 100 may execute the present invention. It becomes. Also, the present invention may be stored in a storage medium and loaded or executed on a computer, a controller, or the processor 100, or via any transmission medium, such as electronic wiring or cables, optical fibers, electromagnetic radiation, etc. It can also be implemented in the form of computer program code as a data signal 103, whether transmitted or not, in which case, when the computer program code is loaded and executed on a computer, the computer An apparatus for implementing the present invention. When implemented on general-purpose processor 100, the computer program code segments configure the processor to build specific logic circuits.

類似の要素を示すための第１、第２、またはその他の同様の表記は、別段の明記がない限り、特定の序列を規定または含意するものではないことを理解されたい。 It should be understood that first, second, or other similar designations for indicating similar elements do not define or imply a particular order, unless explicitly stated otherwise.

以上、本発明について実施例を参照しながら説明したが、本発明の範囲から逸脱することなく、様々な変更を加えることができ、その各要素の代わりに同等物を使用することもできることが、当業者ならわかるであろう。さらに、本発明の基本的な範囲から逸脱することなく、特定の状況または材料を本発明の教示に適合させるために、多くの修正を加えることができる。したがって、本発明は、本発明を実施するために企図された最良の形態として開示した特定の実施態様には限定されず、本発明は、特許請求の範囲に含まれるすべての実施態様を含む。 As described above, the present invention has been described with reference to the embodiments. However, various modifications can be made without departing from the scope of the present invention, and equivalents can be used instead of the respective elements. Those skilled in the art will understand. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Accordingly, the invention is not limited to the specific embodiments disclosed as the best mode contemplated for carrying out the invention, but the invention includes all embodiments falling within the scope of the appended claims.

２進乗算器のアーキテクチャのデータの流れの一例を示す図である。FIG. 3 is a diagram illustrating an example of a data flow of a binary multiplier architecture. 実施例による、３パイプ・スーパースケーラ固定小数点プロセッサの略ブロック図とデータの流れを示す図である。FIG. 3 is a schematic block diagram of a three-pipe superscaler fixed-point processor and a diagram showing a data flow according to an embodiment. 実施例による乗算方法のフローチャートを示す図である。FIG. 4 is a diagram illustrating a flowchart of a multiplication method according to the embodiment. 有損失２進乗算命令を示す図である。FIG. 4 illustrates a lossy binary multiply instruction. ２進乗算命令の１回乗算ファミリにおける２つの命令を示す図である。FIG. 4 illustrates two instructions in a one-time multiplication family of binary multiplication instructions. ２進乗算の乗算論理群における一対の命令を示す図である。FIG. 4 is a diagram illustrating a pair of instructions in a multiplication logic group of binary multiplication.

Explanation of reference numerals

１被乗数
２乗数
３重複ビット
４ＭＣＡＮＤ＿６４制御信号
５ＵＮＳＩＧＮＥＤ制御信号
６ＭＴＥＲＭ＿ＳＨＩＦＴ制御信号
１０乗算器
１１被乗数ハイワード変換
１２被乗数入力マージ（被乗数倍数生成論理）
１３左シフタ（被乗数倍数生成論理）
１４１の補数モジュール（被乗数倍数生成論理）
１５１の補数モジュール（被乗数倍数生成論理）
１２被乗数倍数生成論理
１３被乗数倍数生成論理
１４被乗数倍数生成論理
１５被乗数倍数生成論理
１６ブース記録論理
１７選択論理
１８加算器
１９桁上げレジスタ
２０合計レジスタ
２１桁上げ伝播加算器
２２マルチプレクサ
５０ａパイプライン
５０ｂパイプライン
５０ｃパイプライン
５１オペランド・レジスタ
５２オペランド・レジスタ
５３オペランド・レジスタ
５４オペランド・レジスタ
５５オペランド・レジスタ
５６オペランド・レジスタ
５７出力レジスタ
５８出力レジスタ
５９作業用レジスタ
６０マルチプレクサ
６２マルチプレクサ
６４２進加算器ユニット
６６２進加算器ユニット
７２ビット論理ユニット
７４ビット論理ユニット
７６ローテータ
７８ローテータ
８２先行ゼロ検出論理
８４先行ゼロ検出論理
８６バッファ
９０汎用レジスタ・ファイル
９２オペランド・バッファ
１００プロセッサ
１０２媒体
１０３データ信号
２００３パイプライン・スーパースケーラ固定小数点プロセッサ
DESCRIPTION OF SYMBOLS 1 Multiplicand 2 Multiplier 3 Duplicate bit 4 MCAND_64 control signal 5 UNSIGNED control signal 6 MTERM_SHIFT control signal 10 Multiplier 11 Multiplicand high word conversion 12 Multiplicand input merge (logic for generating multiplicand multiple)
13 Left shifter (multiplicand multiple generation logic)
14 1's complement module (multiplicand multiple generation logic)
15 One's complement module (multiplicand multiple generation logic)
12 Multiplicand multiple generation logic 13 Multiplicand multiple generation logic 14 Multiplicand multiple generation logic 15 Multiplicand multiple generation logic 16 Booth recording logic 17 Selection logic 18 Adder 19 Carry register 20 Total register 21 Carry propagation adder 22 Multiplexer 50a Pipeline 50b Pipe Line 50c pipeline 51 operand register 52 operand register 53 operand register 54 operand register 55 operand register 56 operand register 57 output register 58 output register 59 working register 60 multiplexer 62 multiplexer 64 binary adder unit 66 2 Hex adder unit 72 bit logic unit 74 bit logic unit 76 rotator 78 rotator 82 leading zero detection logic 84 Leading zero detection logic 86 Buffer 90 General purpose register file 92 Operand buffer 100 Processor 102 Medium 103 Data signal 200 3 Pipeline superscaler Fixed point processor

Claims

A binary multiplier for a superscalar processor, comprising:
m bits include at least one of a full width of the processor's register and a maximum width of operand data of a selected binary multiply instruction, and receive a first operand including a maximum m-bit long multiplicand data. A first register input configured as
A second operand comprising one of a multiplier value and a multiplier value subgroup comprising a section of the multiplier value having a length of n bits, wherein n is the operand of the selected binary multiply instruction. A second register input that is less than or equal to the maximum width of the data;
A first indication of the size of the multiplicand data, wherein the multiplier selects a corresponding bit from the multiplicand data input to generate a valid multiplicand and allows unused bits of the multiplicand data input to be set to zero; A control signal input configured to receive a control signal of
A second control signal configured to receive a second control signal indicating whether the multiplicand operation is signed or unsigned, the second control signal configured to enable the multiplier to perform sign extension of the effective multiplicand; And a control signal input.

The multiplier of claim 1, wherein the multiplier further comprises a third control signal input configured to shift the modular product and allow it to be aligned with another modular product for combination. Multiplier.

The multiplier of claim 1, wherein the first control signal facilitates interpretation of a length of the multiplicand data.

The second control signal controls whether the multiplication is recognized as a signed operation or an unsigned operation, and uses a two's complement format as a negative expression so that the multiplicand data is signed and signed. The multiplier according to claim 1, wherein the multiplier controls which of the two is treated as none.

The first control signal cooperates with the second control signal to characterize and modify the multiplicand data to avoid lossy results without additional processing to obtain the appropriate code. The multiplier of claim 1, wherein the multiplier produces a loss result.

2. The apparatus of claim 1, further comprising an input configured to receive continuity bits between the two multiplier value subgroups, thereby allowing multiplication by multipliers of length greater than n bits. The multiplier as described.

The multiplier of claim 1, further comprising recording logic for use on the one of the multiplier value and the multiplier value subgroup to generate a reduced number of partial products.

The multiplier of claim 7, wherein the recording logic includes, but is not limited to, a radix-4 Booth algorithm.

Recording logic generates a multi-bit signal for each scan group, using at least one bit to indicate which multiple of the multiplicand data is to be selected, and using another one bit to determine whether the multiplicand data is positive or negative. The multiplier of claim 7, wherein the multiplier indicates a negative multiple.

For each scan group, the recording logic generates a 3-bit signal, with 2 bits indicating whether to select at least one of ± 1x multiple and ± 2x multiple of the multiplicand data, with the third bit being positive or negative. The multiplier of claim 7, wherein the multiplier indicates a negative multiple.

The third bit is used as a "hot one" that is input to a subsequent partial product at the same column or bit position as the least significant bit of the current partial product to complete a two's complement, The multiplier of claim 10, further comprising implementing a negative multiple.

The multiplier of claim 1, wherein a plurality of partial products are processed using an adder, and the plurality of partial products are compressed into two product terms.

The modular product is operably connected to the output of the adder and shifts the modular product left by n bits based on the size of each of the multiplier value and one of the multiplier value subgroups, based on the alignment of the modular product. 2. The multiplier of claim 1, further comprising a combination of a shifter and a multiplexer to facilitate coupling with other modular products.

A system for binary multiplication in a superscalar processor, comprising:
A binary multiplier;
It is in operable communication with the binary multiplier to shift multiplier data and multiplicand data to ensure that selected subgroups are sent to the binary multiplier and to combine the modular product with another one. A data width shifter that ensures that the digit is aligned to properly combine with at least one of the modular product and the cumulative modular product;
A register operably communicating with the shifter and holding the multiplier data and the multiplicand data for input to the shifter;
An adder operatively communicating with the register and accumulating a modular product with at least one of another modular product and an accumulating modular product;
A system operatively communicating with the adder and including a plurality of registers for holding the modular product and the accumulated modular product for input to the adder and for output of the accumulated modular product. .

Further including leading zero detection logic to detect small operand absolute values of the unsigned multiplier data to control possible early termination logic for longer operations using relatively wide data multiplier data; The system according to claim 14.

In order to control possible early termination logic for longer term operations using relatively wide data multiplier data, the leading one detection logic further detects the small operand absolute value of the negative signed multiplier data. The system of claim 14, comprising:

15. The system of claim 14, further comprising another data width shifter for shifting the multiplier data and the multiplicand data.

15. The system of claim 14, further comprising another adder combining the redundant terms into one modular product.

If the multiplier output is in a redundant form, further comprising another register, wherein two registers hold redundant terms that are input to the adder and combined as one modular product, and wherein the other register is the accumulator. 15. The system of claim 14, wherein the system retains a modular product.

15. The multiplier configured to switch between multiplicand data of different lengths using a control signal for multiplication using equal size multipliers and different size multiplicands. System.

The multiplier and the multiplicand subgroup are optional, in order to accommodate large operands and large results, or to accommodate early termination logic, taking into account constraints in available support hardware. 15. The system of claim 14, wherein the system is configured to be processed in order.

A system for binary multiplication in a superscalar processor, comprising:
A first register, a second register, a third register, a bit logic unit and a binary adder operably communicating with the first register, the second register, and the first multiplexer. An execution unit, wherein the first multiplexer is in operative communication with the third register; a first rotator is in operable communication with the first register and the execution unit; A first pipeline including an execution unit and a leading zero detect register in operable communication with the second register;
A fourth register, a fifth register, a sixth register, another bit logic unit and another two operably communicating with the fourth register, the fifth register, and the second multiplexer. A second execution unit including a binary adder, wherein the second multiplexer is in operable communication with the sixth register, and is operable with the fourth register and the execution unit. A second pipeline including a rotator communicating with the second execution unit and a leading zero detection register operably communicating with the second execution unit and the fifth register;
A third pipeline including a seventh register, an eighth register, a ninth register, and a multiplier operably communicating with the seventh register and the second register;
General purpose registers for storing and retrieving data,
An operand buffer for obtaining a first operand and a second operand;
A system including a communication bus for communication between at least two of the first pipeline, the second pipeline, the third pipeline, the general purpose register, and the operand buffer.

A method of binary multiplication,
Obtaining a multiplicand;
Obtaining a multiplier;
Partitioning the multiplier into a plurality of multiplier subgroups if the multiplier exceeds a selected length;
If the multiplicand exceeds a selected length, partition the multiplicand into a plurality of multiplicand subgroups, zero out unsigned bits in the multiplicand subgroup, and sign extend a smaller part of the multiplicand subgroup. Doing at least one of
Setting a plurality of multiplicand multiples based on at least one of the selected multiplicand subgroup of the plurality of multiplicand subgroups and the multiplicand;
Selecting one or more of the multiplicands of the plurality of multiplicands based on each of the multiplier subgroups of the plurality of multipliers subgroups;
Generating a modular product based on the selection method.

Generating another modular product based on the selection method;
24. The method of claim 23, further comprising: matching and combining the modular product with the other modular product to produce a resulting product.

Recording the plurality of multiplier subgroups using a recording algorithm;
Selecting one or more multiplicand multiples of the plurality of multiplicand multiples based on the record.
24. The method of claim 23, wherein the multiplicand multiple is also based on the record.

26. The method of claim 25, wherein the recording algorithm is a radix-4 Booth algorithm.

The recording step includes the steps of generating a multi-bit signal for each scan group, using at least one bit to indicate which multiple of the multiplicand is to be selected, and using another one bit to calculate the multiplicand. Indicating a positive multiple or a negative multiple.

The multi-bit signal is a 3-bit signal, two bits are used to select one of 0x, ± 1x, and ± 2x multiples of the multiplicand, and a third bit is a positive or negative of the multiplicand. 24. The method of claim 23, used to indicate a negative multiple.

The method of claim 1, further comprising using the other one bit as a "hot one" input to a subsequent product to complete a two's complement required to implement a negative multiple of the multiplicand data. 28. The method according to 28.

24. The method of claim 23, wherein the step of partitioning the multiplicand segments the multiplicand data to avoid lossy results.