JP6920277B2

JP6920277B2 - Mixed width SIMD operation with even and odd element operations using a pair of registers for a wide data element

Info

Publication number: JP6920277B2
Application number: JP2018502231A
Authority: JP
Inventors: エリック・ウェイン・マハーアン; アジャイ・アナント・イングル
Original assignee: クアルコム，インコーポレイテッド
Priority date: 2015-07-21
Filing date: 2016-06-21
Publication date: 2021-08-18
Anticipated expiration: 2036-06-21
Also published as: WO2017014892A1; US10489155B2; EP3326060A1; KR20180030986A; US20170024209A1; CN107851010B; JP2018525731A; HUE049260T2; CN107851010A; BR112018001208B1; ES2795832T3; KR102121866B1; EP3326060B1; BR112018001208A2

Description

本開示の態様は、少なくとも1つのベクトルのデータ要素が少なくとも1つの他のベクトルのデータ要素とは異なるビット幅である2つ以上のベクトルを伴う演算に関する。そのような演算は、混合幅演算(mixed-width operation)と呼ばれる。より詳細には、いくつかの態様は、少なくとも1つの第1のベクトルオペランドと第2のベクトルオペランドとを伴う混合幅単一命令複数データ(SIMD)(mixed-width single instruction multiple data)演算に関し、第1のベクトルオペランドまたは第2のベクトルオペランドのうちの少なくとも1つは、偶数または奇数のレジスタのペアに記憶され得るデータ要素を有する。 Aspects of the present disclosure relate to operations involving two or more vectors in which the data element of at least one vector has a bit width different from that of the data element of at least one other vector. Such an operation is called a mixed-width operation. More specifically, some aspects relate to mixed-width single instruction multiple data (SIMD) operations involving at least one first vector operand and a second vector operand. At least one of the first vector operand or the second vector operand has data elements that can be stored in a pair of even or odd registers.

データ並列処理を利用する処理システムにおいては、単一命令複数データ(SIMD)命令が使用され得る。たとえば、データベクトルの2つ以上のデータ要素に対して同じまたは共通のタスクが実行される必要がある場合、データ並列処理が存在する。複数の命令を使用するのではなく、対応する複数のSIMDレーン内の複数のデータ要素に対して実行されるべき同じ命令を定義する単一のSIMD命令を使用することによって、2つ以上のデータ要素に対して並列に共通のタスクが実行され得る。 In a processing system that utilizes data parallel processing, a single instruction and multiple data (SIMD) instructions may be used. For example, data parallelism exists when the same or common task needs to be performed on two or more data elements of a data vector. Two or more data by using a single SIMD instruction that defines the same instruction to be executed for multiple data elements in the corresponding multiple SIMD lanes, rather than using multiple instructions Common tasks can be performed in parallel for elements.

SIMD命令は、ソースベクトルオペランドおよび宛先ベクトルオペランドなどの1つまたは複数のベクトルオペランドを含み得る。各ベクトルオペランドは2つ以上のデータ要素を含む。SIMD命令の場合、同じベクトルオペランドに属するすべてのデータ要素は、一般に、同じビット幅であり得る。しかしながら、いくつかのSIMD命令は混合幅オペランドを指定してよく、第1のベクトルオペランドのデータ要素は第1のビット幅のデータ要素であり得、第2のベクトルオペランドのデータ要素は第2のビット幅のデータ要素であり得、第1のビット幅と第2のビット幅は互いに異なる。混合幅オペランドを含むSIMD命令の実行には、いくつかの課題があり得る。 SIMD instructions can include one or more vector operands, such as source vector operands and destination vector operands. Each vector operand contains two or more data elements. For SIMD instructions, all data elements belonging to the same vector operand can generally have the same bit width. However, some SIMD instructions may specify a mixed width operand, the data element of the first vector operand can be the data element of the first bit width, and the data element of the second vector operand is the second. It can be a bit-width data element, where the first bit width and the second bit width are different from each other. Execution of SIMD instructions that include mixed width operands can present several challenges.

図1A〜図1Cは、混合幅オペランドを有するSIMD命令を実行するための従来の実装形態に伴う課題の例を示す。図1Aを参照すると、SIMD命令100を実行するための第1の従来の実装形態が示されている。SIMD命令100は、64ビット命令セットアーキテクチャ(ISA)をサポートする従来のプロセッサ(図示せず)によって実行され得ると仮定する。これは、SIMD命令100などの命令が、最大64ビットのビット幅を有するオペランドを指定し得ることを意味する。64ビットオペランドは、64ビットレジスタまたは32ビットレジスタのペアに関して指定され得る。 FIGS. 1A-1C show examples of challenges associated with conventional implementations for executing SIMD instructions with mixed width operands. With reference to FIG. 1A, a first conventional implementation for executing SIMD instruction 100 is shown. It is assumed that SIMD instruction 100 can be executed by a conventional processor (not shown) that supports the 64-bit instruction set architecture (ISA). This means that instructions such as SIMD instruction 100 can specify operands with a bit width of up to 64 bits. 64-bit operands can be specified for 64-bit registers or 32-bit register pairs.

SIMD命令100の目的は、ソースオペランド102の各データ要素に対して同じ命令を実行することである。ソースオペランド102は、0〜7でラベル付けされた8個の8ビットデータ要素を備える64ビットベクトルである。ソースオペランド102は、単一の64ビットレジスタ、または32ビットレジスタのペアに記憶され得る。8個のデータ要素0〜7の各々に対して実行される同じ命令または共通の演算は、たとえば、乗算、二乗関数、左シフト関数、インクリメント関数、加算(たとえば、命令内の一定値もしくは即値フィールドとの、または別のベクトルオペランドによって提供される値との)などであり得、その結果は、結果として得られる8個のデータ要素ごとに、8ビット以上、および最大16ビットのストレージを消費する可能性がある。これは、SIMD命令100の結果が、ソースオペランド102が消費し得るストレージ空間の2倍、すなわち2つの64ビットレジスタ、または32ビットレジスタの2つのペアを消費する可能性があることを意味する。 The purpose of SIMD instruction 100 is to execute the same instruction for each data element of source operand 102. Source operand 102 is a 64-bit vector with eight 8-bit data elements labeled 0-7. Source operand 102 may be stored in a single 64-bit register or a pair of 32-bit registers. The same instruction or common operation performed on each of the eight data elements 0-7 is, for example, multiplication, square function, left shift function, increment function, addition (eg, constant or immediate field in the instruction). And, or with a value provided by another vector operand), and the result consumes at least 8 bits and up to 16 bits of storage for each of the 8 resulting data elements. there is a possibility. This means that the result of SIMD instruction 100 can consume twice the storage space that source operand 102 can consume, i.e. two 64-bit registers, or two pairs of 32-bit registers.

SIMD命令100を実装するように構成された従来のプロセッサは、64ビットより大きいビット幅のオペランドを指定する命令を含まないので、SIMD命令100は、SIMD命令100Xと100Yとの2つの構成要素に分割され得る。SIMD命令100Xは、ソースオペランド102の偶数0、2、4、および6でラベル付けされたデータ要素(または「偶数番号のデータ要素」)に対して実行されるべき共通の演算を指定する。SIMD命令100Xは、64ビット幅であり、それぞれが上位(H)8ビットおよび下位(L)8ビットで構成されるA、C、E、およびGでラベル付けされた16ビットデータ要素を含む、宛先オペランド104xを指定する。ソースオペランド102の偶数番号の8ビットデータ要素0、2、4、および6に対する共通の演算の結果は、宛先オペランド104xの16ビットデータ要素A、C、E、およびGに対応して書き込まれる。SIMD命令100YはSIMD命令100Xに類似しているが、SIMD命令100Yは、ソースオペランド102の奇数1、3、5、および7でラベル付けされたデータ要素(または「奇数番号のデータ要素」)に対する共通の演算を指定し、その結果は、やはりSIMD命令100Xの宛先オペランド104xと同様に64ビットオペランドである、宛先オペランド104yの16ビットデータ要素B、D、F、Hに書き込まれるという相違点がある。このようにして、SIMD命令100Xおよび100Yの各々は、1つの64ビット宛先オペランドを指定することができ、また、SIMD命令100Xおよび100Yは、ソースオペランド102のデータ要素0〜7の各々に対する共通の演算の実行をともに達成することができる。しかし、SIMD命令100を実装するために必要な2つの別個の命令のために、コード空間が増大する。 Traditional processors configured to implement SIMD instruction 100 do not include instructions that specify operands with a bit width greater than 64 bits, so SIMD instruction 100 has two components, SIMD instruction 100X and 100Y. Can be split. The SIMD instruction 100X specifies a common operation to be performed on the data elements labeled with even 0, 2, 4, and 6 in source operand 102 (or "even numbered data elements"). The SIMD instruction 100X is 64-bit wide and contains 16-bit data elements labeled A, C, E, and G, each consisting of upper (H) 8 bits and lower (L) 8 bits. Specify the destination operand 104x. The result of a common operation on the even-numbered 8-bit data elements 0, 2, 4, and 6 of the source operand 102 is written for the 16-bit data elements A, C, E, and G of the destination operand 104x. SIMD instruction 100Y is similar to SIMD instruction 100X, but SIMD instruction 100Y is for data elements (or "odd numbered data elements") labeled with odd numbers 1, 3, 5, and 7 of source operand 102. The difference is that a common operation is specified and the result is written to the 16-bit data elements B, D, F, H of the destination operand 104y, which is also a 64-bit operand similar to the destination operand 104x of the SIMD instruction 100X. be. In this way, each of the SIMD instructions 100X and 100Y can specify one 64-bit destination operand, and the SIMD instructions 100X and 100Y are common to each of the data elements 0-7 of the source operand 102. The execution of operations can be achieved together. However, the code space is increased due to the two separate instructions required to implement the SIMD instruction 100.

図1Bは、構成要素SIMD命令120Xおよび120Yの異なるセットを使用するSIMD命令100の第2の従来の実装形態を示す。SIMD命令120Xおよび120Yは、それぞれソースオペランド102の8ビットデータ要素0〜7の各々に対する共通の演算を指定する。SIMD命令120Xは、結果の下位(L)8ビットが書き込まれる宛先オペランド124xを、宛先オペランド124xの対応する8ビット結果データ要素A〜Hに指定する(一方、結果の上位(H)8ビットは破棄される)。同様に、命令120Yは、結果の上位(H)8ビットが書き込まれる宛先オペランド124yを宛先オペランド124yの対応する8ビットデータ要素A〜Hに指定する(一方、結果の下位(L)8ビットは破棄される)。このSIMD命令100の第2の従来の実装形態では、2つの構成要素SIMD命令120Xおよび120Yのコード空間の増加も問題となる。さらに、理解され得るように、第2の従来の実装形態はまた、ソースオペランド102のデータ要素0〜7の各々の上位(H)8ビット(たとえば、実行命令120X中)または下位(L)8ビット(たとえば、実行命令120Y中)のいずれかを計算および廃棄する際に電力の浪費を招く。 FIG. 1B shows a second conventional implementation of SIMD instruction 100 using different sets of components SIMD instruction 120X and 120Y. SIMD instructions 120X and 120Y each specify a common operation for each of the 8-bit data elements 0-7 of source operand 102. The SIMD instruction 120X specifies the destination operand 124x to which the lower (L) 8 bits of the result are written in the corresponding 8-bit result data elements A to H of the destination operand 124x (while the upper (H) 8 bits of the result are. Will be destroyed). Similarly, the instruction 120Y specifies the destination operand 124y in which the upper (H) 8 bits of the result are written to the corresponding 8-bit data elements A to H of the destination operand 124y (while the lower (L) 8 bits of the result are. Will be destroyed). In the second conventional implementation of the SIMD instruction 100, the increase in the code space of the two components SIMD instructions 120X and 120Y is also a problem. Further, as can be understood, the second conventional implementation also also has an upper (H) 8 bit (eg, in the instruction instruction 120X) or a lower (L) 8 of each of the data elements 0-7 of source operand 102. It wastes power when calculating and discarding any of the bits (for example, in the execution instruction 120Y).

図1Cは、図1AのSIMD命令100Xおよび100Yといくつかの点で類似している、SIMD命令140Xおよび140Yの構成要素のさらに別のセットを使用するSIMD命令100の第3の従来の実装形態を示す。相違点は、ソースオペランド102のデータ要素のうちの諸データ要素が各SIMD命令による操作を受けるという点である。より詳細には、偶数番号の8ビットデータ要素ではなく、SIMD命令140Xは、ソースオペランド102の下位4個のデータ要素0〜3に対して実行されるべき共通の演算を指定する。結果は宛先オペランド144xの16ビットデータ要素A、B、C、Dに書き込まれる。しかしながら、SIMD命令140Xの実行は、宛先オペランド140Xの全64ビットにわたる下位4個の8ビットデータ要素(32ビットにまたがる)に対する演算の結果を広げること(spreading out)を含む。SIMD命令144yは同様であり、64ビット宛先オペランド144yの16ビットデータ要素E、F、G、Hにわたるソースオペランド102の上位4個の8ビットデータ要素4〜7に対する演算の結果を広げることを指定する。第1および第2の従来の実装形態のようなコードサイズの増加は別として、第3の従来の実装形態において見られるようなこれらのデータ移動を広げることは、クロスバーのような追加のハードウェアを必要とする可能性がある。 FIG. 1C is a third conventional implementation of SIMD instruction 100 that uses yet another set of components of SIMD instructions 140X and 140Y, which are similar in some respects to SIMD instructions 100X and 100Y in FIG. 1A. Is shown. The difference is that the data elements of the data elements of source operand 102 are manipulated by each SIMD instruction. More specifically, the SIMD instruction 140X, rather than an even-numbered 8-bit data element, specifies a common operation to be performed on the lower four data elements 0-3 of source operand 102. The result is written to the 16-bit data elements A, B, C, D with destination operand 144x. However, the execution of the SIMD instruction 140X involves spreading out the result of the operation on the lower four 8-bit data elements (over 32 bits) spanning all 64 bits of the destination operand 140X. The SIMD instruction 144y is similar and specifies that the result of the operation on the upper four 8-bit data elements 4-7 of the source operand 102 over the 16-bit data elements E, F, G, H of the 64-bit destination operand 144y is expanded. do. Apart from the increase in code size as in the first and second traditional implementations, widening these data movements as seen in the third traditional implementation is additional hardware such as crossbars. May require clothing.

したがって、従来の実装形態の前述の欠点を回避する、混合幅SIMD命令の改良された実装形態が必要とされている。 Therefore, there is a need for an improved implementation of the mixed width SIMD instruction that avoids the aforementioned drawbacks of the conventional implementation.

例示的な態様は、第1のビット幅のデータ要素を備える少なくとも1つのソースベクトルオペランドと、第2のビット幅のデータ要素を備える宛先ベクトルオペランドとを有する、混合幅単一命令複数データ(SIMD)命令に関連するシステムおよび方法を含み、第2のビット幅は、第1のビット幅の半分または2倍のいずれかである。これに対応して、ソースベクトルオペランドまたは宛先ベクトルオペランドのうちの1つは、第1のレジスタと第2のレジスタとの、レジスタのペアとして表される。他のベクトルオペランドは単一のレジスタとして表される。第1のレジスタのデータ要素は、単一のレジスタとして表される他のベクトルオペランドの偶数番号のデータ要素に対応し、第2のレジスタのデータ要素は、単一のレジスタとして表される他のベクトルオペランドのデータ要素に対応する。 An exemplary embodiment is a mixed width single instruction multiple data (SIMD) having at least one source vector operand with a first bit width data element and a destination vector operand with a second bit width data element. The second bit width is either half or twice the first bit width, including the system and method associated with the instruction. Correspondingly, one of the source or destination vector operands is represented as a register pair of a first register and a second register. The other vector operands are represented as a single register. The data element of the first register corresponds to the even numbered data element of the other vector operands represented as a single register, and the data element of the second register is the other data element represented as a single register. Corresponds to the data element of the vector operand.

たとえば、例示的な態様は、混合幅単一命令複数データ(SIMD)演算を実行する方法に関連し、本方法は、プロセッサによって、第1のビット幅の第1のセットのソースデータ要素を備える少なくとも1つの第1のソースベクトルオペランドと、第2のビット幅の宛先データ要素を備える少なくとも1つの宛先ベクトルオペランドとを備えるSIMD命令を受信するステップであって、第2のビット幅は第1のビット幅の2倍である、ステップを備える。宛先ベクトルオペランドは、宛先データ要素の第1のサブセットを備える第1のレジスタと、宛先データ要素の第2のサブセットを備える第2のレジスタとを含むレジスタのペアを備える。第1のセットのソースデータ要素の順序に基づいて、本方法は、プロセッサにおいてSIMD命令を実行するステップであって、第1のセットの偶数番号のソースデータ要素から、第1のレジスタ内の宛先データ要素の第1のサブセットを生成するステップと、第1のセットの奇数番号のソースデータ要素から、第2のレジスタ内の宛先データ要素の第2のサブセットを生成するステップとを備えるステップを含む。 For example, an exemplary embodiment relates to a method of performing a mixed width single instruction multiple data (SIMD) operation, which method comprises a first set of source data elements of a first bit width by a processor. A step of receiving a SIMD instruction having at least one first source vector operand and at least one destination vector operand with a second bit width destination data element, the second bit width being the first. It has steps that are twice the bit width. The destination vector operand comprises a pair of registers that includes a first register with a first subset of destination data elements and a second register with a second subset of destination data elements. Based on the order of the source data elements in the first set, the method is a step of executing a SIMD instruction in the processor from the even numbered source data elements in the first set to the destination in the first register. Includes a step of generating a first subset of data elements and a step of generating a second subset of destination data elements in a second register from a first set of odd-numbered source data elements. ..

別の例示的な態様は、混合幅単一命令複数データ(SIMD)演算を実行する方法に関連し、本方法は、プロセッサによって、第1のビット幅のソースデータ要素を備える少なくとも1つのソースベクトルオペランドと、第2のビット幅の宛先データ要素を備える少なくとも1つの宛先ベクトルオペランドとを備えるSIMD命令を受信するステップを備え、第2のビット幅は第1のビット幅の半分である。ソースベクトルオペランドは、ソースデータ要素の第1のサブセットを備える第1のレジスタと、ソースデータ要素の第2のサブセットを備える第2のレジスタとを含むレジスタのペアを備える。宛先データ要素の順序に基づいて、本方法は、プロセッサにおいてSIMD命令を実行するステップであって、第1のレジスタ内のソースデータ要素の対応する第1のサブセットから偶数番号の宛先データ要素を生成するステップと、第2のレジスタ内のソースデータ要素の対応する第2のサブセットから奇数番号の宛先データ要素を生成するステップとを備える、ステップを含む。 Another exemplary embodiment relates to a method of performing a mixed width single instruction multiple data (SIMD) operation, the method comprising at least one source vector comprising a first bit width source data element by the processor. It comprises a step of receiving a SIMD instruction with an operand and at least one destination vector operand with a destination data element of the second bit width, the second bit width being half the first bit width. The source vector operand comprises a register pair that includes a first register with a first subset of source data elements and a second register with a second subset of source data elements. Based on the order of the destination data elements, the method is a step of executing a SIMD instruction in the processor to generate an even numbered destination data element from the corresponding first subset of the source data elements in the first register. A step comprising generating an odd numbered destination data element from a corresponding second subset of source data elements in a second register.

別の例示的な態様は、プロセッサによって実行されると、プロセッサに混合幅単一命令複数データ(SIMD)演算を実行させる、プロセッサによって実行可能な命令を備える非一時的コンピュータ可読記憶媒体に関連する。非一時的コンピュータ可読記憶媒体はSIMD命令を備え、SIMD命令は、第1のビット幅の第1のセットのソースデータ要素を備える少なくとも1つの第1のソースベクトルオペランドと、第2のビット幅の宛先データ要素を備える少なくとも1つの宛先ベクトルオペランドとを備え、第2のビット幅は第1のビット幅の2倍である。宛先ベクトルオペランドは、宛先データ要素の第1のサブセットを備える第1のレジスタと、宛先データ要素の第2のサブセットを備える第2のレジスタとを含むレジスタのペアを備える。第1のセットのソースデータ要素の順序に基づいて、非一時的コンピュータ可読記憶媒体は、第1のセットの偶数番号のソースデータ要素から、第1のレジスタ内の宛先データ要素の第1のサブセットを生成するためのコードと、第1のセットの奇数番号のソースデータ要素から、第2のレジスタ内の宛先データ要素の第2のサブセットを生成するためのコードとを含む。 Another exemplary embodiment relates to a non-temporary computer-readable storage medium with instructions that can be executed by a processor that, when executed by the processor, causes the processor to perform mixed width single instruction multiple data (SIMD) operations. .. The non-temporary computer-readable storage medium comprises SIMD instructions, which include at least one first source vector operand with a first set of source data elements of first bit width and a second bit width. It has at least one destination vector operand with a destination data element, and the second bit width is twice the first bit width. The destination vector operand comprises a pair of registers that includes a first register with a first subset of destination data elements and a second register with a second subset of destination data elements. Based on the order of the source data elements in the first set, the non-temporary computer-readable storage medium is the first subset of the destination data elements in the first register from the even-numbered source data elements in the first set. Contains code to generate a second subset of destination data elements in a second register from a first set of odd-numbered source data elements.

別の例示的な態様は、プロセッサによって実行されると、プロセッサに混合幅単一命令複数データ(SIMD)演算を実行させる、プロセッサによって実行可能な命令を備える非一時的コンピュータ可読記憶媒体に関連し、非一時的コンピュータ可読記憶媒体はSIMD命令を備える。SIMD命令は、第1のビット幅のソースデータ要素を備える少なくとも1つのソースベクトルオペランドと、第2のビット幅の宛先データ要素を備える少なくとも1つの宛先ベクトルオペランドとを備え、第2のビット幅は第1のビット幅の半分である。ソースベクトルオペランドは、ソースデータ要素の第1のサブセットを備える第1のレジスタと、ソースデータ要素の第2のサブセットを備える第2のレジスタとを含むレジスタのペアを備える。宛先データ要素の順序に基づいて、非一時的コンピュータ可読記憶媒体は、第1のレジスタ内のソースデータ要素の対応する第1のサブセットから偶数番号の宛先データ要素を生成するためのコードと、第2のレジスタ内のソースデータ要素の対応する第2のサブセットから奇数番号の宛先データ要素を生成するためのコードとを含む。 Another exemplary embodiment relates to a non-temporary computer-readable storage medium with instructions that can be executed by the processor, causing the processor to perform mixed width single instruction multiple data (SIMD) operations when executed by the processor. , Non-temporary computer readable storage media are equipped with SIMD instructions. The SIMD instruction comprises at least one source vector operand with a first bit width source data element and at least one destination vector operand with a second bit width destination data element, with the second bit width being It is half the width of the first bit. The source vector operand comprises a register pair that includes a first register with a first subset of source data elements and a second register with a second subset of source data elements. Based on the order of the destination data elements, the non-temporary computer-readable storage medium contains the code for generating an even numbered destination data element from the corresponding first subset of the source data elements in the first register. Contains code for generating odd-numbered destination data elements from the corresponding second subset of source data elements in the 2 registers.

添付の図面は、本発明の態様の説明を補助するために提示され、態様の説明のためだけに提供され、本発明の限定ではない。 The accompanying drawings are presented to assist in the description of aspects of the invention and are provided solely for the purpose of describing aspects and are not a limitation of the invention.

混合幅SIMD命令の従来の実装形態を示す図である。It is a figure which shows the conventional implementation form of the mixed width SIMD instruction. 混合幅SIMD命令の従来の実装形態を示す図である。It is a figure which shows the conventional implementation form of the mixed width SIMD instruction. 混合幅SIMD命令の従来の実装形態を示す図である。It is a figure which shows the conventional implementation form of the mixed width SIMD instruction. 本開示の態様による、混合幅SIMD命令の例示的な実装形態を示す図である。It is a figure which shows the exemplary implementation form of the mixed width SIMD instruction by the aspect of this disclosure. 本開示の態様による、混合幅SIMD命令の例示的な実装形態を示す図である。It is a figure which shows the exemplary implementation form of the mixed width SIMD instruction by the aspect of this disclosure. 本開示の態様による、混合幅SIMD命令の例示的な実装形態を示す図である。It is a figure which shows the exemplary implementation form of the mixed width SIMD instruction by the aspect of this disclosure. 混合幅単一命令複数データ(SIMD)演算を実行する方法を示す図である。It is a figure which shows the method of performing the mixed width single instruction multiple data (SIMD) operation. 混合幅単一命令複数データ(SIMD)演算を実行する方法を示す図である。It is a figure which shows the method of performing the mixed width single instruction multiple data (SIMD) operation. 本開示の態様が有利に使用され得る例示的なワイヤレスデバイス400を示す図である。FIG. 5 illustrates an exemplary wireless device 400 in which aspects of the present disclosure may be used advantageously.

本発明の態様は、本発明の特定の態様を対象とする、以下の説明および関連する図面に開示されている。本発明の範囲から逸脱することなく、代替態様が考案され得る。さらに、本発明のよく知られている要素は、本発明の関連する詳細を不明瞭にしないように、詳細には記載されないか、省略される。 Aspects of the invention are disclosed in the following description and related drawings that cover specific aspects of the invention. Alternative embodiments can be devised without departing from the scope of the invention. Moreover, well-known elements of the invention are not described or omitted in detail so as not to obscure the relevant details of the invention.

「例示的」という用語は、本明細書では、「例、事例、または例示としての役割を果たす」ことを意味するために使用される。本明細書において「例示的」と記載されている任意の態様は、必ずしも他の態様よりも好ましいまたは有利であると解釈されるべきではない。同様に、「本発明の態様」という用語は、本発明のすべての態様が議論された特徴、利点、または動作モードを含むことを必要としない。 The term "exemplary" is used herein to mean "act as an example, case, or example." Any aspect described herein as "exemplary" should not necessarily be construed as preferred or advantageous over other aspects. Similarly, the term "aspects of the invention" does not need to include all aspects of the invention the features, advantages, or modes of operation discussed.

本明細書で使用する用語は、特定の態様のみを説明する目的のためのものであり、本発明の態様の限定であることを意図しない。本明細書で使用する単数形「a」、「an」、および「the」は、コンテキストがはっきりと別段に指示しない限り、複数形も含むことが意図される。さらに、「備える(comprises)」、「備えている(comprising)」、「含む(includes)」、および/または「含んでいる(including)」という用語は、本明細書で使用されるとき、述べられた特徴、整数、ステップ、動作、要素、および/または構成要素の存在を明示するが、1つまたは複数の他の特徴、整数、ステップ、動作、要素、構成要素、および/またはそれらのグループの存在または追加を排除しないことが理解されよう。 The terms used herein are for purposes of describing only certain aspects and are not intended to be a limitation of aspects of the invention. The singular forms "a," "an," and "the" as used herein are intended to include the plural unless the context explicitly indicates otherwise. In addition, the terms "comprises," "comprising," "includes," and / or "including," as used herein, are mentioned. Explicitly indicate the existence of a feature, integer, step, action, element, and / or component, but one or more other features, integer, step, action, element, component, and / or a group thereof. It will be understood that it does not preclude the existence or addition of.

さらに、多くの態様は、たとえば、コンピューティングデバイスの要素によって実行されるべきアクションのシーケンスに関して記述される。本明細書で説明する様々なアクションは、特定の回路(たとえば、特定用途向け集積回路(ASIC))によって、1つもしくは複数のプロセッサによって実行されているプログラム命令によって、または両方の組合せによって実行され得ることが認識されよう。加えて、本明細書で説明するアクションのこれらのシーケンスは、実行時に関連するプロセッサに本明細書で説明する機能を実行させるコンピュータ命令の対応するセットを記憶しているコンピュータ可読記憶媒体の任意の形態内に完全に具体化されていると考えられ得る。したがって、本発明の様々な態様は、そのすべてが特許請求される主題の範囲内にあると考えられているいくつかの異なる形態において具体化されてもよい。さらに、本明細書で説明される態様の各々について、任意のそのような態様の対応する形態は、たとえば、記載されたアクションを実行する「ように構成された論理」として本明細書において記載され得る。 In addition, many aspects are described, for example, with respect to a sequence of actions to be performed by an element of a computing device. The various actions described herein are performed by a particular circuit (eg, an application specific integrated circuit (ASIC)), by a program instruction being executed by one or more processors, or by a combination of both. It will be recognized to get. In addition, these sequences of actions described herein can be any computer-readable storage medium that stores the corresponding set of computer instructions that cause the associated processor to perform the functions described herein at run time. It can be considered to be fully embodied within the morphology. Thus, the various aspects of the invention may be embodied in several different forms, all of which are believed to be within the claims. Further, for each of the aspects described herein, the corresponding form of any such aspect is described herein, for example, as "logic configured to" perform the described actions. obtain.

本開示の例示的な態様は、SIMDレーンを横切るデータ移動を回避し、コードサイズを低減する、混合幅SIMD演算の実装形態に関連する。たとえば、SIMD演算を2つ以上の構成要素SIMD命令に分解(たとえば、図1A〜図1CにおけるSIMD命令100の従来の実行)するのではなく、例示的な態様は、1つまたは複数のベクトルオペランドをオペランドのペアとして指定する単一のSIMD命令を含み、これらはレジスタのペアの観点から表され得る。少なくとも1つのベクトルオペランド(ソースオペランドまたは宛先オペランドのいずれか)をレジスタのペアまたはレジスタペアとして指定することによって、2つ以上の構成要素の従来のSIMD命令の代わりに単一の例示的なSIMD命令が使用され得る。したがって、混合幅SIMD演算のコードサイズが低減される。 An exemplary embodiment of the present disclosure relates to an implementation of a mixed width SIMD operation that avoids data movement across SIMD lanes and reduces code size. For example, rather than breaking down a SIMD operation into two or more component SIMD instructions (eg, the traditional execution of SIMD instruction 100 in FIGS. 1A-1C), an exemplary embodiment is one or more vector operands. Contains a single SIMD instruction that specifies as a pair of operands, which can be expressed in terms of a pair of registers. A single exemplary SIMD instruction that replaces the traditional SIMD instruction of two or more components by specifying at least one vector operand (either the source operand or the destination operand) as a register pair or register pair. Can be used. Therefore, the code size of the mixed width SIMD calculation is reduced.

本開示においては、命令が1つまたは複数のレジスタに対して実行されるべき演算を指定する通常の命令フォーマットに従うために、レジスタに関してオペランドを表現することが参照される点に留意されたい。したがって、SIMD命令は、レジスタに関して表される1つまたは複数のオペランドに対して共通の演算が指定されるフォーマットのSIMD命令であり得る。したがって、本開示による例示的な混合幅SIMD命令は、単一のレジスタに関して表される少なくとも1つのベクトルオペランドと、レジスタのペアに関して表される少なくとも1つの他のベクトルオペランドとを含む。これらのレジスタへの言及は、例示的なSIMD命令を備えるプログラムによって使用される論理レジスタまたはアーキテクチャレジスタに関し得る。また、無制限に、物理レジスタファイルの物理レジスタにも関し得る。一般に、レジスタへの言及は、あるサイズのストレージ要素を伝えることを意味する。 It should be noted that in the present disclosure, it is referred to to represent an operand with respect to a register in order to follow the usual instruction format in which an instruction specifies an operation to be performed on one or more registers. Therefore, a SIMD instruction can be a SIMD instruction in a format in which a common operation is specified for one or more operands represented for a register. Thus, the exemplary mixed width SIMD instruction according to the present disclosure includes at least one vector operand represented for a single register and at least one other vector operand represented for a pair of registers. References to these registers may relate to logical or architectural registers used by programs with exemplary SIMD instructions. It can also involve an unlimited number of physical registers in a physical register file. In general, reference to registers means conveying a storage element of a certain size.

したがって、レジスタファイルに結合されたプロセッサにおいて混合幅単一命令複数データ(SIMD)演算を実行する例示的な方法は、第1のビット幅のデータ要素を備える少なくとも1つの第1のベクトルオペランドと、第2のビット幅のデータ要素を備える少なくとも1つの第2のベクトルオペランドとを有するSIMD命令を指定することを含み得る。第1のベクトルオペランドはソースベクトルオペランドであり得、第2のベクトルオペランドは宛先ベクトルオペランドであり得る。それに対応して、ソースベクトルオペランドのデータ要素はソースデータ要素と呼ばれ得、宛先ベクトルオペランドのデータ要素は宛先データ要素と呼ばれ得る。 Therefore, an exemplary method of performing a mixed-width single-instruction multiple-data (SIMD) operation in a processor coupled to a register file is to use at least one first vector operand with a first bit-width data element. It may include specifying a SIMD instruction with at least one second vector operand with a second bit width data element. The first vector operand can be the source vector operand and the second vector operand can be the destination vector operand. Correspondingly, the data element of the source vector operand can be called the source data element, and the data element of the destination vector operand can be called the destination data element.

例示的な混合幅SIMD命令においては、ソースデータ要素と宛先データ要素との間に1対1対応が存在する。一般に、混合幅SIMD命令において指定された演算がソースデータ要素に対して実行されると、特定の対応する宛先データ要素が生成される。たとえば、宛先ベクトルオペランドを形成するためにソースベクトルオペランドを左シフトするための混合幅SIMD演算を考える。この例では、各ソースデータ要素は、ソースデータ要素の左シフトが実行されるときに、特定の宛先データ要素を生成する。 In an exemplary mixed width SIMD instruction, there is a one-to-one correspondence between the source and destination data elements. In general, when the operation specified in the mixed width SIMD instruction is performed on a source data element, a particular corresponding destination data element is generated. For example, consider a mixed width SIMD operation to shift the source vector operand to the left to form a destination vector operand. In this example, each source data element produces a particular destination data element when a left shift of the source data element is performed.

本開示の1つの例示的な態様では、宛先データ要素の第2のビット幅は、ソースデータ要素の第1のビット幅より小さく、具体的には半分のサイズであり得る。この態様では、宛先ベクトルオペランドはレジスタのペアとして表され得、ソースベクトルオペランドは単一のレジスタとして表され得る。 In one exemplary embodiment of the disclosure, the second bit width of the destination data element may be smaller than the first bit width of the source data element, specifically half the size. In this aspect, the destination vector operand can be represented as a pair of registers and the source vector operand can be represented as a single register.

本開示の別の例示的な態様では、宛先データ要素の第2のビット幅は、ソースデータ要素の第1のビット幅より大きく、具体的には2倍のサイズであり得る。この態様では、ソースベクトルオペランドは単一のレジスタとして表され得、宛先ベクトルオペランドはレジスタのペアとして表され得る。 In another exemplary aspect of the disclosure, the second bit width of the destination data element can be greater than, specifically twice the size of the first bit width of the source data element. In this aspect, the source vector operand can be represented as a single register and the destination vector operand can be represented as a pair of registers.

ソースベクトルオペランドおよび宛先ベクトルオペランドのソースデータ要素とベクトルデータ要素との間の特定のマッピングをそれぞれ示すために、データ要素のビット幅がより小さいベクトルオペランドのデータ要素に順序が割り当てられる。たとえば、単一のレジスタとして表されるベクトルオペランドのデータ要素に順序が割り当てられる。順序に基づいて、偶数番号のデータ要素(たとえば、番号0、2、4、6などに対応する)および奇数番号のデータ要素(たとえば、番号1、3、5、7などに対応する)は、単一のレジスタとして表されるベクトルオペランドに対して識別される。他のベクトルオペランドのレジスタのペアは、第1のレジスタおよび第2のレジスタと呼ばれ、それぞれデータ要素の第1のサブセットおよび第2のサブセットを備える。したがって、単一のレジスタとして表されるベクトルオペランドの偶数番号のデータ要素には第1のサブセットまたは第1のレジスタのデータ要素との対応が割り当てられ、奇数番号のデータ要素には第2のサブセットまたは第2のレジスタのデータ要素との対応が割り当てられる。このようにして、指定されたSIMD演算の実行中に、対応する宛先データ要素を生成するために、ソースデータ要素に対してSIMDレーンを横切る大きなデータ移動が回避される。 Orders are assigned to the data elements of vector operands with smaller bit widths of the data elements to indicate the specific mapping between the source and vector data elements of the source and destination vector operands, respectively. For example, an order is assigned to the data elements of a vector operand represented as a single register. Based on the order, even-numbered data elements (for example, corresponding to numbers 0, 2, 4, 6, etc.) and odd-numbered data elements (for example, corresponding to numbers 1, 3, 5, 7, etc.) Identified for vector operands represented as a single register. The register pairs of the other vector operands, called the first register and the second register, contain a first subset and a second subset of the data elements, respectively. Therefore, the even-numbered data element of the vector operand represented as a single register is assigned a correspondence with the first subset or the data element of the first register, and the odd-numbered data element is assigned the second subset. Or the correspondence with the data element of the second register is assigned. In this way, large data movements across the SIMD lane with respect to the source data element are avoided in order to generate the corresponding destination data element during the execution of the specified SIMD operation.

例示的な態様はまた、たとえば、第3のビット幅の第3のオペランドおよびそれ以上含む、3つ以上のベクトルオペランドを指定するSIMD演算に関連し得る。レジスタのペアとして表される宛先ベクトルオペランドを生成するために、混合幅SIMD命令用に単一のレジスタとしてそれぞれ表される、2つのソースベクトルオペランドが指定される一例が開示される。多くの他のそのような命令フォーマットは、本開示の範囲内で可能である。説明を簡単にするために、混合幅SIMD演算を実施するための例示的な態様を、オペランドのいくつかの例示的なSIMD命令およびビット幅に関連して説明するが、これらは単に説明のためのものであることに留意する。このように、本明細書で論じられる特徴は、混合幅ベクトル演算のための任意の数のオペランドおよびデータ要素のビット幅に拡張され得る。 Illustrative embodiments may also relate to SIMD operations that specify three or more vector operands, including, for example, a third operand of a third bit width and more. An example is disclosed in which two source vector operands are specified, each represented as a single register for a mixed width SIMD instruction, to generate a destination vector operand represented as a pair of registers. Many other such instruction formats are possible within the scope of this disclosure. For simplicity, exemplary embodiments for performing mixed-width SIMD operations are described in relation to some exemplary SIMD instructions and bit widths of the operands, but these are for illustration purposes only. Keep in mind that it is. Thus, the features discussed herein can be extended to any number of operands and bit widths of data elements for mixed width vector operations.

図2A〜図2Cに、SIMD命令200、220、および240に関する例示的な態様が示されている。これらのSIMD命令200、220、および240の各々は、SIMD命令を実行するように構成されたプロセッサ(たとえば、図4に示されるプロセッサ402)によって実行され得る。より具体的には、これらのSIMD命令200、220、および240の各々は、1つまたは複数のソースベクトルオペランドおよび1つまたは複数の宛先ベクトルオペランドを指定することができ、ソースベクトルオペランドおよび宛先ベクトルオペランドは、レジスタ(たとえば、64ビットレジスタ)に関して表され得る。SIMD命令200、220、および240のソースベクトルオペランドおよび宛先ベクトルオペランドは、それぞれが1つまたは複数のSIMDレーンに該当する、対応するソースデータ要素および宛先データ要素を含む。SIMD命令の実行におけるSIMDレーンの数は、SIMD命令の実行において実行される並列演算の数に対応する。したがって、例示的なSIMD命令200、220、および240を実装するように構成されたプロセッサまたは実行ロジックは、SIMD命令200、220、および240によって指定された並列演算を実施するために必要とされるハードウェア(たとえば、多数の左/右シフタ、加算器、乗算器等を備える算術論理装置(ALU))を含むことができる。 2A-2C show exemplary embodiments of SIMD instructions 200, 220, and 240. Each of these SIMD instructions 200, 220, and 240 can be executed by a processor configured to execute the SIMD instruction (eg, processor 402 shown in FIG. 4). More specifically, each of these SIMD instructions 200, 220, and 240 can specify one or more source vector operands and one or more destination vector operands, source vector operands and destination vector operands. Operands can be represented with respect to registers (eg, 64-bit registers). The source and destination vector operands of SIMD instructions 200, 220, and 240 contain corresponding source and destination data elements, each corresponding to one or more SIMD lanes. The number of SIMD lanes in the execution of the SIMD instruction corresponds to the number of parallel operations performed in the execution of the SIMD instruction. Therefore, a processor or execution logic configured to implement the exemplary SIMD instructions 200, 220, and 240 is required to perform the parallel operations specified by the SIMD instructions 200, 220, and 240. It can include hardware (eg, an arithmetic logic unit (ALU) with a large number of left / right shifters, adders, multipliers, etc.).

したがって、図2Aを参照すると、SIMD命令200の実行のための第1の例示的な態様が示されている。一例では、プロセッサは、64ビット命令セットアーキテクチャ(ISA)をサポートすることができると仮定される。SIMD命令200は、単一の64ビットレジスタに関して表されるソースベクトルオペランドのソースデータ要素に対して実行されるべき同じ演算または共通の命令を指定することができる。 Therefore, with reference to FIG. 2A, a first exemplary embodiment for executing SIMD instruction 200 is shown. In one example, it is assumed that the processor can support the 64-bit instruction set architecture (ISA). SIMD instruction 200 can specify the same operation or common instruction to be performed on the source data element of the source vector operand represented for a single 64-bit register.

SIMD命令200において指定される同じ演算または共通の命令は、たとえば、8ビットのソースデータ要素に対する二乗関数、左シフト関数、インクリメント関数、一定値による加算などであり得(これは、8個の8ビット左シフタ、8個の8ビット加算器などの論理要素で実装され得る)、対応する8個の結果として得られる宛先データ要素を生成し、これは最大16ビットのストレージを消費し得る。図示されるように、SIMD命令200は、8個の8ビットデータ要素を備えるソースベクトルオペランド202を指定し得る。ソースベクトルオペランド202のこれらの8個の8ビットデータ要素には数値順が割り当てられ得、これは参照番号0〜7で示されている。SIMD命令200の結果は、8個の16ビット宛先データ要素または128ビットをともに使用して表され得、これは単一の64ビットレジスタに記憶することはできない。この問題を処理するためにSIMD命令200を2つ以上の命令に分解するのではなく(たとえば、図1A〜図1Cに示されるSIMD命令100の従来の実装形態のように)、宛先ベクトルオペランドは構成要素ベクトルオペランドのペアとして指定される。構成要素宛先ベクトルオペランドのペアは、対応するレジスタ204x、204yのペアとして表され得る。レジスタのペアは、レジスタファイル内の連続する物理的位置に記憶される必要はなく、連続する論理レジスタ番号を有することさえできる点に留意されたい。このように、SIMD命令200は、構成要素ベクトルオペランドまたはレジスタ204x、204yのペア(たとえば、64ビットレジスタのペア)に関して表される宛先ベクトルオペランドと、単一のレジスタ202として表されるソースベクトルオペランド202とを指定する。 The same operation or common instruction specified in SIMD instruction 200 can be, for example, a square function, a left shift function, an increment function, or a constant value addition for an 8-bit source data element (this is eight 8s). It can be implemented with logical elements such as a bit left shifter, 8 8-bit adders), and produces the corresponding 8 resulting destination data elements, which can consume up to 16 bits of storage. As shown, SIMD instruction 200 may specify a source vector operand 202 with eight 8-bit data elements. These eight 8-bit data elements of source vector operand 202 may be assigned a numerical order, which is indicated by reference numbers 0-7. The result of SIMD instruction 200 can be expressed using eight 16-bit destination data elements or 128 bits together, which cannot be stored in a single 64-bit register. Instead of breaking the SIMD instruction 200 into two or more instructions to handle this problem (for example, as in the traditional implementation of SIMD instruction 100 shown in FIGS. 1A-1C), the destination vector operand Specified as a pair of component vector operands. A pair of component destination vector operands can be represented as a pair of corresponding registers 204x, 204y. Note that a pair of registers does not have to be stored in contiguous physical positions in the register file and can even have contiguous logical register numbers. Thus, the SIMD instruction 200 has a destination vector operand represented for a component vector operand or a pair of registers 204x, 204y (eg, a pair of 64-bit registers) and a source vector operand represented as a single register 202. Specify 202.

さらに、ペアの第1のレジスタ204xとして表される第1の構成要素宛先ベクトルオペランドは、ソースベクトルオペランド202の偶数番号のソースデータ要素0、2、4、および6に対して実行されるSIMD命令200の結果の第1のサブセットを含む。これらの結果は、偶数番目のソースデータ要素0、2、4、および6への1対1の対応を有する宛先データ要素A、C、E、およびGによって示されており、これは、宛先データ要素A、C、E、およびGのこの例示的な配置において、結果に関してSIMDレーンを横切る大きな動きが回避されることを意味する。同様に、ペアの第2のレジスタ204yとして表される第2の構成要素宛先ベクトルオペランドは、ソースベクトルオペランド202の奇数番号のソースデータ要素1、3、5、および7に対して実行されるSIMD命令200の結果の第2のサブセットを含む。これらの結果は、奇数番号のソースデータ要素1、3、5、および7への1対1の対応を有する宛先データ要素B、D、F、およびHによって示されており、これもやはり、宛先データ要素B、D、F、およびHのこの例示的な配置において、結果に関してSIMDレーンを横切る大きな動きが回避されることを意味する。したがって、この場合、ソースベクトルオペランド202の偶数番号のソースデータ要素0、2、4、および6は、第1のレジスタ204xの宛先データ要素A、C、E、およびGに対応するか、またはそれを生成し、ソースベクトルオペランド202の奇数番号のソースデータ要素1、3、5、および7は、第2のレジスタ204yの宛先データ要素B、D、F、およびHに対応するか、またはそれを生成する。 In addition, the first component destination vector operand, represented as the first register 204x of the pair, is a SIMD instruction executed on the even-numbered source data elements 0, 2, 4, and 6 of the source vector operand 202. Contains the first subset of 200 results. These results are shown by destination data elements A, C, E, and G, which have a one-to-one correspondence to even-numbered source data elements 0, 2, 4, and 6, which is the destination data. This exemplary arrangement of elements A, C, E, and G means that large movements across the SIMD lane are avoided with respect to the result. Similarly, the second component destination vector operand, represented as the second register 204y of the pair, is the SIMD performed on the odd-numbered source data elements 1, 3, 5, and 7 of the source vector operand 202. Contains a second subset of the results of instruction 200. These results are shown by destination data elements B, D, F, and H, which have a one-to-one correspondence to odd-numbered source data elements 1, 3, 5, and 7, which are also destinations. This exemplary arrangement of data elements B, D, F, and H means that large movements across the SIMD lane with respect to the result are avoided. Thus, in this case, the even numbered source data elements 0, 2, 4, and 6 of the source vector operand 202 correspond to or correspond to the destination data elements A, C, E, and G of the first register 204x. The odd-numbered source data elements 1, 3, 5, and 7 of the source vector operand 202 correspond to, or depend, with the destination data elements B, D, F, and H of the second register 204y. Generate.

たとえばSIMDレーン0〜7と呼ばれる、各レーンがそれぞれのソースデータ要素0〜7を備える8個の8ビットSIMDレーンを考えると、対応する宛先データ要素A〜Hを生成するために必要な移動量は、同じSIMDレーンまたは隣接するSIMDレーン内に含まれることが分かる。言い換えれば、第1のセットのソースデータ要素(たとえば、ソースデータ要素0〜7)は、それぞれのSIMDレーンにあり、ソースデータ要素の各々から、宛先データ要素(たとえば、対応する宛先データ要素A〜H)が、それぞれのSIMDレーン、またはそれぞれのSIMDレーンに隣接するSIMDレーンにおいて生成される。たとえば、SIMDレーン0、2、4、および6内の偶数番号のソースデータ要素0、2、4、および6はそれぞれ、宛先データ要素A、C、E、およびGを生成し、それらはそれぞれSIMDレーン0〜1、2〜3、4〜5、および6〜7内に含まれる。同様に、SIMDレーン0、2、4、および6内の奇数番号のソースデータ要素1、3、5、および7はそれぞれ、宛先データ要素B、D、F、およびHを生成し、それらはそれぞれSIMDレーン0〜1、2〜3、4〜5、および6〜7内に含まれる。 Considering eight 8-bit SIMD lanes, for example SIMD lanes 0-7, where each lane has its own source data elements 0-7, the amount of travel required to generate the corresponding destination data elements A-H. Can be found to be contained within the same SIMD lane or adjacent SIMD lanes. In other words, the first set of source data elements (eg, source data elements 0-7) are in their respective SIMD lanes, and from each of the source data elements, the destination data element (eg, the corresponding destination data element A-). H) is generated in each SIMD lane or in a SIMD lane adjacent to each SIMD lane. For example, even-numbered source data elements 0, 2, 4, and 6 in SIMD lanes 0, 2, 4, and 6 generate destination data elements A, C, E, and G, respectively, which each generate SIMD. Included in lanes 0-1, 2-3, 4-5, and 6-7. Similarly, odd-numbered source data elements 1, 3, 5, and 7 in SIMD lanes 0, 2, 4, and 6 generate destination data elements B, D, F, and H, respectively, which each generate destination data elements B, D, F, and H, respectively. Included within SIMD lanes 0-1, 2-3, 4-5, and 6-7.

したがって、図2Aの第1の例示的な態様では、混合幅SIMD命令200は、(2つ以上の構成要素SIMD命令ではなく、1つのSIMD命令のみが使用されるので)命令空間またはコード空間の効率的な使用を含み、その実装形態または実行はSIMDレーンを横切る大きなデータ移動を回避する。 Therefore, in the first exemplary aspect of FIG. 2A, the mixed width SIMD instruction 200 is of instruction space or code space (since only one SIMD instruction is used, not two or more component SIMD instructions). Its implementation or implementation, including efficient use, avoids large data movements across SIMD lanes.

次に図2Bを参照すると、別の例示的な態様が、混合幅SIMD命令220に関連して示されている。SIMD命令220は、単一のレジスタ222として表される第1のソースベクトルオペランドと、単一のレジスタ223として表される第2のソースベクトルオペランドとの、2つのソースベクトルオペランドを含み、4個の16ビットのソースデータ要素の第1のセットおよび第2のセットをそれぞれ有する。SIMD命令220は、2つのソースベクトルオペランドに対する乗算(たとえば、丸めを伴う)などの同じまたは共通の演算を指定することができ、4個の32ビット結果を生成するために、(レジスタ222内の)第1のセットの4個の16ビットソースデータ要素が、(レジスタ223内の)第2のセットの対応する4個の16ビットソースデータ要素と乗算される(SIMD命令220の実装形態は、4個の16×16乗算器などの論理要素を含むことができる)。これらの4個の32ビット結果を記憶するために128ビットが必要とされるので、宛先ベクトルオペランドは、第1の構成要素宛先ベクトルオペランドと第2の構成要素宛先ベクトルオペランドとの、構成要素ベクトルオペランドのペアに関して指定される(これらは、それに応じて第1の64ビットレジスタ224xと、第2の64ビットレジスタ224yとして表され得る)。SIMD命令220はまた、第2のセットの対応するソースデータ要素を有する第1のセットのソースデータ要素の追加にも適用可能であり、対応する結果は、宛先データ要素ごとに16ビットより多くを消費し得る(32ビットすべてではないとしても)点に留意されたい。 Then, with reference to FIG. 2B, another exemplary embodiment is shown in connection with the mixed width SIMD instruction 220. SIMD instruction 220 contains four source vector operands, a first source vector operand represented as a single register 222 and a second source vector operand represented as a single register 223. Has a first set and a second set of 16-bit source data elements, respectively. The SIMD instruction 220 can specify the same or common operation, such as multiplying two source vector operands (for example, with rounding), to produce four 32-bit results (in register 222). ) The four 16-bit source data elements in the first set are multiplied by the corresponding four 16-bit source data elements in the second set (in register 223) (the implementation of SIMD instruction 220 is: Can contain logical elements such as four 16x16 multipliers). Since 128 bits are required to store these four 32-bit results, the destination vector operand is a component vector of the first component destination vector operand and the second component destination vector operand. Specified for operand pairs (these can be represented as the first 64-bit register 224x and the second 64-bit register 224y accordingly). SIMD instruction 220 is also applicable to the addition of a first set of source data elements with a second set of corresponding source data elements, with the corresponding result being more than 16 bits per destination data element. Note that it can be consumed (if not all 32 bits).

図2Bにおいて、第1および第2のセットのソースデータ要素は、それぞれ0、1、2、3、および0'、1'、2'、3'として代表的に示されている順序を割り当てられる。第1のレジスタ224x内の第1の構成要素宛先ベクトルオペランドは、ソースオペランド222および223の偶数番号のソースデータ要素に対応するSIMD命令220の結果の第1のサブセット(32ビット宛先データ要素AおよびCとして示される)を保持し、同様に、第2のレジスタ224y内の第2の構成要素宛先ベクトルオペランドは、ソースオペランド222および223の奇数番号のソースデータ要素に対応するSIMD命令220の結果の第2のサブセット(32ビットデータ要素BおよびDとして示される)を保持する。この場合、第1のソースベクトルオペランド222および第2のソースベクトルオペランド223の偶数番号のソースデータ要素(0,0')および(2,2')は、それぞれ、第1の宛先ベクトルオペランド224xのデータ要素AおよびCを生成し、第1のソースベクトルオペランド222および第2のソースベクトルオペランド223の奇数番号のデータ要素(1,1')および(3,3')は、それぞれ、第2の宛先ベクトルオペランド224yのデータ要素BおよびDを生成することが分かる。 In Figure 2B, the first and second sets of source data elements are assigned the order typically shown as 0, 1, 2, 3, and 0', 1', 2', 3', respectively. .. The first component destination vector operand in the first register 224x is the first subset of the results of SIMD instruction 220 (32-bit destination data element A and) corresponding to the even numbered source data elements of source operands 222 and 223. The second component destination vector operand in the second register 224y holds (shown as C), as well as the result of SIMD instruction 220 corresponding to the odd-numbered source data elements of source operands 222 and 223. Holds a second subset (denoted as 32-bit data elements B and D). In this case, the even-numbered source data elements (0,0') and (2,2') of the first source vector operand 222 and the second source vector operand 223 are of the first destination vector operand 224x, respectively. Generate data elements A and C, and the odd-numbered data elements (1,1') and (3,3') of the first source vector operand 222 and the second source vector operand 223 are the second, respectively. It can be seen that the data elements B and D of the destination vector operand 224y are generated.

再び、図2Bの第2の例示的な態様においては、混合幅SIMD命令220は、2つ以上の構成要素SIMD命令ではなく、単一の混合幅SIMD命令を利用することによって、コード空間効率を達成することが分かる。さらに、この態様においても、SIMDレーンを横切る動きは最小限に抑えられることも分かる。一般に、第1のセットのソースデータ要素および第2のセットのソースデータ要素は、それぞれのSIMDレーンにあり、第1のセットのソースデータ要素の各ソースデータ要素、および第2のセットのソースデータ要素のうちの対応するソースデータ要素から、それぞれのSIMDレーン、またはそれぞれのSIMDレーンに隣接するSIMDレーン内の宛先データ要素を生成する。たとえば、第1のセットのソースデータ要素0〜3(または、第2のセットのソースデータ要素0〜3')を備える4個の16ビットSIMDレーン0〜3を考えると、それぞれ、対応する宛先データ要素A〜Dを生成するための第1および第2のソースデータ要素のためのデータ移動は、同じSIMDレーンおよび多くとも隣接するSIMDレーン内に含まれる(たとえば、SIMDレーン0および2における偶数番号のソースデータ要素(0,0')および(2,2')は、それぞれSIMDレーン0〜1および2〜4において宛先データ要素AおよびCを生成し、同様に、SIMDレーン1および3における奇数番号のソースデータ要素(1,1')、および(3,3')は、それぞれSIMDレーン0〜1および2〜4において宛先データ要素BおよびDを生成する)。 Again, in the second exemplary embodiment of FIG. 2B, the mixed width SIMD instruction 220 utilizes a single mixed width SIMD instruction rather than two or more component SIMD instructions to improve code space efficiency. You can see that it will be achieved. Furthermore, it can also be seen that in this embodiment as well, the movement across the SIMD lane is minimized. In general, the first set of source data elements and the second set of source data elements are in their respective SIMD lanes, each source data element of the first set of source data elements, and the second set of source data. From the corresponding source data element of the elements, generate a destination data element in each SIMD lane or in a SIMD lane adjacent to each SIMD lane. For example, consider four 16-bit SIMD lanes 0-3 with a first set of source data elements 0-3 (or a second set of source data elements 0-3'), each with a corresponding destination. Data movements for the first and second source data elements to generate data elements A through D are contained within the same SIMD lane and at most adjacent SIMD lanes (eg, even in SIMD lanes 0 and 2). The numbered source data elements (0,0') and (2,2') generate destination data elements A and C in SIMD lanes 0 to 1 and 2 to 4, respectively, and similarly in SIMD lanes 1 and 3. The odd-numbered source data elements (1,1') and (3,3') generate destination data elements B and D in SIMD lanes 0 to 1 and 2 to 4, respectively).

図2Cは、混合幅SIMD命令240に関する第3の例示的な態様を表す。混合幅SIMD命令200および220とは異なり、混合幅SIMD命令240のソースベクトルオペランドは、構成要素ベクトルオペランドのペアとして指定されるか、またはレジスタのペアとして表される。混合幅SIMD命令220は2つの別個のソースベクトルオペランドに含まれ、1つのソースベクトルオペランドのデータ要素が別のソースベクトルオペランドのデータ要素と相互作用する(たとえば、それと乗算される)ように指定されているため、混合幅SIMD命令240は混合幅SIMD命令220とは異なる点に留意されたい。一方、混合幅SIMD命令240においては、そうしないと2つの別個の命令が消費されてしまうので、構成要素ソースベクトルオペランドのペアが指定される。たとえば、SIMD命令240は、8個の8ビット宛先データ要素の結果を得るために、8個の16ビットソースデータ要素に対して実行されるべき16ビットから8ビットへの右シフト関数の共通の演算を含むことができる(SIMD命令240の実装形態は、8個の8ビット右シフタなどの論理要素を含むことができる)。しかしながら、8個の16ビットソースデータ要素は128ビットを消費するので、従来の実装形態は、2つの構成要素SIMD命令を使用して実行されるべきこの演算を分割することになる。一方、図2Cの例示的な態様では、第1のレジスタ242x内の第1の構成要素ソースベクトルオペランドと、第2のレジスタ242y内の第2の構成要素ソースベクトルオペランドとを備えるソースベクトルオペランドのペアが、SIMD命令240によって指定される。したがって、コード空間が効率的に使用される。 FIG. 2C represents a third exemplary aspect of the mixed width SIMD instruction 240. Unlike the mixed width SIMD instructions 200 and 220, the source vector operand of the mixed width SIMD instruction 240 is specified as a pair of component vector operands or represented as a pair of registers. The mixed width SIMD instruction 220 is contained in two separate source vector operands and specifies that the data element of one source vector operand interacts with (for example, is multiplied by) the data element of another source vector operand. Therefore, it should be noted that the mixed width SIMD instruction 240 is different from the mixed width SIMD instruction 220. On the other hand, in the mixed width SIMD instruction 240, a pair of component source vector operands is specified because otherwise two separate instructions would be consumed. For example, SIMD instruction 240 has a common 16-bit to 8-bit right-shift function that should be performed on eight 16-bit source data elements to obtain the results of eight 8-bit destination data elements. It can include operations (the SIMD instruction 240 implementation can include logical elements such as eight 8-bit right shifters). However, since eight 16-bit source data elements consume 128 bits, the traditional implementation would split this operation to be performed using the two component SIMD instructions. On the other hand, in the exemplary embodiment of FIG. 2C, a source vector operand having a first component source vector operand in the first register 242x and a second component source vector operand in the second register 242y. The pair is specified by SIMD instruction 240. Therefore, the code space is used efficiently.

宛先ベクトルオペランドは、この場合は単一の64ビットレジスタ244として表され、SIMD命令240の結果である8個の8ビット宛先データ要素を備える。したがって、レジスタ244内の宛先ベクトルオペランドの宛先データ要素には順序が割り当てられ、これらの要素は参照番号0〜7で示されている。構成要素ソースベクトルオペランドのペアのソースデータ要素(242x、242yのペアとして表される)は、ソースデータ要素A、C、E、およびGの第1のサブセットを備える第1のレジスタ242xが、それぞれレジスタ244内の宛先ベクトルオペランドの偶数番号の宛先データ要素0、2、4、および6に対応する結果を生成し、ソースデータ要素B、D、F、およびHの第2のサブセットを備える第2のレジスタ242yが、それぞれレジスタ244内の宛先ベクトルオペランドの奇数番号の宛先データ要素1、3、5、および7に対応する結果を生成するように配置される。 The destination vector operand, in this case, is represented as a single 64-bit register 244 and comprises eight 8-bit destination data elements that are the result of SIMD instruction 240. Therefore, the destination data elements of the destination vector operand in register 244 are assigned an order, and these elements are indicated by reference numbers 0-7. A source data element (represented as a pair of 242x, 242y) of a pair of component source vector operands is a first register 242x with a first subset of source data elements A, C, E, and G, respectively. A second that produces results corresponding to the even numbered destination data elements 0, 2, 4, and 6 of the destination vector operand in register 244 and contains a second subset of source data elements B, D, F, and H. Registers 242y of are arranged to produce results corresponding to the odd-numbered destination data elements 1, 3, 5, and 7 of the destination vector operand in register 244, respectively.

したがって、ソースベクトルオペランドが宛先ベクトルオペランドよりも広い場合でも、構成要素ソースベクトルオペランドのペアを指定するか、またはソースベクトルオペランドをレジスタのペアとして表すことによって、コード空間が有効に利用され得、SIMDレーンを横切るデータ移動が最小限に抑えられ得る。SIMD命令240の実行時にSIMDレーンを横切る移動も最小化される。一般に、宛先データ要素はそれぞれのSIMDレーンにあり、宛先データ要素の各々は、それぞれのSIMDレーン、またはそれぞれのSIMDレーンに隣接するSIMDレーン内のソースデータ要素から生成されることが分かる。たとえば、8個の宛先データ要素0〜7に対応する8個の8ビットSIMDレーンを考えると、ソースデータ要素A、C、E、およびGは、SIMDレーン0、2、4、および6内の偶数番号の宛先データ要素に対応する結果を生成するために、それぞれSIMDレーン0〜1、2〜3、4〜5、および6〜7から移動し、ソースデータ要素B、D、F、およびHは、SIMDレーン1、3、5、および7内の偶数番号の宛先データ要素に対応する結果を生成するために、それぞれSIMDレーン0〜1、2〜3、4〜5、および6〜7から移動することが分かる。どちらの場合も、移動は2つのSIMDレーン内に含まれる。 Therefore, even if the source vector operand is wider than the destination vector operand, the code space can be effectively utilized by specifying a pair of component source vector operands or by representing the source vector operand as a pair of registers, SIMD. Data movement across lanes can be minimized. Movement across the SIMD lane when executing SIMD instruction 240 is also minimized. It can be seen that, in general, the destination data elements are in their respective SIMD lanes, and each of the destination data elements is generated from each SIMD lane, or a source data element in a SIMD lane adjacent to each SIMD lane. For example, considering eight 8-bit SIMD lanes corresponding to eight destination data elements 0-7, the source data elements A, C, E, and G are in SIMD lanes 0, 2, 4, and 6. Move from SIMD lanes 0 to 1, 2 to 3, 4 to 5, and 6 to 7, respectively, to generate results corresponding to even-numbered destination data elements, source data elements B, D, F, and H, respectively. From SIMD lanes 0 to 1, 2 to 3, 4 to 5, and 6 to 7, respectively, to generate results corresponding to even-numbered destination data elements in SIMD lanes 1, 3, 5, and 7. You can see that it moves. In both cases, the move is contained within two SIMD lanes.

したがって、態様は、本明細書において開示されたプロセス、関数、および/またはアルゴリズムを実施するための様々な方法を含むことが理解されよう。たとえば、図3Aに示されるように、ある態様は、たとえば図2A〜図2Bによる、混合幅単一命令複数データ(SIMD)演算を実行する方法300を含むことができる。 Therefore, it will be appreciated that aspects include various methods for implementing the processes, functions, and / or algorithms disclosed herein. For example, as shown in FIG. 3A, one embodiment can include method 300 of performing a mixed width single instruction multiple data (SIMD) operation, eg, according to FIGS. 2A-2B.

ブロック302において、方法300は、プロセッサ(たとえば、以下で説明される図4のプロセッサ402)によって、およびたとえば図2Aを参照して、第1のビット幅(たとえば、8ビット)の第1のセットのソースデータ要素(たとえば、ソースデータ要素0〜7)を備える少なくとも1つの第1のソースベクトルオペランド(たとえば、レジスタ202における)と、第2のビット幅(たとえば、16ビット)の宛先データ要素(たとえば、宛先データ要素A〜H)を備える少なくとも1つの宛先ベクトルオペランド(たとえば、レジスタのペア204x、204yにおける)とを備えるSIMD命令(たとえば、SIMD命令200)を受信するステップを含み、第2のビット幅は第1のビット幅の2倍であり、宛先ベクトルオペランドは、宛先データ要素の第1のサブセット(たとえば、宛先データ要素A、C、E、G)を備える第1のレジスタ(たとえば、204x)と、宛先データ要素の第2のサブセット(たとえば、宛先データ要素B、D、F、H)を備える第2のレジスタとを含むレジスタのペアを備える。 At block 302, method 300 uses a processor (eg, processor 402 in FIG. 4 described below) and, for example, with reference to FIG. 2A, a first set of first bit widths (eg, 8 bits). At least one first source vector operand (eg, in register 202) with source data elements (eg, source data elements 0-7) and a destination data element (eg, 16 bits) with a second bit width (eg, 16 bits). For example, a second step comprising receiving a SIMD instruction (eg, SIMD instruction 200) with at least one destination vector operand (eg, in register pairs 204x, 204y) with destination data elements A through H). The bit width is twice the first bit width, and the destination vector operand is a first register (eg, for example) with a first subset of destination data elements (eg, destination data elements A, C, E, G). It comprises a pair of registers that includes 204x) and a second register that includes a second subset of destination data elements (eg, destination data elements B, D, F, H).

ブロック303(ブロック304および306を含むものとして示される)において、方法300は、プロセッサ内で混合幅SIMD命令を実行するステップをさらに含む。具体的には、ブロック304においてソースデータ要素に割り当てられた順序(たとえば、0〜7)を考えると、ブロック306は、プロセッサ内でSIMD命令を実行するステップを含む。さらに詳細においては、ブロック306は、並列に実行され得るブロック306aおよび306bの構成要素からなる。 In block 303 (shown as including blocks 304 and 306), method 300 further comprises the step of executing mixed width SIMD instructions within the processor. Specifically, given the order assigned to the source data elements in block 304 (eg, 0-7), block 306 includes the step of executing SIMD instructions within the processor. In more detail, block 306 consists of components of blocks 306a and 306b that can be executed in parallel.

ブロック306aは、第1のセットの偶数番号のソースデータ要素(たとえば、ソースデータ要素0、2、4、6)から、第1のレジスタ(たとえば、第1のレジスタ204x)内の宛先データ要素(たとえば、宛先データ要素A、C、E、G)の第1のサブセットを生成するステップを含む。 Block 306a is from the first set of even-numbered source data elements (eg, source data elements 0, 2, 4, 6) to the destination data element (eg, first register 204x) in the first register (eg, first register 204x). For example, it involves generating a first subset of destination data elements A, C, E, G).

ブロック306bは、第1のセットの奇数番号のソースデータ要素(たとえば、ソースデータ要素1、3、5、7)から、第2のレジスタ(たとえば、第2のレジスタ204y)内の宛先データ要素(たとえば、宛先データ要素B、D、F、H)の第2のサブセットを生成するステップを含む。 Block 306b is from a first set of odd-numbered source data elements (eg, source data elements 1, 3, 5, 7) to a destination data element (eg, second register 204y) in a second register (eg, second register 204y). For example, it involves generating a second subset of destination data elements B, D, F, H).

一般に、方法300のSIMD命令は、第1のセットのソースデータ要素の二乗関数、左シフト関数、インクリメント、または一定値による加算のうちの1つであり得る。コード空間効率は、方法300において単一のSIMD命令を利用することによって達成される。方法300においては、SIMDレーンを横切る移動も最小化され、第1のセットのソースデータ要素はそれぞれのSIMDレーン内にあり、方法300は、ソースデータ要素(たとえば、SIMDレーン0内のソースデータ要素0)のそれぞれから、それぞれのSIMDレーン(たとえば、SIMDレーン0)内の宛先データ要素(たとえば、宛先データ要素A)、またはそれぞれのSIMDレーンに隣接するSIMDレーン(たとえば、SIMDレーン1)を生成するステップを含む。 In general, the SIMD instruction of method 300 can be one of a squared function, a left shift function, an increment, or a constant value addition of the first set of source data elements. Code space efficiency is achieved by utilizing a single SIMD instruction in Method 300. In method 300, movement across SIMD lanes is also minimized, the first set of source data elements are in each SIMD lane, and method 300 is a source data element (eg, source data element in SIMD lane 0). From each of 0), generate a destination data element (eg, destination data element A) within each SIMD lane (eg SIMD lane 0), or a SIMD lane adjacent to each SIMD lane (eg SIMD lane 1). Includes steps to do.

別個に示されていないが、方法300はまた、図2BのSIMD命令220を実装するための方法を含むことができ、この方法は、たとえば、ブロック302において、第1のビット幅の第2のセットのソースデータ要素(たとえば、レジスタ222および223内の第1および第2のソースベクトルオペランド)を備える第2のソースベクトルオペランドを受信するステップをさらに備え、第1のセットのソースデータ要素の順序は、第2のセットのソースデータ要素の順序に対応する点にも留意されたい。この場合、ブロック304において割り当てられた順序に基づいて、ブロック306は、プロセッサ内でSIMD命令を実行するステップを含み、第1のセットの偶数番号のソースデータ要素、および第2のセットの偶数番号のソースデータ要素から、第1のレジスタ内の宛先データ要素の第1のサブセットを生成するためのブロック306aと、第1のセットの奇数番号のソースデータ要素および、第2のセットの偶数番号ソースデータ要素から、第2のレジスタ内の宛先データ要素の第2のサブセットを生成するためのブロック306bとを備える。この場合、SIMD命令は、第1のセットのソースデータ要素と、第2のセットの対応するソースデータ要素との乗算または加算であり得、第1のセットのソースデータ要素および第2のセットのソースデータ要素はそれぞれSIMDレーン内にあり、第1のセットのソースデータ要素の各ソースデータ要素と第2のセットのソースデータ要素のうちの対応するソースデータ要素とから、それぞれのSIMDレーン、またはそれぞれのSIMDレーンに隣接するSIMDレーン内の宛先データ要素を生成する。 Although not shown separately, method 300 can also include a method for implementing SIMD instruction 220 in FIG. 2B, which method, for example, in block 302, is a second bit width of the first bit width. The order of the source data elements in the first set further includes the step of receiving a second source vector operand with the source data elements in the set (eg, the first and second source vector operands in registers 222 and 223). Also note that corresponds to the order of the second set of source data elements. In this case, based on the order assigned in block 304, block 306 includes a step of executing SIMD instructions within the processor, a first set of even numbered source data elements, and a second set of even numbered numbers. Block 306a for generating the first subset of destination data elements in the first register from the source data elements of, the odd-numbered source data elements of the first set, and the even-numbered sources of the second set. It includes block 306b for generating a second subset of destination data elements in a second register from the data elements. In this case, the SIMD instruction can be a multiplication or addition of the source data elements of the first set and the corresponding source data elements of the second set, of the source data elements of the first set and the corresponding source data elements of the second set. Each source data element is in a SIMD lane, and from each source data element in the first set of source data elements and the corresponding source data element in the second set of source data elements, each SIMD lane, or Generate destination data elements in SIMD lanes adjacent to each SIMD lane.

図3Bを参照すると、本明細書において開示されたプロセス、関数、および/またはアルゴリズムを実行するための別の方法が示されている。たとえば、図3Bに示されるように、方法300は、たとえば図2Cによる、混合幅単一命令複数データ(SIMD)演算を実行する別の方法を含む。 With reference to FIG. 3B, another method for performing the processes, functions, and / or algorithms disclosed herein is shown. For example, as shown in FIG. 3B, method 300 includes another method of performing a mixed width single instruction multiple data (SIMD) operation, eg, according to FIG. 2C.

ブロック352において、方法350は、プロセッサ(たとえば、プロセッサ402)によって、第1のビット幅(たとえば、16ビット)のソースデータ要素(たとえば、ソースデータ要素A〜H)を備える少なくとも1つのソースベクトルオペランド(たとえば、レジスタ242x、242yにおける)と、第2のビット幅(たとえば、8ビット)の宛先データ要素(たとえば、宛先データ要素0〜7)を備える少なくとも1つの宛先ベクトルオペランド(たとえば、レジスタ244における)とを備えるSIMD命令(たとえば、SIMD命令240)を受信するステップを含み、第2のビット幅は第1のビット幅の半分であり、ソースベクトルオペランドは、ソースデータ要素の第1のサブセット(たとえば、宛先データ要素0、2、4、6)を備える第1のレジスタ(たとえば、第1のレジスタ242x)と、ソースデータ要素の第2のサブセット(たとえば、宛先データ要素1、3、5、7)を備える第2のレジスタ(たとえば、第2のレジスタ242y)とを含むレジスタのペアを備える。 In block 352, method 350 is at least one source vector operand with a first bit width (eg, 16 bits) of source data elements (eg, source data elements A to H) by the processor (eg, processor 402). At least one destination vector operand (eg, in register 244) that has a destination data element (eg, destination data elements 0-7) with a second bit width (eg, 8 bits) and (for example, in registers 242x, 242y). The second bit width is half the first bit width, and the source vector operand is the first subset of the source data elements (eg, SIMD instruction 240). For example, a first register with destination data elements 0, 2, 4, 6) (eg, first register 242x) and a second subset of source data elements (eg, destination data elements 1, 3, 5, It comprises a pair of registers that includes a second register (eg, a second register 242y) that comprises 7).

ブロック354において、宛先データ要素に順序が割り当てられ、ブロック356において、SIMD命令が実行される。ブロック356は、サブブロック356aおよび356bを含み、これらはまた、並列に実行され得る。 At block 354, the destination data element is assigned an order, and at block 356, the SIMD instruction is executed. Block 356 includes subblocks 356a and 356b, which can also be executed in parallel.

ブロック356aは、第1のレジスタ内のソースデータ要素の対応する第1のサブセット(たとえば、ソースデータ要素A、C、E、G)から、偶数番号の宛先データ要素(たとえば、宛先データ要素0、2、4、6)を生成するステップを含む。 Block 356a is from the corresponding first subset of source data elements in the first register (eg, source data elements A, C, E, G) to even-numbered destination data elements (eg, destination data element 0,). Includes steps to generate 2, 4, 6).

ブロック356bは、第2のレジスタ内のソースデータ要素の対応する第2のサブセット(たとえば、ソースデータ要素B、D、F、H)から、奇数番号の宛先データ要素(たとえば、宛先データ要素1、3、5、7)を生成するステップを含む。 Block 356b is from the corresponding second subset of source data elements in the second register (eg, source data elements B, D, F, H) to odd-numbered destination data elements (eg, destination data elements 1, Includes steps to generate 3, 5, 7).

例示的な態様では、方法350のSIMD命令は、ソースデータ要素の右シフト関数であってもよく、宛先データ要素は、それぞれのSIMDレーン(たとえば、SIMDレーン0〜7)内にあり、それぞれのSIMDレーン(たとえば、SIMDレーン0)、またはそれぞれのSIMDレーンに隣接するSIMDレーン(たとえば、SIMDレーン1)内のソースデータ要素(たとえば、ソースデータ要素A)から、宛先データ要素(たとえば、宛先データ要素0)の各々を生成する。 In an exemplary embodiment, the SIMD instruction of method 350 may be a right shift function of the source data element, the destination data element is in each SIMD lane (eg, SIMD lanes 0-7) and each From a source data element (for example, source data element A) in a SIMD lane (for example, SIMD lane 0) or in a SIMD lane adjacent to each SIMD lane (for example, SIMD lane 1), a destination data element (for example, destination data) Generate each of the elements 0).

図4を参照すると、例示的な態様によるワイヤレスデバイス400の特定の例示的な態様のブロック図である。ワイヤレスデバイス400は、たとえば図3Aの方法300および図3Bの方法350による、例示的な混合幅SIMD命令の実行をサポートおよび実装するように構成され得る(たとえば、実行ロジックを含む)プロセッサ402を含む。図4に示されるように、プロセッサ402は、メモリ432と通信することができる。プロセッサ402は、例示的なSIMD命令のどのオペランドが表されるかに関してレジスタ(たとえば、論理レジスタ)に対応する物理レジスタを保持するレジスタファイル(図示せず)を含み得る。いくつかの態様では、レジスタファイルにはメモリ432からデータが供給され得る。図示されていないが、1つもしくは複数のキャッシュまたは他のメモリ構造もワイヤレスデバイス400に含まれ得る。 FIG. 4 is a block diagram of a particular exemplary embodiment of the wireless device 400 according to an exemplary embodiment. The wireless device 400 includes a processor 402 (including, for example, execution logic) that may be configured to support and implement the execution of exemplary mixed width SIMD instructions, eg, by method 300 in FIG. 3A and method 350 in FIG. 3B. .. As shown in FIG. 4, processor 402 can communicate with memory 432. Processor 402 may include a register file (not shown) that holds the physical registers that correspond to the registers (eg, logical registers) with respect to which operand of the exemplary SIMD instruction is represented. In some embodiments, the register file may be populated with data from memory 432. Although not shown, one or more caches or other memory structures may also be included in the wireless device 400.

図4は、プロセッサ402およびディスプレイ428に結合された、ディスプレイコントローラ426も示している。コーダ/デコーダ(コーデック)434(たとえば、オーディオおよび/または音声コーデック)は、プロセッサ402に結合することができる。(モデムを含んでもよい)ワイヤレスコントローラ440などの他の構成要素も示されている。スピーカー436およびマイクロフォン438は、コーデック434に結合することができる。図4は、ワイヤレスコントローラ440をワイヤレスアンテナ442に結合することができることも示している。特定の態様では、プロセッサ402、ディスプレイコントローラ426、メモリ432、コーデック434、およびワイヤレスコントローラ440は、システムインパッケージデバイスまたはシステムオンチップデバイス422に含まれる。 FIG. 4 also shows the display controller 426 coupled to the processor 402 and the display 428. The coder / decoder (codec) 434 (eg, audio and / or audio codec) can be coupled to processor 402. Other components such as wireless controller 440 (which may include a modem) are also shown. The speaker 436 and microphone 438 can be coupled to the codec 434. Figure 4 also shows that the wireless controller 440 can be coupled to the wireless antenna 442. In certain embodiments, the processor 402, display controller 426, memory 432, codec 434, and wireless controller 440 are included in a system-in-package device or system-on-chip device 422.

特定の態様では、入力デバイス430および電源444は、システムオンチップデバイス422に結合される。さらに、ある特定の態様では、図4に示されるように、ディスプレイ428、入力デバイス430、スピーカー436、マイクロフォン438、ワイヤレスアンテナ442、および電源444は、システムオンチップデバイス422の外部に位置する。ただし、ディスプレイ428、入力デバイス430、スピーカー436、マイクロフォン438、ワイヤレスアンテナ442、および電源444の各々は、インターフェースまたはコントローラのような、システムオンチップデバイス422の構成要素に結合することができる。 In certain embodiments, the input device 430 and power supply 444 are coupled to the system-on-chip device 422. Further, in certain embodiments, the display 428, input device 430, speaker 436, microphone 438, wireless antenna 442, and power supply 444 are located outside the system-on-chip device 422, as shown in FIG. However, each of the display 428, input device 430, speaker 436, microphone 438, wireless antenna 442, and power supply 444 can be coupled to components of the system-on-chip device 422, such as an interface or controller.

図4は、ワイヤレス通信デバイスを示しているが、プロセッサ402およびメモリ432はまた、セットトップボックス、音楽プレーヤ、ビデオプレーヤ、エンターテインメントユニット、ナビゲーションデバイス、携帯情報端末(PDA)、固定位置データユニット、通信デバイス、またはコンピュータに統合され得る点に留意されたい。さらに、ワイヤレスデバイス400の少なくとも1つまたは複数の例示的な態様は、少なくとも1つの半導体ダイに統合され得る。 Although Figure 4 shows a wireless communication device, the processor 402 and memory 432 also include set-top boxes, music players, video players, entertainment units, navigation devices, personal digital assistants (PDAs), fixed-position data units, and communications. Note that it can be integrated into a device or computer. Further, at least one or more exemplary embodiments of the wireless device 400 may be integrated into at least one semiconductor die.

当業者であれば、情報および信号は、様々な異なる技術および技法のいずれかを使用して表され得ることを理解するであろう。たとえば、上記の説明を通じて参照され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、およびチップは、電圧、電流、電磁波、磁場もしくは磁性粒子、光場もしくは光学粒子、またはそれらの任意の組合せによって表され得る。 Those skilled in the art will appreciate that information and signals can be represented using any of a variety of different techniques and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced through the above description are voltages, currents, electromagnetic waves, magnetic or magnetic particles, light fields or optical particles, or any combination thereof. Can be represented by.

さらに、当業者は、本明細書において開示された態様に関連して説明された様々な例示的な論理ブロック、モジュール、回路、およびアルゴリズムステップが、電子ハードウェア、コンピュータソフトウェア、またはその両方の組合せとして実装され得ることを理解するであろう。ハードウェアおよびソフトウェアのこの互換性を明確に示すために、種々の例示的な構成要素、ブロック、モジュール、回路、およびステップについて、上記では概してそれらの機能に関して説明してきた。そのような機能がハードウェアとして実装されるか、またはソフトウェアとして実装されるかは、特定のアプリケーションおよびシステム全体に課される設計制約に依存する。当業者は、特定のアプリケーションごとに様々な方法で説明した機能を実装し得るが、そのような実装形態の決定は本発明の範囲から逸脱するものと解釈されるべきではない。 In addition, one of ordinary skill in the art will appreciate the various exemplary logic blocks, modules, circuits, and algorithm steps described in connection with aspects disclosed herein, including electronic hardware, computer software, or a combination thereof. You will understand that it can be implemented as. To articulate this compatibility of hardware and software, various exemplary components, blocks, modules, circuits, and steps have been generally described above with respect to their functionality. Whether such functionality is implemented as hardware or software depends on the design constraints imposed on the particular application and system as a whole. Those skilled in the art may implement the features described in various ways for a particular application, but the determination of such implementations should not be construed as departing from the scope of the invention.

本明細書において開示された態様に関連して説明された方法、シーケンス、および/またはアルゴリズムは、ハードウェア、プロセッサによって実行されるソフトウェアモジュール、またはその2つの組合せにおいて直接的に実施され得る。ソフトウェアモジュールは、RAMメモリ、フラッシュメモリ、ROMメモリ、EPROMメモリ、EEPROMメモリ、レジスタ、ハードディスク、リムーバブルディスク、CD-ROM、または当技術分野で知られている任意の他の形態の記憶媒体内に存在してもよい。例示的な記憶媒体は、プロセッサが記憶媒体から情報を読み取り、かつ記憶媒体に情報を書き込むことができるように、プロセッサに結合される。あるいは、記憶媒体は、プロセッサに一体化され得る。 The methods, sequences, and / or algorithms described in connection with aspects disclosed herein can be implemented directly in hardware, software modules executed by processors, or a combination thereof. The software module resides in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disks, removable disks, CD-ROMs, or any other form of storage medium known in the art. You may. An exemplary storage medium is coupled to the processor so that the processor can read information from the storage medium and write information to the storage medium. Alternatively, the storage medium can be integrated into the processor.

したがって、本発明のある態様は、(たとえば、上述の方法300および350による、図2A〜図2CのSIMD命令を実装するための)混合幅SIMD命令を実装するための方法を具体化するコンピュータ可読媒体(たとえば、非一時的コンピュータ可読記憶媒体)を含むことができる。したがって、本発明は図示された例に限定されず、本明細書に記載の機能を実行するための任意の手段が本発明の態様に含まれる。 Thus, certain aspects of the invention embody a computer-readable method for implementing a mixed width SIMD instruction (eg, for implementing the SIMD instructions of FIGS. 2A-2C, according to methods 300 and 350 described above). It can include media (eg, non-transitory computer-readable storage media). Therefore, the invention is not limited to the illustrated examples, and any means for performing the functions described herein is included in aspects of the invention.

前述の開示は本発明の例示的な態様を示しているが、添付の特許請求の範囲によって規定される本発明の範囲から逸脱することなしに、本明細書に様々な変更および修正が行われ得る点に留意されたい。本明細書に記載される本発明の態様に従う方法クレームの機能、ステップ、および/または行為を、任意の特定の順序で実施する必要はない。さらに、本発明の要素は、単数形で記載され、特許請求される場合があるが、単数形への制限が明示的に言及されない限り、複数形が意図される。 Although the aforementioned disclosure illustrates an exemplary embodiment of the invention, various modifications and amendments are made herein without departing from the scope of the invention as defined by the appended claims. Note the gains. The functions, steps, and / or actions of the method claims according to aspects of the invention described herein need not be performed in any particular order. In addition, the elements of the invention are described in the singular and may be claimed, but the plural is intended unless restrictions to the singular are explicitly mentioned.

100 SIMD命令
100X SIMD命令
100Y SIMD命令
102 ソースオペランド
104x 宛先オペランド
104y 宛先オペランド
120X SIMD命令
120Y SIMD命令
200 SIMD命令
202 ソースベクトルオペランド
202 単一のレジスタ
204x 第1のレジスタ
204y 第2のレジスタ
220 SIMD命令
220 混合幅SIMD命令
222 単一のレジスタ
222 ソースオペランド
222 第1のソースベクトルオペランド
223 単一のレジスタ
223 ソースオペランド
223 第2のソースベクトルオペランド
224x 第1のレジスタ
224x 第1の宛先ベクトルオペランド
224y 第2のレジスタ
224y 第2の宛先ベクトルオペランド
240 SIMD命令
240 混合幅SIMD命令
242x 第1のレジスタ
242y 第2のレジスタ
244 レジスタ
300 方法
350 方法
400 ワイヤレスデバイス
402 プロセッサ
422 システムオンチップデバイス
426 ディスプレイコントローラ
428 ディスプレイ
430 入力デバイス
432 メモリ
434 コーダ/デコーダ(コーデック)
436 スピーカー
438 マイクロフォン
440 ワイヤレスコントローラ
442 ワイヤレスアンテナ
444 電源 100 SIMD instructions
100X SIMD instruction
100Y SIMD instruction
102 Source operand
104x destination operand
104y destination operand
120X SIMD instruction
120Y SIMD instruction
200 SIMD instruction
202 Source vector operand
202 single register
204x 1st register
204y Second register
220 SIMD instruction
220 mixed width SIMD instruction
222 single register
222 Source operand
222 First source vector operand
223 single register
223 Source operand
223 Second source vector operand
224x first register
224x 1st destination vector operand
224y second register
224y Second destination vector operand
240 SIMD instruction
240 mixed width SIMD instruction
242x 1st register
242y second register
244 register
300 ways
350 methods
400 wireless devices
402 processor
422 System on Chip Device
426 Display controller
428 display
430 input device
432 memory
434 Coda / Decoder (Codec)
436 speaker
438 microphone
440 wireless controller
442 wireless antenna
444 power supply

Claims

Mixed Width Single Instruction A method of performing multiple data (SIMD) operations.
Depending on the processor
A first source vector operand comprising a first source register, wherein the first source register comprises a first set of source data elements with a first bit width.
A step of receiving a SIMD instruction with a destination vector operand with a second bit width destination data element.
The second bit width is twice the first bit width,
The destination vector operand comprises a pair of destination registers including a first destination register comprising a first subset of the destination data elements and a second destination register comprising a second subset of the destination data elements.
A step and the first source register is a single register corresponding to the pair of destination registers.
A step of executing the SIMD instruction in the processor based on the order of the source data elements of the first set, wherein the order of the source data elements of the first set is the source of the first set. Assigned to provide a mapping between a data element and said destination data element
A step of generating the first subset of the destination data elements in the first destination register from the even numbered source data elements of the first set.
A step comprising generating the second subset of the destination data elements in the second destination register from the odd-numbered source data elements of the first set.
The first set of source data elements is in each SIMD lane and
A method of generating from each of the source data elements the respective destination data elements in the respective SIMD lanes or in the SIMD lanes adjacent to the respective SIMD lanes according to the mapping.

The method of claim 1, wherein the SIMD instruction is one of a square function, a left shift function, an increment, or a constant value addition of the source data elements in the first set.

Mixed Width Single Instruction A method of performing multiple data (SIMD) operations.
Depending on the processor
A source vector operand with a first bit width source data element,
A destination vector operand that includes a destination register, the step of receiving a SIMD instruction that includes a destination vector operand with a destination data element of a second bit width.
The second bit width is half of the first bit width.
The source vector operand comprises a pair of source registers including a first source register comprising a first subset of the source data elements and a second source register comprising a second subset of the source data elements.
A step and the destination register is a single register corresponding to the pair of source registers.
A step of executing the SIMD instruction in the processor based on the order of the destination data elements, the order of the destination data elements providing a mapping between the source data elements and the destination data elements. Assigned for
A step of generating an even-numbered destination data element from the first subset of the source data elements,
A step comprising generating an odd-numbered destination data element from the second subset of said source data elements.
The destination data element is in each SIMD lane and
A method of generating each of the destination data elements from the respective SIMD lanes or source data elements in the SIMD lanes adjacent to the respective SIMD lanes according to the mapping.

The method of claim 3, wherein the SIMD instruction is a right shift function of the source data element.

A non-temporary computer-readable storage medium with instructions that, when executed by a processor, causes the processor to perform mixed-width, single-instruction, multiple-data (SIMD) operations.
A first source vector operand comprising a first source register, wherein the first source register comprises a first set of source data elements with a first bit width.
A SIMD instruction with a destination vector operand with a second bit width destination data element.
The second bit width is twice the first bit width,
The destination vector operand comprises a pair of destination registers including a first destination register comprising a first subset of the destination data elements and a second destination register comprising a second subset of the destination data elements.
With the SIMD instruction, the first source register is a single register corresponding to the pair of destination registers.
Based on the order of the source data elements in the first set above
A code for generating the first subset of the destination data elements in the first destination register from the even numbered source data elements of the first set.
A code for generating the second subset of the destination data elements in the second destination register from the odd-numbered source data elements of the first set, comprising the code for generating the second subset of the destination data elements, the first set of source data. The order of the elements is assigned to provide a mapping between the first set of source data elements and the destination data elements.
The first set of source data elements is in each SIMD lane and
Non-temporary computer readable, comprising code for generating each destination data element in each SIMD lane, or in a SIMD lane adjacent to each SIMD lane, from each of the source data elements according to the mapping. Storage medium.

The non-temporary computer-readable memory of claim 5, wherein the SIMD instruction is one of a square function, a left shift function, an increment, or a constant value addition of the source data elements in the first set. Medium.

A non-temporary computer-readable storage medium with instructions that, when executed by a processor, causes the processor to perform mixed-width, single-instruction, multiple-data (SIMD) operations.
A source vector operand with a first bit width source data element,
A SIMD instruction comprising a destination vector operand with a destination vector operand comprising a destination register, wherein the destination register comprises a destination data element of a second bit width.
The second bit width is half of the first bit width.
The source vector operand comprises a pair of source registers including a first source register comprising a first subset of the source data elements and a second source register comprising a second subset of the source data elements.
With the SIMD instruction, the destination register is a single register corresponding to the pair of source registers.
Based on the order of the destination data elements
A code for generating an even-numbered destination data element from the first subset of the source data elements, and
A code for generating an odd numbered destination data element from the second subset of the source data elements is provided, and the order of the destination data elements is a mapping between the source data elements and the destination data elements. Assigned to provide
The destination data element is in each SIMD lane and
A non-temporary computer-readable storage medium comprising a code for generating each of the destination data elements from the respective SIMD lanes or source data elements in the SIMD lanes adjacent to the respective SIMD lanes according to the mapping. ..

The non-temporary computer-readable storage medium according to claim 7, wherein the SIMD instruction is a right-shift function of the source data element.