CN104011670B

CN104011670B - The instruction of one of two scalar constants is stored for writing the content of mask based on vector in general register

Info

Publication number: CN104011670B
Application number: CN201180075835.8A
Authority: CN
Inventors: J·考博尔; M·J·克莱格德; B·L·托尔; A·T·福塞斯
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2011-12-22
Filing date: 2011-12-22
Publication date: 2016-12-28
Anticipated expiration: 2031-12-22
Also published as: US10157061B2; CN104011670A; TWI496079B; US20140297991A1; TW201337740A; WO2013095553A1

Abstract

According to an embodiment, take out the expression of instruction.The form of this instruction specifies its sole source operand writing mask register from single vector, and single general register is appointed as its destination.Additionally, the form of this instruction includes that the first field and the second field, this single vector of the content choice of this first field write mask register, and this single general register of the content choice of this second field.This source operand is to include the mask of writing that a multiple bit vector writes mask element, and a plurality of bit vector writes mask element corresponding to the different long numeric data element position in framework vector registor.The method also includes: occur in response to the single performing described single instruction, store data in described single general register so that based on bit vectors multiple in source operand, the content of described single general register writes whether mask element is that full 0 represents the first or second scalar constant.

Description

instruction to store one of two scalar constants in a general-purpose register based on the contents of a vector write mask

技术领域 technical field

本发明的各实施例涉及处理器领域；更具体而言，涉及用于基于写掩码内容设置通用寄存器中的标量值的指令。 Embodiments of the invention relate to the field of processors; more specifically, to instructions for setting scalar values in general purpose registers based on writemask contents.

背景技术 Background technique

指令集，或指令集架构(ISA)是涉及编程的计算机架构的一部分，并可以包括原生数据类型、指令、寄存器架构、寻址模式、存储器架构、中断和异常处理、以及外部输入和输出(I/O)。应当注意，在本文中术语指令一般指宏指令——即被提供给处理器(或指令转换器，该指令转换器(例如使用静态二进制翻译、包括动态编译的动态二进制翻译)将指令翻译、变形、仿真，或以其他方式将指令转换成要由处理器处理的一个或多个指令)以用于执行的指令——而不是微指令或微操作——它们是处理器的解码器解码宏指令的结果。 An instruction set, or instruction set architecture (ISA), is the part of a computer's architecture that involves programming, and can include native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (ISA) /O). It should be noted that in this context the term instruction generally refers to macro-instructions - i.e. given to a processor (or instruction converter which translates, transforms, morphs (eg using static binary translation, dynamic binary translation including dynamic compilation) , emulation, or otherwise convert an instruction into one or more instructions to be processed by the processor) for execution—rather than microinstructions or micro-ops—these are the processor's decoders to decode macroinstructions the result of.

指令集架构与微架构不同，该微架构是实现ISA的处理器的内部设计。带有不同的微架构的处理器可以共享共同的指令集。例如，Intel Pentium4处理器、Intel Core处理器，以及位于Sunnyvale CA的Advanced Micro Devices公司的处理器实现x86指令集的几乎相同的版本(带有被添加到较新的版本中的某些扩展)，但是，具有不同的内部设计。例如，ISA的相同寄存器架构在不同的微架构中可使用已知的技术以不同方法来实现，包括专用物理寄存器、使用寄存器重命名机制(诸如，使用寄存器别名表RAT、重排序缓冲器ROB、以及引退寄存器组；使用多个映射和寄存器池)的一个或多个动态分配物理寄存器等。除非另作说明，短语“寄存器架构”、“寄存器组”，以及寄存器是指对软件/编程器以及指令指定寄存器的方式可见的东西。在需要特殊性的情况下，形容词“逻辑、架构，或软件可见的”将用于表示寄存器架构中的寄存器/组，而不同的形容词将用于指定给定微型架构中的寄存器(例如，物理寄存器、重新排序缓冲器、引退寄存器、寄存器池)。 An instruction set architecture is distinct from a microarchitecture, which is the internal design of the processor that implements the ISA. Processors with different microarchitectures can share a common instruction set. For example, Intel Pentium4 processors, Intel Core processors, and processors from Advanced Micro Devices, Inc. of Sunnyvale CA implement nearly identical versions of the x86 instruction set (with some extensions added to newer versions), However, has a different internal design. For example, the same register architecture of an ISA can be implemented in different ways in different microarchitectures using known techniques, including dedicated physical registers, using register renaming mechanisms such as using register alias table RAT, reorder buffer ROB, and retired register banks; one or more dynamically allocated physical registers using multiple maps and register pools), etc. Unless otherwise stated, the phrases "register architecture", "register bank", and registers refer to what is visible to the software/programmer and to the way instructions specify registers. Where specificity is required, the adjective "logically, architecturally, or software-visible" will be used to denote a register/bank within a register architecture, while a different adjective will be used to designate a register within a given microarchitecture (e.g., physical registers, reorder buffers, retirement registers, register pools).

指令集包括一个或多个指令格式。给定指令格式定义各种字段(位数、位的位置)以指定，其中，要执行的操作(operand)以及将对其进行操作的操作数。从而，ISA的每个指令是使用给定指令格式来表达的，并且包括用于指定操作和操作数的字段。例如，示例性ADD指令具有专用操作码以及包括指定该操作码的操作码字段和选择操作数的操作数字段(源1/目的地以及源2)的指令格式，并且该ADD指令在指令流中的出现将具有选择专用操作数的操作数字段中的专用内容。 An instruction set includes one or more instruction formats. A given instruction format defines various fields (number of bits, position of bits) to specify, among other things, the operation to be performed (operand) and the operands on which it will be operated. Thus, each instruction of the ISA is expressed using a given instruction format, and includes fields for specifying operations and operands. For example, an exemplary ADD instruction has a dedicated opcode and an instruction format that includes an opcode field that specifies the opcode and an operand field that selects operands (source 1/destination and source 2), and the ADD instruction in the instruction stream Occurrences of will have private content in the operand field that selects the private operand.

科学、金融、自动向量化的通用，RMS(识别、挖掘以及合成)/可视和多媒体应用程序(例如，2D/3D图形、图像处理、视频压缩/解压缩、语音识别算法和音频操纵)常常需要对大量的数据项执行相同操作(被称为“数据并行性”)。单指令多数据(SIMD)是指使处理器对多个数据项执行相同操作的一种指令。SIMD技术特别适于能够在逻辑上将寄存器中的位分割为若干个固定尺寸的数据元素的处理器，每一个元素都表示单独的值。例如，256位寄存器中的位可以被指定为作为四个单独的64位打包的数据元素(四字(Q)尺寸的数据元素)，八个单独的32位打包的数据元素(双字(D)尺寸的数据元素)，十六单独的16位打包的数据元素(字(W)尺寸的数据元素)，或三十二个单独的8位数据元素(字节(B)尺寸的数据元素)来被操作的源操作数。这种类型的数据被称为打包的数据类型或向量数据类型，这种数据类型的操作数被称为打包的数据操作数或向量操作数。换句话说，打包数据项或向量指的是打包数据元素的序列；并且打包数据操作数或向量操作数是SIMD指令(也称为打包数据指令或向量指令)的源操作数或目的地操作数。 General in science, finance, automatic vectorization, RMS (recognition, mining, and synthesis)/visual and multimedia applications (e.g., 2D/3D graphics, image processing, video compression/decompression, speech recognition algorithms, and audio manipulation) often The need to perform the same operation on a large number of data items (known as "data parallelism"). Single Instruction Multiple Data (SIMD) refers to a type of instruction that causes a processor to perform the same operation on multiple data items. SIMD techniques are particularly well-suited for processors that can logically partition the bits in a register into a number of fixed-size data elements, each of which represents a separate value. For example, bits in a 256-bit register can be specified as four individual 64-bit packed data elements (quadword (Q) sized data elements), eight individual 32-bit packed data elements (double word (D ) sized data elements), sixteen individual 16-bit packed data elements (word (W) sized data elements), or thirty-two individual 8-bit data elements (byte (B) sized data elements) The source operand to be operated on. Data of this type are called packed data types or vector data types, and operands of this data type are called packed data operands or vector operands. In other words, a packed data item or vector refers to a sequence of packed data elements; and a packed data operand or vector operand is a source or destination operand of a SIMD instruction (also called a packed data instruction or vector instruction) .

作为示例，一种类型的SIMD指令指定要以纵向方式对两个源向量操作数执行的单个向量操作，以利用相同数量的数据元素，以相同数据元素顺序，生成相同尺寸的目的地向量操作数(也称为结果向量操作数)。源向量操作数中的数据元素被称为源数据元素，而目的地向量操作数中的数据元素被称为目的地或结果数据元素。这些源向量操作数是相同尺寸，并包含相同宽度的数据元素，如此，它们包含相同数量的数据元素。两个源向量操作数中的相同位位置中的源数据元素形成数据元素对(也称为相对应的数据元素；即，每个源操作数的数据元素位置0中的数据元素相对应，每个源操作数的数据元素位置1中的数据元素相对应，等等)。由该SIMD指令所指定的操作分别地对这些源数据元素对中的每一对执行，以生成匹配数量的结果数据元素，如此，每一对源数据元素都具有对应的结果数据元素。由于操作是纵向的并且由于结果向量操作数尺寸相同，具有相同数量的数据元素，并且结果数据元素与源向量操作数以相同数据元素顺序来存储，因此，结果数据元素结果向量操作数中的位位置与源向量操作数中的它们的对应的源数据元素对相同。除此示例性类型的SIMD指令之外，还有各种其他类型的SIMD指令(例如，只有一个或具有两个以上的源向量操作数的；以横向方式操作的；生成不同尺寸的结果向量操作数的，具有不同尺寸的数据元素的，和/或具有不同的数据元素顺序的)。应该理解，术语“目的地向量操作数”(或目的地操作数)被定义为执行由指令所指定的操作的直接结果，包括将该目的地操作数存储在某一位置(寄存器或在由该指令所指定的存储器地址)，以便它可以作为源操作数由另一指令访问(由另一指令指定该同一个位置)。 As an example, one type of SIMD instruction specifies a single vector operation to be performed on two source vector operands in a longitudinal fashion to utilize the same number of data elements, in the same data element order, to produce a destination vector operand of the same size (Also known as the result vector operand). The data elements in the source vector operand are called source data elements, and the data elements in the destination vector operand are called destination or result data elements. These source vector operands are the same size and contain data elements of the same width, and as such, they contain the same number of data elements. The source data elements in the same bit positions in the two source vector operands form data element pairs (also called corresponding data elements; that is, the data elements in data element position 0 of each source operand correspond, each corresponding to the data element in data element position 1 of the source operand, etc.). The operations specified by the SIMD instruction are performed on each of the pairs of source data elements separately to generate a matching number of result data elements such that each pair of source data elements has a corresponding result data element. Since the operation is vertical and because the result vector operand is the same size, has the same number of data elements, and the result data elements are stored in the same data element order as the source vector operand, the result data element bits in the result vector operand The positions are the same as their corresponding pairs of source data elements in the source vector operands. In addition to this exemplary type of SIMD instruction, there are various other types of SIMD instructions (e.g., those with only one or more than two source vector operands; those that operate in a lateral fashion; those that generate result vectors of different sizes) number, have different size data elements, and/or have different data element order). It should be understood that the term "destination vector operand" (or destination operand) is defined as the immediate result of performing the operation specified by the instruction, including storing the destination operand in a location (register or instruction) so that it can be accessed as a source operand by another instruction (by another instruction specifying that same location).

诸如由具有包括x86、MMX^TM、流式SIMD扩展(SSE)、SSE2、SSE3、SSE4.1以及SSE4.2指令的指令集的Core^TM处理器使用的技术之类的SIMD技术，在应用程序性能方面实现了大大的改善(Core^TM和MMX^TM是位于加利福尼亚州Santa Clara的Intel Corporation的注册商标或商标。)。称为高级向量扩展(AVX)(AVX1和AVX2)又使用VEX编码方案的额外的SIMD扩展集已经被发布或出版(例如，参见64和IA-32Architectures Software Developers Manual(架构软件开发者手册)，2011年10月；参见Advanced Vector Extensions Programming Reference(高级向量扩展编程参考)，2011年6月)。 such as those with instruction sets including x86, MMX ^™ , Streaming SIMD Extensions (SSE), SSE2, SSE3, SSE4.1, and SSE4.2 instructions SIMD technology, such as the technology used by Core ^TM processors, achieves dramatic improvements in application performance. (Core ^TM and MMX ^TM are registered trademarks or trademarks of Intel Corporation, Santa Clara, California.). An additional set of SIMD extensions known as Advanced Vector Extensions (AVX) (AVX1 and AVX2) which in turn uses the VEX encoding scheme has been released or published (for example, see 64 and IA-32 Architectures Software Developers Manual, October 2011; see Advanced Vector Extensions Programming Reference (Advanced Vector Extensions Programming Reference), June 2011).

附图说明 Description of drawings

通过参考用来说明本发明的实施例的以下描述和附图，可最好地理解本发明。在附图中： The present invention is best understood by referring to the following description and accompanying drawings, which illustrate embodiments of the invention. In the attached picture:

图1是示出根据本发明的某些实施例的用于基于向量写掩码在通用寄存器中存储两个标量常数之一的示例性指令的操作的框图； 1 is a block diagram illustrating the operation of an exemplary instruction for storing one of two scalar constants in a general-purpose register based on a vector write mask, according to some embodiments of the invention;

图2A是示出根据本发明的一个实施例的关于特定“如果掩码为0则设置 GRP”指令的示例的框图； Figure 2A is a block diagram illustrating an example of a specific "set GRP if mask is 0" instruction according to one embodiment of the present invention;

图2B是示出根据本发明的一个实施例的关于特定“如果掩码不为0则设置GRP”指令的示例的框图； 2B is a block diagram illustrating an example of a specific "set GRP if mask is not 0" instruction according to one embodiment of the present invention;

图3是根据本发明的某些实施例的用于处理基于向量写掩码的内容在通用寄存器中存储两个标量常数之一的指令的每次出现的流程图； 3 is a flowchart for processing each occurrence of an instruction to store one of two scalar constants in a general-purpose register based on the contents of a vector writemask, according to some embodiments of the invention;

图4是根据本发明的某些实施例的用于执行基于向量写掩码的内容在通用寄存器中存储两个标量常数之一的指令的出现的流程图； 4 is a flow diagram of the occurrence of an instruction for executing an instruction to store one of two scalar constants in a general-purpose register based on the contents of a vector writemask, according to some embodiments of the invention;

图5是根据本发明的某些实施例的用于处理基于向量写掩码的内容在通用寄存器中存储两个标量常数之一的指令的出现的具体机器的框图； 5 is a block diagram of a specific machine for handling the presence of an instruction to store one of two scalar constants in a general-purpose register based on the contents of a vector writemask, according to some embodiments of the invention;

图6A是示出根据本发明的一个实施例的示出一位向量写掩码元素依赖于向量尺寸和数据元素尺寸的表； Figure 6A is a table illustrating the dependence of one-bit vector writemask elements on vector size and data element size, according to one embodiment of the present invention;

图6B是示出根据本发明的一个实施例的依据向量尺寸和数据元素尺寸而将向量写掩码寄存器640和位位置用作写掩码的图示； FIG. 6B is a diagram illustrating the use of a vector writemask register 640 and bit positions as a writemask as a function of vector size and data element size according to one embodiment of the invention;

图7A是示出根据本发明的某些实施例的使用来自64位向量写掩码寄存器K1的写掩码进行合并的示例性操作760的框图，其中向量尺寸为512位而数据元素尺寸为32位； 7A is a block diagram illustrating an exemplary operation 760 of merging using a writemask from a 64-bit vector writemask register K1, where the vector size is 512 bits and the data element size is 32 bits, according to some embodiments of the invention. bit;

图7B是示出根据本发明的某些实施例的使用来自64位写掩码寄存器K1的写掩码进行归零(zero)的示例性操作766的框图，其中向量尺寸为512位而数据元素尺寸为32位； 7B is a block diagram illustrating an exemplary operation 766 of zeroing using a write mask from the 64-bit write mask register K1, where the vector size is 512 bits and the data element size is 32 bits;

图8A示出根据本发明的某些实施例的用于使用8位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZB GPR_Y,K_X)的伪代码； FIG. 8A shows pseudo-code for a "set GPR if mask is 0" instruction (KSETZB GPR _Y , K _X ) for using an 8-bit sized source operand 134, according to some embodiments of the invention;

图8B示出根据本发明的某些实施例的用于使用16位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZW GPR_Y,K_X)的伪代码； Figure 8B shows pseudo-code for a "set GPR if mask is 0" instruction (KSETZW GPR _Y , K _X ) for using a 16-bit sized source operand 134, according to some embodiments of the invention;

图8C示出根据本发明的某些实施例的用于使用32位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZD GPR_Y,K_X)的伪代码； Figure 8C shows pseudo-code for a "set GPR if mask is 0" instruction (KSETZD GPR _Y , K _X ) for using a 32-bit sized source operand 134, according to some embodiments of the invention;

图8D示出根据本发明的某些实施例的用于使用64位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZQ GPR_Y,K_X)的伪代码； Figure 8D shows pseudo-code for a "set GPR if mask is 0" instruction (KSETZQ GPR _Y , K _X ) for using a 64-bit sized source operand 134, according to some embodiments of the invention;

图9A示出根据本发明的某些实施例的用于使用8位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZB GPR_Y,K_X)的伪代码； FIG. 9A shows pseudo-code for a “set GPR if mask is not 0” instruction (KSETNZB GPR _Y , K _X ) for using an 8-bit sized source operand 134, according to some embodiments of the invention;

图9B示出根据本发明的某些实施例的用于使用16位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZW GPR_Y,K_X)的伪代码； Figure 9B shows pseudocode for a "set GPR if mask is not 0" instruction (KSETNZW GPR _Y , K _X ) for using a source operand 134 of 16-bit size, according to some embodiments of the present invention;

图9C示出根据本发明的某些实施例的用于使用32位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZD GPR_Y,K_X)的伪代码； Figure 9C shows pseudo-code for a "set GPR if mask is not 0" instruction (KSETNZD GPR _Y , K _X ) for using a 32-bit sized source operand 134, according to some embodiments of the invention;

图9D示出根据本发明的某些实施例的用于使用64位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZQ GPR_Y,K_X)的伪代码； Figure 9D shows pseudo-code for a "set GPR if mask is not 0" instruction (KSETNZQ GPR _Y , K _X ) for using a 64-bit sized source operand 134, according to some embodiments of the invention;

图10A示出用AVX1/AVX2指令编写的将参数传递到函数的示例性代码序列； Figure 10A shows an exemplary code sequence written in AVX1/AVX2 instructions to pass parameters to a function;

图10B示出根据本发明的一个实施例的用KSETZW指令编写的将参数传递到函数的示例性代码序列； Figure 10B shows an exemplary code sequence for passing parameters to a function written with the KSETZW instruction according to one embodiment of the present invention;

图11A示出用AVX1/AVX2指令编写的使用指针和间接函数调用的示例性代码序列； Figure 11A shows an exemplary code sequence written in AVX1/AVX2 instructions using pointers and indirect function calls;

图11B示出根据本发明的一个实施例的用KSETZW指令编写的使用指针和间接函数调用的示例性代码序列； Figure 11B shows an exemplary code sequence using pointers and indirect function calls written with the KSETZW instruction according to one embodiment of the present invention;

图12A提供了VEX C4编码的表示； Figure 12A provides a representation of the VEX C4 encoding;

图12B示出来自图12A的哪些字段构成完整操作码字段1274和基础操作字段1242； Figure 12B shows which fields from Figure 12A make up the full opcode field 1274 and the base opcode field 1242;

图12C示出来自图12A的哪些字段构成寄存器索引字段1244； FIG. 12C shows which fields from FIG. 12A constitute register index field 1244;

图13A是示出根据本发明的实施例的通用向量友好指令格式及其A类指令模板的框图； 13A is a block diagram illustrating a general vector friendly instruction format and its class A instruction templates according to an embodiment of the present invention;

图13B是示出根据本发明的实施例的通用向量友好指令格式及其B类指令模板的框图； 13B is a block diagram illustrating a generic vector-friendly instruction format and its Class B instruction templates according to an embodiment of the present invention;

图14A是示出根据本发明的实施例的示例性专用向量友好指令格式的框图； 14A is a block diagram illustrating an exemplary specific vector friendly instruction format according to an embodiment of the present invention;

图14B是示出根据本发明的实施例的构成完整操作码字段1374的具有专用向量友好指令格式1400的字段的框图； FIG. 14B is a block diagram illustrating the fields of the specific vector friendly instruction format 1400 that make up the full opcode field 1374 according to an embodiment of the invention;

图14C是示出根据本发明的一个实施例的构成寄存器索引字段1344的具有专用向量友好指令格式1400的字段的框图； FIG. 14C is a block diagram illustrating the fields in the specific vector friendly instruction format 1400 that make up the register index field 1344 according to one embodiment of the invention;

图14D是示出根据本发明的一个实施例的构成扩充操作字段1350的具有专用向量友好指令格式1400的字段的框图； Figure 14D is a block diagram showing the fields with the specific vector friendly instruction format 1400 that make up the extended operation field 1350 according to one embodiment of the present invention;

图15是根据本发明的一个实施例的寄存器架构1500的框图； Figure 15 is a block diagram of a register architecture 1500 according to one embodiment of the invention;

图16A是示出根据本发明的实施例的示例性有序流水线以及示例性寄存器重命名的无序发布/执行流水线两者的框图； 16A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming out-of-order issue/execution pipeline according to an embodiment of the invention;

图16B是示出根据本发明的实施例的要包括在处理器中的有序架构核的示例性实施例和示例性的寄存器重命名的无序发布/执行架构核的框图； 16B is a block diagram illustrating an exemplary embodiment of an in-order architecture core and an exemplary register renaming out-of-order issue/execution architecture core to be included in a processor according to an embodiment of the present invention;

图17A是根据本发明的各实施例的单个处理器核连同它与管芯上互连网络1702的连接以及其二级(L2)高速缓存的本地子集1704的框图； FIG. 17A is a block diagram of a single processor core with its connection to an on-die interconnect network 1702 and its local subset 1704 of Level 2 (L2) cache, according to various embodiments of the invention;

图17B是根据本发明的各实施例的图17A中的处理器核的一部分的展开图； Figure 17B is an expanded view of a portion of the processor core in Figure 17A in accordance with various embodiments of the invention;

图18是根据本发明的实施例的可具有一个以上核、可具有集成存储器控制器、并且可具有集成图形器件的处理器1800的框图； 18 is a block diagram of a processor 1800 that may have more than one core, may have an integrated memory controller, and may have an integrated graphics device, according to an embodiment of the invention;

图19是根据本发明的实施例的系统1900的框图； Figure 19 is a block diagram of a system 1900 according to an embodiment of the invention;

图20是根据本发明的实施例的第一更具体的示例性系统2000的框图； Figure 20 is a block diagram of a first more specific exemplary system 2000 according to an embodiment of the present invention;

图21是根据本发明的实施例的第二更具体的示例性系统2100的框图； Figure 21 is a block diagram of a second more specific exemplary system 2100 according to an embodiment of the present invention;

图22是根据本发明的实施例的SoC2200的框图；以及 FIG. 22 is a block diagram of a SoC 2200 according to an embodiment of the invention; and

图23是根据本发明的实施例的对比使用软件指令变换器将源指令集中的二进制指令变换成目标指令集中的二进制指令的框图。 23 is a block diagram comparing binary instructions in a source instruction set to binary instructions in a target instruction set using a software instruction translator, according to an embodiment of the present invention.

具体实施方式 detailed description

在以下描述中，陈述了诸如逻辑实现、操作码、指定操作数的方式、资源划分/共享/复制实现、系统组件的类型和相互关系、以及逻辑划分/整合选择之类的多个具体细节，以提供对本发明的更透彻理解。然而，本领域技术人员应当领会，没有这些具体细节也可实践本发明。在其它实例中，未详细示出控制结构、门级电路以及完整软件指令序列，以免使本发明难以理解。本领域技术人员利用所包括的描述将能在无需过度实验的情况下实现适当的功能。 In the following description, numerous specific details such as logic implementation, opcodes, manner of specifying operands, resource division/sharing/duplication implementation, types and interrelationships of system components, and logic division/integration selection are set forth, To provide a more thorough understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, control structures, gate level circuits and full software instruction sequences have not been shown in detail in order not to obscure the invention. Those skilled in the art, with the included description, will be able to implement the appropriate function without undue experimentation.

还应理解，对例如“一个实施例”、“实施例”或“一个或多个实施例”的引述意味着一特定特征可包括在本发明的实施例的实践中，但是每个实施例可不必包括该特定特征。类似地应当理解，各个特征有时被一起编组在单个实施例、附图或其描述中以使本公开变得流畅并帮助理解各个创新性方面。此外，当参考一个实施例描述特定特征、结构或特性时，认为在本领域技术人员学识范围内，可以与其他实施例一起实施这样的特征、结构或特性，不论是否有明确描述。然而，这种公开方法不应该被解释为反映如下意图，即需要比权利要求中列举的特征更多的特征。相反，如下面权利要求反映的，各创新性方面可具有比单个公开的实施例的全部特征更少的特征。因此，说明书之后所附的权利要求因此被明确纳入该说明书中，每一项权利要求独自作为本发明单独的实施例。 It should also be understood that references such as "one embodiment," "an embodiment" or "one or more embodiments" mean that a particular feature may be included in the practice of the embodiments of the invention, but that each embodiment may This particular feature need not be included. Similarly, it should be understood that various features are sometimes grouped together in a single embodiment, figure, or description thereof in order to streamline the disclosure and to facilitate understanding of the various innovative aspects. Furthermore, when a particular feature, structure or characteristic is described with reference to one embodiment, it is considered within the purview of those skilled in the art that such feature, structure or characteristic can be implemented with other embodiments whether or not explicitly described. This method of disclosure, however, is not to be interpreted as reflecting an intention that more features than are recited in the claims are required. Rather, as the following claims reflect, each inventive aspect may have less than all features of a single disclosed embodiment. Thus, the claims following the specification are hereby expressly incorporated into this specification, with each claim standing on its own as a separate embodiment of this invention.

在以下描述和权利要求书中，可使用术语“耦合”和“连接”及其衍生词。应当理解，这些术语并不旨在作为彼此的同义词。“耦合”用于指示两个或多个元件彼此合作或相互作用，但它们可能或可能不直接物理或电接触。“连接”被用来指示在彼此耦合的两个或更多个元件之间建立通信。 In the following description and claims, the terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. "Coupled" is used to indicate that two or more elements co-operate or interact with each other, but they may or may not be in direct physical or electrical contact. "Connected" is used to indicate the establishment of communication between two or more elements that are coupled to each other.

将参考框图的示例性实施例来描述流程图的操作。然而，应当理解，流程图的操作可以由本发明的不同于参考框图所讨论的那些实施例的实施例来执行，并且参考框图所讨论的实施例可执行不同于参考流程图所讨论的那些操作的操作。 The operations of the flowcharts will be described with reference to the exemplary embodiments of the block diagrams. It should be understood, however, that the operations of the flowcharts may be performed by embodiments of the invention that differ from those discussed with reference to the block diagrams, and that embodiments discussed with reference to the block diagrams may perform operations different than those discussed with reference to the flowcharts operate.

为便于理解，在附图中使用了虚线来表明某些项的可选性质(例如，本发明的给定实现不支持的特征；给定实现支持、但是在某些情况下使用而在其他情况下不使用的特征)。 To facilitate understanding, dashed lines are used in the figures to indicate the optional nature of certain items (e.g., features not supported by a given implementation of the invention; features supported by a given implementation but used in some cases but not in others features not used below).

概览 overview

图1是示出根据本发明的某些实施例的用于基于向量写掩码在通用寄存器中存储两个标量常数之一的示例性指令的操作的框图。图1示出了架构向量写掩码寄存器110、架构向量寄存器120、以及架构通用寄存器130。 Figure 1 is a block diagram illustrating the operation of an exemplary instruction for storing one of two scalar constants in a general purpose register based on a vector write mask, according to some embodiments of the present invention. FIG. 1 shows architectural vector write mask register 110 , architectural vector register 120 , and architectural general purpose register 130 .

向量寄存器120被个别指定为VR_Z，其中z可以是从0到U的值，向量寄存器120被用于存储向量操作数。指令集架构包括指定向量操作并且具有从这些向量寄存器120中选择源寄存器和/或目的地寄存器的至少某些SIMD指令(示例性SIMD指令可以指定要对向量寄存器120中的一个或多个的内容执行的向量操作，该向量操作的结果被存储在向量寄存器120之一中)。本发明的不同的实施例可以具有不同尺寸的向量寄存器并支持更多/更少/不同尺寸的数据元素。由SIMD指令指定的多位数据元素的尺寸(例如，字节、字、双字、四字)决定向量寄存器内“数据元素位置”的位定位，并且向量操作数的尺寸决定数据元素的数量。换言之，依据目的地操作数中数据元素的尺寸以及目的地操作数的尺寸(目的地操作数中的总位数)(或换言之，依据目的地操作数的尺寸和目的地操作数中数据元素的数量)，所得到的向量操作数内多位数据元素位置的位定位(bit location)改变(例如，如果所得到的向量操作数的目的地是向量寄存器，则目的地向量寄存器内多位数据元素位置的位定位改变)。例如，多位数据元素的位位置在对32位数据元素(数据元素位置0占用位位置31:0，数据元素位置1占用位位置63:32，依次类推)进行操作的向量操作和对64位数据元素(数据元素位置0占用位位置63:0，数据元素位置1占用位位置127:64，依次类推)进行操作的向量操作之间是不同的。在本文中更详细地描述了这一点。 Vector registers 120, individually designated _VRZ , where z can be a value from 0 to U, are used to store vector operands. The instruction set architecture includes at least some SIMD instructions that specify vector operations and have at least some SIMD instructions that select source registers and/or destination registers from these vector registers 120 (an exemplary SIMD instruction may specify that the contents of one or more of the vector registers 120 are to be A vector operation is performed, the result of which is stored in one of the vector registers 120). Different embodiments of the present invention may have different size vector registers and support more/less/different size data elements. The size of a multi-bit data element specified by a SIMD instruction (eg, byte, word, doubleword, quadword) determines the bit positioning of a "data element position" within a vector register, and the size of a vector operand determines the number of data elements. In other words, depending on the size of the data elements in the destination operand and the size of the destination operand (total number of bits in the destination operand) (or in other words, depending on the size of the destination operand and the number of data elements in the destination operand number), the bit location of the multi-bit data element position within the resulting vector operand is changed (for example, if the destination of the resulting vector operand is a vector register, the multi-bit data element within the destination vector register The bit alignment of the position changes). For example, the bit positions of multi-bit data elements differ between vector operations that operate on 32-bit data elements (data element position 0 occupies bit positions 31:0, data element position 1 occupies bit positions 63:32, and so on) and vector operations that operate on 64-bit The vector operations that operate on data elements (data element position 0 occupies bit positions 63:0, data element position 1 occupies bit positions 127:64, and so on) differ between vector operations. This is described in more detail in this paper.

向量写掩码寄存器110被个别指定为K_X，其中x的范围可为从0到T，向量写掩码寄存器110被用来存储写掩码，其中写掩码包括多个一位向量写掩码元素，所述向量写掩码元素对应于目的地向量操作数内的不同的多位数据元素位置。上面描述的SIMD指令的至少一部分包括用于从向量写掩码寄存器110中选择写掩码的字段，并且所选择的写掩码的该一位写掩码元素控制目的地向量操作数中的哪个数据元素位置反映该向量操作的结果。因为如上所述该目的地操作数的多位数据元素的位定位在支持对不同尺寸的数据元素进行操作的向量操作的实施例中改变(例如，多位数据元素中的位定位在对32位数据元素操作的向量操作和对64位数据元素操作的向量操作之间是不同的)，所以这些实施例可支持该一位写掩码元素与该目的地操作数内的位定位的不同关系(也被称为对应性或映射)；因此，在位定位随着目的地操作数改变时，一位向量写掩码元素的映射也改变。 Vector writemask registers 110 are individually designated as Kx, where _x can range from 0 to T, and vector writemask registers 110 are used to store a writemask, wherein the writemask includes a plurality of one-bit vector writemasks code elements, the vector writemask elements corresponding to different multi-bit data element positions within the destination vector operand. At least a portion of the SIMD instructions described above include a field for selecting a write mask from the vector write mask register 110, and the one-bit write mask element of the selected write mask controls which of the destination vector operands The data element positions reflect the results of operations on this vector. Because the bit alignment of the multi-bit data elements of the destination operand is changed in embodiments that support vector operations operating on data elements of different sizes as described above (e.g., the bit alignment of the multi-bit data elements in the are different between vector operations that operate on data elements and vector operations that operate on 64-bit data elements), so these embodiments can support a different relationship between the one-bit writemask element and the bit positioning within the destination operand ( Also known as correspondence or mapping); thus, as the bit positioning changes with the destination operand, the mapping of the bit vector writemask elements also changes.

通用寄存器130被个别指定为GPR_Y，其中Y的范围可为从0到V，通用寄存器130被用来存储用于逻辑操作、算术操作、地址计算、和存储器指针的操作数。该指令集架构包括标量指令，所述标量指令指定要对通用寄存器130内的寄存器的内容执行的标量操作。通用寄存器130和向量寄存器120之间的差别在本发明的不同实施例之间可以改变(例如，向量寄存器相对于通用寄存器，总数不同；向量寄存器与通用寄存器相比，尺寸不同；向量寄存器可按照整数格式和浮点格式两种格式存储数据，而通用寄存器只按照整数格式存储数据)，本文稍后更详细描述这一点的示例。 Individually designated as GPR _Y , where Y can range from 0 to V, general purpose registers 130 are used to store operands for logical operations, arithmetic operations, address calculations, and memory pointers. The instruction set architecture includes scalar instructions that specify scalar operations to be performed on the contents of registers within general purpose registers 130 . The difference between general purpose registers 130 and vector registers 120 may vary between different embodiments of the invention (e.g., vector registers differ in total number relative to general purpose registers; vector registers differ in size compared to general purpose registers; vector registers may be An example of this is described in more detail later in this article.

尽管向量写掩码寄存器110、向量寄存器120、和通用寄存器130中的每一个中的寄存器数量被分别制定为T、U、和V，然而这些寄存器组中的一个或多个可以具有相同数量的寄存器。在一个实施例中，可以按照各种方式来设置给定向量写掩码寄存器的值，包括作为向量比较指令的直接结果、从GPR传输、或作为两个掩码之间的逻辑操作的直接结果来计算。 Although the number of registers in each of vector writemask register 110, vector register 120, and general purpose register 130 is specified as T, U, and V, respectively, one or more of these register banks may have the same number of register. In one embodiment, the value of a given vector writemask register can be set in various ways, including as a direct result of a vector compare instruction, transferred from a GPR, or as a direct result of a logical operation between two masks to calculate.

在此图中，圈起来的字母被用来指示阅读所示各项的顺序以便于理解，而在一些情况下以便指示那些项之间的关系。在圈起来的A处，有指令100，指令100具有包括第一字段102和第二字段104的格式，第一字段102的内容选择源寄存器K_X，而第二字段的内容选择目的地寄存器GPR_Y。指令100属于指令集架构，并且指令100在指令流内的每次“出现”将包括第一字段102和第二字段104内的值，所述值分别选择架构向量写掩码寄存器110和架构通用寄存器130中的具体寄存器。 In this figure, encircled letters are used to indicate the order of reading the items shown to facilitate understanding and, in some cases, to indicate the relationship between those items. At circled A there is an instruction 100 having a format comprising a first field 102 and a second field 104, the content of the first field 102 selecting the source register _KX and the content of the second field selecting the destination register GPR _Y. The instruction 100 belongs to the instruction set architecture, and each "occurrence" of the instruction 100 within the instruction stream will include values in the first field 102 and the second field 104, which select the architectural vector writemask register 110 and the architectural general A specific register in register 130.

圈起来的B代表从向量写掩码寄存器110中选择K_X作为源向量写掩码寄存器。图1单独示出了所选择的源向量写掩码寄存器的内容132的具体示例，其中该向量写掩码寄存器是64位寄存器并且其中示出了位位置0:15,31和63的示例性的值。 The circled B represents selecting K _X from the vector write mask register 110 as the source vector write mask register. FIG. 1 alone shows a specific example of the contents 132 of the selected source vector writemask register, where the vector writemask register is a 64-bit register and where exemplary bit positions 0:15, 31, and 63 are shown. value.

类似地，圈起来的C示出选择通用寄存器130内的GPR_Y作为目的地通用寄存器。 Similarly, a circled C shows that GPR _Y within the general purpose register 130 is selected as the destination general purpose register.

在虚线圈D处，源操作数134是从源向量写掩码寄存器K_X的内容中选择的。源向量写掩码寄存器的不同位位置处的具有虚划线标记(hash mark)的点画线(具体而言，虚划线标记在位位置7:8,15:16和31:32之间)示出了在本发明的一些实施例中源操作数134的尺寸是可选择的。允许选择不同尺寸的源操作数134的不同实施例可按各种方式控制此选择(例如，使用不同的操作码、使用来自通用寄存器之一的值)。当然，本发明的替换实施例可支持不同尺寸的向量写掩码寄存器和/或源操作数134的不同的/更多的/更少的选择尺寸。进而，本发明的一些实施例可仅支持包括所选择向量写掩码寄存器的全部内容的源操作数(源操作数134总是与所选择的向量写掩码寄存器尺寸相同)。 At dashed circle D, source operand 134 is selected from the contents of source vector write mask register _KX . Dot-dash lines with dash marks at different bit positions of the source vector writemask register (specifically, the dash marks are between bit positions 7:8, 15:16, and 31:32) It is shown that the size of source operand 134 is optional in some embodiments of the invention. Different embodiments that allow selection of source operands 134 of different sizes may control this selection in various ways (eg, using different opcodes, using a value from one of the general purpose registers). Of course, alternative embodiments of the present invention may support different sizes of vector writemask registers and/or different/more/less select sizes of source operands 134 . Furthermore, some embodiments of the present invention may only support source operands that include the entire contents of the selected vector writemask register (source operand 134 is always the same size as the selected vector writemask register).

在圈起来的E处，对源操作数134进行操作以产生标量常数(此标量常数被称为目的地操作数或结果)。圈起来的E包括第一框140，其中确定是否源操作数的全部一位向量写掩码元素均是0。如果是，则输出第一标量常数(例如，1)(框142)；否则，输出第二标量常数(例如，0)(框142)。 At the circled E, the source operand 134 is operated on to produce a scalar constant (this scalar constant is called the destination operand or result). Encircled E includes a first block 140 in which it is determined whether all bit vector writemask elements of the source operand are zeros. If so, a first scalar constant (eg, 1) is output (block 142); otherwise, a second scalar constant (eg, 0) is output (block 142).

在圈起来的F处，结果(其为第一标量常数或第二标量常数)被写回通用寄存器GPR_Y。因为这是标量结果并且它正被写回通用寄存器，在一些实施例中，基于该源操作数中的该多个一位写掩码元素是否全为0，该多位通用寄存器的所有内容将表示第一标量常数(例如00…1)或第二标量常数(例如00…0)。 At the circled F, the result (which is either the first scalar constant or the second scalar constant) is written back to the general purpose register GPR _Y . Because this is a scalar result and it is being written back to a general-purpose register, in some embodiments, based on whether the multiple one-bit writemask elements in the source operand are all zeros, the entire contents of the multi-bit general-purpose register will be Indicates the first scalar constant (eg 00...1) or the second scalar constant (eg 00...0).

如下面更详细地描述的，可以实现指令100，使得这两个标量常数为布尔值1和0。如此，基于该向量写掩码的所有位是否为全0，布尔值被存储在所选择的GPR中(换言之，所选择的GPR的多个位将共同表示1或0)。在此布尔情况中，指令100被称为“如果掩码为P则设置GPR”指令，其中K可以是0或不是0。如此，基于ISA中的向量写掩码固有的控制流信息，生成布尔值。通过在GPR而不是控制流寄存器中放置该布尔值(例如，进位标志)，基于该布尔值的决策可以是基于数据流的而不是基于控制流的。具体而言，基于控制流的决策依赖于改变执行流的指令(例如，跳转(jump)、分支(branch)等)，而数据流决策基于该布尔值在数据之间选择。例如，如果掩码为P则设置GPR指令对于在函数中传递的参数是有用的(参见图11A-B)，并且对于用于不同代码段的快速条件执行的高效指针生成和间接函数调用是有用的(参见图12A-B)。 As described in more detail below, instruction 100 may be implemented such that the two scalar constants are Boolean 1 and 0. As such, a Boolean value is stored in the selected GPR based on whether all bits of the vector's writemask are all zeros (in other words, the bits of the selected GPR will collectively represent a 1 or 0). In this Boolean case, instruction 100 is referred to as a "set GPR if mask is P" instruction, where K may or may not be zero. As such, Boolean values are generated based on the control flow information inherent in the vector writemask in the ISA. By placing the boolean value (eg, the carry flag) in the GPR instead of the control flow register, decisions based on the boolean value can be data flow based rather than control flow based. Specifically, control-flow-based decisions rely on instructions that change the flow of execution (eg, jumps, branches, etc.), while data-flow decisions choose between data based on the Boolean value. For example, the set GPR instruction if the mask is P is useful for parameters passed in functions (see Figure 11A-B), and for efficient pointer generation and indirect function calls for fast conditional execution of different code segments (see Figure 12A-B).

尽管图1示出了单个指令100，该单个指令100将基于向量写掩码的内容在通用寄存器中存储两个标量常数之一，然而应当理解，该指令集架构可包括多个此类指定类似操作但具有不同标准的指令(例如，选择不同的源操作数尺寸、在不同标量常数间选择、当所有的一位向量写掩码元素为0时存储相反的标量常数)，如本文稍后描述的。尽管在本发明的一些实施例中所示出的指令100指定来自向量写掩码寄存器110中的单个向量写掩码寄存器的源操作数作为其唯一源操作数并且指定通用寄存器130中的单个通用寄存器作为其目的地，然而本发明的其他实施例可包括附加的源(例如，用于存储器访问计算的数据)、不同类型的目的地(例如，存储器位置而不是寄存器)、和/或附加的源操作数和目的地(例如，还将结果存储在条件代码标志中的指令、致使对附加的源操作数执行单独操作的指令，其结果存储在该附加目的地中)。 Although FIG. 1 shows a single instruction 100 that will store one of two scalar constants in a general-purpose register based on the contents of the vector writemask, it should be understood that the instruction set architecture may include multiple such designations like Instructions that operate but have different criteria (e.g., choose a different source operand size, choose between different scalar constants, store the opposite scalar constant when all bit vector writemask elements are 0), as described later in this document of. Although in some embodiments of the invention instruction 100 is shown specifying a source operand from a single vector writemask register in vector writemask register 110 as its only source operand and a single general purpose register 130 registers as its destination, however other embodiments of the invention may include additional sources (e.g., data for memory access calculations), different types of destinations (e.g., memory locations instead of registers), and/or additional A source operand and a destination (eg, an instruction that also stores the result in a condition code flag, an instruction that causes a separate operation to be performed on an additional source operand, the result of which is stored in the additional destination).

示例性的“如果掩码为0则设置GPR”和“如果掩码不为0则设置GPR”指令 Exemplary "set GPR if mask is 0" and "set GPR if mask is not 0" instructions

图2A是示出关于根据本发明的一个实施例的具体“如果掩码为0则设置GPR”指令的示例的框图，而图2B是示出关于根据本发明的一个实施例的具体“如果掩码不为0则设置GPR”指令的示例的框图。图2A和2B两者均示出源向量写掩码寄存器K_X内容132。而且两幅图均示出源操作数134A，该源操作数是源向量写掩码寄存器K_X中的位63:0的位15:0，而不是源向量写掩码寄存器K_X中的所有位。此外，两幅图均包括确定该源操作数134A中的该写掩码的所有一位向量写掩码元素为全0。在图2A中，如果源操作数134A中的所有位为全0，则控制传递到框212，在该处使GPR_Y等于标量常数1；否则，控制传递到框214，在该处使GPR_Y等于标量常数0。转向图2B的框210，如果源操作数134A的所有位为全0，则控制传递到框216，在该处使GPR_Y等于标量常数0；否则，控制传递到框218，在该处使GPR_Y等于标量常数1。 FIG. 2A is a block diagram showing an example of a specific "set GPR if mask is 0" instruction according to one embodiment of the present invention, and FIG. 2B is a block diagram showing an example of a specific "if mask Block diagram of an example of the set GPR if code is not 0" instruction. Both FIGS. 2A and 2B show the source vector writemask register K _X contents 132 . Also, both figures show source operand 134A, which is bits 15:0 of bits 63:0 in source vector _writemask register KX, rather than all bits in source vector _writemask register KX. bit. Additionally, both figures include determining that all bit vector writemask elements of the writemask in the source operand 134A are all zeros. In FIG. 2A, if all bits in source operand 134A are all zeros, control passes to block 212 where GPR _Y is made equal to the scalar constant 1; otherwise, control passes to block 214 where GPR _Y Equal to the scalar constant 0. Turning to block 210 of FIG. 2B, if all bits of source operand 134A are all zeros, then control passes to block 216 where GPR _Y is made equal to the scalar constant 0; otherwise, control passes to block 218 where GPR Y is made equal to the scalar constant 0; _Y is equal to the scalar constant 1.

在本发明的一些实施例中，“如果掩码为0则设置GPR”指令类型被称为KSETZ{B,W,D,Q}GPR_Y,K_X(其中{}指示可选择源操作数134尺寸)，而“如果掩码不为0则设置GPR”指令类型被称为KSETNZ{B,W,D,Q}GPR_Y,K_X。 In some embodiments of the invention, the "set GPR if mask is 0" instruction type is called KSETZ{B,W,D,Q}GPR _Y , K _X (where { } indicates the optional source operand 134 size), and the "set GPR if mask is not 0" instruction type is called KSETNZ{B,W,D,Q}GPR _Y ,K _X .

示例性流和处理器核 Exemplary Streams and Processor Cores

图3是根据本发明的某些实施例的用于处理基于向量写掩码的内容在通用寄存器中存储两个标量常数之一的指令的每次出现的流程图。在框301，取出此类指令的表示。该指令的格式将来自单个向量写掩码寄存器的源操作数指定为其唯一源操作数，并将单个通用寄存器指定为其目的地。该指令的格式包括第一字段，该第一字段的内容从多个架构向量写掩码寄存器中选择该单个向量写掩码寄存器；并且该指令的格式包括第二字段，该第二字段的内容从多个架构通用寄存器选择该单个通用寄存器。该源操作数是包括多个一位向量写掩码元素的写掩码，该多个一位向量写掩码元素对应于架构向量寄存器内的不同的多位数据元素位置。从框300，控制传递至框302。 3 is a flowchart for processing each occurrence of an instruction to store one of two scalar constants in a general-purpose register based on the contents of a vector writemask, according to some embodiments of the invention. At block 301, a representation of such instructions is fetched. The format of this instruction designates a source operand from a single vector writemask register as its only source operand and a single general purpose register as its destination. The format of the instruction includes a first field whose contents select the single vector writemask register from a plurality of architectural vector writemask registers; and the format of the instruction includes a second field whose contents The single general purpose register is selected from a plurality of architectural general purpose registers. The source operand is a writemask comprising a plurality of one-bit vector writemask elements corresponding to different multi-bit data element locations within the architectural vector register. From block 300 , control passes to block 302 .

在框302，响应于执行来自框301的该单个指令的单次出现，数据被存储在该单个通用寄存器中，使得其内容基于该多个一位向量写掩码元素是否为全0而表示第一或第二标量常数。针对可被选择的标量常数的示例并且基于该多个一位向量写掩码元素是否为全0选择哪个标量常数，参见图2A和2B。 At block 302, in response to executing a single occurrence of the single instruction from block 301, data is stored in the single general-purpose register such that its contents represent the first One or two scalar constants. See FIGS. 2A and 2B for examples of scalar constants that may be selected and which scalar constant is selected based on whether the plurality of bit vector writemask elements are all zeros.

图4是根据本发明的某些实施例的用于执行基于向量写掩码的内容在通用寄存器中存储两个标量常数之一的指令的出现的流程图。如框401中所示，对一个此类指令的出现的源操作数的多个一位向量写掩码元素执行逻辑OR(或)操作。为了支持不同的源操作数尺寸，可以使用由复用器连接的一组OR树(构成最小尺寸源操作数的最低有效位是第一OR树的输入，此第一OR树的输出是OR门的输入，此OR门的另一输入是复用器的输出，此复用器的输入是0和第二OR树的输出，第二OR树的输入是构成源操作数的下一尺寸的下一最高有效位，复用器由指示正在使用最小尺寸源操作数还是较大尺寸源操作数的信号来控制，这可被缩放以包括附加尺寸的源操作数)。从框401，控制传递至框402。在框402，基于控制信号来复用第一或第二标量常数，该控制信号是从该逻辑OR操作的结果和对该指令为多种类型中的哪一种的指示所形成的。例如，一种此类类型是来自图2A的“如果向量写掩码为0则设置GPR”类型，另一种此类类型是来自图2B的“如果向量写掩码不为0则设置GPR”类型。作为进一步示例，如果“如果向量写掩码为0则设置GPR”和“如果向量写掩码不为0则设置GPR”类型分别用逻辑1和0的“类型信号”来代表，则此类型信号可以与源操作数的逻辑OR的结果“异或”(也称为“XOR”或逻辑异或操作)以形成该控制信号；该控制信号被提供至在这两个标量常数间选择的复用器。在此实施例中，这两个标量常数被硬连线为1和0；而在这样的实施例中，逻辑1的控制信号选择硬连线的标量常数1，而逻辑0的控制信号选择硬连线的标量常数0。 4 is a flow diagram of the occurrence of an instruction for executing an instruction to store one of two scalar constants in a general-purpose register based on the contents of a vector writemask, according to some embodiments of the invention. As shown in block 401, a logical OR operation is performed on the multiple bit vector writemask elements of the source operand occurrences of one such instruction. To support different source operand sizes, a set of OR trees connected by multiplexers can be used (the least significant bits making up the smallest size source operand are the input to the first OR tree, and the output of this first OR tree is the OR gate The other input of this OR gate is the output of the multiplexer, the input of this multiplexer is 0 and the output of the second OR tree, the input of the second OR tree is the lower dimension of the next dimension that constitutes the source operand One most significant bit, the multiplexer is controlled by a signal indicating whether the smallest or larger sized source operand is being used, which may be scaled to include additional sized source operands). From block 401 , control passes to block 402 . At block 402, the first or second scalar constant is multiplexed based on a control signal formed from a result of the logical OR operation and an indication of which of a plurality of types the instruction is. For example, one such type is the "set GPR if vector writemask is 0" type from Figure 2A, and another such type is "set GPR if vector writemask is not 0" from Figure 2B Types of. As a further example, if the "set GPR if vector writemask is 0" and "set GPR if vector writemask is not 0" types are represented by "type signals" of logic 1 and 0, respectively, then this type signal The control signal can be "exclusively ORed" with the result of a logical OR of the source operands (also known as "XOR" or logical exclusive OR operation); the control signal is provided to a multiplexer that selects between these two scalar constants device. In this embodiment, the two scalar constants are hardwired to 1 and 0; whereas in such an embodiment, a control signal of logic 1 selects the hardwired scalar constant 1, while a control signal of logic 0 selects the hardwired scalar constant 1. Scalar constant 0 for the wire.

尽管参考图4描述了特定的分立逻辑，然而应当理解，不同的实施例可以使用不同的逻辑(例如，可以翻转对不同类型指令的逻辑值赋值并且翻转复用器输入)。 Although specific discrete logic is described with reference to FIG. 4, it should be understood that different embodiments may use different logic (eg, logic value assignments to different types of instructions may be flipped and multiplexer inputs may be flipped).

图5是根据本发明的某些实施例的用于处理基于向量写掩码的内容在通用寄存器中存储两个标量常数之一的指令的出现的具体机器的框图。图5重复了来自图1的指令100、向量写掩码寄存器110、向量寄存器120、通用寄存器130、以及圈起来的B-C。图5还示出了处理核500，该处理核500包括任选的指令取出单元510、硬件解码单元515、和执行引擎单元520、以及向量写掩码寄存器110、向量寄存器120、以及通用寄存器130。 Figure 5 is a block diagram of a specific machine for handling the presence of an instruction to store one of two scalar constants in a general purpose register based on the contents of a vector write mask, according to some embodiments of the invention. FIG. 5 repeats instruction 100, vector writemask register 110, vector register 120, general register 130, and circled B-C from FIG. 5 also shows processing core 500, which includes optional instruction fetch unit 510, hardware decode unit 515, and execution engine unit 520, as well as vector write mask register 110, vector register 120, and general purpose register 130. .

在图5中的圈出来的A处，指令100(或者基于向量写掩码的内容而在通用寄存器中存储两个标量常数之一的不同类型的指令)的表示被提供至硬件解码单元515(任选地作为指令取出单元510取出指令100的结果)。对于解码单元515，可使用各种不同的公知解码单元。例如，该解码单元可以将每个宏指令解码为单个宽微指令。作为另一示例，该解码单元可以将某些宏指令解码为单个宽微指令，但是将其他宏指令解码为多个宽微指令。作为特别适于无序处理器流水线的另一示例，该解码单元可以将每个宏指令解码为一个或多个微操作(micro-op)，其中每个微操作可被发出并无序执行。而且，该解码单元可以由一个或多个解码器来实现，并且每个解码器可被实现为可编程逻辑阵列(PLA)，如本领域公知的。作为示例，给定解码单元可以：1)具有转向逻辑以便将不同的宏指令定向到不同的解码器；2)第一解码器，可解码该指令集的子集(但是比第二、第三、和第四解码器解码得更多)并且每次生成两个微操作；3)第二、第三、和第四解码器，可各仅解码完整指令集的子集，并且每次仅生成一个微操作；4)微排序器ROM，可以仅解码完整指令集的子集并且每次生成四个微操作；以及5)由解码器和微排序器ROM馈送的复用逻辑，确定哪个的输出被提供至微操作队列。该解码单元的其他实施例可具有解码更多或更少指令和指令子集的更多或更少的解码器。例如，一个实施例可具有第二、第三和第四解码器，该第二、第三和第四解码器可每次各生成两个微操作；并且可包括每次生成8个微操作的微排序器ROM。 At circled A in FIG. 5 , a representation of instruction 100 (or a different type of instruction that stores one of two scalar constants in a general-purpose register based on the contents of the vector writemask) is provided to hardware decode unit 515 ( Optionally as a result of instruction fetch unit 510 fetching instruction 100). For the decoding unit 515, various known decoding units can be used. For example, the decode unit may decode each macroinstruction into a single wide microinstruction. As another example, the decode unit may decode some macroinstructions as a single wide microinstruction, but other macroinstructions as multiple wide microinstructions. As another example particularly suited to out-of-order processor pipelines, the decode unit may decode each macro-instruction into one or more micro-ops, each of which may be issued and executed out-of-order. Also, the decoding unit may be implemented by one or more decoders, and each decoder may be implemented as a Programmable Logic Array (PLA), as known in the art. As an example, a given decode unit may: 1) have steering logic to direct different macroinstructions to different decoders; 2) a first decoder that can decode a subset of that instruction set (but less than the second, third , and fourth decoders decode more) and generate two micro-ops each time; 3) second, third, and fourth decoders can each only decode a subset of the full instruction set, and generate only one micro-op; 4) the micro-sequencer ROM, which can decode only a subset of the full instruction set and generate four micro-ops at a time; and 5) the multiplexing logic fed by the decoder and the micro-sequencer ROM, which determines the output of which is provided to the uop queue. Other embodiments of the decode unit may have more or fewer decoders that decode more or fewer instructions and subsets of instructions. For example, one embodiment may have second, third, and fourth decoders that may each generate two uops at a time; and may include 8 uops at a time Microsequencer ROM.

在圈起来的D处，访问提供该源操作数的架构向量写掩码寄存器(这可以通过专用物理寄存器、重命名的物理寄存器、旁路路径(如果内容刚生成)、等等)，并且该源操作数被提供至执行引擎单元520，该执行引擎单元在圈起来的E处执行指令流中该指令100的出现。具体而言，响应于每次出现，执行引擎单元520将确定该出现的源操作数的多个一位向量写掩码元素是否为全 0，并且使得数据被存储在该出现的所选择的单个通用寄存器中，使得其内容基于所述确定而代表第一或第二标量常数。执行引擎单元520可以按各种方式实现，包括上面参考图4所描述的逻辑。 At circled D, the architectural vector writemask register that provides that source operand is accessed (this could be through a dedicated physical register, a renamed physical register, a bypass path (if the content was just generated), etc.), and the The source operands are provided to the execution engine unit 520, which executes the occurrence of the instruction 100 in the instruction stream at the circled E. Specifically, in response to each occurrence, execution engine unit 520 will determine whether the multiple bit vector writemask elements of the source operand for that occurrence are all zeros, and cause data to be stored in the selected single A general purpose register such that its content represents either the first or the second scalar constant based on the determination. Execution engine unit 520 may be implemented in various ways, including the logic described above with reference to FIG. 4 .

在圈起来的F处，结果(其为第一标量常数或第二标量常数)被写回到架构通用寄存器GPR_Y中(其可被写到专用物理寄存器、重命名的物理寄存器等中)。因为这是标量结果并且它正被写回通用寄存器，基于该源操作数中的多个一位写掩码元素是否全为0，该通用寄存器的内容将表示第一标量常数或第二标量常数。 At the circled F, the result (which is either the first scalar constant or the second scalar constant) is written back into the architectural general purpose register GPR _Y (which may be written to a dedicated physical register, a renamed physical register, etc.). Since this is a scalar result and it is being written back to a general purpose register, the contents of this general purpose register will represent either the first scalar constant or the second scalar constant based on whether the multiple one-bit writemask elements in the source operand are all zeros .

示例性的对应性和向量写掩码操作 Exemplary Correspondence and Vector Write Masking Operations

图6A是示出根据本发明的一个实施例的示出一位向量写掩码元素依赖于向量尺寸和数据元素尺寸的表。示出了128位、256位，以及512位的向量尺寸，但是其他宽度也是可以的。考虑了8位字节(B)、16位字(W)、32位双字(D)或单精度浮点，以及64位四字(Q)或双精度浮点的数据元素尺寸，但是其他宽度也是可以的。如所示，当向量尺寸是128位时，当向量的数据元素尺寸是8位时可使用16位用于掩码，当向量的数据元素尺寸是16位时可使用8位用于掩码，当向量的数据元素尺寸是32位时可使用4位用于掩码，当向量的数据元素尺寸是64位时可使用2位用于掩码。当向量尺寸是256位时，当打包数据元素宽度是8位时可使用32位用于掩码，当向量的数据元素尺寸是16位时可使用16位用于掩码，当向量的数据元素尺寸是32位时可使用8位用于掩码，当向量的数据元素尺寸是64位时可使用4位用于掩码。如所示，当向量尺寸是512位时，当向量的数据元素尺寸是8位时可使用64位用于掩码，当向量的数据元素尺寸是16位时可使用32位用于掩码，当向量的数据元素尺寸是32位时可使用16位用于掩码，当向量的数据元素尺寸是64位时可使用8位用于掩码。 Figure 6A is a table illustrating the dependence of one-bit vector writemask elements on vector size and data element size, according to one embodiment of the present invention. Vector sizes of 128-bit, 256-bit, and 512-bit are shown, but other widths are possible. Considers data element sizes of 8-bit byte (B), 16-bit word (W), 32-bit doubleword (D) or single-precision floating point, and 64-bit quadword (Q) or double-precision floating point, but other Width is also possible. As shown, when the vector size is 128 bits, 16 bits can be used for the mask when the data element size of the vector is 8 bits, and 8 bits can be used for the mask when the data element size of the vector is 16 bits, 4 bits can be used for the mask when the data element size of the vector is 32 bits, and 2 bits can be used for the mask when the data element size of the vector is 64 bits. When the vector size is 256 bits, 32 bits can be used for the mask when the packed data element width is 8 bits, and 16 bits can be used for the mask when the data element size of the vector is 16 bits. 8 bits can be used for the mask when the size is 32 bits, and 4 bits can be used for the mask when the data element size of the vector is 64 bits. As shown, when the vector size is 512 bits, 64 bits can be used for the mask when the data element size of the vector is 8 bits, and 32 bits can be used for the mask when the data element size of the vector is 16 bits, 16 bits can be used for the mask when the data element size of the vector is 32 bits, and 8 bits can be used for the mask when the data element size of the vector is 64 bits.

图6B是示出根据本发明的一个实施例的依赖于向量尺寸和数据元素尺寸而将向量写掩码寄存器640和位位置用作写掩码的图示。在图6B中，向量写掩码寄存器是64位宽的，但这不是必须的。依据向量尺寸和数据元素尺寸的组合，无论所有64位，或只有64位的子集，可以被用作写掩码。一般而言，当使用单个每元素掩码控制位时，向量写掩码寄存器中用于掩码的位数等于按位计的向量尺寸除以按位计的向量数据元素尺寸。 Figure 6B is a diagram illustrating the use of vector writemask register 640 and bit positions as a writemask depending on vector size and data element size according to one embodiment of the invention. In Figure 6B, the vector writemask register is 64 bits wide, but this is not required. Depending on the combination of vector size and data element size, either all 64 bits, or only a subset of 64 bits, can be used as a write mask. In general, when using a single per-element mask control bit, the number of bits used for masking in the vector writemask register is equal to the vector size in bits divided by the vector data element size in bits.

针对512位向量示出了若干解说性示例。即，当向量尺寸为512位而该向量的数据元素尺寸为64位时，则该寄存器的仅最低8位被用作写掩码。当向量尺寸为512位而该向量的数据元素尺寸为32位时，则该寄存器的仅最低16位被用作写掩码。当向量尺寸为512位而该向量的数据元素尺寸为16位时，则该寄存器的仅最低32位被用作写掩码。当向量尺寸为512位而该向量的数据元素尺寸为8位时，则该寄存器的全部64位被用作写掩码。尽管在所示实施例中，该寄存器的最低阶子集或部分被用于掩码，然而替换实施例可使用某种其他位集合(例如，最高阶子集)。而且，尽管图6仅构想了512位向量尺寸，然而相同的原理适用于其他向量尺寸，诸如举例而言256位和128位。 Several illustrative examples are shown for 512-bit vectors. That is, when the vector size is 512 bits and the vector's data element size is 64 bits, then only the lowest 8 bits of the register are used as a write mask. When the vector size is 512 bits and the vector's data element size is 32 bits, then only the lowest 16 bits of this register are used as a write mask. When the vector size is 512 bits and the vector's data element size is 16 bits, then only the lowest 32 bits of this register are used as a write mask. When the vector size is 512 bits and the vector's data element size is 8 bits, then all 64 bits of this register are used as the write mask. Although in the illustrated embodiment the lowest order subset or portion of the register is used for the mask, alternative embodiments may use some other set of bits (eg, the highest order subset). Also, although FIG. 6 only envisions a 512-bit vector size, the same principles apply to other vector sizes such as, for example, 256-bit and 128-bit.

图7A是示出根据本发明的某些实施例的使用来自64位向量写掩码寄存器K1的写掩码进行合并的示例性操作760的框图，其中向量尺寸为512位而数据元素尺寸为32位。图7A示出源A操作数705；源B操作数710；向量写掩码寄存器K1715的内容(其中较低的16位包括1和0的混合)；以及目的地操作数720。此外，图7A中示出了示例性的对应性700。具体而言，因为如上所述该目的地向量操作数的多位数据元素的位定位在支持对不同尺寸的数据元素进行操作的向量操作的实施例之间改变(例如，多位数据元素中的位定位在对32位数据元素操作的向量操作和对64位数据元素操作的向量操作之间是不同的)，所以这些实施例可支持一位写掩码元素与该目的地操作数内的位定位的不同关系(也被称为对应性或映射)；因此，在位定位随着目的地操作数改变时，一位向量写掩码元素的映射也改变。所以，尽管对应性700在向量写掩码寄存器K1中仅具有与数据元素相对应的较低16位位置(并且从而具有较低的16个向量写掩码元素位置)(K1[0]对应于占据位31:0的数据元素位置0；K1[1]对应于占据位63:32的数据元素位置1；如此等等)，然而如果数据元素的尺寸被改变，则该对应性改变(例如，如果数据元素是16位，则K1[0]对应于占据位15:0的数据元素位置0，K1[1]对应于占据位32:16的数据元素位置1，如此等等)。 7A is a block diagram illustrating an exemplary operation 760 of merging using a writemask from a 64-bit vector writemask register K1, where the vector size is 512 bits and the data element size is 32 bits, according to some embodiments of the invention. bit. 7A shows source A operand 705 ; source B operand 710 ; the contents of vector writemask register K1 715 (where the lower 16 bits include a mix of 1s and 0s); and destination operand 720 . Furthermore, an exemplary correspondence 700 is shown in FIG. 7A. Specifically, because the bit alignment of multi-bit data elements of the destination vector operand as described above changes between embodiments that support vector operations operating on data elements of different sizes (e.g., Bit positioning is different between vector operations that operate on 32-bit data elements and vector operations that operate on 64-bit data elements), so these embodiments can support a one-bit writemask element with a bit within the destination operand A different relationship of alignment (also known as correspondence or mapping); thus, as the bit alignment changes with the destination operand, the mapping of the bit vector writemask elements also changes. So, while correspondence 700 has only the lower 16-bit positions in vector writemask register K1 corresponding to data elements (and thus the lower 16 vector writemask element positions) (K1[0] corresponds to data element position 0 occupying bits 31:0; K1[1] corresponds to data element position 1 occupying bits 63:32; and so on), however this correspondence changes if the size of the data element is changed (e.g., If the data element is 16 bits, then K1[0] corresponds to data element position 0 occupying bits 15:0, K1[1] corresponds to data element position 1 occupying bits 32:16, and so on).

对于目的地向量操作数720中的每个数据元素位置，依据该向量写掩码寄存器K1中的对应位位置分别是0还是1，其包含源操作数710中的该数据元素位置的内容或者该操作的结果(被示出为加)。 For each data element position in destination vector operand 720, it contains either the content of that data element position in source operand 710 or the The result of the operation (shown as plus).

图7B是示出根据本发明的某些实施例的使用来自64位写掩码寄存器K1的写掩码来进行归零(zero)的示例性操作766的框图，其中向量尺寸为512位而数据元素尺寸为32位。图7B包括与图7A相同的项，不同在于目的地操作数720被目的地操作数764取代。对于目的地向量操作数764中的每个数据元素位置，依据该向量写掩码寄存器K1中的对应位位置分别是0还是1，其包含0或者该操作的结果(被示出为加)。 7B is a block diagram illustrating an exemplary zero operation 766 using a write mask from the 64-bit write mask register K1, where the vector size is 512 bits and the data The element size is 32 bits. FIG. 7B includes the same items as FIG. 7A except that destination operand 720 is replaced by destination operand 764 . For each data element position in the destination vector operand 764, it contains either a 0 or the result of the operation (shown as an addition), depending on whether the corresponding bit position in the vector writemask register K1 is 0 or 1, respectively.

示例性的“如果掩码为P则设置GPR”指令 Exemplary "set GPR if mask is P" instruction

图8A-D和图9A-D分别示出了根据本发明的某些实施例的对于不同源操作数134尺寸的用于“如果掩码为0则设置GPR”和“如果掩码不为0则设置GPR”类型指令的伪代码。图8A示出根据本发明的某些实施例的用于使用8位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZB GPR_Y,K_X)的伪代码。图8B示出根据本发明的某些实施例的用于使用16位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZW GPR_Y,K_X)的伪代码。图8C示出根据本发明的某些实施例的用于使用32位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZD GPR_Y,K_X)的伪代码。图8D示出根据本发明的某些实施例的用于使用64位尺寸的源操作数134的“如果掩码为0则设置GPR”指令(KSETZQ GPR_Y,K_X)的伪代码。图9A示出根据本发明的某些实施例的用于使用8位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZB GPR_Y,K_X)的伪代码。图9B示出根据本发明的某些实施例的用于使用16位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZW GPR_Y,K_X)的伪代码。图9C示出根据本发明的某些实施例的用于使用32位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZD GPR_Y,K_X)的伪代码。图9D示出根据本发明的某些实施例的用于使用64位尺寸的源操作数134的“如果掩码不为0则设置GPR”指令(KSETNZQ GPR_Y,K_X)的伪代码。 Figures 8A-D and Figures 9A-D illustrate the "set GPR if mask is 0" and "if mask is not Then set the pseudocode of the GPR" type instruction. FIG. 8A shows pseudocode for a "set GPR if mask is 0" instruction (KSETZB GPR _Y , K _X ) using an 8-bit sized source operand 134, according to some embodiments of the invention. FIG. 8B shows pseudocode for a "set GPR if mask is 0" instruction (KSETZW GPR _Y , K _X ) using a 16-bit sized source operand 134, according to some embodiments of the invention. FIG. 8C shows pseudocode for a "set GPR if mask is 0" instruction (KSETZD GPR _Y , K _X ) using a source operand 134 of size 32 bits, according to some embodiments of the invention. FIG. 8D shows pseudocode for a "set GPR if mask is 0" instruction (KSETZQ GPR _Y , K _X ) for using a 64-bit sized source operand 134, according to some embodiments of the invention. Figure 9A shows pseudo-code for a "set GPR if mask is not 0" instruction (KSETNZB GPR _Y , K _X ) using an 8-bit sized source operand 134, according to some embodiments of the invention. FIG. 9B shows pseudocode for a "set GPR if mask is not zero" instruction (KSETNZW GPR _Y , K _X ) using a source operand 134 of size 16 bits, according to some embodiments of the invention. FIG. 9C shows pseudocode for a "set GPR if mask is not zero" instruction (KSETNZD GPR _Y , K _X ) using a source operand 134 of size 32 bits, according to some embodiments of the invention. FIG. 9D shows pseudocode for a "set GPR if mask is not zero" instruction (KSETNZQ GPR _Y , K _X ) for using a 64-bit sized source operand 134, according to some embodiments of the invention.

使用“如果掩码为P则设置GPR”指令的示例性代码序列 Exemplary code sequence using the "set GPR if mask is P" instruction

如前所述，通过在GPR而不是控制流寄存器中放置布尔值(例如，进位标志)，基于该布尔值的决策可以是基于数据流的而不是基于控制流的。具体而言，基于控制流的决策依赖于改变执行流的指令(例如，跳转(jump)、分支(branch)等)，而数据流决策基于该布尔值在数据之间选择。例如，如果掩码为P则设置GPR指令对于在函数中传递的参数是有用的(参见图10A-B)，并且对于用于不同代码段的快速条件执行的高效指针生成和间接函数调用是有用的(参见图11A-B)。具体而言，图10A和11A示出了用AVX1/AVX2指令编写的伪汇编代码序列(参见64和IA-32架构软件开发者手册，2011年10月；并且参见先进向量扩展编程参考，2011年6月)。 As mentioned earlier, by placing a boolean value (eg, the carry flag) in a GPR instead of a control flow register, decisions based on that boolean value can be data flow based rather than control flow based. Specifically, control-flow-based decisions rely on instructions that change the flow of execution (eg, jumps, branches, etc.), while data-flow decisions choose between data based on the Boolean value. For example, the set GPR instruction if the mask is P is useful for parameters passed in functions (see Figure 10A-B), and for efficient pointer generation and indirect function calls for fast conditional execution of different code segments of (see Figure 11A-B). Specifically, Figures 10A and 11A show pseudo-assembly code sequences written with AVX1/AVX2 instructions (see 64 and IA-32 Architectures Software Developer's Handbook, October 2011; see also Advanced Vector Extensions Programming Reference, June 2011).

图10A示出用AVX1/AVX2指令编写的将参数传递到函数的示例性代码序列。该序列包括对称为“foo”的函数的两个函数调用。对foo的第一个调用将A、B和1作为参数传递，而第二个调用传递A、B和0。该序列使用控制流指令来在这两个函数调用间选择。具体而言，该代码序列开始于VMOVAPS，其将经对齐的打包单精浮点数据元素从A(其可以是ymm寄存器或256位存储器位置)移动至ymm1。接下来，VCMPPS使用imm8的位4:0作为比较断言(comparison predicate)(其中位4:0定义比较的类型而位5:7被保留；而在图10A中指示小于(LT)))来将B(其可以是ymm寄存器或256位存储器位置)中的打包单精浮点值和ymm1进行比较。接下来，依据源的按位逻辑AND和逻辑ANDN(ANDNOT，与非)，VPTEST设置0标志(ZF)和进位标志(CF)(它们是EFLAGS寄存器中的条件代码标志)(如果在按位AND的结果中所有位为全0则设置ZF标志；如果在按位ANDN的结果中所有位为全0则设置CF标志)。接下来，如果ZF等于0，则JZ跳到目的地“输出”。此条件分支(JZ指令)可导致误断言的分支，并从而影响性能。如果不进行跳跃，则函数调用foo(A,B,1)和jmp end(跳跃结束)被处理。如果进行跳跃，则函数调用foo(A,B,0)和End:(结束：)被处理。 Figure 10A shows an exemplary code sequence written in AVX1/AVX2 instructions to pass parameters to a function. The sequence includes two function calls to a function called "foo". The first call to foo passes A, B, and 1 as arguments, while the second call passes A, B, and 0. The sequence uses control flow instructions to choose between these two function calls. Specifically, the code sequence begins with VMOVAPS, which moves aligned packed single-precision floating-point data elements from A (which can be a ymm register or a 256-bit memory location) to ymm1. Next, VCMPPS uses bits 4:0 of imm8 as a comparison predicate (where bits 4:0 define the type of comparison and bits 5:7 are reserved; and in Figure 10A indicate less than (LT))) to The packed single precision floating point value in B (which can be a ymm register or a 256-bit memory location) is compared with ymm1. Next, VPTEST sets the 0 flag (ZF) and the carry flag (CF) (which are the condition code flags in the EFLAGS register) according to the bitwise logical AND and logical ANDN (ANDNOT, NAND) of the source (if in the bitwise AND The ZF flag is set if all bits are all 0 in the result of bitwise ANDN; the CF flag is set if all bits are all 0 in the result of bitwise ANDN). Next, if ZF is equal to 0, JZ jumps to the destination "Output". This conditional branch (JZ instruction) can result in a mis-asserted branch and thus affect performance. If no jump is performed, the function calls foo(A,B,1) and jmp end are processed. If a jump is made, the function calls foo(A,B,0) and End: (end:) are processed.

图10B示出根据本发明的一个实施例的用KSETZW指令编写的将参数传递到函数的示例性代码序列。该序列仅包括对foo的一个函数调用并且不包括JZ指令，但是实现了图10A的序列处的相同结果。具体而言，该代码序列开始于相同的VMOVAPS指令。接下来，使用一种新类型的VCMPPS指令。此新类型的VCMPPS指令通过来自向量写掩码寄存器K1的向量写掩码，使用imm8作为比较断言(其指示如图10A中所示的LT)将B(其可以是ymm寄存器或float32向量存储器位置)中的打包单精浮点数据元素和ymm1进行比较并将结果(全1(比较为真)或全0(比较为假)的四字写掩码)放置回K1中。接下来，如果K1的最低有效字为全0；则KSETZW将rax(GPR)设置为标量常数1；否则，将rax归零。接下来，执行对foo的单次函数调用，传递A、B和rax作为参数。因此，将rax设置为标量常数1或标量常数0允许rax被用来通过单次函数调用来向foo传递1或0。从而，图10B中的序列实现了与图10A相同的结果，但是是用数据流决策而不是控制流决策来实现这一点的；并且从而图10B避免了条件分支(JZ指令)并减少了代码尺寸。 Figure 10B shows an exemplary code sequence written with the KSETZW instruction to pass parameters to a function, according to one embodiment of the present invention. This sequence includes only one function call to foo and no JZ instruction, but achieves the same result as at the sequence of Figure 10A. Specifically, this code sequence begins with the same VMOVAPS instruction. Next, a new type of VCMPPS instruction is used. This new type of VCMPPS instruction passes B (which can be a ymm register or a float32 vector memory location ) is compared with ymm1 and the result (a quadword writemask of all 1s (comparison true) or all 0s (comparison false)) is placed back into K1. Next, if the least significant word of K1 is all zeros; then KSETZW sets rax(GPR) to the scalar constant 1; otherwise, zeros rax. Next, a single function call to foo is performed, passing A, B, and rax as arguments. Thus, setting rax to a scalar constant 1 or a scalar constant 0 allows rax to be used to pass 1 or 0 to foo with a single function call. Thus, the sequence in Figure 10B achieves the same result as Figure 10A, but does so with data flow decisions rather than control flow decisions; and thus Figure 10B avoids conditional branches (JZ instructions) and reduces code size .

图11A示出用AVX1/AVX2指令编写的使用指针和间接函数调用的示例性代码序列。图11A中的序列与图10A相同，不同在于foo(A,B,1)被LEA rbx,foo和(*rbx)(A[],B[],C[])取代；而foo(A,B,0)被LEA rbx,foo+8和(*rbx)(A[],B[],C[]+尺寸)取代，其中“[]”前面的大写字母(例如A[])代表指向数组的指针。LEA指令将有效地址加载到rbx(GPR)中；LEA的第一和第二出现分别将foo和foo+8的有效地址加载到rbx中。这导致LEA指令的出现之后的两个指令使用指针(参见(*rbx))对foo或foo+8进行间接函数调用；并且这两个函数调用的不同之处在于传递参数C[]或C[]+尺寸；其中“尺寸”是存储器中的值或常数(经由#define)。同样，该条件分支(JZ指令)可导致误断言的分支，并从而影响性能。 Figure 11A shows an exemplary code sequence written in AVX1/AVX2 instructions using pointers and indirect function calls. The sequence in Figure 11A is the same as in Figure 10A, except that foo(A,B,1) is replaced by LEA rbx,foo and (*rbx)(A[],B[],C[]); while foo(A, B,0) is replaced by LEA rbx,foo+8 and (*rbx)(A[],B[],C[]+size), where the capital letters in front of "[]" (such as A[]) represent points to pointer to the array. The LEA instruction loads the effective address into rbx(GPR); the first and second occurrences of LEA load the effective addresses of foo and foo+8 into rbx, respectively. This causes the two instructions following the occurrence of the LEA instruction to make an indirect function call to foo or foo+8 using a pointer (see (*rbx)); and the two function calls differ by passing the arguments C[] or C[ ]+size; where "size" is a value in memory or a constant (via #define). Also, this conditional branch (JZ instruction) can result in a mis-asserted branch and thus affect performance.

图11B示出根据本发明的一个实施例的用KSETZW指令编写的使用指针和间接函数调用的示例性代码序列。类似于图10B，图11B中的序列只包括一个函数调用而不包括JZ指令，但是实现了与图11A的序列相同结果。图11B中的序列与图10B相同，不同在于foo(A,B,rax)已被LEA rbx,foo+rax*8；IMULrax,size(尺寸)和(*rbx)(A[],B[],C[]+rax)取代。LEA指令的出现将foo+rax*8的有效地址加载到rbx中，其中依据KSETZW指令的结果，rax为标量常数0或1；换言之，用foo(所计算的foo+0*8)或foo+8(所计算的foo+1*8)的有效地址来加载rbx。IMUL指令将两个带符号整数相乘；IMUL指令的出现将rax的内容和size(尺寸)相乘，并将结果存入rax；所以rax＝rax*size；换言之，使rax等于0*size或1*size。作为结果，(*rbx)(A[],B[],C[]+rax)对foo或foo+8做出间接函数调用并分别传递C[]或C[]+8作为参数。所以将rax和rbx设置为标量常数0或标量常数1允许rax和rbx通过单次函数调用生成不同的有效地址和不同的参数值。从而，图11B中的序列实现了与图11A相同的结果，但是是通过数据流决策而不是控制流决策来实现的；并且从而图11B避免了条件分支(JZ指令)并减少了代码尺寸。 Figure 1 IB shows an exemplary code sequence written with the KSETZW instruction using pointers and indirect function calls according to one embodiment of the present invention. Similar to FIG. 10B, the sequence in FIG. 11B includes only one function call and no JZ instruction, but achieves the same result as the sequence of FIG. 11A. The sequence in Figure 11B is the same as in Figure 10B, except that foo(A,B,rax) has been replaced by LEA rbx,foo+rax*8; IMULrax,size(size) and (*rbx)(A[],B[] , C[]+rax) instead. The presence of the LEA instruction loads into rbx the effective address of foo+rax*8, where rax is a scalar constant 0 or 1 depending on the result of the KSETZW instruction; 8 (computed foo+1*8) effective address to load rbx. The IMUL instruction multiplies two signed integers; the presence of the IMUL instruction multiplies the contents of rax by size (size) and stores the result in rax; so rax=rax*size; in other words, make rax equal to 0*size or 1*size. As a result, (*rbx)(A[],B[],C[]+rax) makes an indirect function call to foo or foo+8 and passes C[] or C[]+8 respectively as arguments. So setting rax and rbx to scalar constant 0 or scalar constant 1 allows rax and rbx to generate different effective addresses and different parameter values with a single function call. Thus, the sequence in FIG. 11B achieves the same result as FIG. 11A , but through data flow decisions instead of control flow decisions; and thus FIG. 11B avoids conditional branches (JZ instructions) and reduces code size.

示例性指令编码 Exemplary Instruction Encoding

本文中所描述的指令的实施例可以以不同的格式体现。另外，在下文中详述示例性系统、架构、以及流水线。指令的实施例可在这些系统、架构、以及流水线上执行，但是不限于详述的系统、架构、以及流水线。 Embodiments of the instructions described herein may be embodied in different formats. Additionally, exemplary systems, architectures, and pipelines are detailed below. Embodiments of the instructions may execute on these systems, architectures, and pipelines, but are not limited to the systems, architectures, and pipelines detailed.

VEX编码 VEX code

作为示例，将描述VEX C4编码，并且将描述可以如何将各种“如果掩码为P则设置GPR”指令编码到该编码的一个示例。 As an example, the VEX C4 encoding will be described, and one example of how the various "set GPR if mask is P" instructions can be encoded into the encoding will be described.

图12A提供了VEX C4编码的表示。该编码包括以下字段(参见先进向量扩展编程参考，2011年6月)： Figure 12A provides a representation of the VEX C4 encoding. This code includes the following fields (see Advanced Vector Extensions Programming Reference, June 2011):

VEX前缀(字节0-2)1202-以三字节形式进行编码。 VEX prefix (bytes 0-2) 1202 - encoded in three bytes.

格式字段1240(VEX字节0，位[7:0])-第一字节(VEX字节0)是格式字段1240且该格式字段1240包含明确的C4字节值(用于区分C4指令格式的唯一值)。 Format field 1240 (VEX byte 0, bits [7:0]) - The first byte (VEX byte 0) is the format field 1240 and this format field 1240 contains an explicit C4 byte value (used to distinguish the C4 instruction format unique value).

第二-第三字节(VEX字节1-2)包括提供专用能力的大量位字段。具体而言： The second-third bytes (VEX bytes 1-2) include a number of bit fields providing specific capabilities. in particular:

REX字段1205(VEX字节1，位[7-5])由VEX.R位字段(VEX字节1，位[7]–R)、VEX.X位字段(VEX字节1，位[6]–X)以及VEX.B位字段(VEX字节1，位[5]–B)组成。这些指令的其他字段对如在本领域中已知的寄存器索引的较低三个位(rrr、xxx以及bbb)进行编码，由此Rrrr、Xxxx以及Bbbb可通过增加VEX.R、VEX.X以及VEX.B来形成。 The REX field 1205 (VEX byte 1, bits [7-5]) is composed of the VEX.R bit field (VEX byte 1, bits [7]–R), the VEX.X bit field (VEX byte 1, bits [6] ]–X) and the VEX.B bit field (VEX byte 1, bits [5]–B). The other fields of these instructions encode the lower three bits of the register index (rrr, xxx, and bbb) as known in the art, whereby Rrrr, Xxxx, and Bbbb can be increased by adding VEX.R, VEX.X, and VEX.B to form.

操作码映射字段1215(VEX字节1，位[4:0]–mmmmm)–其内容编码了隐含的前导操作码字节。 Opcode Map Field 1215 (VEX byte 1, bits [4:0] - mmmmm) - its content encodes the implicit leading opcode bytes.

VEX.W(VEX字节2，位[7]–W)–由记号VEX.W表示，并且依据该指令提供了不同的功能。 VEX.W (VEX byte 2, bits [7] - W) - is represented by the notation VEX.W and provides different functions depending on the instruction.

VEX.vvvv1220(VEX字节2，位[6:3]-vvvv)－VEX.vvvv的作用可包括如下：1)VEX.vvvv对以反转(1补码)的形式指定第一源寄存器操作数进行编码，且对具有两个或两个以上源操作数的指令有效；2)VEX.vvvv针对特定向量位移对以1补码的形式指定的目的地寄存器操作数进行编码；或者3)VEX.vvvv不对任何操作数进行编码，保留该字段，并且应当包含1111b。 VEX.vvvv1220 (VEX byte 2, bits [6:3]-vvvv) - The role of VEX.vvvv can include the following: 1) The VEX.vvvv pair specifies the first source register operation in the form of inversion (1's complement) and valid for instructions with two or more source operands; 2) VEX.vvvv encodes a destination register operand specified in 1's complement for a specific vector displacement; or 3) VEX .vvvv does not encode any operands, this field is reserved, and should contain 1111b.

VEX.L1268尺寸字段(VEX字节2，位[2]-L)–如果VEX.L＝0，则它指示128位向量；如果VEX.L＝1，则它指示256位向量。 VEX.L1268 size field (VEX byte 2, bits [2]-L) - if VEX.L = 0, it indicates a 128-bit vector; if VEX.L = 1, it indicates a 256-bit vector.

前缀编码字段1225(VEX字节2，位[1:0]-pp)–提供了用于基础操作字段的附加位。 Prefix encoding field 1225 (VEX byte 2, bits [1:0]-pp) - provides additional bits for the base operation field.

实操作码字段1230(字节3) Real opcode field 1230 (byte 3)

这也被称为操作码字节。操作码的一部分在该字段中指定。 This is also known as the opcode byte. Part of the opcode is specified in this field.

MOD R/M字段1240(字节4) MOD R/M field 1240 (byte 4)

修饰符字段1246(MODR/M.MOD,位[7-6]–MOD字段1242)。 Modifier field 1246 (MODR/M.MOD, bits [7-6] - MOD field 1242).

MODR/M.reg字段1244、位[5-3]–ModR/M.reg字段的角色可被概括为两种情况：ModR/M.reg对目的地寄存器操作数或源寄存器操作数(Rfff中的rrr)进行编码；或者ModR/M.reg被视为操作码扩展且不用于对任何指令操作数进行编码。 MODR/M.reg field 1244, bits [5-3] - the role of the ModR/M.reg field can be summarized into two cases: ModR/M.reg is to the destination register operand or the source register operand (in Rfff rrr) for encoding; or ModR/M.reg is considered an opcode extension and is not used to encode any instruction operands.

MODR/M.r/m字段1246、位[2-0]–ModR/M.r/m字段的角色可包括以下：ModR/M.r/m对参考存储器地址的指令操作数进行编码；或者ModR/M.r/m对目的地寄存器操作数或源寄存器操作数进行编码。 MODR/M.r/m field 1246, bits [2-0] - The role of the ModR/M.r/m field may include the following: ModR/M.r/m encodes an instruction operand that references a memory address; or ModR/M.r/m pairs Destination register operand or source register operand to encode.

比例、索引、基址(SIB)字节(字节5) Scale, Index, Base (SIB) byte (byte 5)

比例字段1260(SIB.SS,位[7-6]–比例字段1260的内容被用于存储器地址生成。 Scale Field 1260 (SIB.SS, bits [7-6] - The content of the Scale Field 1260 is used for memory address generation.

SIB.xxx1254(位[5-3])和SIB.bbb1256(位[2-0])–先前已经针对寄存器索引Xxxx和Bbbb参考了这些字段的内容。 SIB.xxx1254 (bits[5-3]) and SIB.bbb1256 (bits[2-0]) - The contents of these fields have been referenced previously for register indices Xxxx and Bbbb.

位移字节(字节6或字节6-9) Offset byte (byte 6 or bytes 6-9)

立即数1272(IMM8)(开始于字节7或10) Immediate 1272 (IMM8) (starts at byte 7 or 10)

图12B示出来自图12A的哪些字段构成完整操作码字段1274和基础操作字段1242。图12C示出来自图12A的哪些字段构成寄存器索引字段。 FIG. 12B shows which fields from FIG. 12A make up the full opcode field 1274 and the base opcode field 1242 . Figure 12C shows which fields from Figure 12A constitute the register index field.

示例性编码 example code

KSETZ{B,W,D,Q}GPR_Y,K_X和KSETNZ{B,W,D,Q}GPR_Y,K_X KSETZ{B,W,D,Q}GPR _Y ,K _X and KSETNZ{B,W,D,Q}GPR _Y ,K _X

格式字段1240＝C4 Format Field 1240 = C4

VEX.R和MODR/M.reg字段1244(Rrrr)-标识GPR_Y VEX.R and MODR/M.reg field 1244 (Rrrr) - identifies GPR _Y

VEX.X和VEX.B–被忽略 VEX.X and VEX.B – ignored

操作码映射字段1215＝0F Opcode Mapping Field 1215 = 0F

VEX.W＝x(被忽略；或者选择GPR_Y的尺寸-对于EAX为0而对于RAX为1). VEX.W=x (ignored; alternatively select size of GPR _Y - 0 for EAX and 1 for RAX).

VEX.vvvv1220–被忽略 VEX.vvvv1220 – ignored

VEX.L＝0 VEX.L=0

前缀编码字段1225＝00 Prefix code field 1225 = 00

实操作码字段1230–指示设置在0上或设置在1上；指示B,W,D,Q Real opcode field 1230 - indicates set on 0 or set on 1; indicates B, W, D, Q

MODR/M.r/m字段1246–指示K_X.换言之，历史上被用于访问不同寄存器的指令的位，在本文中被用于访问架构向量写掩码寄存器。 _MODR /Mr/m field 1246 - Indicates Kx. In other words, a bit that was historically used for an instruction that accesses a different register, is used herein to access the architectural vector writemask register.

SIB–被忽略(如果存在) SIB – ignored (if present)

位移字段1262–被忽略(如果存在) Offset field 1262 – ignored (if present)

立即数(IMM8)–被忽略(如果存在) immediate value (IMM8) – ignored if present

通用向量友好指令格式 Generic Vector Friendly Instruction Format

向量友好指令格式是适于向量指令(例如，存在专用于向量操作的特定字段)的指令格式。尽管描述了其中通过向量友好指令格式支持向量和标量操作两者的实施例，但是替换实施例只通过向量友好指令格式使用向量操作。 A vector friendly instruction format is an instruction format suitable for vector instructions (eg, there are specific fields dedicated to vector operations). Although an embodiment is described in which both vector and scalar operations are supported through a vector friendly instruction format, alternative embodiments use only vector operations through a vector friendly instruction format.

图13A-13B是示出根据本发明的实施例的通用向量友好指令格式及其指令模板的框图。图13A是示出根据本发明的实施例的通用向量友好指令格式及其A类指令模板的框图；而图13B是示出根据本发明的实施例的通用向量友好指令格式及其B类指令模板的框图。具体地，针对通用向量友好指令格式1300定义A类和B类指令模板，两者包括无存储器访问1305的指令模板和存储器访问1320的指令模板。在向量友好指令格式的上下文中的术语通用指不绑定到任何专用指令集的指令格式。 13A-13B are block diagrams illustrating a generic vector friendly instruction format and its instruction templates according to an embodiment of the present invention. 13A is a block diagram illustrating a general vector friendly instruction format and a class A instruction template thereof according to an embodiment of the present invention; and FIG. 13B is a block diagram illustrating a general vector friendly instruction format and a class B instruction template thereof according to an embodiment of the present invention block diagram. Specifically, class A and class B instruction templates are defined for the general vector friendly instruction format 1300 , both of which include no memory access 1305 instruction templates and memory access 1320 instruction templates. The term generic in the context of a vector friendly instruction format refers to an instruction format that is not bound to any specialized instruction set.

尽管将描述其中向量友好指令格式支持以下：64字节向量操作数长度(或尺寸)与32位(4字节)或64位(8字节)数据元素宽度(或尺寸)(并且由此，64字节向量由16双字尺寸的元素或者替换地8双字尺寸的元素组成)、64字节向量操作数长度(或尺寸)与16位(2字节)或8位(1字节)数据元素宽度(或尺寸)、32字节向量操作数长度(或尺寸)与32位(4字节)、64位(8字节)、16位(2字节)、或8位(1字节)数据元素宽度(或尺寸)、以及16字节向量操作数长度(或尺寸)与32位(4字节)、64位(8字节)、16位(2字节)、或8位(1字节)数据元素宽度(或尺寸)的本发明的实施例，但是替换实施例可支持更大、更小、和/或不同的向量操作数尺寸(例如，256字节向量操作数)与更大、更小或不同的数据元素宽度(例如，128位(16字节)数据元素宽度)。 Although will be described where the vector-friendly instruction format supports the following: 64-byte vector operand length (or size) with 32-bit (4-byte) or 64-bit (8-byte) data element width (or size) (and thus, 64-byte vector consisting of 16 dword-sized elements or alternatively 8 dword-sized elements), 64-byte vector operand length (or size) with 16 bits (2 bytes) or 8 bits (1 byte) Data element width (or size), 32-byte vector operand length (or size) and 32-bit (4-byte), 64-bit (8-byte), 16-bit (2-byte), or 8-bit (1-word section) data element width (or size), and 16-byte vector operand length (or size) with 32-bit (4-byte), 64-bit (8-byte), 16-bit (2-byte), or 8-bit (1 byte) data element width (or size) of the present invention, but alternative embodiments may support larger, smaller, and/or different vector operand sizes (e.g., 256 byte vector operands) with a larger, smaller, or different data element width (for example, a 128-bit (16-byte) data element width).

图13A中的A类指令模板包括：1)在无存储器访问1305的指令模板内，示出无存储器访问的全部舍入(round)控制型操作1310的指令模板、以及无存储器访问的数据变换型操作1315的指令模板；以及2)在存储器访问1320的指令模板内，示出存储器访问的时效性1325的指令模板和存储器访问的非时效性1330的指令模板。图13B中的B类指令模板包括：1)在无存储器访问1305的指令模板内，示出无存储器访问的写掩码控制的部分舍入控制型操作1312的指令模板以及无存储器访问的写掩码控制的vsize型操作1317的指令模板；以及2)在存储器访问1320的指令模板内，示出存储器访问的写掩码控制1327的指令模板。 The class A instruction templates in FIG. 13A include: 1) In the instruction templates without memory access 1305, the instruction templates showing all rounding (round) control type operations 1310 without memory access, and the data transformation type without memory access The instruction template of operation 1315; and 2) within the instruction template of memory access 1320, the instruction template of time-sensitive 1325 of memory access and the instruction template of non-time-sensitive 1330 of memory access are shown. The class B instruction templates in FIG. 13B include: 1) In the instruction templates of no memory access 1305, the instruction templates of the partial rounding control type operation 1312 showing the write mask control of no memory access and the write mask of no memory access and 2) within the instruction templates for memory access 1320, the instruction template for writemask control 1327 for memory access is shown.

通用向量友好指令格式1300包括以下列出以在图13A-13B中示出顺序的如下字段。 The generic vector friendly instruction format 1300 includes the following fields listed below in the order shown in Figures 13A-13B.

格式字段1340－该字段中的特定值(指令格式标识符值)唯一地标识向量友好指令格式，并且由此标识指令在指令流中以向量友好指令格式出现。由此，该字段在无需只有通用向量友好指令格式的指令集的意义上是任选的。 Format field 1340 - A specific value in this field (instruction format identifier value) uniquely identifies the vector friendly instruction format, and thereby identifies that the instruction appears in the vector friendly instruction format in the instruction stream. Thus, this field is optional in the sense that an instruction set that only has a generic vector friendly instruction format is not required.

基础操作字段1342－其内容区分不同的基础操作。 Base Operations field 1342 - its content distinguishes between different base operations.

寄存器索引字段1344-其内容直接或者通过地址生成指定源或目的地操作数在寄存器中或者在存储器中的位置。这些字段包括足够数量的位以从PxQ(例如，32x512、16x128、32x1024、64x1024)个寄存器组选择N个寄存器。尽管在一个实施例中N可高达三个源和一个目的地寄存器，但是替换实施例可支持更多或更少的源和目的地寄存器(例如，可支持高达两个源，其中这些源中的一个源还用作目的地，可支持高达三个源，其中这些源中的一个源还用作目的地，可支持高达两个源和一个目的地)。 Register Index Field 1344 - its content specifies the location of the source or destination operand in a register or in memory either directly or through address generation. These fields include a sufficient number of bits to select N registers from a bank of PxQ (eg, 32x512, 16x128, 32x1024, 64x1024) registers. Although in one embodiment N can be as high as three source and one destination registers, alternative embodiments can support more or fewer source and destination registers (for example, up to two sources can be supported, where the A source also acts as a destination, supporting up to three sources, where one of these sources also acts as a destination, supporting up to two sources and a destination).

修饰符(modifier)字段1346－其内容将以指定存储器访问的通用向量指令格式出现的指令与不指定存储器访问的通用向量指令格式出现的指令区分开；即在无存储器访问1305的指令模板与存储器访问1320的指令模板之间。存储器访问操作读取和/或写入到存储器等级(在一些情况下，使用寄存器中的值来指定源和/或目的地地址)，而非存储器访问操作不这样(例如，源和/或目的地是寄存器)。尽管在一个实施例中，该字段还在三种不同的方式之间选择以执行存储器地址计算，但是替换实施例可支持更多、更少或不同的方式来执行存储器地址计算。 Modifier field 1346 - its content distinguishes instructions that appear in the general vector instruction format that specify memory access from instructions that appear in the general vector instruction format that do not specify memory access; Access 1320 among instruction templates. Memory access operations read and/or write to the memory level (in some cases using values in registers to specify source and/or destination addresses), while non-memory access operations do not (e.g., source and/or destination ground is a register). Although in one embodiment this field also selects between three different ways to perform memory address calculations, alternative embodiments may support more, fewer or different ways to perform memory address calculations.

扩充操作字段1350－其内容区分除基础操作以外要执行各种不同操作中的哪一个操作。该字段是针对上下文的。在本发明的一个实施例中，该字段被分成类字段1368、α字段1352、以及β字段1354。扩充操作字段1350允许在单个指令而非2、3或4个指令中执行多组共同的操作。 Extended Operation Field 1350 - its content distinguishes which of a variety of different operations to perform in addition to the base operation. This field is context specific. In one embodiment of the invention, this field is divided into a class field 1368 , an alpha field 1352 , and a beta field 1354 . Extended operations field 1350 allows multiple sets of common operations to be performed in a single instruction instead of 2, 3 or 4 instructions.

比例字段1360－其内容允许用于存储器地址生成(例如，用于使用2^比例*索引+基址的地址生成)的索引字段的内容的缩放。 Scale field 1360 - its content allows scaling of the content of the index field for memory address generation (eg, for address generation using 2 ^scale *index+base).

位移字段1362A－其内容用作存储器地址生成的一部分(例如，用于使用2^比例*索引+基址+位移的地址生成)。 Displacement field 1362A - its content is used as part of memory address generation (eg, for address generation using 2 ^scale *index+base+displacement).

位移因数字段1362B(注意，位移字段1362A直接在位移因数字段1362B上的并置指示使用一个或另一个)－其内容用作地址生成的一部分，它指定由存储器访问的尺寸(N)缩放的位移因数，其中N是存储器访问中的字节数量(例如，用于使用2^比例*索引+基址+缩放的位移的地址生成)。忽略冗余的低阶位，并且因此位移因数字段的内容乘以存储器操作数总尺寸以生成在计算有效地址中使用的最终位移。N的值由处理器硬件在运行时基于完整操作码字段1374(稍候在本文中描述)和数据操纵字段1354C确定。位移字段1362A和位移因数字段1362B可以不用于无存储器访问1305的指令模板和/或不同的实施例可实现两者中的仅一个或均未实现，在这个意义上它们是任选的。 Displacement Factor Field 1362B (note that the concatenation of Displacement Field 1362A directly on Displacement Factor Field 1362B indicates the use of one or the other) - whose content is used as part of address generation, which specifies displacement scaled by the size (N) of the memory access factor, where N is the number of bytes in the memory access (e.g. for address generation using a displacement of 2 ^scale *index+base+scale). Redundant low-order bits are ignored, and thus the contents of the displacement factor field are multiplied by the total memory operand size to generate the final displacement used in computing the effective address. The value of N is determined by the processor hardware at runtime based on the full opcode field 1374 (described later herein) and the data manipulation field 1354C. They are optional in the sense that displacement field 1362A and displacement factor field 1362B may not be used for the instruction template of no memory access 1305 and/or different embodiments may implement only one or neither of the two.

数据元素宽度字段1364－其内容区分使用大量数据元素宽度中的哪一个(在一些实施例中用于所有指令，在其他实施例中只用于一些指令)。如果支持仅一个数据元素宽度和/或使用操作码的某一方面支持数据元素宽度则不需要该字段，在这个意义上它是任选的。 Data Element Width Field 1364 - its content distinguishes which of a number of data element widths is used (in some embodiments for all instructions, in other embodiments only for some instructions). This field is optional in the sense that it is not required if only one data element width is supported and/or some aspect of the opcode is used to support data element widths.

写掩码字段1370－其内容在每一数据元素位置的基础上控制目的地向量操作数中的数据元素位置是否反映基础操作和扩充操作的结果。A类指令模板支持合并-写掩码，而B类指令模板支持合并写掩码和归零写掩码两者。当合并时，向量掩码允许在执行任何操作(由基础操作和扩充操作指定)期间保护目的地中的任何元素集免于更新，在另一实施例中，保持其中对应掩码位具有0的目的地的每一元素的旧值。相反，当归零时，向量掩码允许在执行任何操作(由基础操作和扩充操作指定)期间使目的地中的任何元素集归零，在一个实施例中，目的地的元素在对应掩码位具有0值时被设为0。该功能的子集是控制执行的操作的向量长度的能力(即，从第一个到最后一个要修改的元素的跨度)，然而，修改的元素连续是不必要的。由此，写掩码字段1370允许部分向量操作，包括加载、存储、算术、逻辑等。尽管描述了其中写掩码字段1370的内容选择大量写掩码寄存器中的包含要使用的写掩码的一个写掩码寄存器(并且由此写掩码字段1370的内容间接地标识要执行的那个掩码)的本发明的实施例，但是替换实施例相反或另外允许掩码写字段1370的内容直接地指定要执行的掩码。 Writemask field 1370 - its content controls on a per data element position basis whether the data element position in the destination vector operand reflects the results of base and augment operations. Class A instruction templates support coalescing-writemasking, while class B instruction templates support both coalescing and zeroing writemasking. When combined, a vector mask allows to protect any set of elements in the destination from being updated during the execution of any operation (specified by the base and augmentation operations), in another embodiment, maintaining the one in which the corresponding mask bit has 0 The old value of each element of the destination. Conversely, when zeroed, a vector mask allows any set of elements in the destination to be zeroed during the execution of any operation (specified by the base and augmentation operations). In one embodiment, the elements of the destination are in the corresponding mask bits Has a value of 0 and is set to 0. A subset of this functionality is the ability to control the vector length of the operations performed (i.e., the span from the first to the last element to be modified), however, it is not necessary that the modified elements be contiguous. Thus, the writemask field 1370 allows partial vector operations, including loads, stores, arithmetic, logic, and the like. Although described in which the content of writemask field 1370 selects one of a number of writemask registers that contains the writemask to use (and thus indirectly identifies which one to perform mask), but alternate embodiments instead or in addition allow the contents of the mask write field 1370 to directly specify the mask to be performed.

立即数字段1372－其内容允许对立即数的规范。在实现不支持立即数的通用向量友好格式中不存在且在不使用立即数的指令中不存在该字段，在这个意义上它是任选的。 Immediate field 1372 - its content allows specification of an immediate. It is optional in the sense that it is not present in implementations that do not support the general vector friendly format for immediates and is not present in instructions that do not use immediates.

类字段1368－其内容在指令的不同的类之间进行区分。参考图13A-B，该字段的内容在A类和B类指令之间进行选择。在图13A-B中，圆角方形用于指示专用值存在于字段中(例如，在图13A-B中分别用于类字段1368的A类1368A和B类1368B)。 Class field 1368 - its content distinguishes between different classes of instructions. Referring to Figures 13A-B, the content of this field selects between Type A and Type B instructions. In Figures 13A-B, rounded squares are used to indicate that a dedicated value exists in the field (eg, Class A 1368A and Class B 1368B for class field 1368, respectively, in Figures 13A-B).

A类指令模板 Type A instruction template

在A类非存储器访问1305的指令模板的情况下，α字段1352被解释为其内容区分要执行不同扩充操作类型中的哪一种(例如，针对无存储器访问的舍入型操作1310和无存储器访问的数据变换型操作1315的指令模板分别指定舍入1352A.1和数据变换1352A.2)的RS字段1352A，而β字段1354区分要执行指定类型的操作中的哪一种。在无存储器访问1305指令模板中，比例字段1360、位移字段1362A以及位移比例字段1362B不存在。 In the case of an instruction template for a class A non-memory access 1305, the alpha field 1352 is interpreted as its content to distinguish which of the different types of augmented operations are to be performed (e.g., a round-type operation 1310 for a no-memory access and a no-memory The instruction template of the accessed data transformation type operation 1315 specifies the RS field 1352A of rounding 1352A.1 and data transformation 1352A.2) respectively, while the beta field 1354 distinguishes which of the specified types of operations is to be performed. In the no memory access 1305 instruction template, the scale field 1360, displacement field 1362A, and displacement scale field 1362B do not exist.

无存储器访问的指令模板－全部舍入控制型操作 Instruction templates with no memory access - all round-controlled operations

在无存储器访问的全部舍入控制型操作1310的指令模板中，β字段1354被解释为其内容提供静态舍入的舍入控制字段1354A。尽管在本发明的所述实施例中舍入控制字段1354A包括抑制所有浮点异常(SAE)字段1356和舍入操作控制字段1358，但是替换实施例可支持、可将这些概念两者都编码成相同的字段或者只有这些概念/字段中的一个或另一个(例如，可只有舍入操作控制字段1358)。 In the instruction template for all round-control-type operations without memory access 1310, the beta field 1354 is interpreted as a round-control field 1354A whose content provides static rounding. Although in the described embodiment of the invention the rounding control field 1354A includes the suppress all floating point exceptions (SAE) field 1356 and the rounding operation control field 1358, alternative embodiments may support, and may encode, both of these concepts as The same fields or only one or the other of these concepts/fields (eg, there may be only the rounding operation control field 1358).

SAE字段1356－其内容区分是否停用异常事件报告；当SAE字段1356的内容指示启用抑制时，给定指令不报告任何种类的浮点异常标志且不提起任何浮点异常处理器。 SAE field 1356 - its content distinguishes whether exception event reporting is disabled; when the content of SAE field 1356 indicates suppression is enabled, the given instruction does not report any kind of floating point exception flag and does not raise any floating point exception handler.

舍入操作控制字段1358－其内容区分执行一组舍入操作中的哪一个(例如，向上舍入、向下舍入、向零舍入、以及就近舍入)。由此，舍入操作控制字段1358允许在每一指令的基础上改变舍入模式。在其中处理器包括用于指定舍入模式的控制寄存器的本发明的一个实施例中，舍入操作控制字段1350的内容优先于该寄存器值。 Rounding operation control field 1358 - its content distinguishes which of a set of rounding operations is performed (eg, round up, round down, round towards zero, and round to nearest). Thus, the rounding operation control field 1358 allows the rounding mode to be changed on a per instruction basis. In one embodiment of the invention where the processor includes a control register for specifying the rounding mode, the content of the rounding operation control field 1350 takes precedence over the register value.

无存储器访问的指令模板－数据变换型操作 Instruction templates without memory access - data transformation type operations

在无存储器访问的数据变换型操作1315的指令模板中，β字段1354被解释为数据变换字段1354B，其内容区分要执行大量数据变换中的哪一个(例如，无数据变换、混合、广播)的。 In the instruction template for a data transformation type operation 1315 with no memory access, the β field 1354 is interpreted as a data transformation field 1354B whose content distinguishes which of a number of data transformations to perform (e.g., no data transformation, mixing, broadcasting) .

在A类存储器访问1320的指令模板的情况下，α字段1352被解释为驱逐提示字段1352B，其内容区分要使用驱逐提示中的哪一个(在图13A中，为存储器访问时效性1325指令模板和存储器访问非时效性1330的指令模板分别指定时效性1352B.1和非时效性1352B.2)，而β字段1354被解释为数据操纵字段1354C，其内容区分要执行大量数据操纵操作(也称为基元(primitive))中的哪一个(例如，无操纵、广播、源的向上转换、以及目的地的向下转换)。存储器访问1320的指令模板包括比例字段1360、以及任选的位移字段1362A或位移比例字段1362B。 In the case of an instruction template for a class A memory access 1320, the alpha field 1352 is interpreted as an eviction hint field 1352B whose content distinguishes which of the eviction hints to use (in FIG. The instruction template of the memory access non-timeliness 1330 specifies timeliness 1352B.1 and non-timeliness 1352B.2), respectively, and the β field 1354 is interpreted as a data manipulation field 1354C, whose content distinguishes that a large number of data manipulation operations (also called which of the primitives (eg, no manipulation, broadcast, source upcast, and destination downcast). The instruction template for memory access 1320 includes scale field 1360, and optionally displacement field 1362A or displacement scale field 1362B.

向量存储器指令使用转换支持来执行来自存储器的向量负载并将向量存储到存储器。如同有规律的向量指令，向量存储器指令以数据元素式的方式与存储器来回传输数据，其中实际传输的元素由选为写掩码的向量掩码的内容阐述。 Vector memory instructions use conversion support to perform vector loads from memory and store vectors to memory. Like regular vector instructions, vector memory instructions transfer data to and from memory in a data-element fashion, where the actual elements transferred are specified by the contents of the vector mask selected as the write mask.

存储器访问的指令模板－时效性 Instruction Templates for Memory Access - Timing

时效性数据是可能很快地重新使用足以从高速缓存受益的数据。然而，这是提示且不同的处理器可以不同的方式实现它，包括完全忽略该提示。 Time-sensitive data is data that is likely to be reused soon enough to benefit from caching. However, this is a hint and different processors may implement it differently, including ignoring the hint entirely.

存储器访问的指令模板－非时效性 Instruction templates for memory accesses - not time sensitive

非时效性数据是不可能很快地重新使用足以从第一级高速缓存中的高速缓存受益且应当给予驱逐优先级的数据。然而，这是提示且不同的处理器可以不同的方式实现它，包括完全忽略该提示。 Non-time-sensitive data is data that is unlikely to be reused quickly enough to benefit from caching in the first level cache and should be given priority for eviction. However, this is a hint and different processors may implement it differently, including ignoring the hint entirely.

B类指令模板 Type B instruction template

在B类指令模板的情况下，α字段1352被解释为写掩码控制(Z)字段1352C，其内容区分由写掩码字段1370控制的写掩码应当是合并还是归零。 In the case of a Type B instruction template, the alpha field 1352 is interpreted as a writemask control (Z) field 1352C, the content of which distinguishes whether the writemask controlled by the writemask field 1370 should be coalesced or zeroed.

在B类非存储器访问1305的指令模板的情况下，β字段1354的一部分被解释为RL字段1357A，其内容区分要执行不同扩充操作类型中的哪一种(例如，针对无存储器访问的写掩码控制部分舍入控制类型操作1312的指令模板和无存储器访问的写掩码控制VSIZE型操作1317的指令模板分别指定舍入1357A.1和向量长度(VSIZE)1357A.2)，而β字段1354的其余部分区分要执行指定类型的操作中的哪一种。在无存储器访问1305指令模板中，比例字段1360、位移字段1362A以及位移比例字段1362B不存在。 In the case of an instruction template for class B non-memory access 1305, part of the β field 1354 is interpreted as the RL field 1357A, whose content distinguishes which of the different types of extended operations to perform (e.g., write masking for no memory access The instruction template of the code control section rounding control type operation 1312 and the instruction template of the write mask control VSIZE type operation 1317 without memory access specify rounding 1357A.1 and vector size (VSIZE) 1357A.2), respectively, while the β field 1354 The remainder of the distinguishes which of the specified types of operations is to be performed. In the no memory access 1305 instruction template, the scale field 1360, displacement field 1362A, and displacement scale field 1362B do not exist.

在无存储器访问的写掩码控制的部分舍入控制型操作1310的指令模板中，β字段1354的其余部分被解释为舍入操作字段1359A，并且停用异常事件报告(给定指令不报告任何种类的浮点异常标志且不提起任何浮点异常处理器)。 In instruction templates for writemask-controlled partial rounding-controlled operations with no memory access 1310, the remainder of the beta field 1354 is interpreted as the rounding operation field 1359A, and exception reporting is disabled (given instruction does not report any kind of floating-point exception flags and does not raise any floating-point exception handlers).

舍入操作控制字段1359A－只作为舍入操作控制字段1358，其内容区分执行一组舍入操作中的哪一个(例如，向上舍入、向下舍入、向零舍入、以及就近舍入)。由此，舍入操作控制字段1359A允许在每一指令的基础上改变舍入模式。在其中处理器包括用于指定舍入模式的控制寄存器的本发明的一个实施例中，舍入操作控制字段1350的内容优先于该寄存器值。 Rounding operation control field 1359A - as rounding operation control field 1358 only, its content distinguishes which of a set of rounding operations to perform (e.g., round up, round down, round toward zero, and round to nearest ). Thus, the rounding operation control field 1359A allows the rounding mode to be changed on a per instruction basis. In one embodiment of the invention where the processor includes a control register for specifying the rounding mode, the content of the rounding operation control field 1350 takes precedence over the register value.

在无存储器访问的写掩码控制VSIZE型操作1317的指令模板中，β字段1354的其余部分被解释为向量长度字段1359B，其内容区分要执行大量数据向量长度中的哪一个(例如，128字节、256字节、或512字节)。 In the instruction template for the writemask control VSIZE type operation 1317 with no memory access, the remainder of the β field 1354 is interpreted as a vector length field 1359B whose content distinguishes which of a number of data vector lengths to execute (e.g., 128 words section, 256 bytes, or 512 bytes).

在B类存储器访问1320的指令模板的情况下，β字段1354的一部分被解释为广播字段1357B，其内容区分是否要执行广播型数据操纵操作，而β字段1354的其余部分被解释为向量长度字段1359B。存储器访问1320的指令模板包括比例字段1360、以及任选的位移字段1362A或位移比例字段1362B。 In the case of the instruction template for class B memory access 1320, a portion of the β field 1354 is interpreted as a broadcast field 1357B whose content distinguishes whether a broadcast-type data manipulation operation is to be performed, while the rest of the β field 1354 is interpreted as a vector length field 1359B. The instruction template for memory access 1320 includes scale field 1360, and optionally displacement field 1362A or displacement scale field 1362B.

针对通用向量友好指令格式1300，示出完整操作码字段1374，包括格式字段1340、基础操作字段1342以及数据元素宽度字段1364。尽管示出了其中完整操作码字段1374包括所有这些字段的一个实施例，但是完整操作码字段1374包括在不支持所有这些字段的实施例中的少于所有的这些字段。完整操作码字段1374提供操作码(opcode)。 For the generic vector friendly instruction format 1300 , a full opcode field 1374 is shown, including a format field 1340 , a base operation field 1342 , and a data element width field 1364 . Although an embodiment is shown in which the full opcode field 1374 includes all of these fields, the full opcode field 1374 includes less than all of these fields in embodiments that do not support all of these fields. Full opcode field 1374 provides the operation code (opcode).

扩充操作字段1350、数据元素宽度字段1364以及写掩码字段1370允许这些特征在每一指令的基础上以通用向量友好指令格式指定。 Extended operation field 1350, data element width field 1364, and writemask field 1370 allow these features to be specified on a per-instruction basis in a generic vector friendly instruction format.

写掩码字段和数据元素宽度字段的组合创建各种类型的指令，其中这些指令允许基于不同的数据元素宽度应用该掩码。 The combination of the write mask field and the data element width field creates various types of instructions that allow the mask to be applied based on different data element widths.

在A类和B类内找到的各种指令模板在不同的情形下是有益的。在本发明的一些实施例中，不同处理器或者处理器内的不同核可只有支持仅A类、仅B类、或者可支持两类。举例而言，期望用于通用计算的高性能通用无序核可只支持B类，期望主要用于图形和/或科学(吞吐量)计算的核可只支持A类，并且期望用于两者的核可支持两者(当然，具有来自两类的模板和指令的一些混合的核，但是并非来自两类的所有模板和指令都在本发明的权限内)。同样，单个处理器可包括多个核，所有核支持相同的类或者其中不同的核支持不同的类。举例而言，在具有分离的图形和通用核的处理器中，图形核中的期望主要用于图形和/或科学计算的一个核可只支持A类，而通用核中的一个或多个可以是和期望用于通用计算的支持B类的无序执行和寄存器重命名的高性能通用核。没有分离的图形核的另一处理器可包括支持A类和B类两者的一个或多个通用有序或无序核。当然，在本发明的不同实施例中，来自一类的特征还可在其他类中实现。以高级语言撰写的程序可被输入(例如，仅仅按时间编译或者统计编译)到各种不同的可执行形式，包括：1)只有用于执行的目标处理器支持的类的指令的形式；或者2)具有使用所有类的指令的不同组合而撰写的替换例程且具有选择这些例程以基于由当前正在执行代码的处理器支持的指令而执行的控制流代码的形式。 The various instruction templates found within Class A and Class B are beneficial in different situations. In some embodiments of the invention, different processors or different cores within a processor may support only class A, only class B, or may support both classes. For example, a high-performance general-purpose out-of-order core expected for general-purpose computing might only support class B, a core expected primarily for graphics and/or scientific (throughput) computing might only support class A, and a core expected for both The cores of ® may support both (of course, cores with some mix of templates and instructions from both classes, but not all templates and instructions from both classes are within the purview of this invention). Likewise, a single processor may include multiple cores, all supporting the same class or where different cores support different classes. For example, in a processor with separate graphics and general-purpose cores, one of the graphics cores expected to be primarily used for graphics and/or scientific computing may only support class A, while one or more of the general-purpose cores may It is a high-performance general-purpose core that supports class B out-of-order execution and register renaming for general-purpose computing. Another processor without a separate graphics core may include one or more general-purpose in-order or out-of-order cores that support both class A and class B. Of course, features from one class may also be implemented in other classes in different embodiments of the invention. A program written in a high-level language can be imported (e.g., time-only or statistically compiled) into a variety of different executable forms, including: 1) a form with only the classes of instructions supported by the target processor for execution; or 2) To have replacement routines written using different combinations of instructions of all classes and in the form of control flow code that selects these routines to execute based on the instructions supported by the processor that is currently executing the code.

示例性专用向量友好指令格式 Exemplary Specific Vector Friendly Instruction Format

图14是示出根据本发明的实施例的示例性专用向量友好指令格式的框图。图14示出在其指定位置、尺寸、解释和字段的次序、以及那些字段中的一些字段的值的意义上是专用的专用向量友好指令格式1400。专用向量友好指令格式1400可用于扩展x86指令集，并且由此一些字段类似于在现有x86指令集及其扩展(例如，AVX)中使用的那些字段或与之相同。该格式保持与具有扩展的现有x86指令集的前缀编码字段、实操作码字节字段、MOD R/M字段、SIB字段、位移字段、以及立即数字段一致。示出来自图13的字段映射到的来自图14的字段。 Figure 14 is a block diagram illustrating an exemplary specific vector friendly instruction format according to an embodiment of the present invention. Figure 14 shows a specific vector friendly instruction format 1400 that is specific in the sense that it specifies the location, size, interpretation, and order of the fields, and the values of some of those fields. The specific vector friendly instruction format 1400 can be used to extend the x86 instruction set, and thus some fields are similar or identical to those used in the existing x86 instruction set and its extensions (eg, AVX). The format remains consistent with the prefix encoding field, real opcode byte field, MOD R/M field, SIB field, displacement field, and immediate field of the existing x86 instruction set with extensions. The fields from FIG. 14 to which the fields from FIG. 13 map are shown.

应当理解，虽然出于说明的目的在通用向量友好指令格式1300的上下文中，本发明的实施例参考专用向量友好指令格式1400进行了描述，但是本发明不限于专用向量友好指令格式1400，声明的地方除外。例如，通用向量友好指令格式1300构想各种字段的各种可能的尺寸，而专用向量友好指令格式1400被示为具有专用尺寸的字段。作为具体示例，尽管在专用向量友好指令格式1400中数据元素宽度字段1364被示为一位字段，但是本发明不限于此(即，通用向量友好指令格式1300构想数据元素宽度字段1364的其他尺寸)。 It should be understood that although embodiments of the invention have been described with reference to the specific vector friendly instruction format 1400 in the context of the general vector friendly instruction format 1300 for purposes of illustration, the invention is not limited to the specific vector friendly instruction format 1400, the stated except places. For example, the general vector friendly instruction format 1300 contemplates various possible sizes of various fields, while the specific vector friendly instruction format 1400 is shown with fields of specific sizes. As a specific example, although the data element width field 1364 is shown as a one-bit field in the specific vector friendly instruction format 1400, the invention is not so limited (i.e., the general vector friendly instruction format 1300 contemplates other sizes for the data element width field 1364) .

通用向量友好指令格式1300包括以下列出以在图14A中示出的顺序的如下字段。 The generic vector friendly instruction format 1300 includes the following fields listed below in the order shown in Figure 14A.

EVEX前缀(字节0-3)1402－以四字节形式进行编码。 EVEX prefix (bytes 0-3) 1402 - Encoded in four bytes.

格式字段1340(EVEX字节0，位[7:0])－第一字节(EVEX字节0)是格式字段1340，并且它包含0x62(在本发明的一个实施例中用于区分向量友好指令格式的唯一值)。 Format field 1340 (EVEX byte 0, bits [7:0]) - The first byte (EVEX byte 0) is the format field 1340, and it contains 0x62 (used in one embodiment of the invention to distinguish vector friendly unique value in the instruction format).

第二-第四字节(EVEX字节1-3)包括提供专用能力的大量位字段。 The second-fourth bytes (EVEX bytes 1-3) include a number of bit fields providing specific capabilities.

REX字段1405(EVEX字节1，位[7-5])－由EVEX.R位字段(EVEX 字节1，位[7]–R)、EVEX.X位字段(EVEX字节1，位[6]–X)以及(1357BEX字节1，位[5]–B)组成。EVEX.R、EVEX.X和EVEX.B位字段提供与对应VEX位字段相同的功能，并且使用1补码的形式进行编码，即ZMM0被编码为1111B，ZMM15被编码为0000B。这些指令的其他字段对如在本领域中已知的寄存器索引的较低三个位(rrr、xxx、以及bbb)进行编码，由此Rrrr、Xxxx以及Bbbb可通过增加EVEX.R、EVEX.X以及EVEX.B来形成。 REX field 1405 (EVEX byte 1, bits [7-5]) - composed of the EVEX.R bit field (EVEX byte 1, bits [7]–R), the EVEX.X bit field (EVEX byte 1, bits [7] 6]–X) and (1357BEX byte 1, bit[5]–B). The EVEX.R, EVEX.X, and EVEX.B bit fields provide the same functionality as the corresponding VEX bit fields and are encoded using 1's complement form, ie ZMM0 is encoded as 1111B and ZMM15 is encoded as 0000B. The other fields of these instructions encode the lower three bits of the register index (rrr, xxx, and bbb) as known in the art, whereby Rrrr, Xxxx, and Bbbb can be changed by adding EVEX.R, EVEX.X and EVEX.B to form.

REX’字段1310－这是REX’字段1310的第一部分，并且是用于对扩展的32个寄存器集合的较高16个或较低16个寄存器进行编码的EVEX.R’位字段(EVEX字节1，位[4]–R’)。在本发明的一个实施例中，该位与以下指示的其他位一起以位反转的格式存储以(在公知x86的32位模式下)与其实操作码字节是62的BOUND指令进行区分，但是在MOD R/M字段(在下文中描述)中不接受MOD字段中的值11；本发明的替换实施例不以反转的格式存储该指示的位以及其他指示的位。值1用于对较低16个寄存器进行编码。换句话说，通过组合EVEX.R’、EVEX.R、以及来自其他字段的其他RRR来形成R’Rrrr。 REX' field 1310 - This is the first part of the REX' field 1310 and is the EVEX.R' bitfield (EVEX byte 1, bits [4]–R'). In one embodiment of the invention, this bit is stored in bit-reversed format along with the other bits indicated below to distinguish (in the known x86 32-bit mode) from the BOUND instruction whose opcode byte is 62, But the value 11 in the MOD field is not accepted in the MOD R/M field (described below); alternate embodiments of the invention do not store this indicated bit, along with other indicated bits, in inverted format. A value of 1 is used to encode the lower 16 registers. In other words, R'Rrrr is formed by combining EVEX.R', EVEX.R, and other RRRs from other fields.

操作码映射字段1415(EVEX字节1，位[3:0]–mmmm)–其内容对隐含的前导操作码字节(0F、0F38、或0F3)进行编码。 Opcode Mapping Field 1415 (EVEX byte 1, bits [3:0] - mmmm) - its content encodes the implicit leading opcode byte (OF, 0F38, or 0F3).

数据元素宽度字段1364(EVEX字节2，位[7]–W)－由记号EVEX.W表示。EVEX.W用于定义数据类型(32位数据元素或64位数据元素)的粒度(尺寸)。 Data Element Width Field 1364 (EVEX byte 2, bits [7] - W) - denoted by the notation EVEX.W. EVEX.W is used to define the granularity (size) of a data type (32-bit data element or 64-bit data element).

EVEX.vvvv1420(EVEX字节2，位[6:3]-vvvv)－EVEX.vvvv的作用可包括如下：1)EVEX.vvvv对以反转(1补码)的形式指定的第一源寄存器操作数进行编码且对具有两个或两个以上源操作数的指令有效；2)EVEX.vvvv针对特定向量位移对以1补码的形式指定的目的地寄存器操作数进行编码；或者3)EVEX.vvvv不对任何操作数进行编码，保留该字段，并且应当包含1111b。由此，EVEX.vvvv字段1420对以反转(1补码)的形式存储的第一源寄存器指定符的4个低阶位进行编码。依据该指令，额外不同的EVEX位字段用于将指定符尺寸扩展到32个寄存器。 EVEX.vvvv1420 (EVEX byte 2, bits [6:3]-vvvv) - The role of EVEX.vvvv may include the following: 1) EVEX.vvvv pairs the first source register specified in inverted (1's complement) form Operands are encoded and valid for instructions with two or more source operands; 2) EVEX.vvvv encodes a destination register operand specified in 1's complement for a specific vector displacement; or 3) EVEX .vvvv does not encode any operands, this field is reserved, and should contain 1111b. Thus, the EVEX.vvvv field 1420 encodes the 4 low order bits of the first source register specifier stored in inverted (1's complement) form. According to this instruction, an additional distinct EVEX bit field is used to extend the specifier size to 32 registers.

EVEX.U1368类字段(EVEX字节2，位[2]-U)－如果EVEX.U＝0，则它指示A类或EVEX.U0，如果EVEX.U＝1，则它指示B类或EVEX.U1。 EVEX.U1368 class field (EVEX byte 2, bit[2]-U) - if EVEX.U = 0, it indicates class A or EVEX.U0, if EVEX.U = 1, it indicates class B or EVEX .U1.

前缀编码字段1425(EVEX字节2，位[1:0]-pp)－提供了用于基础操作字段的附加位。除了对以EVEX前缀格式的传统SSE指令提供支持以外，这也具有的压缩SIMD前缀的益处(EVEX前缀只需要2位，而不是需要字节来表达SIMD前缀)。在一个实施例中，为了支持使用以传统格式和以EVEX前缀格式的SIMD前缀(66H、F2H、F3H)的传统SSE指令，这些传统SIMD前缀被编码成SIMD前缀编码字段；并且在运行时在提供给解码器的PLA之前被扩展成传统SIMD前缀(因此PLA可执行传统和EVEX格式的这些传统指令，而无需修改)。虽然较新的指令可将EVEX前缀编码字段的内容直接作为操作码扩展，但是为了一致性，特定实施例以类似的方式扩展，但允许由这些传统SIMD前缀指定不同的含义。替换实施例可重新设计PLA以支持2位SIMD前缀编码，并且由此不需要扩展。 Prefix encoding field 1425 (EVEX byte 2, bits [1:0]-pp) - Provides additional bits for the base operation field. In addition to providing support for legacy SSE instructions in EVEX prefix format, this also has the benefit of compressing SIMD prefixes (EVEX prefixes require only 2 bits instead of bytes to express SIMD prefixes). In one embodiment, to support legacy SSE instructions using SIMD prefixes (66H, F2H, F3H) in legacy format and in EVEX prefixed format, these legacy SIMD prefixes are encoded into the SIMD prefix encoding field; The PLA to the decoder was previously extended to a legacy SIMD prefix (so the PLA can execute these legacy instructions in both legacy and EVEX formats without modification). While newer instructions may extend the contents of the EVEX prefix encoding field directly as an opcode, for consistency certain embodiments extend in a similar fashion, but allow different meanings to be specified by these legacy SIMD prefixes. An alternate embodiment could redesign the PLA to support 2-bit SIMD prefix encoding, and thus not require extensions.

α字段1352(EVEX字节3，位[7]–EH，也称为EVEX.EH、EVEX.rs、EVEX.RL、EVEX.写掩码控制、以及EVEX.N；还被示为具有α)－如先前所述的，该字段是上下文专用的。 Alpha field 1352 (EVEX byte 3, bit [7] - EH, also known as EVEX.EH, EVEX.rs, EVEX.RL, EVEX.writemask control, and EVEX.N; also shown with alpha) - As stated previously, this field is context specific.

β字段1354(EVEX字节3，位[6:4]-SSS，也称为EVEX.s2-0、EVEX.r2-0、EVEX.rr1、EVEX.LL0、EVEX.LLB；还被示为具有βββ)－如先前所述的，该字段是内容专用的。 Beta field 1354 (EVEX byte 3, bits [6:4] - SSS, also known as EVEX.s2-0, EVEX.r2-0, EVEX.rr1, EVEX.LL0, EVEX.LLB; also shown as having βββ) - As stated previously, this field is content specific.

REX’字段1310－这是REX’字段1210的其余部分，并且是可用于对扩展的32个寄存器集合的较高16个或较低16寄存器进行编码的EVEX.V’位字段(EVEX字节3，位[3]–V’)。该位以位反转的格式存储。值1用于对较低16个寄存器进行编码。换句话说，通过组合EVEX.V’、EVEX.vvvv来形成V’VVVV。 REX' field 1310 - This is the rest of the REX' field 1210 and is the EVEX.V' bit field that can be used to encode the upper 16 or lower 16 registers of the extended 32 register set (EVEX byte 3 , bits [3]–V'). This bit is stored in bit-reversed format. A value of 1 is used to encode the lower 16 registers. In other words, V'VVVV is formed by combining EVEX.V', EVEX.vvvv.

写掩码字段1370(EVEX字节3，位[2:0]-kkk)－其内容指定写掩码寄存器中的寄存器索引，如先前所述的。在本发明的一个实施例中，专用值EVEX.kkk＝000具有隐含着没有写掩码用于特定指令(这可以各种方式(包括使用硬连线到所有的写掩码或者旁路掩码硬件的硬件)实现)的特别行为。 Writemask field 1370 (EVEX byte 3, bits [2:0]-kkk) - its content specifies the register index in the writemask register, as previously described. In one embodiment of the invention, a dedicated value of EVEX.kkk=000 has an implicit no write mask for a particular instruction (this can be done in various ways, including using a write mask hardwired to all or bypassing the mask special behavior of the hardware) implementation of the code hardware.

实操作码字段1430(字节4)还被称为操作码字节。操作码的一部分在该字段中指定。 The real opcode field 1430 (byte 4) is also referred to as the opcode byte. Part of the opcode is specified in this field.

MOD R/M字段1440(字节5)包括MOD字段1442、Reg字段1444、以及R/M字段1446。如先前所述的，MOD字段1442的内容在存储器访问和非存储器访问的操作之间进行区分。Reg字段1444的作用可被归结为两种情形：对目的地寄存器操作数或源寄存器操作数进行编码；或者被视为操作码扩展且不用于对任何指令操作数进行编码。R/M字段1446的作用可包括如下：对参考存储器地址的指令操作数进行编码；或者对目的地寄存器操作数或源寄存器操作数进行编码。 MOD R/M field 1440 (byte 5 ) includes MOD field 1442 , Reg field 1444 , and R/M field 1446 . As previously described, the contents of the MOD field 1442 distinguish between memory access and non-memory access operations. The role of the Reg field 1444 can be reduced to two cases: to encode either a destination register operand or a source register operand; or to be treated as an opcode extension and not used to encode any instruction operands. The role of the R/M field 1446 may include the following: encoding an instruction operand that references a memory address; or encoding a destination register operand or a source register operand.

比例索引基址(SIB)字节(字节6)－如先前所述的，比例字段1350的内容用于存储器地址生成。SIB.xxx1454和SIB.bbb1456－先前已经针对寄存器索引Xxxx和Bbbb参考了这些字段的内容。 Scale Index Base (SIB) Byte (Byte 6) - As previously described, the contents of the scale field 1350 are used for memory address generation. SIB.xxx1454 and SIB.bbb1456 - The contents of these fields have been previously referenced for register indices Xxxx and Bbbb.

位移字段1362A(字节7-10)－当MOD字段1442包含10时，字节7-10是位移字段1362A，并且它与传统32位位移(disp32)一样地工作，并且以字节粒度工作。 Displacement field 1362A (bytes 7-10) - When the MOD field 1442 contains 10, bytes 7-10 is the displacement field 1362A, and it works like a traditional 32-bit displacement (disp32), and at byte granularity.

位移因数字段1362B(字节7)－当MOD字段1442包含01时，字节7是位移因数字段1362B。该字段的位置与传统x86指令集8位位移(disp8)的位置相同，它以字节粒度工作。由于disp8是符号扩展的，因此它可只在-128和127字节偏移量之间寻址，在64字节的高速缓存线的方面，disp8使用可被设为仅四个实有用的值-128、-64、0和64的8位；由于常常需要更大的范围，所以使用disp32；然而，disp32需要4个字节。与disp8和disp32对比，位移因数字段1362B是disp8的重新解释；当使用位移因数字段1362B时，实际位移通过位移因数字段的内容乘以存储器操作数访问的尺寸(N)确定。该类型的位移被称为disp8*N。这减小了平均指令长度(用于位移但具有大得多的范围的单个字节)。这种压缩位移基于有效位移是存储器访问的粒度的倍数的假设，并且由此地址偏移量的冗余低阶位不需要被编码。换句话说，位移因数字段1362B替换传统x86指令集8位位移。由此，位移因数字段1362B以与x86指令集8位位移相同的方式(因此在ModRM/SIB编码规则中没有变化)进行编码，唯一的不同在于，disp8超载至disp8*N。换句话说，在编码规则中没有变化，或者只在通过硬件对位移值的解释中有编码长度(这需要使位移缩放存储器操作数的尺寸以获得字节式地址偏移量)。 Displacement Factor Field 1362B (Byte 7) - When the MOD field 1442 contains 01, Byte 7 is the Displacement Factor field 1362B. The location of this field is the same as that of the legacy x86 instruction set 8-bit displacement ( disp8 ), which works at byte granularity. Since disp8 is sign-extended, it can only be addressed between -128 and 127 byte offsets, and in terms of a 64-byte cache line, disp8 usage can be set to only four useful values 8 bits for -128, -64, 0, and 64; since a larger range is often needed, disp32 is used; however, disp32 requires 4 bytes. In contrast to disp8 and disp32, the displacement factor field 1362B is a reinterpretation of disp8; when the displacement factor field 1362B is used, the actual displacement is determined by multiplying the contents of the displacement factor field by the size (N) of the memory operand access. This type of displacement is called disp8*N. This reduces the average instruction length (single byte for bit shifts but with a much larger range). This compressed displacement is based on the assumption that the effective displacement is a multiple of the granularity of the memory access, and thus the redundant low-order bits of the address offset need not be encoded. In other words, the displacement factor field 1362B replaces the traditional x86 instruction set 8-bit displacement. Thus, the displacement factor field 1362B is encoded in the same way as an x86 instruction set 8-bit displacement (so no change in the ModRM/SIB encoding rules), with the only difference that disp8 is overloaded to disp8*N. In other words, there is no change in the encoding rules, or only the encoded length, in the interpretation of the displacement value by the hardware (which requires the displacement to scale the size of the memory operand to obtain the byte-wise address offset).

立即数字段1372如先前所述地操作。 The immediate field 1372 operates as previously described.

完整操作码字段 full opcode field

图14B是示出根据本发明的实施例的构成完整操作码字段1374的具有专用向量友好指令格式1400的字段的框图。具体地，完整操作码字段1374包括格式字段1340、基础操作字段1342、以及数据元素宽度(W)字段1364。基础操作字段1342包括前缀编码字段1425、操作码映射字段1415以及实操作码字段1430。 Figure 14B is a block diagram illustrating the fields of the specific vector friendly instruction format 1400 that make up the full opcode field 1374, according to an embodiment of the invention. Specifically, full opcode field 1374 includes format field 1340 , base operation field 1342 , and data element width (W) field 1364 . The base operation field 1342 includes a prefix encoding field 1425 , an opcode mapping field 1415 , and a real opcode field 1430 .

寄存器索引字段 register index field

图14C是示出根据本发明的一个实施例的构成寄存器索引字段1344的具有专用向量友好指令格式1400的字段的框图。具体地，寄存器索引字段1344包括REX字段1405、REX’字段1410、MODR/M.reg字段1444、MODR/M.r/m字段1446、VVVV字段1420、xxx字段1454以及bbb字段1456。 Figure 14C is a block diagram illustrating the fields of the specific vector friendly instruction format 1400 that make up the register index field 1344 according to one embodiment of the invention. Specifically, the register index field 1344 includes a REX field 1405, a REX' field 1410, a MODR/M.reg field 1444, a MODR/M.r/m field 1446, a VVVV field 1420, a xxx field 1454, and a bbb field 1456.

扩充操作字段 Extended Action Field

图14D是示出根据本发明的一个实施例的构成扩充操作字段1350的具有专用向量友好指令格式1400的字段的框图。当类(U)字段1368包含0时，它表达EVEX.U0(A类1368A)；当它包含1时，它表达EVEX.U1(B类1368B)。当U＝0且MOD字段1442包含11(表达无存储器访问操作)时，α字段1352(EVEX字节3，位[7]–EH)被解释为rs字段1352A。当rs字段1352A包含1(舍入1352A.1)时，β字段1354(EVEX字节3，位[6:4]–SSS)被解释为舍入控制字段1354A。舍入控制字段1354A包括一位SAE字段1356和两位舍入操作字段1358。当rs字段1352A包含0(数据变换1352A.2)时，β字段1354(EVEX字节3，位[6:4]–SSS)被解释为三位数据变换字段1354B。当U＝0且MOD字段1442包含00、01或10(表达存储器访问操作)时，α字段1352(EVEX字节3，位[7]–EH)被解释为驱逐提示(EH)字段1352B且β字段1354(EVEX字节3，位[6:4]-SSS)被解释为三位数据操纵字段1354C。 Figure 14D is a block diagram illustrating the fields of the specific vector friendly instruction format 1400 that make up the extended operation field 1350 according to one embodiment of the present invention. When the class (U) field 1368 contains 0, it expresses EVEX.U0 (A class 1368A); when it contains 1, it expresses EVEX.U1 (B class 1368B). When U=0 and MOD field 1442 contains 11 (expressing no memory access operation), alpha field 1352 (EVEX byte 3, bits [7] - EH) is interpreted as rs field 1352A. When the rs field 1352A contains 1 (round 1352A.1), the beta field 1354 (EVEX byte 3, bits [6:4] - SSS) is interpreted as the round control field 1354A. Rounding control field 1354A includes a one-bit SAE field 1356 and a two-bit rounding operation field 1358 . When the rs field 1352A contains 0 (data transform 1352A.2), the beta field 1354 (EVEX byte 3, bits [6:4] - SSS) is interpreted as the three-bit data transform field 1354B. When U=0 and the MOD field 1442 contains 00, 01, or 10 (expressing a memory access operation), the alpha field 1352 (EVEX byte 3, bits [7] - EH) is interpreted as the eviction hint (EH) field 1352B and the beta Field 1354 (EVEX byte 3, bits [6:4] - SSS) is interpreted as a three-bit data manipulation field 1354C.

当U＝1时，α字段1352(EVEX字节3，位[7]–EH)被解释为写掩码控制(Z)字段1352C。当U＝1且MOD字段1442包含11(表达无存储器访问操作)时，β字段1354的一部分(EVEX字节3，位[4]–S0)被解释为RL字段1357A；当它包含1(舍入1357A.1)时，β字段1354的其余部分(EVEX字节3，位[6-5]–S2-1)被解释为舍入操作字段1359A，而当RL字段1357A包含0(VSIZE1357.A2)时，β字段1354的其余部分(EVEX字节3，位[6-5]-S2-1)被解释为向量长度字段1359B(EVEX字节3，位[6-5]–L1-0)。当U＝1且MOD字段1442包含00、01或10(表达存储器访问操作)时，β字段1354(EVEX字节3，位[6:4]–SSS)被解释为向量长度字段1359B(EVEX字节3，位[6-5]–L1-0)和广播字段1357B(EVEX字节3，位[4]–B)。 When U=1, alpha field 1352 (EVEX byte 3, bits [7] - EH) is interpreted as writemask control (Z) field 1352C. When U=1 and MOD field 1442 contains 11 (expressing no memory access operation), part of β field 1354 (EVEX byte 3, bits [4] - S0) is interpreted as RL field 1357A; 1357A.1), the remainder of the β field 1354 (EVEX byte 3, bits [6-5]–S2-1) is interpreted as the rounding operation field 1359A, while the RL field 1357A contains 0 (VSIZE1357.A2 ), the rest of the β field 1354 (EVEX byte 3, bits [6-5]-S2-1) is interpreted as the vector length field 1359B (EVEX byte 3, bits [6-5]-L1-0) . When U=1 and the MOD field 1442 contains 00, 01, or 10 (expressing a memory access operation), the β field 1354 (EVEX byte 3, bits [6:4] - SSS) is interpreted as the vector length field 1359B (EVEX word section 3, bits[6-5]–L1-0) and broadcast field 1357B (EVEX byte 3, bits[4]–B).

示例性寄存器架构 Exemplary Register Architecture

图15是根据本发明的一个实施例的寄存器架构1500的框图。在所示出的实施例中，有32个512位宽的向量寄存器1510；这些寄存器被引用为zmm0到zmm31。较低的16zmm寄存器的较低阶256个位覆盖在寄存器ymm0-16上。较低的16zmm寄存器的较低阶128个位(ymm寄存器的较低阶128个位)覆盖在寄存器xmm0-15上。专用向量友好指令格式1400对这些覆盖的寄存器组操作，如在以下表格中所示的。 Figure 15 is a block diagram of a register architecture 1500 according to one embodiment of the invention. In the illustrated embodiment, there are thirty-two 512-bit wide vector registers 1510; these registers are referenced as zmm0 through zmm31. The lower order 256 bits of the lower 16zmm registers are overlaid on registers ymm0-16. The lower order 128 bits of the lower 16zmm registers (the lower order 128 bits of the ymm registers) are overlaid on registers xmm0-15. The specific vector friendly instruction format 1400 operates on these overlaid register banks, as shown in the table below.

换句话说，向量长度字段1359B在最大长度与一个或多个其他较短长度之间进行选择，其中每一这种较短长度是前一长度的一半，并且没有向量长度字段1359B的指令模板对最大向量长度操作。此外，在一个实施例中，专用向量友好指令格式1400的B类指令模板对打包或标量单/双精度浮点数据以及打包或标量整数数据操作。标量操作是在zmm/ymm/xmm寄存器中的最低阶数据元素位置上执行的操作；依据本实施例，较高阶数据元素位置保持与在指令之前相同或者归零。 In other words, vector length field 1359B selects between a maximum length and one or more other shorter lengths, where each such shorter length is half the previous length, and there is no instruction template pair for vector length field 1359B Maximum vector length operation. Furthermore, in one embodiment, the class B instruction templates of the specific vector friendly instruction format 1400 operate on packed or scalar single/double precision floating point data as well as packed or scalar integer data. Scalar operations are operations performed on the lowest order data element positions in a zmm/ymm/xmm register; according to this embodiment, higher order data element positions remain the same as before the instruction or are zeroed.

写掩码寄存器1515－在所示的实施例中，存在8个写掩码寄存器(k0至k7)，每一写掩码寄存器的尺寸是64位。在替换实施例中，写掩码寄存器1515的尺寸是16位。如先前所述的，在本发明的一个实施例中，向量掩码寄存器k0无法用作写掩码；当正常可指示k0的编码用作写掩码时，它选择硬连线的写掩码0xFFFF，从而有效地停用该指令的写掩码。 Write Mask Registers 1515 - In the embodiment shown, there are 8 write mask registers (k0 to k7), each 64 bits in size. In an alternate embodiment, the writemask register 1515 is 16 bits in size. As previously stated, in one embodiment of the invention, the vector mask register k0 cannot be used as a writemask; it selects the hardwired writemask when an encoding that would normally indicate k0 is used as a writemask 0xFFFF, effectively disabling the write mask for that instruction.

通用寄存器1525——在所示出的实施例中，有十六个64位通用寄存器，这些寄存器与现有的x86寻址模式来寻址存储器操作数一起使用。这些寄存器通过名称RAX、RBX、RCX、RDX、RBP、RSI、RDI、RSP，以及R8到R15来引用。 General Purpose Registers 1525 - In the illustrated embodiment, there are sixteen 64-bit general purpose registers that are used with existing x86 addressing modes to address memory operands. These registers are referred to by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.

标量浮点堆栈寄存器组(x87堆栈)1545，在其上面重叠(aliased)MMX打包整数平坦寄存器组1550——在所示出的实施例中，x87堆栈是用于使用x87指令集扩展来对32/64/80位浮点数据执行标量浮点运算的八元素堆栈；而使用MMX寄存器来对64位打包整数数据执行操作，以及为在MMX和XMM寄存器之间执行的某些操作保存操作数。 Scalar floating point stack register set (x87 stack) 1545 over which is aliased the MMX packed integer flat register set 1550 - in the embodiment shown, the x87 stack is used to use the x87 instruction set extensions for 32 An eight-element stack that performs scalar floating-point operations on 64/80-bit floating-point data; while MMX registers are used to perform operations on 64-bit packed integer data, and to hold operands for some operations performed between MMX and XMM registers.

本发明的替换实施例可以使用较宽的或较窄的寄存器。另外，本发明的替换实施例可以使用多一些，少一些或不同的寄存器组和寄存器。 Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternative embodiments of the invention may use more, fewer or different register banks and registers.

示例性核架构、处理器和计算机架构 Exemplary core architecture, processor and computer architecture

处理器核可以用出于不同目的的不同方式在不同的处理器中实现。例如，这样的核的实现可以包括：1)旨在用于通用计算的通用有序核；2)预期用于通用计算的高性能通用无序核；3)主要预期用于图形和/或科学(吞吐量)计算的专用核。不同处理器的实现可包括：包括预期用于通用计算的一个或多个通用有序核和/或预期用于通用计算的一个或多个通用无序核的CPU；以及2)包括主要预期用于图形和/或科学(吞吐量)的一个或多个专用核的协处理器。这样的不同处理器导致不同的计算机系统架构，其可包括：1)在与CPU分开的芯片上的协处理器；2)在与CPU相同的封装中但分开的管芯上的协处理器；3)与CPU在相同管芯上的协处理器(在该情况下，这样的协处理器有时被称为诸如集成图形和/或科学(吞吐量)逻辑等专用逻辑，或被称为专用核)；以及4)可以将所描述的CPU(有时被称为应用核或应用处理器)、以上描述的协处理器和附加功能包括在同一管芯上的片上系统。接着描述示例性核架构，随后描述示例性处理器和计算机架构。 Processor cores can be implemented in different processors in different ways for different purposes. For example, implementations of such cores may include: 1) general-purpose in-order cores intended for general-purpose computing; 2) high-performance general-purpose out-of-order cores intended for general-purpose computing; 3) primarily intended for graphics and/or scientific (throughput) dedicated cores for computing. Implementations of different processors may include: a CPU including one or more general-purpose in-order cores intended for general-purpose computing and/or one or more general-purpose out-of-order cores intended for general-purpose computing; Coprocessor with one or more dedicated cores for graphics and/or science (throughput). Such different processors result in different computer system architectures, which may include: 1) a coprocessor on a separate chip from the CPU; 2) a coprocessor on a separate die in the same package as the CPU; 3) A coprocessor on the same die as the CPU (in which case such a coprocessor is sometimes referred to as dedicated logic such as integrated graphics and/or scientific (throughput) logic, or as a dedicated core ); and 4) a system-on-chip that may include the described CPU (sometimes called an application core or application processor), the coprocessors described above, and additional functionality on the same die. An exemplary core architecture is described next, followed by an exemplary processor and computer architecture.

示例性核架构 Exemplary Core Architecture

有序和无序核框图 Ordered and Disordered Core Block Diagrams

图16A是示出根据本发明的实施例的示例性有序流水线以及示例性寄存器重命名的无序发布/执行流水线两者的框图。图16B是示出根据本发明的实施例的有序架构核的示例性实施例以及包括在处理器中的示例性寄存器重命名的无序发布/执行架构核两者的框图。图16A-16B中的实线框解说了有序流水线和有序核，而虚线框中的可选附加项解说了寄存器重命名的、无序发布/执行流水线和核。给定有序方面是无序方面的子集的情况下，无序方面将被描述。 16A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming out-of-order issue/execution pipeline according to an embodiment of the present invention. 16B is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register-renaming, out-of-order issue/execution architecture core included in a processor according to an embodiment of the invention. The solid line boxes in Figures 16A-16B illustrate in-order pipelines and in-order cores, while the optional additions in dashed boxes illustrate register renaming, out-of-order issue/execution pipelines and cores. Given that an ordered aspect is a subset of an unordered aspect, an unordered aspect will be described.

在图16A中，处理器流水线1600包括取出级1602、长度解码级1604、解码级1606、分配级1608、重命名级1610、调度(也称为分派或发布)级1612、寄存器读取/存储器读取级1614、执行级1616、写回/存储器写入级1618、异常处理级1622和提交级1624。 In FIG. 16A, processor pipeline 1600 includes fetch stage 1602, length decode stage 1604, decode stage 1606, allocate stage 1608, rename stage 1610, dispatch (also called dispatch or issue) stage 1612, register read/memory read Fetch stage 1614 , execute stage 1616 , writeback/memory write stage 1618 , exception handling stage 1622 and commit stage 1624 .

图16B示出了包括耦合到执行引擎单元1650的前端单元1630的处理器核1690，且执行引擎单元和前端单元两者都耦合到存储器单元1670。核1690可以是精简指令集合计算(RISC)核、复杂指令集合计算(CISC)核、非常长的指令字(VLIW)核或混合或替换核类型。作为又一选项，核1690可以是专用核，诸如例如网络或通信核、压缩引擎、协处理器核、通用计算图形处理器单元(GPGPU)核、或图形核等等。 FIG. 16B shows processor core 1690 including front end unit 1630 coupled to execution engine unit 1650 , and both execution engine unit and front end unit are coupled to memory unit 1670 . Core 1690 may be a Reduced Instruction Set Computing (RISC) core, a Complex Instruction Set Computing (CISC) core, a Very Long Instruction Word (VLIW) core, or a hybrid or alternate core type. As yet another option, core 1690 may be a special purpose core such as, for example, a network or communication core, a compression engine, a coprocessor core, a general purpose computing graphics processor unit (GPGPU) core, or a graphics core, among others.

前端单元1630包括耦合到指令高速缓存单元1634的分支预测单元1632，该指令高速缓存单元1634被耦合到指令翻译后备缓冲器(TLB)1636，该指令翻译后备缓冲器1636被耦合到指令取出单元1638，指令取出单元1638被耦合到解码单元1640。解码单元1640(或解码器)可解码指令，并生成从原始指令解码出的、或以其他方式反映原始指令的、或从原始指令导出的一个或多个微操作、微代码进入点、微指令、其他指令、或其他控制信号作为输出。解码单元1640可使用各种不同的机制来实现。合适的机制的示例包括但不限于查找表、硬件实现、可编程逻辑阵列(PLA)、微代码只读存储器(ROM)等。在一个实施例中，核1690包括存储(例如，在解码单元1640中或否则在前端单元1630内的)某些宏指令的微代码的微代码ROM或其他介质。解码单元1640耦合至执行引擎单元1650中的重命名/分配器单元1652。 Front end unit 1630 includes branch prediction unit 1632 coupled to instruction cache unit 1634, which is coupled to instruction translation lookaside buffer (TLB) 1636, which is coupled to instruction fetch unit 1638 , the instruction fetch unit 1638 is coupled to the decode unit 1640 . Decode unit 1640 (or decoder) may decode an instruction and generate one or more micro-operations, microcode entry points, microinstructions decoded from, or otherwise reflecting, or derived from, the original instruction. , other instructions, or other control signals as output. The decoding unit 1640 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read-only memories (ROMs), and the like. In one embodiment, core 1690 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (eg, in decode unit 1640 or otherwise within front end unit 1630 ). Decode unit 1640 is coupled to rename/allocator unit 1652 in execution engine unit 1650 .

执行引擎单元1650包括重命名/分配器单元1652，该重命名/分配器单元1652耦合至引退单元1654和一个或多个调度器单元1656的集合。调度器单元1656表示任何数目的不同调度器，包括预留站、中央指令窗等。调度器单元1656被耦合到物理寄存器组单元1658。每个物理寄存器组单元1658表示一个或多个物理寄存器组，其中不同的物理寄存器组存储一种或多种不同的数据类型，诸如标量整数、标量浮点、打包整数、打包浮点、向量整数、向量浮点、状态(例如，作为要执行的下一指令的地址的指令指针)等。在一个实施例中，物理寄存器组单元1658包括向量寄存器单元、写掩码寄存器单元和标量寄存器单元。这些寄存器单元可以提供架构向量寄存器、向量掩码寄存器、和通用寄存器。物理寄存器组单元1658被引退单元1654覆盖以示出可以用来实现寄存器重命名和无序执行的各种方式(例如，使用重新排序缓冲器和引退寄存器组；使用将来的文件、历史缓冲器和引退寄存器组；使用寄存器图和寄存器池等等)。引退单元1654和物理寄存器组单元1658被耦合到执行群集1660。执行群集1660包括一个或多个执行单元1662的集合和一个或多个存储器访问单元1664的集合。执行单元1662可以执行各种操作(例如，移位、加法、减法、乘法)，以及对各种类型的数据(例如，标量浮点、打包整数、打包浮点、向量整数、向量浮点)执行。尽管某些实施例可以包括专用于特定功能或功能集合的多个执行单元，但其他实施例可包括全部执行所有函数的仅一个执行单元或多个执行单元。调度器单元1656、物理寄存器组单元1658和执行群集1660 被示为可能有多个，因为某些实施例为某些类型的数据/操作(例如，标量整数流水线、标量浮点/打包整数/打包浮点/向量整数/向量浮点流水线，和/或各自具有其自己的调度器单元、物理寄存器单元和/或执行群集的存储器访问流水线——以及在分开的存储器访问流水线的情况下，实现其中仅该流水线的执行群集具有存储器访问单元1664的某些实施例)创建分开的流水线。还应当理解，在分开的流水线被使用的情况下，这些流水线中的一个或多个可以为无序发布/执行，并且其余流水线可以为有序发布/执行。 Execution engine unit 1650 includes a rename/allocator unit 1652 coupled to a retirement unit 1654 and a set of one or more scheduler units 1656 . Scheduler unit 1656 represents any number of different schedulers, including reservation stations, central instruction windows, and the like. Scheduler unit 1656 is coupled to physical register file unit 1658 . Each physical register file unit 1658 represents one or more physical register files, where different physical register files store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer , vector floating point, state (eg, an instruction pointer that is the address of the next instruction to execute), etc. In one embodiment, the physical register file unit 1658 includes a vector register unit, a write mask register unit, and a scalar register unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file unit 1658 is overlaid by the retirement unit 1654 to illustrate the various ways that register renaming and out-of-order execution can be implemented (e.g., using reorder buffers and retiring register files; using future files, history buffers, and Retire register sets; use register maps and register pools, etc.). Retirement unit 1654 and physical register file unit 1658 are coupled to execution cluster 1660 . Execution cluster 1660 includes a set of one or more execution units 1662 and a set of one or more memory access units 1664 . The execution unit 1662 may perform various operations (e.g., shift, add, subtract, multiply), as well as perform . While some embodiments may include multiple execution units dedicated to a particular function or set of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The scheduler unit 1656, the physical register file unit 1658, and the execution cluster 1660 are shown as potentially multiple, as some embodiments provide for certain types of data/operations (e.g., scalar integer pipeline, scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipelines, and/or memory access pipelines each with its own scheduler unit, physical register unit, and/or execution cluster - and in the case of separate memory access pipelines, implementing Only execution clusters of this pipeline (certain embodiments with memory access unit 1664) create separate pipelines. It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the remaining pipelines may be in-order issue/execution.

存储器访问单元1664的集合被耦合到存储器单元1670，该存储器单元1670包括耦合到数据高速缓存单元1674的数据TLB单元1672，其中数据高速缓存单元1674耦合到二级(L2)高速缓存单元1676。在一个示例性实施例中，存储器访问单元1664可包括加载单元、存储地址单元和存储数据单元，其中的每一个均耦合至存储器单元1670中的数据TLB单元1672。指令高速缓存单元1634还耦合到存储器单元1670中的二级(L2)高速缓存单元1676。L2高速缓存单元1676被耦合到一个或多个其他级的高速缓存，并最终耦合到主存储器。 Set of memory access units 1664 are coupled to memory unit 1670 including data TLB unit 1672 coupled to data cache unit 1674 coupled to level two (L2) cache unit 1676 . In one exemplary embodiment, the memory access unit 1664 may include a load unit, a store address unit and a store data unit, each of which is coupled to the data TLB unit 1672 in the memory unit 1670 . Instruction cache unit 1634 is also coupled to level two (L2) cache unit 1676 in memory unit 1670 . The L2 cache unit 1676 is coupled to one or more other levels of cache and ultimately to main memory.

作为示例，示例性寄存器重命名的、无序发布/执行核架构可以如下实现流水线1600：1)指令取出1638执行取出和长度解码级1602和1604；2)解码单元1640执行解码级1606；3)重命名/分配器单元1652执行分配级1608和重命名级1610；4)调度器单元1656执行调度级1612；5)物理寄存器组单元1658和存储器单元1670执行寄存器读取/存储器读取级1614；执行群集1660执行执行级1616；6)存储器单元1670和物理寄存器组单元1658执行写回/存储器写入级1618；7)各单元可牵涉到异常处理级1622；以及8)引退单元1654和物理寄存器组单元1658执行提交级1624。 As an example, an exemplary register renaming, out-of-order issue/execution core architecture may implement pipeline 1600 as follows: 1) instruction fetch 1638 performs fetch and length decode stages 1602 and 1604; 2) decode unit 1640 performs decode stage 1606; 3) Rename/allocator unit 1652 performs allocation stage 1608 and rename stage 1610; 4) scheduler unit 1656 performs dispatch stage 1612; 5) physical register file unit 1658 and memory unit 1670 performs register read/memory read stage 1614; Execution cluster 1660 executes execution stage 1616; 6) memory unit 1670 and physical register file unit 1658 executes writeback/memory write stage 1618; 7) units may involve exception handling stage 1622; and 8) retirement unit 1654 and physical register Group unit 1658 executes commit stage 1624 .

核1690可支持一个或多个指令集合(例如，x86指令集合(具有与较新版本一起添加的某些扩展)；加利福尼亚州桑尼维尔市的MIPS技术公司的MIPS指令集合；加利福尼州桑尼维尔市的ARM控股的ARM指令集合(具有诸如NEON等可选附加扩展))，其中包括本文中描述的各指令。在一个实施例中，核1690包括支持打包数据指令集扩展(例如，AVX1、AVX2和/或先前描述的一些形式的一般向量友好指令格式(U＝0和/或U＝1))的逻辑，从而允许很多多媒体应用使用的操作能够使用打包数据来执行。 Core 1690 may support one or more instruction sets (e.g., x86 instruction set (with some extensions added with newer versions); MIPS instruction set from MIPS Technologies, Inc., Sunnyvale, Calif.; ARM Holdings of Sunnyvale's ARM instruction set (with optional additional extensions such as NEON), which includes the instructions described in this article. In one embodiment, core 1690 includes logic to support packed data instruction set extensions (e.g., AVX1, AVX2, and/or some form of the previously described general vector-friendly instruction format (U=0 and/or U=1), This allows many operations used by multimedia applications to be performed using packed data.

应当理解，核可支持多线程化(执行两个或更多个并行的操作或线程的集合)，并且可以按各种方式来完成该多线程化，此各种方式包括时分多线程化、同步多线程化(其中单个物理核为物理核正同步多线程化的各线程中的每一个线程提供逻辑核)、或其组合(例如，时分取出和解码以及此后诸如用超线程化技术来同步多线程化)。 It should be understood that a core can support multithreading (a collection of two or more operations or threads executing in parallel), and that this multithreading can be accomplished in a variety of ways, including time-division multithreading, synchronous Multithreading (where a single physical core provides a logical core for each of the threads that the physical core is synchronously multithreading), or a combination thereof (e.g., time-division fetch and decode and thereafter such as with Hyper-threading technology to synchronize multi-threading).

尽管在无序执行的上下文中描述了寄存器重命名，但应当理解，可以在有序架构中使用寄存器重命名。尽管所解说的处理器的实施例还包括分开的指令和数据高速缓存单元1634/1674以及共享L2高速缓存单元1676，但替换实施例可以具有用于指令和数据两者的单个内部高速缓存，诸如例如一级(L1)内部高速缓存或多个级别的内部缓存。在某些实施例中，该系统可包括内部高速缓存和在核和/或处理器外部的外部高速缓存的组合。或者，所有高速缓存都可以在核和/或处理器的外部。 Although register renaming is described in the context of out-of-order execution, it should be understood that register renaming can be used in in-order architectures. Although the illustrated embodiment of the processor also includes separate instruction and data cache units 1634/1674 and a shared L2 cache unit 1676, alternative embodiments may have a single internal cache for both instructions and data, such as Examples include Level 1 (L1) internal caches or multiple levels of internal caches. In some embodiments, the system may include a combination of internal caches and external caches external to the cores and/or processors. Alternatively, all cache memory may be external to the core and/or processor.

具体的示例性有序核架构 Concrete Exemplary Ordered Core Architecture

图17A-B示出更具体的示例性有序核架构的框图，该核可以是芯片中的若干逻辑块(包括具有相同类型和/或不同类型的其他核)中的一个。这些逻辑块通过高带宽的互连网络(例如，环形网络)与某些固定的功能逻辑、存储器I/O接口和其它必要的I/O逻辑通信，这依赖于应用。 17A-B show block diagrams of more specific exemplary in-order core architectures, which may be one of several logical blocks (including other cores of the same type and/or different types) in a chip. Depending on the application, these logic blocks communicate with certain fixed function logic, memory I/O interfaces, and other necessary I/O logic through a high-bandwidth interconnect network (eg, a ring network).

图17A是根据本发明的各实施例的单个处理器核连同它与管芯上互连网络1702的连接以及其二级(L2)高速缓存的本地子集1704的框图。在一个实施例中，指令解码器1700支持具有打包数据指令集合扩展的x86指令集。L1高速缓存1706允许对标量和向量单元中的高速缓存存储器的低等待时间访问。尽管在一个实施例中(为了简化设计)，标量单元1708和向量单元1710使用分开的寄存器集合(分别为标量寄存器1712和向量寄存器1714)，并且在这些寄存器之间转移的数据被写入到存储器并随后从一级(L1)高速缓存1706读回，但是本发明的替换实施例可以使用不同的方法(例如使用单个寄存器集合或包括允许数据在这两个寄存器组之间传输而无需被写入和读回的通信路径)。 Figure 17A is a block diagram of a single processor core along with its connection to an on-die interconnect network 1702 and its local subset 1704 of Level 2 (L2) cache according to various embodiments of the invention. In one embodiment, instruction decoder 1700 supports the x86 instruction set with packed data instruction set extensions. L1 cache 1706 allows low latency access to cache memory in scalar and vector units. Although in one embodiment (to simplify the design), scalar unit 1708 and vector unit 1710 use separate sets of registers (scalar registers 1712 and vector registers 1714, respectively), and data transferred between these registers is written to memory and then read back from Level 1 (L1) cache 1706, but alternative embodiments of the invention could use a different approach (such as using a single set of registers or including allowing data to be transferred between these two register sets without being written to and readback communication paths).

L2高速缓存的本地子集1704是全局L2高速缓存的一部分，该全局L2高速缓存被划分成多个分开的本地子集，即每个处理器核一个本地子集。每个处理器核具有到其自己的L2高速缓存1704的本地子集的直接访问路径。被处理器核读出的数据被存储在其L2高速缓存子集1704中，并且可以被快速访问，该访问与其他处理器核访问其自己的本地L2高速缓存子集并行。被处理器核写入的数据被存储在其子集的L2高速缓存子集1704中，并在必要的情况下从其它子集清除。环形网络确保共享数据的一致性。环形网络是双向的，以允许诸如处理器核、L2高速缓存和其它逻辑块之类的代理在芯片内彼此通信。每个环形数据路径为每个方向1012位宽。 Local subset of L2 cache 1704 is a portion of the global L2 cache that is divided into separate local subsets, ie, one local subset per processor core. Each processor core has a direct access path to its own local subset of L2 cache 1704. Data read by a processor core is stored in its L2 cache subset 1704 and can be accessed quickly in parallel with other processor cores accessing their own local L2 cache subset. Data written by a processor core is stored in its subset's L2 cache subset 1704 and flushed from other subsets if necessary. The ring network ensures the consistency of shared data. The ring network is bidirectional to allow agents such as processor cores, L2 caches, and other logic blocks to communicate with each other within the chip. Each ring data path is 1012 bits wide in each direction.

图17B是根据本发明的各实施例的图17A中的处理器核的一部分的展开图。图17B包括作为L1高速缓存1704的L1数据高速缓存1706A部分，以及关于向量单元1710和向量寄存器1714的更多细节。具体地说，向量单元1710是16宽向量处理单元(VPU)(见16宽ALU1728)，该单元执行整数、单精度浮点以及双精度浮点指令中的一个或多个。该VPU通过混合单元1720支持对寄存器输入的混合、通过数值转换单元1722A-B支持数值转换，并通过复制单元1724支持对存储器输入的复制。写掩码寄存器1726允许断言所得的向量写入。 Figure 17B is an expanded view of a portion of the processor core in Figure 17A, according to various embodiments of the invention. FIG. 17B includes a portion of L1 data cache 1706A as L1 cache 1704 , and more details about vector unit 1710 and vector register 1714 . Specifically, vector unit 1710 is a 16-wide vector processing unit (VPU) (see 16-wide ALU 1728 ) that executes one or more of integer, single-precision floating-point, and double-precision floating-point instructions. The VPU supports mixing of register inputs through mixing unit 1720 , value conversion through value conversion units 1722A-B , and replication of memory inputs through replication unit 1724 . Write mask register 1726 allows predicated resulting vector writes.

具有集成存储器控制器和图形器件的处理器 Processor with integrated memory controller and graphics

图18是根据本发明的实施例的可具有一个以上核、可具有集成存储器控制器、并且可具有集成图形的处理器1800的框图。图18的实线框示出了处理器1800，处理器1800具有单个核1802A、系统代理1810、一组一个或多个总线控制器单元1816，而可选附加的虚线框示出了替换的处理器1800，具有多个核1802A-N、系统代理单元1810中的一组一个或多个集成存储器控制器单元1814以及专用逻辑1808。 18 is a block diagram of a processor 1800 that may have more than one core, may have an integrated memory controller, and may have integrated graphics, according to an embodiment of the invention. 18 shows a processor 1800 with a single core 1802A, a system agent 1810, a set of one or more bus controller units 1816, while the optional additional dashed boxes show alternative processing 1800 having multiple cores 1802A-N, a set of one or more integrated memory controller units 1814 in a system agent unit 1810 and dedicated logic 1808.

因此，处理器1800的不同实现可包括：1)CPU，其中专用逻辑1808是集成图形和/或科学(吞吐量)逻辑(其可包括一个或多个核)，并且核1802A-N是一个或多个通用核(例如，通用的有序核、通用的无序核、这两者的组合)；2)协处理器，其中核1802A-N是主要预期用于图形和/或科学(吞吐量)的大量专用核；以及3)协处理器，其中核1802A-N是大量通用有序核。因此，处理器1800可以是通用处理器、协处理器或专用处理器，诸如例如网络或通信处理器、压缩引擎、图形处理器、GPGPU(通用图形处理单元)、高吞吐量的集成众核(MIC)协处理器(包括30个或更多核)、或嵌入式处理器等。该处理器可以被实现在一个或多个芯片上。处理器1800可以是一个或多个衬底的一部分，和/或可以使用诸如例如BiCMOS、CMOS或NMOS等的多个加工技术中的任何一个技术将其实现在一个或多个衬底上。 Thus, different implementations of processor 1800 may include: 1) a CPU, where application-specific logic 1808 is integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and cores 1802A-N are one or Multiple general-purpose cores (e.g., general-purpose in-order cores, general-purpose out-of-order cores, a combination of both); 2) coprocessors, where cores 1802A-N are primarily intended for graphics and/or scientific (throughput ) a large number of special-purpose cores; and 3) coprocessors, wherein cores 1802A-N are a large number of general-purpose in-order cores. Thus, processor 1800 may be a general-purpose processor, a co-processor, or a special-purpose processor, such as, for example, a network or communications processor, a compression engine, a graphics processor, a GPGPU (general-purpose graphics processing unit), a high-throughput integrated many-core ( MIC) coprocessor (including 30 or more cores), or embedded processor, etc. The processor may be implemented on one or more chips. Processor 1800 may be part of and/or may be implemented on one or more substrates using any of a number of processing technologies such as, for example, BiCMOS, CMOS, or NMOS.

存储器层次结构包括在各核内的一个或多个级别的高速缓存、一个或多个共享高速缓存单元1806的集合、以及耦合至集成存储器控制器单元1814的集合的外部存储器(未示出)。该共享高速缓存单元1806的集合可以包括一个或多个中间级高速缓存，诸如二级(L2)、三级(L3)、四级(L4)或其他级别的高速缓存、末级高速缓存(LLC)、和/或其组合。尽管在一个实施例中，基于环的互连单元1812将集成图形逻辑1808、共享高速缓存单元1806的集合以及系统代理单元1810/集成存储器控制器单元1814互连，但替换实施例可使用任何数量的公知技术来将这些单元互连。在一个实施例中，在一个或多个高速缓存单元1806与核1802-A-N之间维持一致性。 The memory hierarchy includes one or more levels of cache within each core, a set of one or more shared cache units 1806 , and external memory (not shown) coupled to a set of integrated memory controller units 1814 . The set of shared cache units 1806 may include one or more intermediate level caches, such as level two (L2), level three (L3), level four (L4) or other levels of cache, last level cache (LLC) ), and/or combinations thereof. Although in one embodiment a ring-based interconnect unit 1812 interconnects the integrated graphics logic 1808, the set of shared cache units 1806, and the system agent unit 1810/integrated memory controller unit 1814, alternative embodiments may use any number of known techniques to interconnect these units. In one embodiment, coherency is maintained between one or more cache units 1806 and cores 1802-A-N.

在某些实施例中，核1802A-N中的一个或多个核能够多线程化。系统代理1810包括协调和操作核1802A-N的那些组件。系统代理单元1810可包括例如功率控制单元(PCU)和显示单元。PCU可以是或包括调整核1802A-N和集成图形逻辑1808的功率状态所需的逻辑和组件。显示单元用于驱动一个或多个外部连接的显示器。 In some embodiments, one or more of cores 1802A-N are capable of multithreading. System agent 1810 includes those components that coordinate and operate cores 1802A-N. The system agent unit 1810 may include, for example, a power control unit (PCU) and a display unit. The PCU may be or include the logic and components needed to adjust the power states of cores 1802A-N and integrated graphics logic 1808 . The display unit is used to drive one or more externally connected displays.

核1802A-N在架构指令集合方面可以是同构的或异构的；即，这些核1802A-N中的两个或更多个核可能能够执行相同的指令集合，而其他核可能能够执行该指令集合的仅仅子集或不同的指令集合。 The cores 1802A-N may be homogeneous or heterogeneous in terms of architectural instruction sets; that is, two or more of the cores 1802A-N may be capable of executing the same set of instructions while other cores may be capable of executing the same set of instructions. Only a subset or a different set of instructions.

示例性计算机架构 Exemplary Computer Architecture

图19-22是示例性计算机架构的框图。本领域已知的对膝上型设备、台式机、手持PC、个人数字助理、工程工作站、服务器、网络设备、网络中枢、交换机、嵌入式处理器、数字信号处理器(DSP)、图形设备、视频游戏设备、机顶盒、微控制器、蜂窝电话、便携式媒体播放器、手持设备以及各种其他电子设备的其他系统设计和配置也是合适的。一般来说，能够纳入本文中所公开的处理器和/或其它执行逻辑的大量系统和电子设备一般都是合适的。 19-22 are block diagrams of exemplary computer architectures. Known in the art for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network equipment, network backbones, switches, embedded processors, digital signal processors (DSPs), graphics devices, Other system designs and configurations for video game devices, set-top boxes, microcontrollers, cellular phones, portable media players, handheld devices, and various other electronic devices are also suitable. In general, a number of systems and electronic devices capable of incorporating the processors and/or other execution logic disclosed herein are generally suitable.

现在参考图19，所示出的是根据本发明一实施例的系统1900的框图。系统1900可以包括一个或多个处理器1910、1915，这些处理器耦合到控制器中枢1920。在一个实施例中，控制器中枢1920包括图形存储器控制器中枢(GMCH)1990和输入/输出中枢(IOH)1950(其可以在分开的芯片上)；GMCH1990包括存储器1940和协处理器1945耦合到的存储器和图形控制器；IOH1950将输入/输出(I/O)设备1960耦合到GMCH1990。替换地，存储器和图形控制器中的一个或两个在处理器(如本文中所描述的)内集成，存储器1940和协处理器1945直接耦合到处理器1910、以及在单个芯片中具有IOH1950的控制器中枢1920。 Referring now to FIG. 19 , shown is a block diagram of a system 1900 in accordance with an embodiment of the present invention. System 1900 may include one or more processors 1910, 1915 coupled to a controller hub 1920. In one embodiment, controller hub 1920 includes graphics memory controller hub (GMCH) 1990 and input/output hub (IOH) 1950 (which may be on separate chips); GMCH 1990 includes memory 1940 and coprocessor 1945 coupled to memory and graphics controller; IOH 1950 couples input/output (I/O) devices 1960 to GMCH 1990. Alternatively, one or both of the memory and the graphics controller are integrated within the processor (as described herein), the memory 1940 and coprocessor 1945 are directly coupled to the processor 1910, and the IOH 1950 is in a single chip. Controller Hub 1920.

附加处理器1915的任选性质用虚线表示在图19中。每一处理器1910、1915可包括本文中描述的处理核中的一个或多个，并且可以是处理器1800的某一版本。 The optional nature of additional processors 1915 is indicated in Figure 19 with dashed lines. Each processor 1910 , 1915 may include one or more of the processing cores described herein, and may be some version of processor 1800 .

存储器1940可以是例如动态随机存取存储器(DRAM)、相变化存储器(PCM)或这两者的组合。对于至少一个实施例，控制器中枢1920经由诸如前侧总线(FSB)之类的多点总线(multi-drop bus)、诸如快速通道互连(QPI)之类的点对点接口、或者类似的连接1995与处理器1910、1915进行通信。 Memory 1940 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of both. For at least one embodiment, the controller hub 1920 is connected 1995 via a multi-drop bus such as a front-side bus (FSB), a point-to-point interface such as a quickpath interconnect (QPI), or the like. Communicates with processors 1910,1915.

在一个实施例中，协处理器1945是专用处理器，诸如例如高吞吐量MIC处理器、网络或通信处理器、压缩引擎、图形处理器、GPGPU、或嵌入式处理器等等。在一个实施例中，控制器中枢1920可以包括集成图形加速计。 In one embodiment, coprocessor 1945 is a special purpose processor such as, for example, a high throughput MIC processor, network or communications processor, compression engine, graphics processor, GPGPU, or embedded processor, among others. In one embodiment, controller hub 1920 may include an integrated graphics accelerometer.

按照包括架构、微架构、热、功耗特征等等优点的度量谱，物理资源1910、1915之间存在各种差别。 There are various differences between physical resources 1910, 1915 in terms of a spectrum of metrics including architectural, microarchitectural, thermal, power consumption characteristics, etc. advantages.

在一个实施例中，处理器1910执行控制一般类型的数据处理操作的指令。嵌入在这些指令中的可以是协处理器指令。处理器1910识别如具有应当由附连的协处理器1945执行的类型的这些协处理器指令。因此，处理器1910在协处理器总线或者其他互连上将这些协处理器指令(或者表示协处理器指令的控制信号)发布到协处理器1945。协处理器1945接受并执行所接收的协处理器指令。 In one embodiment, processor 1910 executes instructions that control general types of data processing operations. Embedded within these instructions may be coprocessor instructions. The processor 1910 identifies those coprocessor instructions as being of the type that should be executed by the attached coprocessor 1945 . Accordingly, processor 1910 issues these coprocessor instructions (or control signals representing coprocessor instructions) to coprocessor 1945 over a coprocessor bus or other interconnect. Coprocessor 1945 accepts and executes received coprocessor instructions.

现在参考图20，示出了根据本发明的一个实施例的第一更具体的示例性系统2000的框图。如图20所示，多处理器系统2000是点对点互连系统，并包括经由点对点互连2050耦合的第一处理器2070和第二处理器2080。处理器2070和2080中的每一个都可以是处理器1800的某一版本。在本发明的一个实施例中，处理器2070和2080分别是处理器1910和1915，而协处理器2038是协处理器1945。在另一实施例中，处理器2070和2080分别是处理器1910和协处理器1945。 Referring now to FIG. 20 , shown is a block diagram of a first more specific exemplary system 2000 in accordance with one embodiment of the present invention. As shown in FIG. 20 , multiprocessor system 2000 is a point-to-point interconnect system and includes a first processor 2070 and a second processor 2080 coupled via a point-to-point interconnect 2050 . Each of processors 2070 and 2080 may be some version of processor 1800 . In one embodiment of the invention, processors 2070 and 2080 are processors 1910 and 1915, respectively, and coprocessor 2038 is coprocessor 1945. In another embodiment, processors 2070 and 2080 are processor 1910 and coprocessor 1945, respectively.

处理器2070和2080被示为分别包括集成存储器控制器(IMC)单元2072和2082。处理器2070还包括作为其总线控制器单元的一部分的点对点(P-P)接口2076和2078；类似地，第二处理器2080包括点对点接口2086和2088。处理器2070、2080可以使用点对点(P-P)电路2078、2088经由P-P接口2050来交换信息。如图20所示，IMC2072和2082将各处理器耦合至相应的存储器，即存储器2032和存储器2034，这些存储器可以是本地附连至相应的处理器的主存储器的一部分。 Processors 2070 and 2080 are shown including integrated memory controller (IMC) units 2072 and 2082, respectively. Processor 2070 also includes point-to-point (P-P) interfaces 2076 and 2078 as part of its bus controller unit; similarly, second processor 2080 includes point-to-point interfaces 2086 and 2088 . Processors 2070 , 2080 may exchange information via P-P interface 2050 using point-to-point (P-P) circuits 2078 , 2088 . As shown in Figure 20, IMCs 2072 and 2082 couple each processor to respective memories, memory 2032 and memory 2034, which may be part of main memory locally attached to the respective processors.

处理器2070、2080可各自经由使用点对点接口电路2076、2094、2086、2098的各个P-P接口2052、2054与芯片组2090交换信息。芯片组2090可以可选地经由高性能接口2039与协处理器2038交换信息。在一个实施例中，协处理器2038是专用处理器，诸如例如高吞吐量MIC处理器、网络或通信处理器、压缩引擎、图形处理器、GPGPU、或嵌入式处理器等等。 Processors 2070 , 2080 may each exchange information with chipset 2090 via respective P-P interfaces 2052 , 2054 using point-to-point interface circuits 2076 , 2094 , 2086 , 2098 . Chipset 2090 may optionally exchange information with coprocessor 2038 via high performance interface 2039 . In one embodiment, coprocessor 2038 is a special purpose processor such as, for example, a high throughput MIC processor, network or communications processor, compression engine, graphics processor, GPGPU, or embedded processor, among others.

共享高速缓存(未示出)可以被包括在任一处理器之内或被包括两个处理器外部但仍经由P-P互连与这些处理器连接，从而如果将某处理器置于低功率模式时，可将任一处理器或两个处理器的本地高速缓存信息存储在该共享高速缓存中。 A shared cache (not shown) can be included within either processor or external to both processors but still be connected to these processors via a P-P interconnect so that if a processor is placed in a low power mode, Either processor or both processors' local cache information can be stored in this shared cache.

芯片组2090可经由接口2096耦合至第一总线2016。在一个实施例中，第一总线2016可以是外围部件互连(PCI)总线，或诸如PCI Express总线或其它第三代I/O互连总线之类的总线，但本发明的范围并不受此限制。 Chipset 2090 may be coupled to first bus 2016 via interface 2096 . In one embodiment, the first bus 2016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or other third-generation I/O interconnect bus, but the scope of the present invention is not limited by this limit.

如图20所示，各种I/O设备2014可以连同总线桥2018耦合到第一总线2016，总线桥2018将第一总线2016耦合至第二总线2020。在一个实施例中，诸如协处理器、高吞吐量MIC处理器、GPGPU的处理器、加速计(诸如例如图形加速计或数字信号处理器(DSP)单元)、场可编程门阵列或任何其他处理器的一个或多个附加处理器2015被耦合到第一总线2016。在一个实施例中，第二总线2020可以是低引脚计数(LPC)总线。各种设备可以被耦合至第二总线2020，在一个实施例中这些设备包括例如键盘/鼠标2022、通信设备2027以及诸如可包括指令/代码和数据2030的盘驱动器或其它大容量存储设备的存储单元2028。此外，音频I/O2024可以被耦合至第二总线2020。注意，其它架构是可能的。例如，取代图20的点对点架构，系统可以实现多站总线或其它这类架构。 As shown in FIG. 20 , various I/O devices 2014 may be coupled to a first bus 2016 along with a bus bridge 2018 that couples the first bus 2016 to a second bus 2020 . In one embodiment, a processor such as a coprocessor, a high-throughput MIC processor, a GPGPU, an accelerometer (such as, for example, a graphics accelerometer or a digital signal processor (DSP) unit), a field programmable gate array, or any other One or more additional processors 2015 of processors are coupled to a first bus 2016 . In one embodiment, the second bus 2020 may be a low pin count (LPC) bus. Various devices may be coupled to second bus 2020 including, in one embodiment, keyboard/mouse 2022, communication devices 2027, and storage devices such as disk drives or other mass storage devices, which may include instructions/code and data 2030, for example. Unit 2028. Additionally, audio I/O 2024 may be coupled to second bus 2020 . Note that other architectures are possible. For example, instead of the point-to-point architecture of Figure 20, the system could implement a multidrop bus or other such architecture.

现在参考图21，示出了根据本发明的一个实施例的第二更具体的示例性系统2100的框图。图20和21中的相同元件使用相同附图标记，且在图21中省略了图20的某些方面以避免混淆图21的其它方面。 Referring now to FIG. 21 , shown is a block diagram of a second more specific exemplary system 2100 in accordance with one embodiment of the present invention. Like elements in FIGS. 20 and 21 have the same reference numerals, and certain aspects of FIG. 20 are omitted in FIG. 21 to avoid obscuring other aspects of FIG. 21 .

图21示出处理器2070、2080可分别包括集成存储器和I/O控制逻辑(“CL”)2072和2082。因此，CL2072、2082包括集成存储器控制器单元并包括I/O控制逻辑。图21不仅示出耦合至CL2072、2082的存储器2032、2034，而且还解说了同样耦合至控制逻辑2072、2082的I/O设备2114。传统I/O设备2115被耦合至芯片组2090。 Figure 21 shows that processors 2070, 2080 may include integrated memory and I/O control logic ("CL") 2072 and 2082, respectively. Therefore, the CL2072, 2082 includes an integrated memory controller unit and includes I/O control logic. FIG. 21 not only shows memory 2032 , 2034 coupled to CL 2072 , 2082 , but also illustrates I/O device 2114 also coupled to control logic 2072 , 2082 . Legacy I/O devices 2115 are coupled to chipset 2090 .

现在参照图22，所示出的是根据本发明一个实施例的SoC2200的框图。在图18中，相似的部件具有同样的附图标记。另外，虚线框是更先进的SoC的可选特征。在图22中，互连单元2202被耦合至：应用处理器2210，该应用处理器包括一个或多个核202A-N的集合以及共享高速缓存单元1806；系统代理单元1810；总线控制器单元1816；集成存储器控制器单元1814；一组或一个或多个协处理器2220，其可包括集成图形逻辑、图像处理器、音频处理器和视频处理器；静态随机存取存储器(SRAM)单元2230；直接存储器存取(DMA)单元2232；以及用于耦合至一个或多个外部显示器的显示单元2240。在一个实施例中，协处理器2220包括专用处理器，诸如例如网络或通信处理器、压缩引擎、GPGPU、高吞吐量MIC处理器、或嵌入式处理器等等。 Referring now to FIG. 22 , shown is a block diagram of a SoC 2200 in accordance with one embodiment of the present invention. In Fig. 18, similar parts have the same reference numerals. Also, dashed boxes are optional features for more advanced SoCs. In FIG. 22, interconnection unit 2202 is coupled to: application processor 2210, which includes a set of one or more cores 202A-N and shared cache unit 1806; system agent unit 1810; bus controller unit 1816 an integrated memory controller unit 1814; a set or one or more coprocessors 2220, which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a static random access memory (SRAM) unit 2230; a direct memory access (DMA) unit 2232; and a display unit 2240 for coupling to one or more external displays. In one embodiment, coprocessor 2220 includes a special purpose processor such as, for example, a network or communications processor, compression engine, GPGPU, high throughput MIC processor, or embedded processor, among others.

本文公开的机制的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本发明的实施例可实现为在可编程系统上执行的计算机程序或程序代码，该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。 Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the invention may be implemented as computer programs or program code executing on a programmable system comprising at least one processor, memory system (including volatile and non-volatile memory and/or storage elements) , at least one input device, and at least one output device.

可将程序代码(诸如图20中解说的代码2030)应用于输入指令，以执行本文描述的各功能并生成输出信息。输出信息可以按已知方式被应用于一个或多个输出设备。为了本申请的目的，处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。 Program code, such as code 2030 illustrated in Figure 20, may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.

程序代码可以用高级程序化语言或面向对象的编程语言来实现，以便与处理系统通信。程序代码也可以在需要的情况下用汇编语言或机器语言来实现。事实上，本文中描述的机制不仅限于任何特定编程语言的范围。在任一情形下，语言可以是编译语言或解释语言。 The program code can be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code can also be implemented in assembly or machine language, if desired. In fact, the mechanisms described in this paper are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.

至少一个实施例的一个或多个方面可以由存储在机器可读介质上的表征性指令来实现，该指令表示处理器中的各种逻辑，该指令在被机器读取时使得该机器制作用于执行本文所述的技术的逻辑。被称为“IP核”的这些表示可以被存储在有形的机器可读介质上，并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。 One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium, the instructions representing various logic in a processor, which when read by a machine cause the machine to act Logic used to implement the techniques described herein. These representations, referred to as "IP cores," may be stored on a tangible, machine-readable medium and provided to various customers or production facilities for loading into the fabrication machines that actually manufacture the logic or processor.

这样的机器可读存储介质可以包括但不限于通过机器或设备制造或形成的物品的非瞬态、有形安排，其包括存储介质，诸如硬盘；任何其它类型的盘，包括软盘、光盘、紧致盘只读存储器(CD-ROM)、紧致盘可重写(CD-RW)的以及磁光盘；半导体器件，例如只读存储器(ROM)、诸如动态随机存取存储器(DRAM)和静态随机存取存储器(SRAM)的随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、闪存、电可擦除可编程只读存储器(EEPROM)；相变化存储器(PCM)；磁卡或光卡；或适于存储电子指令的任何其它类型的介质。 Such machine-readable storage media may include, but are not limited to, non-transitory, tangible arrangements of articles manufactured or formed by a machine or apparatus, including storage media, such as hard disks; any other type of disk, including floppy disks, optical disks, compact Disk read-only memory (CD-ROM), compact disk rewritable (CD-RW), and magneto-optical disks; semiconductor devices such as read-only memory (ROM), such as dynamic random access memory (DRAM) and static random access memory Random Access Memory (RAM), Erasable Programmable Read-Only Memory (EPROM), Flash Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM); Phase Change Memory (PCM); Magnetic Card or optical card; or any other type of medium suitable for storing electronic instructions.

因此，本发明的各实施例还包括非瞬态、有形机器可读介质，该介质包含指令或包含设计数据，诸如硬件描述语言(HDL)，它定义本文中描述的结构、电路、装置、处理器和/或系统特性。这些实施例也被称为程序产品。 Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as a hardware description language (HDL), which defines the structures, circuits, devices, processes described herein device and/or system characteristics. These embodiments are also referred to as program products.

仿真(包括二进制变换、代码变形等) Simulation (including binary transformation, code deformation, etc.)

在某些情况下，指令转换器可用来将指令从源指令集转换至目标指令集。例如，指令转换器可以变换(例如使用静态二进制变换、包括动态编译的动态二进制变换)、变形、仿真或以其它方式将指令转换成将由核来处理的一个或多个其它指令。指令转换器可以用软件、硬件、固件、或其组合实现。指令转换器可以在处理器上、在处理器外、或者部分在处理器上部分在处理器外。 In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, an instruction converter may transform (eg, using static binary translation, dynamic binary translation including dynamic compilation), warp, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core. The instruction converter can be implemented in software, hardware, firmware, or a combination thereof. The instruction converter can be on-processor, off-processor, or part-on-processor and part-off-processor.

图23是根据本发明的实施例的对比使用软件指令变换器将源指令集中的二进制指令变换成目标指令集中的二进制指令的框图。在所示的实施例中，指令转换器是软件指令转换器，但作为替换该指令转换器可以用软件、固件、硬件或其各种组合来实现。图23示出了用高级语言2302的程序可以使用x86编译器2304来编译，以生成可以由具有至少一个x86指令集核的处理器2316原生执行的x86二进制代码2306。具有至少一个x86指令集核的处理器2316表示任何处理器，这些处理器能通过兼容地执行或以其他方式处理以下内容来执行与具有至少一个x86指令集核的英特尔处理器基本相同的功能：1)英特尔x86指令集核的指令集的本质部分，或2)被定向为在具有至少一个x86指令集核的英特尔处理器上运行的应用或其它程序的对象代码版本，以便取得与具有至少一个x86指令集核的英特尔处理器基本相同的结果。x86编译器2304表示用于生成x86二进制代码2306(例如，对象代码)的编译器，该二进制代码2306可通过或不通过附加的链接处理在具有至少一个x86指令集核的处理器2316上执行。类似地，图23示出用高级语言2302的程序可以使用替换的指令集编译器2308来编译，以生成可以由不具有至少一个x86指令集核的处理器2314(例如具有执行加利福尼亚州桑尼维尔市的MIPS技术公司的MIPS指令集，和/或执行加利福尼亚州桑尼维尔市的ARM控股公司的ARM指令集的核的处理器)原生执行的替换指令集二进制代码2310。指令转换器2312被用来将x86二进制代码2306转换成可以由不具有x86指令集核的处理器2314原生执行的代码。该转换后的代码不大可能与替换性指令集二进制代码2310相同，因为能够这样做的指令转换器难以制造；然而，转换后的代码将完成一般操作并由来自替换性指令集的指令构成。因此，指令转换器2312通过仿真、模拟或任何其它过程来表示允许不具有x86指令集处理器或核的处理器或其它电子设备执行x86二进制代码2306的软件、固件、硬件或其组合。 23 is a block diagram comparing binary instructions in a source instruction set to binary instructions in a target instruction set using a software instruction translator, according to an embodiment of the present invention. In the illustrated embodiment, the instruction converter is a software instruction converter, but the instruction converter may alternatively be implemented in software, firmware, hardware, or various combinations thereof. 23 shows that a program in a high-level language 2302 can be compiled using an x86 compiler 2304 to generate x86 binary code 2306 that can be natively executed by a processor 2316 having at least one x86 instruction set core. Processor 2316 having at least one x86 instruction set core means any processor capable of performing substantially the same function as an Intel processor having at least one x86 instruction set core by compatibly executing or otherwise processing: 1) an essential portion of the instruction set of an Intel x86 instruction set core, or 2) an object code version of an application or other program directed to run on an Intel processor with at least one x86 instruction set core, in order to obtain a Basically the same result as the x86 instruction set core of the Intel processor. An x86 compiler 2304 represents a compiler for generating x86 binary code 2306 (eg, object code) executable on a processor 2316 having at least one x86 instruction set core, with or without additional linkage processing. Similarly, FIG. 23 shows that a program in a high-level language 2302 can be compiled using an alternative instruction set compiler 2308 to generate a processor 2314 that does not have at least one x86 instruction set core (e.g., with a Sunnyvale, Calif. The replacement instruction set binary code 2310 is natively executed by the MIPS instruction set of MIPS Technologies, Inc. of Sunnyvale, California, and/or by a processor of a core executing the ARM instruction set of ARM Holdings, Inc. of Sunnyvale, California. An instruction converter 2312 is used to convert x86 binary code 2306 into code that can be natively executed by a processor 2314 that does not have an x86 instruction set core. This translated code is unlikely to be identical to the alternative instruction set binary code 2310 because instruction converters capable of doing so are difficult to manufacture; however, the translated code will perform common operations and be composed of instructions from the alternative instruction set. Thus, instruction converter 2312 represents, by emulation, emulation or any other process, software, firmware, hardware or a combination thereof that allows a processor or other electronic device that does not have an x86 instruction set processor or core to execute x86 binary code 2306.

可选实施例 Alternative embodiment

尽管是通过几个实施例来对本发明进行描述的，但是，那些精通相关技术的人将认识到，本发明不仅限于所描述的实施例，在所附权利要求书的精神和范围内，可以对本发明进行修改。说明书因此应当被视为解说性的而非限定性的。例如，尽管附图中的流程图示出本发明的某些实施例的特定操作顺序，按应理解该顺序是示例性的(例如，可选实施例可按不同顺序执行操作、组合某些操作、使某些操作重叠等)。 Although the present invention has been described in terms of several embodiments, those skilled in the relevant art will recognize that the invention is not limited to the described embodiments, but that the invention can be modified within the spirit and scope of the appended claims. Invention to modify. The specification should therefore be regarded as illustrative rather than restrictive. For example, although the flowcharts in the figures show a particular sequence of operations for some embodiments of the invention, it is to be understood that the sequence is exemplary (e.g., alternative embodiments may perform operations in a different order, combine certain operations , make certain operations overlap, etc).

Claims

1. A computer-implemented method comprising:

An occurrence of a fetch instruction whose format specifies a source operand from a single vector writemask register as its only source operand and a single general purpose register as its destination, wherein the format of the instruction includes the first field, the content of the first field selects the single vector write mask register from a plurality of architectural vector write mask registers, and wherein the format of the instruction includes a second field, the content of the second field selects from multiple selects the single general purpose register among architectural general purpose registers, and wherein the source operand is a write mask comprising a plurality of one bit vector write mask elements, the plurality of one bit vector write mask elements elements correspond to different multi-bit data element locations within the architectural vector register; and

storing data in the single general purpose register in response to executing a single occurrence of the instruction such that the content of the single general purpose register is based on whether the plurality of one-bit vector writemask elements in the source operand Both are 0, representing the first or second scalar constant.

2. The method of claim 1, wherein the first and second scalar constants are 1 and 0, respectively.

3. The method of claim 2, wherein storing comprises storing data in the single general purpose register such that when the plurality of bit vector writemask elements are all zeros, the single The contents of the general register represent 1.

4. The method of claim 2, wherein storing comprises storing data in the single general purpose register such that when the plurality of bit vector writemask elements are all zeros, the single The contents of the general-purpose register represent 0.

5. The method of any one of claims 1-4, wherein an opcode of the instruction specifies the size of the source operand.

6. The method of claim 5, wherein the size of the source operand is smaller than the size of the single vector writemask register.

7. The method of claim 6, wherein the source operands are contiguous bits from the single vector writemask register, starting with the least significant bit.

8. The method of claim 1, wherein the instruction is part of an instruction set architecture (ISA), wherein other instructions from the instruction set architecture (ISA) specify vector operations, select destinations, and Select from among the writemasks of the plurality of architectural vector writemask registers, wherein for each of the other instructions, the plurality of one-bit vector writemask elements of the selected writemask control the selected Which data element locations in the destination reflect the result of the instruction's vector operation.

9. The method of claim 1 , wherein the general-purpose registers are configured to store operands for logical operations, arithmetic operations, address calculations, and memory pointers, and wherein the architectural vector registers are configured is a storage vector.

10. The method of claim 1 , wherein there are at least 16 architectural general purpose registers of size at least 64 bits, wherein there are at least 8 architectural vector writemask registers of size at least 32 bits for storing write masks, and There are at least 16 architectural vector registers of size at least 256 bits for storing vectors.

11. The method of claim 1 , wherein there are at least 16 architectural general purpose registers of size at least 64 bits, wherein there are at least 8 architectural vector writemask registers of size at least 64 bits for storing write masks, and There are at least 32 architectural vector registers of size at least 512 bits for storing vectors.

12. The method of claim 1, wherein said performing comprises:

performing a logical OR operation on the plurality of bit vector writemask elements; and

The first or second scalar constant is generated based on a result of the logical OR operation.

13. The method of claim 12, wherein said generating comprises:

Negate the result of the logical OR;

The negated value is converted to a 64-bit unsigned integer value to form the first or second scalar constant.

14. The method of claim 12, wherein said generating comprises:

The first or second scalar constant is multiplexed based on a control signal formed from the result of the logical OR operation and an indication of which of a plurality of types the instruction is.

15. A processor core, comprising:

a hardware decode unit configured to decode occurrences of a set of one or more instructions, wherein each of the occurrences specifies a source operand from a selected one of a plurality of architectural vector writemask registers as its sole source, and designates as its destination a selected one of a plurality of architectural general purpose registers, wherein the set of instructions has a format with a first field whose contents select the plurality of one of the architectural vector write mask registers, and wherein the format has a second field whose contents select one of the plurality of architectural general purpose registers, and wherein each of the source operands is a writemask, the writemask includes a plurality of one-bit vector writemask elements corresponding to different multi-bit data element positions within the architectural vector register;

an execution engine unit coupled to the hardware decoding unit and configured to, in response to each of the occurrences:

determining whether the plurality of bit-vector writemask elements of the source operand present are all zeros; and

Data is caused to be stored in the selected single general purpose register of the occurrence such that the content of the single general purpose register represents either the first or the second scalar constant based on the determination.

16. The processor core of claim 15, wherein the first and second scalar constants are 1 and 0, respectively.

17. The processor core of claim 16 , wherein a first instruction in the set of instructions causes the contents of the selected single general-purpose register to be full in the plurality of one-bit vector writemask elements. 0 means 1.

18. The processor core of claim 17 , wherein a second instruction in the set of instructions causes the contents of the selected single general-purpose register to be full in the plurality of one-bit vector writemask elements. 0 means 0.

19. The processor core of any one of claims 15-18, wherein different instructions in the set of instructions specify different sizes of source operands, and wherein at least one of the sizes is smaller than the Dimensions of the vector write mask register.

20. The processor core of claim 19, wherein the source operands are contiguous bits from a selected single vector writemask register, starting with the least significant bit.

21. The processor core of claim 15, wherein the hardware decode unit is further configured to decode occurrences of other instructions specifying vector operations, select destinations, and from the A selection is made among a write mask of a plurality of architectural vector write mask registers, wherein for each of said other instructions, a plurality of one-bit vector write mask elements of the selected write mask control the selected destination Which data element positions in reflect the result of the vector operation of the instruction.

22. The processor core of claim 15 , wherein the plurality of architectural general purpose registers are configured to store operands for logical operations, arithmetic operations, address calculations, and memory pointers, and wherein The architectural vector registers are configured to store vectors.

23. The processor core of claim 15 , wherein there are at least 16 architectural general purpose registers of size at least 64 bits, wherein there are at least 8 architectural vector writemask registers of size at least 32 bits for storing write masks , and in which there are at least 16 architectural vector registers of size at least 256 bits for storing vectors.

24. The processor core of claim 15 , wherein there are at least 16 architectural general purpose registers having a size of at least 64 bits, wherein there are at least 8 architectural write mask registers having a size of at least 64 bits for storing a write mask, And there are at least 32 architectural vector registers of size at least 512 bits for storing vectors.

25. The processor core according to claim 15, wherein the execution engine unit comprises:

a logical OR logical unit configured to logically OR the plurality of bit vector writemask elements; and

a multiplexer coupled to a logical OR logic unit and configured to select said first or second scalar constant based on a control signal derived from the result of said logical OR operation and said instruction An indication of which instruction in the set is being executed is formed.