CN105612494A - Data processing apparatus and method for controlling performance of speculative vector operations - Google Patents
Data processing apparatus and method for controlling performance of speculative vector operations Download PDFInfo
- Publication number
- CN105612494A CN105612494A CN201480054729.5A CN201480054729A CN105612494A CN 105612494 A CN105612494 A CN 105612494A CN 201480054729 A CN201480054729 A CN 201480054729A CN 105612494 A CN105612494 A CN 105612494A
- Authority
- CN
- China
- Prior art keywords
- speculative
- vector
- speculation
- width
- indication
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
- G06F9/30038—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30043—LOAD or STORE instructions; Clear instruction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30065—Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
- G06F9/38873—Iterative single instructions for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Description
技术领域technical field
本发明关于用于控制推测向量运算效能的数据处理设备及方法。The present invention relates to a data processing device and method for controlling performance of speculative vector operations.
背景技术Background technique
用于提高数据处理设备效能的一项已知技术是提供电路系统以支持向量运算的执行。对至少一个向量操作数执行向量运算,其中每一向量操作数包括多个向量元素。因此,向量运算的执行涉及对一个或多个向量操作数内的多个向量元素重复应用运算。One known technique for increasing the performance of data processing devices is to provide circuitry to support the execution of vector operations. A vector operation is performed on at least one vector operand, where each vector operand includes a plurality of vector elements. Thus, performance of a vector operation involves repeatedly applying an operation to multiple vector elements within one or more vector operands.
在支持向量运算的执行的典型数据处理系统中,将提供向量寄存器库以用于储存向量操作数。因此,举例而言,向量寄存器库内的每一向量寄存器可储存包括多个向量元素的向量操作数。In a typical data processing system that supports the performance of vector operations, a bank of vector registers will be provided for storing vector operands. Thus, for example, each vector register within a vector register bank may store a vector operand comprising a plurality of vector elements.
在高效能实施中,亦已知提供向量处理电路系统(常被称作单指令多数据(SingleInstructionMultipleData;SIMD)处理电路系统),该电路系统可对向量操作数内的多个向量元素并行执行所需运算。在一替代性实施例中,标量处理电路系统仍可用以实施向量运算,但在此情况下,向量运算藉由迭代执行经由标量处理电路系统的运算而得以实施,每一迭代对向量操作数的不同向量元素进行运算。In high-performance implementations, it is also known to provide vector processing circuitry (often referred to as Single Instruction Multiple Data (SIMD) processing circuitry) that can execute all operations in parallel on multiple vector elements within a vector operand. need to calculate. In an alternative embodiment, scalar processing circuitry can still be used to perform vector operations, but in this case, the vector operations are performed by iteratively performing operations through the scalar processing circuitry, each iteration of the vector operand Operate on different vector elements.
经由使用向量运算,与等效的标量运算系列的效能相比可实现显著的效能益处。Through the use of vector operations, significant performance benefits can be realized compared to the performance of an equivalent series of scalar operations.
当设法获得向量处理的效能益处时,已知设法以向量化一系列标量运算,以便将这些标量运算替换为等效的向量运算系列。例如,对于包含一系列标量指令的回路而言,可能藉由将该标量指令系列替换为等效的向量指令系列来将回路向量化,其中向量操作数包含与原始标量回路的不同迭代相关的元素以作为向量元素。When trying to obtain the performance benefits of vector processing, it is known to try to vectorize a series of scalar operations in order to replace them with an equivalent series of vector operations. For example, for a loop containing a sequence of scalar instructions, it is possible to vectorize the loop by replacing the sequence of scalar instructions with an equivalent sequence of vector instructions, where the vector operands contain elements associated with different iterations of the original scalar loop as vector elements.
尽管如此,尽管该种方法在经由原始标量回路的所需迭代数目为预定的情况下可有效,但在迭代数目未经预定的情况下向量化该种回路更困难。特定而言,由于迭代数目未经预定,因此不能预定每一向量操作数中将需要多少向量元素。Nevertheless, while this approach can be effective if the required number of iterations through the original scalar loop is predetermined, it is more difficult to vectorize such a loop when the number of iterations is not predetermined. In particular, since the number of iterations is not predetermined, it is not predetermined how many vector elements will be required in each vector operand.
在上述类型的一些情况下,有可能执行推测向量处理,其中推测向量元素的所需数目,及当决定所需向量元素的精确数目时稍后采取矫正措施。In some cases of the type described above, it is possible to perform speculative vector processing in which the required number of vector elements is guessed, and corrective action is taken later when the exact number of vector elements required is determined.
KAsanovic所著标题为「向量微处理器(VectorMicroprocessors)」的博士论文(伯克利学院,1998年,第116-121页)论述了对向量操作数的整体宽度执行推测,及额外追踪在推测期间发生的架构事件(例如页错误)。该种架构事件将触发异常,从而使操作系统执行异常例行例程以便解决该异常。所建议的方法记录在侦测到该种架构事件的向量宽度内的每一向量元素位置。随后,当到达所需向量元素的位置集为已知的提交(commit)点时,每一所需向量元素位置与架构事件的此记录相比较。由于与所需向量元素位置关联的任何架构事件将防碍向量处理电路系统正确地执行向量运算,因此在提交点触发任何该种延迟异常。如若所需向量元素位置集中无一位置与架构事件关联,则更新向量长度及屏蔽,及清除架构事件的记录。KAsanovic's doctoral dissertation titled "Vector Microprocessors" (Berkeley College, 1998, pp. 116-121) discusses performing speculation on the entire width of vector operands, and additionally tracking what happens during speculation Schema events (such as page faults). This architectural event will trigger an exception, causing the operating system to execute an exception routine in order to resolve the exception. The proposed method records the position of each vector element within the width of the vector where the architectural event was detected. Then, when a commit point is reached where the set of required vector element positions is known, each required vector element position is compared to this record of architectural events. Any such delay exception is triggered at the commit point because any architectural event associated with the desired vector element location would prevent the vector processing circuitry from correctly performing the vector operation. If none of the required set of vector element positions is associated with a fabric event, then update the vector length and mask, and clear the record of the fabric event.
上述过程容许执行推测向量处理,同时藉由屏蔽提交点处的架构事件来确保正确运算。The above process allows speculative vector processing to be performed while ensuring correct operation by masking architectural events at commit points.
尽管如此,尽管上述方法可在执行推测向量处理运算的同时确保正确运算,但存在可能影响执行推测向量处理的益处的其他因素。如前所说,当执行推测之时,所需的迭代数目未知,及因此存在执行某些可能对设备的效能特性(例如产量或能量消耗)产生不利影响的运算的可能性,只能在稍后察出原本不需要那些运算。因此,将需要提供一机制以用于执行推测向量运算,同时管理该种推测向量处理对设备的效能特性的影响。Nonetheless, while the methods described above may ensure correct operation while performing speculative vector processing operations, there are other factors that may affect the benefits of performing speculative vector processing. As mentioned earlier, when performing speculation, the number of iterations required is unknown, and therefore there is the possibility of performing certain operations that may adversely affect the performance characteristics of the device (such as yield or energy consumption), only at a slight Later, it was found that those operations were not needed. Therefore, it would be desirable to provide a mechanism for performing speculative vector operations while managing the impact of such speculative vector processing on the performance characteristics of the device.
发明内容Contents of the invention
自第一方面可见,本发明提供一种数据处理设备,该设备包括:处理电路系统,该电路系统经配置以对向量操作数执行推测向量运算序列,每一向量操作数包括多个向量元素;推测控制电路系统,该电路系统经配置以维护推测宽度指示,该推测宽度指示指示每一向量操作数中将经受这些推测向量运算的向量元素的数目,该推测宽度指示在该推测向量运算序列的执行之前被初始化至初始值;该处理电路系统经配置以在该推测向量运算序列的执行期间产生进展指示;推测控制电路系统,经进一步配置以藉由参考进展指示及推测缩减标准来侦测推测缩减条件的存在,推测缩减条件是一条件,该条件指示:相对于在推测宽度指示无缩减情况下的连续运算,推测宽度指示的缩减预期将改良数据处理设备的至少一个效能特性;推测控制电路系统进一步响应于对该推测缩减条件的侦测而缩减推测宽度指示。As seen from a first aspect, the present invention provides a data processing apparatus comprising: processing circuitry configured to perform a sequence of speculative vector operations on vector operands, each vector operand comprising a plurality of vector elements; Speculation control circuitry configured to maintain a speculation width indication indicating the number of vector elements in each vector operand that will be subject to the speculative vector operations, the speculation width indicating at the end of the sequence of speculative vector operations initialized to initial values prior to execution; the processing circuitry configured to generate a progress indication during execution of the speculative vector operation sequence; speculation control circuitry further configured to detect speculation by reference to the progress indication and the speculation reduction criteria the existence of a reduction condition, the speculative reduction condition being a condition indicating that reduction of the speculative width indication is expected to improve at least one performance characteristic of the data processing device relative to successive operations without reduction of the speculation width indication; the speculative control circuit The system further shrinks the speculation width indication in response to detection of the speculation shrinkage condition.
藉由配置处理电路系统以在推测向量运算的序列的执行期间产生进展指示,此举容许推测控制电路系统相对于推测缩减标准(这些标准可为固定标准,或可为被储存以便由推测控制电路系统存取及可被重配置的标准)评估那些进展指示,以便决定当前推测宽度似乎对数据处理设备的所选效能特性(例如产量或能量消耗)具有显著的不利效应的情况。在侦测到该种情况时,推测控制电路系统随后经配置以缩减推测宽度指示,从而减少每一向量操作数中将经受推测向量运算的向量元素的数目。By configuring the processing circuitry to generate an indication of progress during execution of a sequence of speculative vector operations, this allows the speculative control circuitry to scale down criteria relative to speculation (these criteria may be fixed or may be stored for use by the speculative control circuitry). System Access and Reconfigurable Criteria) evaluates those progress indicators to determine situations where the current speculation width appears to have a significant adverse effect on selected performance characteristics (eg, throughput or power consumption) of the data processing device. Upon detection of this, the speculation control circuitry is then configured to reduce the speculation width indication, thereby reducing the number of vector elements in each vector operand that will be subject to a speculative vector operation.
通常情况下,进展指示将经排列以便容许推测控制电路系统辨识推测向量运算的执行对效能特性产生不利影响的任何向量元素位置。例如,每一进展指示可特定地辨识进展指示所相关的向量元素位置。Typically, the progress indicators will be arranged so as to allow the speculative control circuitry to identify any vector element locations where performance of speculative vector operations adversely affects performance characteristics. For example, each progress indication may specifically identify the vector element position to which the progress indication is associated.
仅仅举例而言,如若进展指示辨识出在推测向量加载运算执行期间将自位于元素位置x的内存加载的特定向量元素已在系统的高速缓存阶层内的某个快取等级导致快取未中,则推测缩减标准可辨识将与该种快取未中事件相关联的预期潜时,或可直接辨识该种事件的发生指示推测缩减条件。尽管推测向量处理仍可在当前规定的推测宽度正确地继续执行,但此举可能由于因快取未中而引入的潜时而显著影响效能,及藉由在此处缩减推测宽度以排除元素位置x及更高位置可避免此潜时。随后,如若需要则可重复向量运算序列以设法针对所省略的向量元素执行向量运算,及可能在重复该向量运算序列的时间之前不会产生相同的潜时问题(例如,在该时间之前,所需数据元素可存在于高速缓存中,因而可不产生快取未中)。By way of example only, if the progress indication identifies that a particular vector element to be loaded from memory at element location x during execution of a speculative vector load operation has resulted in a cache miss at a cache level within the system's cache hierarchy, The speculative shrinkage criteria may then identify the expected latency to be associated with such a cache miss event, or may directly identify the occurrence of such an event indicating a speculative shrinkage condition. Although speculative vector processing can continue correctly at the currently specified speculative width, this can significantly affect performance due to the latency introduced by cache misses, and by reducing the speculative width here to exclude element position x and higher to avoid this latent time. The sequence of vector operations can then be repeated if desired to try to perform the vector operations on the omitted vector elements, and may not have the same latent-time problem until the time at which the sequence of vector operations is repeated (e.g., before that time, all The required data element may exist in the cache, and thus may not generate a cache miss).
由此,经由使用本发明,数据处理设备的至少一个效能特性可在执行推测向量运算的同时,藉由设法避免以特定推测宽度执行推测向量运算将对该所选的效能特性具有不利影响的情况而得以改良。Thus, through use of the present invention, at least one performance characteristic of a data processing device may be performed while performing speculative vector operations, by managing to avoid situations where performing speculative vector operations with a particular speculation width would have an adverse effect on the selected performance characteristic And be improved.
进展指示可采用多种形式。在一个实施例中,进展指示指示在推测向量运算的执行期间在处理电路系统内发生的产量影响事件。或者或此外,进展指示可指示在推测向量运算的执行期间在处理电路系统内发生的能量消耗影响事件。尽管在一个实施例中,由进展指示所指示的影响事件是对效能特性具有负面影响的事件,但在一个实施例中,影响事件亦可辨识对效能特性具有正面影响的事件,例如比预期更快地执行或消耗更少能量的特定运算。Progress indicators can take many forms. In one embodiment, the progress indication indicates a yield impacting event that occurred within the processing circuitry during execution of the speculative vector operation. Alternatively or additionally, the progress indication may indicate an energy consumption impacting event occurring within the processing circuitry during execution of the speculative vector operation. Although in one embodiment the impact events indicated by the progress indications are events that have a negative impact on the performance characteristic, in one embodiment the impact events may also identify events that have a positive impact on the performance characteristic, such as more than expected Perform specific operations faster or consume less energy.
触发进展指示的产生的事件可采用多种形式。在一个实施例中,响应于在推测向量运算的执行期间在处理电路系统内发生的微架构事件而发出进展指示中的至少一些。The event that triggers the generation of the progress indication can take a variety of forms. In one embodiment, at least some of the progress indications are issued in response to microarchitectural events occurring within the processing circuitry during execution of the speculative vector operations.
如若处理器的特征、组件,或行为仅影响实施的质量(例如使处理器使用更多或更少时间或能量来执行程序)而不是实施的正确性(亦即是否正确地实施指令集架构),则处理器的特征、组件,或行为可被视作「微架构」。例如,现代处理器使用分支预测器以加快分支,使用高速缓存以加快内存读取,使用写入缓冲器以加快内存写入,使用变换旁看缓冲器以加快页表查找,及使用管线以加快指令序列的执行。这些特征全部可被视作微架构特征,因为这些特征加快执行但却不影响程序的最终输出结果。If a feature, component, or behavior of a processor affects only the quality of the implementation (such as causing the processor to use more or less time or energy to execute a program) rather than the correctness of the implementation (that is, whether the instruction set architecture is implemented correctly) , the features, components, or behaviors of a processor can be considered a "microarchitecture." For example, modern processors use branch predictors to speed up branches, caches to speed up memory reads, write buffers to speed up memory writes, transform lookaside buffers to speed up page table lookups, and pipelines to speed up Execution of a sequence of instructions. All of these features can be considered microarchitectural features because they speed up execution without affecting the final output of the program.
此情况应与影响实施正确性的处理器特征、组件,或行为相反,这些特征、组件,或行为被视为「架构」。例如,现代处理器使用页表及页错误异常以实施虚拟内存,使用中断以支持上下文切换,使用算术异常以处置诸如「除以零」的算术错误条件。架构特征、组件,或行为例如藉由使额外指令被执行而产生通常会影响程序执行的架构事件。相反,微架构特征、组件,或行为产生影响微架构的行为的微架构事件,但对架构等级则不具有任何影响。这些微架构特征、组件,或行为可能例如使程序比原本情况运行稍慢,但不影响程序执行。This shall be in contrast to processor features, components, or behaviors that affect implementation correctness, which are considered "architecture". For example, modern processors use page tables and page fault exceptions to implement virtual memory, interrupts to support context switching, and arithmetic exceptions to handle arithmetic error conditions such as "divide by zero". Architectural features, components, or behaviors generate architectural events that typically affect program execution, such as by causing additional instructions to be executed. In contrast, microarchitectural features, components, or behaviors generate microarchitectural events that affect the behavior of the microarchitecture, but have no effect on the architecture level. These microarchitectural features, components, or behaviors may, for example, cause a program to run slightly slower than otherwise, but do not affect program execution.
推测宽度指示可采用多种形式。例如,推测宽度指示可由屏蔽规定或由辨识诸如开始元素位置和/或结束元素位置的特定元素位置的一个或多个寄存器的内容规定。在一个实施例中,推测宽度指示不仅指示每一向量操作数中将经受推测向量运算的向量元素的数目,还进一步辨识每一向量操作数中将经受该推测向量运算的第一向量元素。The speculative width indication may take a variety of forms. For example, the speculative width indication may be specified by a mask or by the contents of one or more registers identifying a particular element position, such as a start element position and/or an end element position. In one embodiment, the speculation width indication not only indicates the number of vector elements in each vector operand that will be subjected to the speculative vector operation, but further identifies the first vector element in each vector operand that will be subjected to the speculative vector operation.
尽管将经受推测向量运算的向量元素的数目无需占据一系列邻接的向量元素位置,但在一个实施例中,推测宽度指示确实将每一向量操作数中将经受该推测向量运算的向量元素的数目辨识为自该第一向量元素开始的顺序向量元素的规定数目。Although the number of vector elements that will be subject to a speculative vector operation need not occupy a series of contiguous vector element positions, in one embodiment the speculative width indication does place the number of vector elements that will be subject to that speculative vector operation in each vector operand Identifies the specified number of sequential vector elements starting from the first vector element.
在一个实施例中,处理电路系统经配置以执行指令向量回路,向量回路包括定义该向量运算序列的指令及在该向量运算序列的执行之后在向量回路内的提交点处执行的至少一个评估指令,该至少一评估指令的执行使得决定所需的向量宽度。进一步而言,推测控制电路系统响应于所需向量宽度的决定以藉由参考推测宽度指示的当前值而决定该向量运算序列的执行是导致推测过度还是推测不足,并且倘若发生推测不足,则设定重复旗标以使该指令向量回路的进一步迭代得以执行。由此,推测宽度可在推测向量运算的执行期间按需要而变化,后续的推测向量运算序列的迭代按照需要而被执行以确保最终对全部所需向量元素都执行了向量运算。In one embodiment, the processing circuitry is configured to execute an instruction vector loop comprising instructions defining the sequence of vector operations and at least one evaluation instruction executed at a commit point within the vector loop after execution of the sequence of vector operations , the execution of the at least one evaluation instruction results in a determination of the required vector width. Further, the speculation control circuitry is responsive to the determination of the desired vector width to determine whether execution of the sequence of vector operations results in over-speculation or under-speculation by referring to a current value indicated by the speculation width, and if under-speculation occurs, setting A repeat flag is set to enable further iterations of the instruction vector loop to be performed. Thus, the speculation width may be varied during execution of a speculative vector operation as necessary, and subsequent iterations of the sequence of speculative vector operations are performed as necessary to ensure that eventually the vector operation is performed on all required vector elements.
在一个实施例中,在指令向量回路的进一步迭代之后,推测控制电路系统经配置以在虑及在指令向量回路的先前迭代期间经处理的向量元素数目的情况下将推测宽度指示初始化至经修正的初始值。In one embodiment, after a further iteration of the instruction vector loop, the speculation control circuitry is configured to initialize the speculation width indication to the revised the initial value of .
在一个实施例中,向量回路包括一个或多个在提交点之后待执行的非推测指令,推测控制电路系统进一步经配置以设置屏蔽值以辨识每一向量操作数中将经受由该一个或多个非推测指令定义的非推测运算的向量元素的数目。在一个实施例中,倘若发生推测不足,则此屏蔽将被设定以辨识在该提交点处存在的推测宽度的值,而倘若发生推测过度,则屏蔽将被设定以辨识该提交点所决定的所需向量宽度。In one embodiment, where the vector loop includes one or more non-speculative instructions to be executed after the commit point, the speculative control circuitry is further configured to set mask values to identify each vector operand that will be subject to the one or more The number of vector elements for non-speculative operations defined by non-speculative instructions. In one embodiment, if under-speculation occurs, the mask will be set to identify the value of the speculative width that exists at the commit point, and if over-speculation occurs, the mask will be set to identify the value of the speculative width at the commit point Determine the desired vector width.
在一个实施例中,数据处理设备包括向量寄存器库,该库经配置以储存向量操作数以用于由处理电路系统存取,并且处理电路系统包括数据存取电路系统,该电路系统经配置以执行向量存取操作以便在向量寄存器库与包括至少快取储存器的一个等级的内存系统之间移动向量操作数。在该实施例中,数据存取电路系统可经配置以发出与在向量存取操作的执行期间发生的快取未中相关的信息以作为进展指示。该种快取未中可产生显著潜时,因此可在决定是否缩减推测宽度时提供有用信息。In one embodiment, a data processing apparatus includes a vector register bank configured to store vector operands for access by processing circuitry, and the processing circuitry includes data access circuitry configured to Vector access operations are performed to move vector operands between a vector register bank and a memory system including at least one level of cache memory. In this embodiment, the data access circuitry may be configured to issue information related to cache misses that occur during execution of the vector access operation as an indication of progress. Such cache misses can have significant latency and thus provide useful information when deciding whether to reduce the speculation width.
在一个实施例中,数据处理设备进一步包括变换旁看缓冲器(translationlookasidebuffer;TLB),数据存取电路系统在向量存取操作的执行期间参考TLB,数据存取电路系统经进一步配置以发出与在向量存取操作的执行期间发生的TLB未中相关的信息以作为进展指示。TLB未中亦可产生显著潜时,因为在此情况下可能必须执行「页表移动」过程以便自内存撷取所需页表信息以用于储存在TLB内,由此,TLB未中的指示亦可在决定是否缩减推测宽度时提供有用信息。In one embodiment, the data processing apparatus further includes a translation lookaside buffer (TLB), the data access circuitry refers to the TLB during execution of a vector access operation, the data access circuitry is further configured to issue and Information about TLB misses that occur during the execution of a vector access operation as an indication of progress. A TLB miss can also generate significant latency because in this case a "page table move" process may have to be performed in order to fetch the required page table information from memory for storage in the TLB, thus, the indication of a TLB miss Also provides useful information when deciding whether to reduce the speculation width.
尽管进展指示的上述两个实例关于数据存取电路系统的活动,但将了解,处理电路系统的其他组件亦可经排列以向推测控制电路系统提供进展指示。实际上,除由执行推测向量运算的处理电路系统所发出的进展指示之外,进展指示亦可由该系统内的其他组件发出,例如经配置以执行标量运算的标量电路系统,因为正执行的推测向量运算可能对该系统内的那些其他组件具有效能影响。例如,如若推测回路包含大量运算(无论是向量还是标量),则在推测宽度缩减过多的情况下将必须重复那些运算。追踪所执行的标量及向量运算的数目容许此潜在重复成本影响对缩减推测宽度的决定。Although the above two examples of progress indications relate to the activity of the data access circuitry, it will be appreciated that other components of the processing circuitry may also be arranged to provide progress indications to the speculative control circuitry. In fact, in addition to progress indications issued by processing circuitry performing speculative vector operations, progress indications may also be issued by other components within the system, such as scalar circuitry configured to perform scalar operations, because the speculative Vector operations may have performance impacts on those other components within the system. For example, if a speculative loop contains a large number of operations (whether vector or scalar), those operations will have to be repeated if the speculative width is reduced too much. Tracking the number of scalar and vector operations performed allows this potential repetition cost to influence the decision to reduce the speculation width.
在一个实施例中,数据处理电路系统响应于推测宽度指示的缩减而改变每一向量操作数中经受所选的向量运算的向量元素数目,这些所选的向量运算在该序列发生,该序列自一向量运算开始,该向量运算的进展指示导致侦测到推测缩减条件,该推测缩减条件使推测宽度指示得以缩减。在一个特定实施例中,所选的向量运算包括向量运算(该向量运算的进展指示导致推测宽度缩减)及在序列中该向量运算随后的全部向量运算。In one embodiment, the data processing circuitry changes the number of vector elements per vector operand that are subject to selected vector operations that occur in the sequence from A vector operation is initiated, and the progress indication of the vector operation results in the detection of a speculative reduction condition that causes the speculative width indication to be reduced. In a particular embodiment, the selected vector operation includes a vector operation whose progress indication results in a speculative width reduction and all vector operations following that vector operation in sequence.
此外,在一个实施例中,此方法可经扩展以额外改变每一向量操作数中经受未完成向量运算的向量元素的数目,这些未完成向量运算在序列中在一向量运算之前出现,该向量运算的进展指示导致侦测到推测缩减条件,该推测缩减条件使推测宽度指示得以缩减。由此,此方法容许对经受先前发起的向量运算的向量元素的数目进行回顾修整,先前发起的这些向量运算在缩减推测宽度指示之时仍在进行中。此举可由此降低在其他情况下完成那些先前发起的推测向量运算时所消耗的能量消耗。Furthermore, in one embodiment, this method can be extended to additionally vary the number of vector elements in each vector operand that are subject to outstanding vector operations that occur before a vector operation in the sequence that The progress indication of the operation results in the detection of a speculative reduction condition that causes the speculative width indication to be reduced. Thus, this approach allows retrospective trimming of the number of vector elements that were subject to previously initiated vector operations that were still in progress at the time of the reduced speculation width indication. This can thereby reduce the energy consumption otherwise consumed in completing those previously initiated speculative vector operations.
在一个实施例中,数据处理设备进一步包括控制电路系统,该控制电路系统经配置以响应于推测宽度指示的缩减而降低处理电路系统的一个或多个组件内的功率消耗。此举可以多种方式达成,但在一个实施例中,控制电路系统采用时钟闸控及功率闸控中的至少一者以降低该一个或多个组件内的功率消耗。由此,如若使用时钟闸控机制,则处理电路系统内的某些组件可移除其时钟信号,以便防止这些组件在现位于已修正的推测宽度以外的元素位置上的推测向量运算执行中消耗功率。在一些情况下,功率闸控可改为用于从相关组件移除电力供应。在一个实施例中,控制电路系统可经配置以连续接收当前推测宽度的指示,或可经配置以仅在推测宽度已变更时接收指示。In one embodiment, the data processing apparatus further includes control circuitry configured to reduce power consumption within one or more components of the processing circuitry in response to the reduction of the speculation width indication. This can be accomplished in a number of ways, but in one embodiment, the control circuitry employs at least one of clock gating and power gating to reduce power consumption within the one or more components. Thus, if a clock gating mechanism is used, certain components within the processing circuitry can have their clock signals removed in order to prevent these components from being consumed in the execution of speculative vector operations that are now at element locations outside the modified speculative width. power. In some cases, power gating may instead be used to remove power supply from associated components. In one embodiment, the control circuitry may be configured to continuously receive an indication of the current speculation width, or may be configured to receive an indication only when the speculation width has changed.
有数种方式可规定推测宽度指示的初始值及推测缩减标准。例如,在一个实施例中,由将由数据处理设备执行的指令规定推测宽度指示经初始化的初始值及推测缩减标准中的至少一者。该指令可为处理电路系统所执行的指令中的一者,或实际上可为推测控制电路系统所执行的特定指令,例如用以启动推测的指令。在替代性实施例中,预定推测宽度指示经初始化的初始值及推测缩减标准中的至少一者。There are several ways to specify the initial value of the speculative width indicator and the speculative reduction criteria. For example, in one embodiment at least one of an initialized initial value of the speculation width indication and a speculation reduction criterion are specified by instructions to be executed by the data processing apparatus. The instruction may be one of the instructions executed by the processing circuitry, or indeed may be a specific instruction executed by the speculative control circuitry, such as an instruction to enable speculation. In an alternative embodiment, the predetermined speculation width indicates at least one of an initialized initial value and a speculation reduction criterion.
在又一实施例中,数据处理设备可进一步包括预测电路系统,该电路系统经配置以维护关于推测宽度值的历史数据,这些推测宽度值在提交点用于先前由处理电路系统执行的推测向量运算的序列。然后,预测电路系统可针对将由处理电路系统执行的推测向量运算的当前序列而进行配置,以在执行推测向量运算的当前序列之前,参考历史数据来确定设定推测宽度指示的初始值。或者或另外,预测电路系统可维护关于用于先前推测向量运算序列的推测缩减标准的历史数据,然后参考该历史数据来确定将用于推测向量运算的当前序列的推测缩减标准。In yet another embodiment, the data processing apparatus may further include predictive circuitry configured to maintain historical data about speculative width values for speculative vectors previously executed by the processing circuitry at the commit point sequence of operations. The predictive circuitry may then be configured for the current sequence of speculative vector operations to be performed by the processing circuitry to refer to historical data to determine an initial value for setting the speculative width indication prior to performing the current sequence of speculative vector operations. Alternatively or additionally, the predictive circuitry may maintain historical data regarding the speculative reduction criteria used for previous sequences of speculative vector operations and then refer to the historical data to determine the speculative reduction criteria to be used for the current sequence of speculative vector operations.
可以多种方式辨识推测向量运算的序列。例如,指令集可包括某些向量指令的非推测版本及推测版本,以便可将特定指令明确辨识为推测指令。或者,数据处理设备可经配置以进入及退出推测运算模式,在推测运算方法内遇到的全部指令以推测方式来执行。在支持该种推测运算模式的一个实施例中,推测控制电路系统响应于开始推测指令的执行以触发推测运算模式,并且处理电路系统经配置以响应于在该推测运算模式期间所执行的指令而执行推测向量运算。The sequence of speculative vector operations can be identified in a number of ways. For example, an instruction set may include non-speculative and speculative versions of certain vector instructions so that certain instructions can be unambiguously identified as speculative. Alternatively, the data processing apparatus may be configured to enter and exit a speculative computing mode, in which all instructions encountered within a speculative computing method are executed speculatively. In one embodiment that supports such a speculative mode of operation, the speculative control circuitry triggers the speculative mode of operation in response to beginning execution of a speculative instruction, and the processing circuitry is configured to respond to instructions executed during the speculative mode of operation. Perform speculative vector operations.
在一个该种实施例中,推测控制电路系统进一步响应于提交指令的执行而终止推测运算模式。In one such embodiment, the speculative control circuitry is further responsive to execution of the committed instruction to terminate the speculative mode of operation.
在一个实施例中,推测控制电路系统响应于推测缩减条件而修正推测宽度指示,以指示每一向量操作数中至少一个向量元素将经受推测向量运算。藉由对推测宽度指示设置最低限制以便处理每一向量操作数中的至少一个向量元素,此举确保进展始终为正向。当推测宽度指示辨识单个向量元素时,此情况实际上是没有推测正在执行的情况。In one embodiment, the speculation control circuitry modifies the speculation width indication to indicate that at least one vector element in each vector operand is to be subjected to a speculative vector operation in response to the speculation reduction condition. This ensures that progress is always positive by placing a minimum limit on the speculative width indication to process at least one vector element in each vector operand. When the speculation width indicates that a single vector element is identified, this is actually the case when no speculation is being performed.
尽管推测缩减标准可直接规定一个或多个标准,这些标准在得以满足的情况下指示推测缩减条件的存在,或者或另外,推测缩减标准可包括由推测控制电路系统维护的效能容限信息,推测控制电路系统经配置以在虑及在推测向量运算序列的执行期间产生的进展指示的情况下调整效能容限信息。在一个该种实施例中,效能容限信息可被视作提供效能特性在推测向量运算的处理期间如何受影响的可用松弛(slack)指示,然后,依据接收到的进展指示而调整该松弛指示。在一个实施例中,进展指示可仅指示对所选的效能特性具有负面影响的事件,因此,将仅在一个方向上调整松弛指示。然而,在替代性实施例中,进展指示可指示兼具负面影响及正面影响的事件,及在该种实施例中,可双向调整松弛指示。无论效能容限信息(松弛指示)经如何调整,推测控制电路系统都可经配置以在效能容限信息到达触发点的情况下侦测到推测缩减条件。Although the speculative shrinkage criteria may directly specify one or more criteria that, if satisfied, indicate the existence of a speculative shrinkage condition, alternatively or additionally, the speculative shrinkage criteria may include performance margin information maintained by the speculative control circuitry, speculatively The control circuitry is configured to adjust the performance margin information taking into account an indication of progress generated during execution of the sequence of speculative vector operations. In one such embodiment, the performance margin information can be viewed as providing an indication of available slack of how performance characteristics are affected during the processing of speculative vector operations, which is then adjusted according to received progress indications . In one embodiment, the progress indicator may only indicate events that have a negative impact on the selected performance characteristic, thus, the slack indicator will only be adjusted in one direction. However, in alternative embodiments, progress indicators may indicate events that have both negative and positive impacts, and in such embodiments, slack indicators may be adjusted bi-directionally. Regardless of how the performance margin information (slack indication) is adjusted, the speculative control circuitry can be configured to detect a speculative reduction condition if the performance margin information reaches a trigger point.
在一个实施例中,推测控制电路系统在提交推测宽度指示将缩减的量时参考效能容限信息。此举使得能够虑及超出效能容限的程度而动态管理推测宽度缩减量。In one embodiment, the speculation control circuitry references the performance margin information when submitting the speculation width indicating the amount by which it will be reduced. This enables the amount of speculative width reduction to be dynamically managed taking into account the extent to which performance tolerances are exceeded.
或者,可预定在侦测到推测缩减条件时推测宽度指示将缩减的量。可藉由辨识推测宽度将缩减的元素位置预置数目来预定此量,或可藉由辨识在确定如何调整推测宽度时使用的特定规则来预定此量。Alternatively, the amount by which the speculative width indication will be reduced when a speculative reduction condition is detected may be predetermined. This amount can be predetermined by identifying a preset number of element positions by which the speculative width will be reduced, or by identifying a particular rule used in determining how to adjust the speculative width.
自第二方面可见,本发明提供控制推测向量运算的效能的方法,该方法包括:对向量操作数执行推测向量运算序列,每一向量操作数包括多个向量元素;维护指示每一向量操作数中将经受该推测向量运算的向量元素数目的推测宽度指示,该推测宽度指示在该推测向量运算序列的执行之前被初始化至初始值;在该推测向量运算序列的执行期间产生进展指示;参考进展指示及推测缩减标准来侦测推测缩减条件的存在,推测缩减条件是一条件,该条件指示:相对于在推测宽度指示无缩减情况下的连续运算,推测宽度指示的缩减预期将改良数据处理设备的至少一个效能特性;及在侦测到该推测缩减条件时,缩减推测宽度指示。As seen from a second aspect, the present invention provides a method of controlling the performance of speculative vector operations, the method comprising: performing a sequence of speculative vector operations on vector operands, each vector operand comprising a plurality of vector elements; A speculative width indication of the number of vector elements to be subjected to the speculative vector operation, the speculative width indication being initialized to an initial value prior to execution of the speculative vector operation sequence; a progress indication generated during execution of the speculative vector operation sequence; referring to progress Indicating and speculative reduction criteria to detect the presence of a speculative reduction condition, the speculative reduction condition being a condition indicating that reduction of the speculative width indication is expected to improve data processing equipment relative to successive operations without reduction of the speculation width indication at least one performance characteristic of; and shrinking the speculative width indication when the speculative shrinking condition is detected.
自第三方面可见,本发明提供一种计算机程序产品,该计算机程序产品以非暂态形式储存计算机程序,该计算机程序用于控制计算机以提供对应于根据本发明第一方面的数据处理设备的程序指令的虚拟机执行环境。As seen from a third aspect, the present invention provides a computer program product storing in non-transitory form a computer program for controlling a computer to provide a data processing device corresponding to the first aspect of the present invention. The virtual machine execution environment for program instructions.
自第四方面可见,本发明提供一种数据处理设备,该设备包括:处理手段,用于对向量操作数执行推测向量运算序列,每一向量操作数包括多个向量元素;推测控制手段,用于维护指示每一向量操作数中将经受该推测向量运算的向量元素数目的推测宽度指示,该推测宽度指示在该推测向量运算序列的执行之前被初始化至初始值;该处理手段用于在该推测向量运算序列的执行期间产生进展指示;推测控制手段,用于参考进展指示及推测缩减标准来侦测推测缩减条件的存在,推测缩减条件是一条件,该条件指示:相对于在推测宽度指示无缩减情况下的连续运算,推测宽度指示的缩减预期将改良数据处理设备的至少一个效能特性;推测控制手段,用于藉由缩减推测宽度指示来响应于对该推测缩减条件的侦测。As can be seen from the fourth aspect, the present invention provides a data processing device comprising: processing means for performing a sequence of speculative vector operations on vector operands, each vector operand comprising a plurality of vector elements; speculative control means for maintaining a speculative width indication indicating the number of vector elements to be subjected to the speculative vector operation in each vector operand, the speculative width indication being initialized to an initial value prior to execution of the speculative vector operation sequence; the means for processing in the A progress indication is generated during the execution of the sequence of speculative vector operations; speculation control means for detecting the presence of a speculative reduction condition with reference to the progress indication and a speculative reduction criterion, the speculative reduction condition being a condition indicating: relative to the speculative width indication Reduction of the speculative width indication is expected to improve at least one performance characteristic of the data processing device for consecutive operations without reduction; speculative control means for responding to detection of the speculative reduction condition by reducing the speculation width indication.
附图说明Description of drawings
将藉由参考其实施例仅以举例的方式进一步描述本发明,这些实施例如附图中所图示,这些附图中:The invention will be further described, by way of example only, by reference to embodiments thereof, as illustrated in the accompanying drawings, in which:
图1是根据一个实施例的数据处理设备的方块图;Figure 1 is a block diagram of a data processing device according to one embodiment;
图2A至图2C示意地图示在图1中根据一个实施例的推测控制电路系统内提供的多种控制寄存器;2A-2C schematically illustrate various control registers provided within the speculative control circuitry in FIG. 1 according to one embodiment;
图3示意地图示标量指令回路,该回路可藉由使用本文描述的实施例的技术而向量化;Figure 3 schematically illustrates a scalar instruction loop that can be vectorized using techniques of embodiments described herein;
图4及图5示意地图示用以向量化图3中根据一个实施例的标量回路的向量指令序列;Figures 4 and 5 schematically illustrate vector instruction sequences used to vectorize the scalar loop in Figure 3 according to one embodiment;
图6是一流程图,该图示意地图示在推测向量运算序列的执行期间依据一个实施例的推测宽度缩减方式;Figure 6 is a flow diagram schematically illustrating the manner in which speculation width is reduced according to one embodiment during execution of a sequence of speculative vector operations;
图7是一流程图,该图图示根据一个实施例在开始推测时执行的步骤;Figure 7 is a flowchart illustrating the steps performed when initiating speculation according to one embodiment;
图8是一流程图,该图图示根据一个实施例在提交过程期间采取的步骤;Figure 8 is a flowchart illustrating the steps taken during the submission process according to one embodiment;
图9是一流程图,该图图示根据一个实施例在向量运算序列的执行期间在条件性退出阶段执行的步骤;Figure 9 is a flowchart illustrating the steps performed during the execution of a sequence of vector operations during the conditional exit phase, according to one embodiment;
图10A及图10B示意地图示,根据某些实施例,在推测宽度缩减之后,可如何改变搁置推测向量运算的操作;Figures 10A and 10B schematically illustrate how the operation of shelving speculative vector operations may be changed after speculation width reduction, according to certain embodiments;
图11A及图11B示意地图示,根据又一实施例,在推测宽度缩减之后,可如何改变搁置推测向量运算的操作;Figures 11A and 11B schematically illustrate how the operation of shelving speculative vector operations may be changed after speculation width reduction, according to yet another embodiment;
图12A及图12B示意地图示与不使用本文描述的实施例的技术(图12A)情况下的推测向量运算的效能相比较,在使用本文描述的实施例的技术(图12B)时可实现的效能差异;12A and 12B schematically illustrate what can be achieved when using the techniques of embodiments described herein (FIG. 12B), compared to the performance of speculative vector operations without using the techniques of embodiments described herein (FIG. 12A). potency difference;
图13是一流程图,该图图示根据一个实施例在缩减标准包括松弛指示形式的效能容限信息的实施例中可如何调整推测宽度;Figure 13 is a flow diagram illustrating how speculation width may be adjusted in an embodiment in which the reduction criteria includes performance margin information in the form of a slack indication, according to one embodiment;
图14A及图14B示意地图示,根据两个实施例,在采用图13的方法时松弛指示可如何改变;及Figures 14A and 14B schematically illustrate how the slack indication may vary when the method of Figure 13 is employed, according to two embodiments; and
图15示意地图示根据一个实施例的数据处理设备的虚拟机实施方式。Fig. 15 schematically illustrates a virtual machine implementation of a data processing device according to an embodiment.
具体实施方式detailed description
图1图示根据实施例的数据处理设备5的一部分。该图仅图示向量处理部分,并且亦可能存在标量处理电路系统、标量加载/储存单元,及标量寄存器库,以便可译码及执行向量指令及标量指令。Fig. 1 illustrates a part of a data processing device 5 according to an embodiment. The figure only shows the vector processing portion, and there may also be scalar processing circuitry, scalar load/store units, and scalar register banks so that vector and scalar instructions can be decoded and executed.
提供保持待执行的指令的指令队列10,这些指令经路由至译码电路系统20,该译码电路系统经排列以译码指令及将控制信号发送至图1中设备内的适当电路。特定而言,对于一般向量处理指令而言,译码电路系统20会将控制信号发出至向量处理电路系统30内的相关向量处理单元35,这些向量处理单元随后将参考储存在向量寄存器库40内的一个或多个向量源操作数来执行所需的向量处理运算。通常情况下,那些运算的结果亦储存回向量寄存器库40以作为一个或多个向量目标操作数。如图1中示意地图示,向量处理单元可采用多种形式,例如一个或多个算术逻辑单元(arithmeticlogicunit;ALU)、浮点单元(floatingpointunit;FPU)等。An instruction queue 10 is provided which holds instructions to be executed which are routed to decoding circuitry 20 which is arranged to decode the instructions and send control signals to the appropriate circuits within the apparatus of FIG. 1 . In particular, for general vector processing instructions, decode circuitry 20 issues control signals to associated vector processing units 35 within vector processing circuitry 30, which then store references in vector register bank 40 One or more vector source operands to perform the desired vector processing operations. Typically, the results of those operations are also stored back into the vector register bank 40 as one or more vector destination operands. As schematically shown in FIG. 1 , the vector processing unit may take various forms, such as one or more arithmetic logic units (arithmetic logic unit; ALU), floating point unit (floating point unit; FPU) and the like.
对于任何向量数据存取指令而言,那些指令的译码将使控制信号被发出至向量处理电路系统30内的向量加载/储存单元50,该单元经配置以使一个或多个数据操作数在向量寄存器库40与快取/内存(快取/内存在本文中被称作内存系统)之间在双向中任一方向移动。如为了说明所图示,内存系统可包括阶层式高速缓存结构,该结构由位于向量寄存器库与主存储器76之间的1级高速缓存72、2级高速缓存74,及可能更多的高速缓存等级所组成。For any vector data access instructions, decoding of those instructions will cause control signals to be issued to vector load/store unit 50 within vector processing circuitry 30, which is configured so that one or more data operands are Movement between the vector register bank 40 and the cache/memory (cache/memory is referred to herein as the memory system) is in either direction in both directions. As illustrated for purposes of illustration, the memory system may include a hierarchical cache structure consisting of a level 1 cache 72, a level 2 cache 74, and possibly more caches located between the vector register bank and main memory 76. composed of grades.
如若向量数据存取指令是向量加载指令,则加载/储存单元50将自内存系统加载至少一个向量操作数至向量寄存器库40。同样,如若向量数据存取指令是向量储存指令,则加载/储存单元50将自向量寄存器库40将至少一个向量操作数向外储存至内存系统。If the vector data access instruction is a vector load instruction, the load/store unit 50 will load at least one vector operand from the memory system to the vector register bank 40 . Likewise, if the vector data access instruction is a vector store instruction, the load/store unit 50 will store at least one vector operand from the vector register bank 40 to the memory system.
根据本文所述的实施例,处理电路系统30可经排列以执行推测向量运算序列,提供推测控制电路系统60以维护指示每一向量操作数中将经受推测向量运算的向量元素数目的推测宽度指示。特定而言,推测控制电路系统具有数个控制寄存器65,这些控制寄存器控制经由路径82递送至向量处理电路系统30的信息,在一个实施例中,那些控制寄存器的状态辨识执行推测运算的时间及亦辨识当前推测宽度。According to embodiments described herein, processing circuitry 30 may be arranged to perform a sequence of speculative vector operations, speculative control circuitry 60 is provided to maintain a speculative width indication indicating the number of vector elements in each vector operand that will be subject to speculative vector operations . In particular, the speculation control circuitry has a number of control registers 65 that control the information delivered to the vector processing circuitry 30 via path 82. In one embodiment, the state of those control registers identifies when the speculative operation is performed and The current speculation width is also identified.
在图1中图示的实施例中,储存在指令队列10中的指令序列将包括开始推测指令,该开始推测指令在由译码电路系统20译码时将使控制信号被发出至推测控制电路系统60以启动推测运算模式。特定而言,在一个实施例中,推测控制电路系统60将响应于该种控制信号以设定控制寄存器65内的推测旗标,以辨识推测模式当前有效。然后,此信息将经由路径82路由至向量电路系统30,然后,将由向量处理电路系统30依据自译码电路系统20接收到的控制信号来执行的任何后续运算将以推测方式执行,直至推测运算模式退出之时为止。在一个实施例中,经由提交指令而退出推测模式,该指令在由译码电路系统20译码时使控制信号被发送至推测控制电路系统60,从而使控制电路系统清除推测旗标。In the embodiment illustrated in FIG. 1, the sequence of instructions stored in instruction queue 10 will include a start speculative instruction which, when decoded by decode circuitry 20, will cause control signals to be issued to the speculative control circuit. The system 60 is enabled in speculative computing mode. Specifically, in one embodiment, speculative control circuitry 60 will respond to such a control signal by setting a speculative flag in control register 65 to identify that speculative mode is currently active. This information will then be routed to vector circuitry 30 via path 82, and any subsequent operations to be performed by vector processing circuitry 30 based on control signals received from decode circuitry 20 will then be performed speculatively until the speculative operation until the mode is exited. In one embodiment, speculative mode is exited via committing an instruction that, when decoded by decode circuitry 20, causes a control signal to be sent to speculative control circuitry 60, causing the control circuitry to clear the speculative flag.
尽管在所述的实施例中提供特定推测运算模式,但并非必需特定推测运算模式。相反,在替代性实施例中,可提供指令集中的至少一指令子集的推测及非推测版本,以便可以推测方式执行序列内的各个指令。While specific speculative modes of operation are provided in the described embodiments, no specific speculative mode of operation is required. Rather, in alternative embodiments, speculative and non-speculative versions of at least a subset of instructions in an instruction set may be provided such that individual instructions within a sequence may be speculatively executed.
对于由向量处理电路系统30执行的任何推测运算而言,自推测控制电路系统60内的控制寄存器65经由路径82递送推测宽度指示,以辨识每一向量操作数中将经受那些推测向量运算的向量元素的数目。当启动推测时,推测宽度指示将被初始化至初始值,该值在一个实施例中例如可为操作数的整个向量宽度。因此,仅仅举例而言,如若向量操作数包含16个向量元素,则推测宽度指示的初始值可设定为16。For any speculative operations performed by vector processing circuitry 30, a speculative width indication is delivered via path 82 from control register 65 within speculative control circuitry 60 to identify the vectors of each vector operand that will be subjected to those speculative vector operations number of elements. When speculation is enabled, the speculation width indication will be initialized to an initial value, which in one embodiment may be, for example, the entire vector width of the operand. Therefore, for example only, if the vector operand contains 16 vector elements, the initial value of the speculative width indicator may be set to 16.
在推测向量运算的执行期间,处理电路系统30经排列以经由路径80将进展指示发出至推测控制电路系统60。进展指示提供有关由向量处理电路系统所正在执行的多个运算的进展的信息。尽管进展指示可以多种方式发出,但在一个实施例中,向量处理电路系统内的数个事件引起输出进展指示。在一个实施例中,这些进展指示将指示产量影响事件和/或能量消耗影响事件。在一个实施例中,这些事件是对产量或能量消耗具有负面影响的类型的事件。然而,在替代性实施例中(如稍后将参考图13而进行描述的实施例),这些事件亦可包括对产量或能量消耗具有正面影响的事件。During execution of speculative vector operations, processing circuitry 30 is arranged to issue progress indications to speculative control circuitry 60 via path 80 . Progress indicators provide information about the progress of operations being performed by the vector processing circuitry. Although a progress indication can be issued in a variety of ways, in one embodiment, several events within the vector processing circuitry cause the output of a progress indication. In one embodiment, these progress indicators will indicate yield impacting events and/or energy consumption impacting events. In one embodiment, these events are of the type that have a negative impact on yield or energy consumption. However, in alternative embodiments, such as those described later with reference to FIG. 13 , these events may also include events that have a positive impact on yield or energy consumption.
产生该种进展指示的事件的类型可采用多种形式,但在一个实施例中,这些事件是在推测向量运算的执行期间在处理电路系统30内出现的微架构事件。如前文所论述,该种微架构事件仅影响实施的质量(例如使处理电路系统在执行包含正在执行的指令的程序时使用更多或更少时间或能量),而不影响实施的正确性。因此,微架构事件的存在自身不需要改变电路系统运算。然而,如下文将更详细地论述,根据所描述的实施例,推测控制电路系统使用进展指示以便决定以下情况:相对于在未发生该种推测宽度缩减的情况下继续运算,推测宽度的缩减可能改良数据处理设备的至少一个效能特性。The types of events that generate such progress indications can take many forms, but in one embodiment, these events are microarchitectural events that occur within processing circuitry 30 during the execution of speculative vector operations. As discussed above, such microarchitectural events only affect the quality of the implementation (eg, causing the processing circuitry to use more or less time or energy in executing the program containing the instructions being executed), not the correctness of the implementation. Thus, the existence of a microarchitectural event does not in itself require changes to circuitry operation. However, as will be discussed in more detail below, according to the described embodiments, the speculative control circuitry uses the progress indication in order to determine whether a reduction in speculative width is likely to occur relative to continuing operations without such a reduction in speculative width. At least one performance characteristic of the data processing device is improved.
特定而言,推测控制电路系统60可存取缩减标准70,该标准可为固定标准,或可为例如经由指令串流内的指令执行而可配置的标准。有数种方式可规定缩减标准。例如推测缩减标准可辨识将与经由进展指示所报告的某些事件相关联的预期潜时和/或能量消耗,推测控制电路系统60保持与随时间而接收到的事件相关的潜时和/或能量消耗的标签,并且在达到某一触发点时侦测到推测缩减条件。或者,缩减标准70可直接辨识某些事件,这些事件出现时即指示推测缩减条件的存在。In particular, speculative control circuitry 60 has access to reduction criteria 70, which may be fixed criteria, or may be configurable, eg, via execution of instructions within an instruction stream. There are several ways to specify the reduction criteria. For example, speculative reduction criteria may identify expected latencies and/or energy consumption that will be associated with certain events reported via progress indications, speculative control circuitry 60 maintains latencies and/or energy consumption associated with events received over time. Energy consumption tags and speculative reduction conditions are detected when a certain trigger point is reached. Alternatively, narrowing criteria 70 may directly identify certain events that, when present, indicate the existence of a presumed narrowing condition.
在侦测到推测缩减条件时,经由对经路径80接收到的进展指示针对缩减标准70的分析,推测控制电路系统60经配置以缩减推测宽度,在此点处,控制寄存器65的内容经更新以辨识该缩减的推测宽度,从而将新推测宽度经由路径82转递至向量处理电路系统30。此举将使向量处理电路系统30减少经受正在进行的向量处理运算的向量元素的数目。特定而言,在通常情况下,至少将对数目减少的向量元素执行在产生进展指示(该进展指示导致推测宽度缩减)的运算之后执行的向量运算。在一个实施例中,假设产生了使推测宽度缩减发生的进展指示的运算仍在进行中,则缩减的推测宽度亦可应用于该运算。此外,在下文中将参考图10B进行论述的一个实施例中,推测宽度缩减亦可影响当缩减发生时仍在进行中的任何向量运算,即使那些运算是在产生了使推测宽度缩减的进展指示的运算之前发出的亦如此。Speculation control circuitry 60 is configured to reduce the speculation width via analysis of progress indications received via path 80 against reduction criteria 70 when a speculation reduction condition is detected, at which point the contents of control register 65 are updated To recognize the reduced speculation width, the new speculation width is forwarded to vector processing circuitry 30 via path 82 . This will allow vector processing circuitry 30 to reduce the number of vector elements subject to ongoing vector processing operations. In particular, at least the vector operations performed after the operation that produced the progress indication that resulted in the speculative width reduction will generally be performed on a reduced number of vector elements. In one embodiment, the reduced speculation width may also be applied to the operation, assuming the operation that generated the progress indication that caused the speculation width reduction to occur is still in progress. In addition, in one embodiment discussed below with reference to FIG. 10B , the speculative width reduction can also affect any vector operations that were still in progress when the reduction occurred, even if those operations were producing indications of progress that reduced the speculation width. The same is true for those emitted before the operation.
藉由使用进展指示以侦测具有当前推测宽度的连续运算可能对诸如产量或能量消耗的效能特性具有不利影响的情况,推测控制电路系统可随后缩减推测宽度以便设法避免该影响。如若随后决定所需向量元素宽度大于缩减的推测宽度,则由于该推测宽度缩减而从推测运算中排除的任何向量元素可因此成为推测向量运算的后续迭代的对象。By using the progress indication to detect situations where successive operations with the current speculation width may have an adverse effect on performance characteristics such as throughput or energy consumption, the speculation control circuitry can then reduce the speculation width in an attempt to avoid this effect. If it is subsequently determined that the required vector element width is greater than the reduced speculative width, any vector elements excluded from the speculative operation due to this reduced speculative width may thus be the subject of subsequent iterations of the speculative vector operation.
在推测宽度缩减时,为降低向量处理电路系统30的能量消耗,可选择性地提供时钟/功率闸控电路系统92,或该电路系统可经排列以经由路径90接收当前推测宽度的指示。基于当前的推测宽度信息,时钟/功率闸控电路系统可改变供应至向量处理电路系统30内的多个组件的时钟或功率。例如,如若由于推测宽度的缩减而确定处理电路系统内的某些组件在一个或多个时钟周期中无需执行任何运算,则可从那些组件中移除时钟信号以降低那些组件所消耗的功率。如若在更长时段中无需那些组件,则在设备支持动态电压按比例缩放的技术的情况下,藉由移除或降低对那些组件的电压供应来使其处于低功率操作模式较为适当。To reduce energy consumption of vector processing circuitry 30 as the speculation width is reduced, clock/power gating circuitry 92 may optionally be provided, or may be arranged to receive an indication of the current speculation width via path 90 . Based on the current speculation width information, clock/power gating circuitry may vary the clock or power supplied to various components within vector processing circuitry 30 . For example, if it is determined that certain components within the processing circuitry do not need to perform any operations for one or more clock cycles due to the reduced speculation width, the clock signal may be removed from those components to reduce the power consumed by those components. If those components are not needed for longer periods of time, it may be appropriate to put them in a low power operating mode by removing or reducing the voltage supply to those components if the device supports dynamic voltage scaling techniques.
向量处理电路系统内的任何组件可经排列以经由路径80产生进展指示。然而,虑及可能对产量具有不利影响的事件,这些事件常与在使用向量加载/储存单元(load/storeunit;LSU)50执行加载或储存操作时出现的快取未中关联。特定而言,如若在执行向量加载或储存操作时,LSU在与特定向量元素位置关联的1级高速缓存72或2级高速缓存74内侦测到快取未中,则LSU可经由路径80发送进展指示,该进展指示辨识在哪一高速缓存等级中侦测到该未中,及发生快取未中的元素位置。由此,推测控制电路系统可使用该信息,与缩减标准70关联以确定是否缩减推测宽度。缩减标准可例如辨识任何2级快取未中(此快取未中在通常情况下导致明显的潜时)将导致推测宽度经缩减,以排除产生了2级快取未中的向量元素位置。或者,缩减标准70可有效地为不同元素位置提供不同标准。例如,随着向量内的元素位置增加,此举表示更高程度的推测,由此,尽管与较高元素位置关联的快取未中可立即使推测宽度缩减,但与较低元素位置关联的快取未中无需使推测宽度缩减。Any component within the vector processing circuitry may be arranged to generate an indication of progress via path 80 . However, considering events that may have an adverse impact on throughput, these events are often associated with cache misses that occur when a load or store operation is performed using a vector load/store unit (LSU) 50 . In particular, if, while performing a vector load or store operation, the LSU detects a cache miss in either the L1 cache 72 or the L2 cache 74 associated with a particular vector element location, the LSU may be sent via path 80 A progress indicator identifying in which cache level the miss was detected and the element location where the cache miss occurred. Thus, the speculation control circuitry can use this information, in association with the reduction criteria 70, to determine whether to reduce the speculation width. Reduction criteria may, for example, recognize that any level 2 cache misses (which typically result in significant latency) will cause the speculation width to be reduced to exclude vector element locations that produced level 2 cache misses. Alternatively, narrowing criteria 70 may effectively provide different criteria for different element positions. For example, as element positions within a vector increase, this represents a higher degree of speculation, whereby while cache misses associated with higher element positions immediately reduce the speculative width, those associated with lower element positions A cache miss does not require a reduction in the speculative width.
数据处理系统常在存取内存时使用虚拟地址,并且TLB储存器55用以将那些虚拟地址转变为实体地址以用于存取内存系统。将容易理解,TLB储存器55具有数个条目,这些条目辨识特定虚拟地址、用于将每一虚拟地址转换为实体地址的信息,及诸如那些地址关于可高速缓存区域还是可缓冲存储器区域的某些许可属性信息。LSU将因此向TLB储存器发出虚拟地址,如若在TLB储存器内侦测到该虚拟地址的命中,则所需的实体地址及许可属性信息可立即返回LSU。然而,倘若未中,则TLB储存器将通常需要执行页表移动过程以便从包含必需信息的内存系统中撷取页表,以赋能将虚拟地址转变为实体地址及提供关联的许可属性。在该过程期间,TLB储存器55将按需求存取1级高速缓存72及更低等级的内存阶层,以便撷取所需的页表信息。倘若在该过程期间发生快取未中,则此情况可再次经由路径80作为进展指示而报告至推测控制电路系统,并由推测控制电路系统用于确定是否缩减推测宽度。Data processing systems often use virtual addresses when accessing memory, and the TLB store 55 is used to convert those virtual addresses into physical addresses for accessing the memory system. It will be readily understood that TLB store 55 has several entries that identify specific virtual addresses, information used to translate each virtual address into a physical address, and certain information such as whether those addresses are cacheable or cacheable memory regions. some license attribute information. The LSU will thus issue a virtual address to the TLB store, and if a hit to the virtual address is detected in the TLB store, the required physical address and permission attribute information can be immediately returned to the LSU. However, in the event of a miss, the TLB store will generally need to perform a page table movement process to retrieve the page table from the memory system containing the necessary information to enable the translation of virtual addresses into physical addresses and provide associated permission attributes. During this process, TLB store 55 will access L1 cache 72 and lower levels of memory as needed in order to retrieve the required page table information. If a cache miss occurs during the process, this can again be reported via path 80 as an indication of progress to the speculation control circuitry and used by the speculation control circuitry to determine whether to reduce the speculation width.
在一个实施例中,可预定推测宽度指示的初始化初始值及推测缩减标准。然而,该两者可替代性地由经数据处理设备执行的指令中的至少一者规定。例如,开始推测指令可辨识将在控制寄存器65内设定的初始推测宽度。当在推测向量运算的执行期间评估是否缩减推测宽度时,该指令亦可辨识将由推测控制电路系统60使用的某些推测缩减标准。作为另一选项,可提供推测预估电路系统85,该电路系统用于维护关于用于先前的推测向量运算序列的推测宽度值和/或推测缩减标准的历史数据。针对推测宽度值,历史数据可辨识在先前的推测向量运算序列的执行期间在提交点的最终推测宽度值。预估电路系统可因此经排列以接收与启动推测的指令关联的程序计数器值,并且在虑及根据该程序计数器值启动的任何先前推测向量运算序列的情况下基于该程序计数器值来参考历史数据,以便确定用于推测宽度指示的初始值,以及若需要则确定将使用的推测缩减标准。历史数据将储存在可由推测预估电路系统85存取的历史储存器87内。In one embodiment, the initialization initial value of the speculation width indication and the speculation reduction criterion may be predetermined. However, both may alternatively be specified by at least one of the instructions executed by the data processing device. For example, a start speculation instruction may identify an initial speculation width to be set in control register 65 . The instruction may also identify certain speculative reduction criteria to be used by the speculative control circuitry 60 when evaluating whether to reduce the speculative width during execution of a speculative vector operation. As another option, speculative estimation circuitry 85 may be provided for maintaining historical data regarding the speculative width values and/or speculative reduction criteria for previous sequences of speculative vector operations. For speculative width values, historical data may identify the final speculative width value at the commit point during the execution of a previous sequence of speculative vector operations. Prediction circuitry may thus be arranged to receive a program counter value associated with the instruction that initiated speculation, and to refer to historical data based on that program counter value, taking into account any previous sequence of speculative vector operations initiated from that program counter value. , to determine an initial value for the speculation width indication and, if necessary, the speculation reduction criteria to use. The historical data will be stored in historical storage 87 accessible by speculative estimation circuitry 85 .
尽管在图1中,进展指示图示为仅由向量处理电路系统30所发出,但在替代性实施例中,推测控制电路系统60亦可从系统内的别处接收进展指示,例如从标量处理电路系统接收。Although in FIG. 1 progress indications are shown as being issued by vector processing circuitry 30 only, in alternative embodiments speculative control circuitry 60 may receive progress indications from elsewhere within the system, such as from scalar processing circuitry The system receives.
控制寄存器65可采用多种形式,标签图2A图示可储存在控制寄存器内的数个参数。首先,维护推测宽度值100,并且在一个实施例中,此值可采用1与16之间的值,该值指示每一向量操作数中将经受推测向量运算的向量元素的数目可在1与16个向量元素之间改变(在一个实施例中,16个向量元素表示向量操作数的全宽)。The control register 65 can take a variety of forms. Labeled FIG. 2A illustrates several parameters that can be stored in the control register. First, a speculative width value of 100 is maintained, and in one embodiment, this value may take a value between 1 and 16, which indicates that the number of vector elements to be subjected to a speculative vector operation in each vector operand may be between 1 and 16. Change between 16 vector elements (in one embodiment, 16 vector elements represent the full width of the vector operand).
在一个实施例中,控制寄存器65亦包括推测旗标105,该旗标经设定以指示推测是开启还是关闭。当推测关闭时,以非推测方式执行向量运算。然而,当推测开启时,以推测方式执行向量运算。In one embodiment, the control register 65 also includes a speculation flag 105 that is set to indicate whether speculation is on or off. When speculation is off, vector operations are performed non-speculatively. However, when speculation is turned on, vector operations are performed speculatively.
可以多种方式规定推测宽度指示100。然而,在一个实施例中,控制寄存器65同时包括第一元素位置寄存器110及推测宽度寄存器115。第一元素位置寄存器110辨识将经受向量处理运算的第一向量元素位置,而推测宽度寄存器115则辨识最终向量元素位置,然后,由这两个寄存器的内容之间的差异指示向量操作数120内的推测宽度。Speculative width indication 100 may be specified in a variety of ways. However, in one embodiment, the control registers 65 include both the first element position register 110 and the speculative width register 115 . The first element position register 110 identifies the first vector element position to be subjected to a vector processing operation, while the speculative width register 115 identifies the final vector element position, and the difference between the contents of these two registers then indicates the position of the vector element within the vector operand 120. The inferred width of .
在向量运算序列的第一迭代期间,可能是第一元素位置寄存器指向向量120内的第一向量元素,并且推测宽度寄存器115可例如指向最后的向量元素,由此规定整个向量宽度。在向量运算的执行期间,推测宽度寄存器内容可经缩减以辨识经缩减的推测宽度。如若在到达提交点之时,推测宽度已经缩减至于提交点处所确定的向量元素所需数目以下,则可执行向量运算的后续迭代,此时,第一元素位置寄存器110将经设定以辨识尚未经由先前的推测向量运算迭代处理的第一所需向量元素。然后,推测宽度寄存器115将经设定以辨识后续迭代所需的推测宽度。During a first iteration of a sequence of vector operations, it may be that the first element position register points to the first vector element within vector 120, and the speculative width register 115 may, for example, point to the last vector element, thereby specifying the overall vector width. During execution of a vector operation, the speculation width register content may be reduced to identify the reduced speculation width. If, by the time the commit point is reached, the speculation width has shrunk below the required number of vector elements determined at the commit point, then a subsequent iteration of the vector operation may be performed, at which point the first element position register 110 will be set to identify the The first desired vector element processed iteratively via a previous speculative vector operation. The speculation width register 115 will then be set to identify the speculation width required for subsequent iterations.
尽管在图2B的实例中,维护两个独立寄存器,但在替代性实施例中,可提供屏蔽寄存器130以辨识推测宽度指示。特定而言,屏蔽可包含针对向量操作数120内每一元素位置的比特位,那些比特位被设定为零或一以辨识推测宽度。在一个实施例中,推测宽度将由屏蔽内包含的一系列逻辑1值规定,方法是藉由那些逻辑1值中的某些逻辑1值转变为逻辑0值以辨识缩减的推测宽度,当推测宽度缩减时,该屏蔽的内容在运算执行期间更新。将了解,在替代性实施例中,屏蔽内的逻辑1及逻辑0值的含义可反转。Although in the example of FIG. 2B two separate registers are maintained, in an alternative embodiment, mask register 130 may be provided to recognize speculation width indications. In particular, the mask may include bits for each element position within the vector operand 120, those bits set to zero or one to identify the speculation width. In one embodiment, the speculation width will be specified by a series of logic 1 values contained within the mask by identifying a reduced speculation width by transitioning some of those logic 1 values to logic 0 values, when the speculation width When shrinking, the contents of this mask are updated during operation execution. It will be appreciated that in alternative embodiments, the meaning of the logic 1 and logic 0 values within the mask may be reversed.
图3示意地图示可藉由使用上文所述的实施例而向量化的标量回路。标量指令的此回路包括指令系列,该指令系列中的一些指令在经执行以确定是否退出回路的条件测试之前出现,并且该指令系列中的一些指令在该条件测试之后出现。在图示的实例序列中,回路经历三个完整迭代200、205、210,然后,条件测试经评估为指示该回路将在部分穿过第四迭代215的点220处结束。通常情况是:在条件测试中,所需迭代数目未知,因此,尽管在该实例中该回路在部分穿过第四迭代时终止,但在其他实例中,回路可能在执行更多迭代之后才终止,或甚至提早终止。Figure 3 schematically illustrates a scalar loop that can be vectorized by using the embodiments described above. This loop of scalar instructions includes a series of instructions, some of which occur before a conditional test executed to determine whether to exit the loop, and some of which occur after the conditional test. In the example sequence illustrated, the loop goes through three full iterations 200 , 205 , 210 , and then the conditional test is evaluated to indicate that the loop will end at point 220 partially traversing the fourth iteration 215 . It is often the case that in a conditional test, the number of iterations required is unknown, so although in this instance the loop terminates partially through the fourth iteration, in other instances the loop may not terminate until more iterations are performed , or even terminate early.
当执行推测向量运算以便向量化该种标量回路时,每一标量指令由向量指令替代,在此情况下,所规定的向量操作数包括多个向量元素,每一向量元素关于一不同迭代。因为并不知晓标量回路将在哪一迭代退出,因此无法藉由规定具有特定数目的向量元素的向量操作数来向量化回路。相反,如图4所图示,对于在条件测试之前出现的标量指令的等效向量指令而言,推测宽度用于推测所需的向量元素数目。如前文所论述,在一个实施例中,此推测宽度最初将被设定为诸如16的所选值,由此,在该实例中,这些向量指令中的每一者的执行最初将复制16次等效的标量指令的执行(亦即16个单独迭代中每一迭代为一次)。如若当前推测宽度产生进展指示,这些进展指示使推测缩减条件被侦测到,则如前文所论述,推测宽度将缩减,同时确保每一向量操作数中至少一个向量元素继续被处理。When performing speculative vector operations to vectorize such a scalar loop, each scalar instruction is replaced by a vector instruction, in which case the specified vector operands include a plurality of vector elements, each vector element associated with a different iteration. Because it is not known at which iteration a scalar loop will exit, it is not possible to vectorize a loop by specifying a vector operand with a specific number of vector elements. In contrast, as illustrated in Figure 4, for the equivalent vector instruction of a scalar instruction that occurs before a conditional test, the speculation width is used to speculate on the number of vector elements required. As previously discussed, in one embodiment, this speculation width will initially be set to a chosen value such as 16, whereby execution of each of these vector instructions will initially be replicated 16 times in this example Execution of the equivalent scalar instruction (ie, one for each of 16 individual iterations). If the current speculative width produces progress indicators that cause a speculative reduction condition to be detected, then as discussed above, the speculative width will be reduced while ensuring that at least one vector element in each vector operand continues to be processed.
当随后评估条件测试时,则可确定需要多少向量元素。例如,可评估出等效标量回路原本将在第三迭代结束,由此,所需推测宽度是三。假设推测宽度仍大于三,则将已经处理全部所需向量元素。然而,如若当前推测宽度小于条件测试所指示的迭代数目,则随后将需要执行向量运算序列的至少又一迭代以处理剩余的所需向量元素。When the conditional test is subsequently evaluated, it can then be determined how many vector elements are required. For example, it can be estimated that the equivalent scalar loop would have ended at the third iteration, whereby the required speculation width is three. Assuming the speculative width is still greater than three, all required vector elements will have been processed. However, if the current speculation width is less than the number of iterations indicated by the conditional test, then at least one further iteration of the sequence of vector operations will then need to be performed to process the remaining required vector elements.
在提交点之后,以非推测方式执行剩余向量指令。然而,虑及在条件测试分析期间所辨识的宽度,屏蔽可经设定以确保仅处理所需数目的向量元素(或如若该推测宽度小于所需宽度,则相当于当前推测宽度的数目,由此将需要又一迭代)。然后,处理将条件性退出。特定而言,如若条件测试指示已写入全部所需数据,则处理将退出,反之则处理将重复至少一次。After the commit point, the remaining vector instructions are executed non-speculatively. However, masking can be set to ensure that only the required number of vector elements are processed (or, if the speculative width is less than the required width, the number equivalent to the current speculative width, given the widths identified during conditional test analysis, by This will require yet another iteration). Processing will then exit conditionally. In particular, if the conditional test indicates that all required data has been written, then processing will exit, otherwise processing will repeat at least once.
图5图示针对使用特定推测指令及提交指令的情况的向量回路。推测指令用以开启推测及因此设定推测旗标105。随后,以推测方式执行指令系列,其中推测宽度100用以辨识每一向量操作数中的向量元素数目。如前文所论述,推测宽度可在向量运算的执行期间依据对推测缩减条件的侦测而进行缩减。其后,将执行一个或多个指令以确定适当宽度以提交,且随后,将执行单独的提交指令以使推测关闭。在此之后,将执行一系列非推测指令,并且如前文所论述,屏蔽或长度值可与那些指令关联使用,以在虑及在提交点之前所作出的确定的情况下适当地设定向量元素宽度。然后,可使用分支指令以便确定应重复回路还是应退出回路。Figure 5 illustrates a vector loop for the case of using specific speculative and commit instructions. The speculative command is used to turn on speculation and thus set the speculative flag 105 . Subsequently, the sequence of instructions is executed speculatively, with a speculative width of 100 identifying the number of vector elements in each vector operand. As previously discussed, the speculative width may be reduced during execution of a vector operation upon detection of a speculative reduction condition. Thereafter, one or more instructions will be executed to determine the appropriate width to commit, and then a separate commit instruction will be executed to turn speculation off. Following this, a series of non-speculative instructions will be executed, and as previously discussed, a mask or length value can be used in association with those instructions to set vector elements appropriately taking into account determinations made prior to the commit point width. Branch instructions can then be used in order to determine whether the loop should be repeated or should be exited.
图6是一流程图,该图图示根据一个实施例的所执行的推测宽度缩减过程。在步骤300中,推测控制电路系统等待进展指示的接收,并且如若在步骤300中确定未接收到进展指示,并且在步骤305中例如因为推测旗标经判定仍位于控制寄存器65内而确定推测未终止,则流程在步骤300处等候直至接收到进展指示。如若推测终止,则过程自步骤305前进至310,过程在步骤310处结束。Figure 6 is a flow diagram illustrating the speculative width reduction process performed according to one embodiment. In step 300, the speculative control circuitry waits for the receipt of a progress indication, and if in step 300 it is determined that no progress indication has been received, and in step 305 it is determined that the speculative flag is still located within the control register 65 Terminates, the process waits at step 300 until a progress indication is received. If the speculation is terminated, the process proceeds from step 305 to 310 where the process ends.
在接收到进展指示之后,在步骤315中,推测控制电路系统60参考推测缩减标准70来分析该进展指示,随后在步骤320中确定是否已经侦测到推测缩减条件。如若未侦测到,则过程返回步骤300。然而,如若侦测到推测缩减条件,则在步骤325中,推测控制电路系统确定所需的推测缩减程度。此程度可为预定量,使推测控制电路系统缩减推测宽度该预定量,或相反,此程度可在待应用的规则方面被预定,例如设定新的推测宽度以便最高有效元素位置是已辨识的产生进展指示的元素位置左侧的位置,从而从经修正的推测宽度中排除该已辨识的元素位置。After receiving the progress indication, in step 315 the speculation control circuitry 60 analyzes the progress indication with reference to the speculation reduction criteria 70 and then in step 320 determines whether a speculation reduction condition has been detected. If not detected, the process returns to step 300 . However, if a speculative downscaling condition is detected, then in step 325 the speculative control circuitry determines the degree of speculative downscaling required. The degree may be a predetermined amount by which the speculation control circuitry reduces the speculation width, or conversely the degree may be predetermined in terms of rules to be applied, such as setting a new speculation width so that the most significant element position is identified The position to the left of the element position that yields the progress indication, thereby excluding the identified element position from the revised guess width.
在替代性实施例中,可在虑及推测缩减标准自身的情况下确定推测缩减程度。下文中将参考图13描述该种方法的实例,其中,缩减标准维护松弛值,该值指示在推测向量运算的处理期间在效能特性如何受影响方面可用的松弛,及在虑及超过松弛的程度的情况下而动态管理推测宽度缩减量。In alternative embodiments, the degree of speculative reduction may be determined taking into account the speculative reduction criteria themselves. An example of such an approach will be described below with reference to FIG. 13 , where the reduction criterion maintains a slack value indicating the slack available during the processing of speculative vector operations in terms of how performance characteristics are affected, and the degree to which excess slack is taken into account. The speculative width reduction is dynamically managed in the case of .
在步骤325中确定推测缩减程度之后,在步骤330中,推测宽度经缩减,随后,过程返回步骤300。如上所述,一旦推测宽度已经缩减,则此信息经由路径82传递至向量处理电路系统30,并且使在任何未完成推测运算的执行期间所处理的向量元素数目减少。After determining the degree of speculative reduction in step 325 , in step 330 the speculative width is reduced and the process then returns to step 300 . As described above, once the speculation width has been reduced, this information is passed via path 82 to vector processing circuitry 30 and causes the number of vector elements processed during execution of any outstanding speculative operations to be reduced.
图7是一流程图,该图图示在开始推测时执行的过程。尽管推测可以多种方式开始,但在一个实施例中提供明确的开始推测指令,该指令在由译码电路系统20译码时使适当的控制信号被发出至推测控制电路系统60。在步骤340中,藉由设定适当的控制寄存器65而确定及设定初始推测宽度。如先前所述,可预定初始推测宽度,可在自译码电路系统转递的控制信息内辨识初始推测宽度,或甚至初始推测宽度可由推测预估电路系统85提供。在步骤340之后,在步骤345中依据设备是否正在使用明确的推测模式而选择性地开启推测模式。FIG. 7 is a flow chart illustrating the process performed when speculation is started. Although speculation can be initiated in a variety of ways, in one embodiment an explicit start speculation instruction is provided which when decoded by decode circuitry 20 causes the appropriate control signals to be issued to speculation control circuitry 60 . In step 340, the initial guess width is determined and set by setting the appropriate control register 65. As previously described, the initial speculation width may be predetermined, may be identified within control information passed from the decoding circuitry, or may even be provided by the speculation estimation circuitry 85 . Following step 340, speculative mode is selectively enabled in step 345 depending on whether the device is using an explicit speculative mode.
图8是一流程图,该图图示一提交过程,该过程在图5中向量回路内的系列推测运算已经执行之后及在任何非推测运算的执行之前被执行。一个或多个指令可经执行以便执行图8中图示的步骤。在步骤350中,条件测试经执行以确定待处理的所需向量宽度,亦即所需的向量元素数目。随后,在步骤355中,读取当前推测宽度,在步骤360中,比较当前推测宽度与所需向量宽度。FIG. 8 is a flow diagram illustrating a commit process that is performed after the series of speculative operations within the vector loop of FIG. 5 have been performed and before the performance of any non-speculative operations. One or more instructions may be executed to perform the steps illustrated in FIG. 8 . In step 350, a conditional test is performed to determine the required vector width, ie, the required number of vector elements, to be processed. Then, in step 355, the current speculation width is read, and in step 360, the current speculation width is compared with the desired vector width.
在步骤365中,随后确定推测向量运算是否已导致推测不足。如若在到达提交点时,推测宽度已缩减至小于所需向量宽度的点,则将发生此情况。否则,如若当前推测宽度大于所需向量宽度,则过程将已推测过度。In step 365, it is then determined whether a speculative vector operation has resulted in under-speculation. This will happen if, by the time the commit point is reached, the speculative width has shrunk to the point where it is smaller than the desired vector width. Otherwise, if the current speculation width is larger than the required vector width, the process will have over-speculated.
倘若推测过度,则过程前行至步骤385,在该步骤中清除重复旗标,从而指示不再需要更多向量回路迭代。随后,在步骤390中为非推测指令设定屏蔽,以辨识所需的向量宽度。此举将赋能向量回路在无需另一迭代的情况下向量化等效的标量回路。If over speculation, the process proceeds to step 385 where the repeat flag is cleared, indicating that no more vector loop iterations are required. Next, a mask is set for non-speculative instructions in step 390 to identify the required vector width. This will enable a vector loop to vectorize an equivalent scalar loop without another iteration.
然而,倘若推测不足,则在步骤370中,设定重复旗标以调用向量回路的后续迭代。随后,在步骤375中为非推测指令设定屏蔽,以辨识当前的推测宽度。由此,一旦当前向量回路完成,则推测向量运算及非推测运算皆将已处理相同数目的向量元素。由此,向量回路可重复一次或多次以处置所需向量宽度内的剩余向量元素。However, if the speculation is insufficient, then in step 370, a repeat flag is set to invoke a subsequent iteration of the vector loop. Next, a mask is set for non-speculative instructions in step 375 to identify the current speculative width. Thus, once the current vector loop is complete, both speculative and non-speculative operations will have processed the same number of vector elements. Thus, the vector loop can be repeated one or more times to handle the remaining vector elements within the desired vector width.
在使用特定推测模式的实施例中,在步骤375或390之后,推测模式在步骤380中关闭。此举将确保以非推测方式执行向量回路内随后的指令。In embodiments using a specific speculative mode, after step 375 or 390 the speculative mode is turned off in step 380 . This will ensure that subsequent instructions within the vector loop are executed in a non-speculative manner.
图9图示一条件性退出过程,该过程在通常情况下在一系列推测指令及非推测指令的执行之后由诸如图5中图示的分支指令执行。在步骤400中,确定是否设定重复旗标。如若确定设定,则过程返回步骤405中向量回路开始处的开始推测点,反之过程则在步骤410中结束。FIG. 9 illustrates a conditional exit process that is typically performed by a branch instruction such as that illustrated in FIG. 5 after the execution of a series of speculative and non-speculative instructions. In step 400, it is determined whether a repeat flag is set. If the setting is confirmed, the process returns to the starting guess point at the beginning of the vector loop in step 405 , otherwise the process ends in step 410 .
图10A是示意地图示推测向量运算的处理在推测宽度缩减时如何变更的图示。首先,执行推测指令以开启推测模式并将推测宽度设定至规定量,在此实例中为16个向量元素。在图示的实例中,假定向量处理电路系统内的组件在单个周期期间能够处理来自每一源操作数中的两个向量元素,由此,执行向量运算VOP1420的组件将采用8个周期以完成全部所需运算。FIG. 10A is a diagram schematically illustrating how the processing of speculative vector operations changes when the speculative width is reduced. First, a speculative instruction is executed to turn on speculative mode and set the speculative width to a specified amount, in this example 16 vector elements. In the illustrated example, it is assumed that the components within the vector processing circuitry are capable of processing two vector elements from each source operand during a single cycle, thus, the components performing the vector operation VOP 1420 will take 8 cycles to complete All required operations.
在图示的实例中,假定VOP1与VOP2之间存在数据相依性(在此实例中,VOP2是向量加载),以便VOP2的执行在VOP1的执行开始之后一个周期才开始。In the illustrated example, it is assumed that there is a data dependency between VOP1 and VOP2 (in this example, VOP2 is a vector load), so that execution of VOP2 does not begin one cycle after execution of VOP1 begins.
如此实例中图示,当LSU50正在处理向量加载时,LSU侦测到与向量元素9(亦即元素位置10处的向量元素)相关联的微架构事件并发出进展指示,该进展指示辨识该事件及元素位置,如图10A中标号430所指示。该种微架构事件例如可为在设法加载向量元素9时发生的快取未中,并且在此实例中,假定当相对于缩减标准70评估该进展指示时,推测控制电路系统60决定将推测宽度缩减至9个元素的宽度,从而排除向量元素9(如前文提及,此向量元素位于元素位置10处)。向量运算VOP3图示向量运算的实例,该向量运算在进展指示发出及关联的推测宽度缩减之前已开始处理,但该向量运算在推测宽度缩减时仍在进行中。如VOP3435内的十字所图示,响应于推测宽度的缩减,执行VOP3的组件可经排列以拣出在元素位置10或更高位置处的向量元素中的任何者。如前文参考图1所论述,时钟闸控和/或功率闸控技术可用于基于缩减的推测宽度来降低执行VOP3的组件的能量消耗。As illustrated in this example, when LSU 50 is processing a vector load, the LSU detects a microarchitectural event associated with vector element 9 (i.e., the vector element at element position 10) and issues a progress indication that identifies the event and element positions, as indicated by reference numeral 430 in FIG. 10A . Such a microarchitectural event could be, for example, a cache miss that occurs while attempting to load vector element 9, and in this example it is assumed that when evaluating this indication of progress relative to shrinkage criteria 70, speculation control circuitry 60 decides to speculate width Reduce to a width of 9 elements, thereby excluding vector element 9 (which, as mentioned earlier, is at element position 10). Vector Operation VOP3 illustrates an example of a vector operation that started processing before the progress indication was issued and the associated speculative width reduction, but was still in progress at the time of the speculative width reduction. As illustrated by the cross within VOP3 435, in response to the reduction in speculation width, the components performing VOP3 may be arranged to cull any of the vector elements at element position 10 or higher. As discussed above with reference to FIG. 1 , clock gating and/or power gating techniques may be used to reduce energy consumption of components executing VOP3 based on the reduced speculation width.
向量运算VOP4说明仅在已进行推测宽度缩减之后才启动的向量运算,并且在此实例中,如标号440所示,此情况在该向量运算需要相对于前9个向量元素而被执行的最初(不再经受其他推测宽度缩减)便将为已知。Vector operation VOP4 illustrates a vector operation that is initiated only after a speculative width reduction has been performed, and in this example, as indicated by reference numeral 440, this is the case at the initial ( Subject to no further speculative width reductions) will be known.
在图10A中图示的实例中,假定向量运算VOP4440是另一向量加载运算,并且该向量加载在前一向量加载VOP2完成之后才可开始执行。In the example illustrated in FIG. 10A , it is assumed that vector operation VOP4 440 is another vector load operation, and that this vector load cannot begin execution until the previous vector load VOP2 is completed.
图10B图示替代性实施例,其中推测宽度的调整不仅影响今后的向量运算,亦影响先前的向量运算。特定而言,如标号450所示,当缩减推测宽度时,此情况可使执行VOP2(亦即产生推测缩减的向量运算)的组件亦终止一个或多个后续迭代的处理,在此情况下,两个迭代相关于最后四个向量元素。Figure 10B illustrates an alternative embodiment in which the adjustment of the speculation width affects not only future vector operations, but previous vector operations as well. In particular, as indicated by reference numeral 450, when the speculative width is reduced, this may cause the component performing VOP2 (ie, the vector operation that produced the speculative reduction) to also terminate processing for one or more subsequent iterations, in which case, Two iterations are relative to the last four vector elements.
此外,在一个实施例中,过程可用以回顾性地修整甚至在VOP2之前已发出的向量运算的操作。在所示的该实例中,如标号445所指示,VOP1的最终迭代亦基于缩减推测宽度而终止。再一次,时钟及功率闸控技术可用于降低执行这些运算的组件的能量消耗。可实现的另一优势是增强效能,因为组件将可用于更快地处理其他运算。例如,当与图10A的实例比较时,在图10B的实例中,虑及VOP4,LSU50将提前两个周期终止VOP2的执行,由此,VOP4可提前两个周期开始。Furthermore, in one embodiment, the process can be used to retrospectively trim operations of vector operations that have been issued even before VOP2. In the example shown, the final iteration of VOP1 also terminates based on reducing the speculation width, as indicated by reference numeral 445 . Again, clock and power gating techniques can be used to reduce the energy consumption of the components performing these operations. Another advantage that can be realized is increased performance, as components will be available to process other operations more quickly. For example, when compared to the example of FIG. 10A, in the example of FIG. 10B, the LSU 50 will terminate the execution of VOP2 two cycles earlier in consideration of VOP4, whereby VOP4 can start two cycles earlier.
图11A及图11B图示推测缩减如何在后续运算的执行中触发修正的其他实例。在此实例中,假定执行VOP1及VOP2的组件在每一迭代中能够处理每一源向量操作数中的8个向量元素,由此,可在两个周期中处理原始规定的16个向量元素。如若当执行VOP1460时推测缩减触发在点465处发生,则此触发可用于时钟闸控(或甚至功率闸控)处置VOP2的某些组件(如标号475所指示),对于处理位于元素位置465上或上方的元素位置引起推测缩减的那些组件尤其如此。然而,并不需要对产生了推测缩减的元素位置上或上方的全部元素位置都进行时钟闸控,并且如图11B所示,在替代性实施例中,可仅对较高元素位置的子集进行时钟闸控,如标号480所示。如下文中参考图14所论述,此情况可能例如归因于:推测控制电路系统60参考缩减标准而确定推测宽度实际上仅应修整2,而非全部修整以排除向量元素位置465及更高位置。11A and 11B illustrate other examples of how speculation reduction can trigger corrections in the performance of subsequent operations. In this example, it is assumed that the components executing VOP1 and VOP2 are capable of processing 8 vector elements in each source vector operand in each iteration, whereby the originally specified 16 vector elements can be processed in two cycles. If a speculative shrink trigger occurs at point 465 when VOP1 460 is executing, this trigger can be used to clock gating (or even power gating) certain components of VOP2 (as indicated by reference numeral 475) for processing at element position 465 This is especially true for those components whose element positions above or cause speculative shrinkage. However, not all element positions on or above the element position that produced the speculative reduction need be clock-gated, and as shown in FIG. 11B, in an alternative embodiment, only a subset of higher element positions may be clock-gated Clock gating is performed, as indicated by reference numeral 480 . As discussed below with reference to FIG. 14 , this may be due, for example, to speculation control circuitry 60 determining with reference to the reduction criteria that the speculation width should actually only be trimmed by 2, rather than all trimming to exclude vector element positions 465 and higher.
图12A图示在执行推测向量运算时,诸如2级快取未中的微架构事件可如何引入显著的潜时。在此实例中,在点500处执行推测指令,该点设定初始推测宽度为16。然后,执行向量加载指令510,但在此情况下,当设法加载向量元素6(亦即位于向量元素位置7处的数据元素)时,发生2级快取未中。在此实例中,此情况在向量加载运算能够加载向量元素6及剩余向量元素之前引入200个潜时周期。由于那些多个向量运算之间的数据相依性,此情况对随后两个向量运算VOP1515及VOP2520具有冲击性影响。由此,在执行提交过程时,在点525处消耗另一周期,推测宽度在此点处仍为16。如可见,此情况导致执行所需向量运算共需212个周期,每一向量元素需要平均13.25周期。12A illustrates how microarchitectural events such as level 2 cache misses can introduce significant latency when performing speculative vector operations. In this example, a speculative instruction is executed at point 500 , which sets an initial speculation width of sixteen. Then, a vector load instruction 510 is executed, but in this case a level 2 cache miss occurs when trying to load vector element 6 (ie, the data element at vector element position 7). In this example, this introduces 200 latency cycles before the vector load operation is able to load vector element 6 and the remaining vector elements. This situation has an impact on the next two vector operations, VOP1515 and VOP2520, due to the data dependencies between those multiple vector operations. Thus, another cycle is consumed at point 525 as the commit process is performed, at which point the speculative width is still 16. As can be seen, this results in a total of 212 cycles required to perform the required vector operations, requiring an average of 13.25 cycles per vector element.
相比而言,图12B图示当使用所述实施例中的适应性推测宽度方法时的情况。尽管在步骤500中,推测宽度再一次被初始设定为16,但当相关于向量加载运算530的数据元素6(元素位置7)侦测到2级快取未中时,此情况使推测控制电路系统60在较短延时(该延时归因于相对于缩减标准而评估进展指示并且确定将进行的缩减而消耗的时间)之后动态地将推测宽度缩减至6。由于缩减的推测宽度,VOP1535及VOP2540仅处理前6个元素位置,由此,在已经执行VOP2之前6个元素之后,基于当前推测宽度6,随即可在点525处执行提交过程。结果,该过程耗费7个周期以便处理6个元素,每一向量元素平均耗费1.167周期。In contrast, FIG. 12B illustrates the situation when using the adaptive guess width method in the described embodiment. Although in step 500 the speculative width is again initially set to 16, when a level 2 cache miss is detected with respect to data element 6 (element position 7) of the vector load operation 530, this situation causes the speculative control Circuitry 60 dynamically scales down the speculation width to 6 after a short delay due to the time it takes to evaluate the indication of progress relative to the scale down criteria and determine the scale down to make. Due to the reduced speculation width, VOP1 535 and VOP2 540 only process the first 6 element positions, whereby the commit process can then be performed at point 525 based on the current speculation width of 6 after the first 6 elements of VOP2 have been executed. As a result, the process takes 7 cycles to process 6 elements, an average of 1.167 cycles per vector element.
假定在提交点处确定所需推测宽度是6或6以下,则将不再需要向量回路的更多迭代,并且过程将完成。如若相反,所需向量宽度经确定为大于6,则将需要执行向量回路的一个或多个其他迭代以处理剩余元素。然而,此举仍可产生显著的效能增强,因为在执行后续迭代之前,完全可能不再发生相关于向量元素6的2级快取未中,由此,后续迭代更加有效的进行。Assuming at the commit point it is determined that the required speculation width is 6 or less, no further iterations of the vector loop will be required and the process will complete. If, instead, the required vector width is determined to be greater than 6, then one or more additional iterations of the vector loop will need to be performed to process the remaining elements. However, this can still result in a significant performance gain because it is entirely possible that no level 2 cache misses will occur with respect to vector element 6 before subsequent iterations are performed, and thus subsequent iterations proceed more efficiently.
尽管推测缩减标准可直接规定一个或多个标准,这些标准在得以满足的情况下指示推测缩减条件的存在,或者或另外,推测缩减标准可包括由推测控制电路系统维护的效能容限信息,并且在虑及在推测向量运算序列的执行期间产生的进展指示的情况下调整该效能容限信息。图13是一流程图,该图图示根据一该种实施例的推测缩减过程的执行,其中,效能容限信息采用松弛指示的形式,该松弛指示提供在推测向量运算的处理期间在效能特性如何受影响方面对可用松弛的指示。Although the speculative shrinkage criteria may directly specify one or more criteria that, if satisfied, indicate the presence of a speculative shrinkage condition, alternatively or additionally, the speculative shrinkage criteria may include performance margin information maintained by the speculative control circuitry, and The performance margin information is adjusted taking into account progress indications generated during execution of the sequence of speculative vector operations. Figure 13 is a flow diagram illustrating the execution of a speculative reduction process according to one such embodiment, wherein the performance margin information is in the form of a slack indication that provides performance characteristics during the processing of speculative vector operations. An indication of how the aspect is affected against the available slack.
在步骤600中,推测宽度在推测开始时被设定为初始值,并且在步骤605中,参数「松弛(slack)」被设定为一些规定预算值。此值可由开始推测的指令提供,可经默认,或可以其他方式提供,例如在包括此推测预估电路系统85的实施例中由该种推测预估电路系统提供。In step 600, the speculation width is set to an initial value at the start of speculation, and in step 605, a parameter "slack" is set to some specified budget value. This value may be provided by an instruction to start speculation, may be by default, or may be provided in other ways, such as by such speculative estimation circuitry 85 in embodiments that include it.
在步骤610中,确定是否已经接收到进展指示,如若未接收到,则在步骤615中确定推测是否已终止。如果已终止,则过程在步骤620结束,否则,过程则返回步骤610。In step 610, it is determined whether a progress indication has been received, and if not, it is determined in step 615 whether the speculation has terminated. If so, the process ends at step 620 , otherwise, the process returns to step 610 .
当接收到进展指示时,确定该进展指示指出对效能特性是负面影响还是正面影响。特定而言,在一个可选实施例中,处理电路系统30可提供进展指示,这些进展指示不仅关于对效能特性具有负面影响的事件,而是亦关于具有正面影响的某些事件,例如指示某些运算已比预期更快地执行。倘若是正面进展指示,则在步骤630中,松弛值增大由进展指示所指示的量,随后过程返回步骤610。When an indication of progress is received, a determination is made as to whether the indication of progress indicates a negative or positive impact on the performance characteristic. In particular, in an alternative embodiment, processing circuitry 30 may provide progress indications not only with respect to events that have a negative impact on performance characteristics, but also with respect to certain events that have a positive impact, such as indicating that certain Some operations have performed faster than expected. In the event of a positive indication of progress, in step 630 the slack value is increased by the amount indicated by the indication of progress and the process returns to step 610 .
相反,倘若在步骤625中侦测到负面进展指示,则在步骤635中评估参数「成本(cost)」,此参数指示与该负面指示关联的效能成本。随后,藉由自当前的松弛值中减去成本值来调整松弛值。然后,在步骤655中,确定松弛值当前是否为负,如若不是,则过程返回步骤610。Conversely, if in step 625 a negative progress indication is detected, then in step 635 a parameter "cost" is evaluated, which indicates the performance cost associated with the negative indication. Then, the slack value is adjusted by subtracting the cost value from the current slack value. Then, in step 655 , it is determined whether the slack value is currently negative, if not, the process returns to step 610 .
如若在步骤655中确定松弛值为负,则藉由自当前的推测宽度中减去作为当前松弛值的函数所确定的值,来设定内部参数(SW)。函数可以多种方式设定,但在一个实例中,该函数将导致:负松弛值越大,则自推测宽度中减去的量越大。If the slack value is determined to be negative in step 655, an internal parameter (SW) is set by subtracting the value determined as a function of the current slack value from the current guess width. The function can be set in a number of ways, but in one example, the function will result in: the greater the negative slack value, the greater the amount subtracted from the inferred width.
在步骤665中,侦测参数SW是否小于1。如若小于1,则在过程前进至步骤675之前,在步骤670中将SW设定为等于1,反之,过程则直接自步骤665进行至步骤675。步骤665、步骤670的目的是确保内部参数SW绝不低于1,以便确保向量回路的当前迭代将始终产生一些正面进展。特定而言,在步骤685中,推测宽度将被设定为等于SW,由此,在过程返回步骤610之前,推测宽度将被设定为值1或大于1。In step 665, it is detected whether the parameter SW is less than 1. If less than 1, then set SW equal to 1 in step 670 before the process proceeds to step 675 , otherwise, the process proceeds directly from step 665 to step 675 . The purpose of steps 665, 670 is to ensure that the internal parameter SW never falls below 1, so as to ensure that the current iteration of the vector loop will always produce some positive progress. In particular, in step 685 the speculative width will be set equal to SW, whereby the speculative width will be set to a value of 1 or greater before the process returns to step 610 .
步骤675及680是可选的,在步骤675中,藉由自当前推测宽度(亦即在步骤685中调整之前的推测宽度)中减去内部参数SW来确定内部参数「缩减(reduction)」。随后,在步骤680中,松弛值增加已确定的「缩减」值。然后,在步骤685中缩减推测宽度。当使用可选步骤675、680时,将因此可见,每当缩减推测宽度时,依据推测宽度的缩减量而正向调整松弛值。Steps 675 and 680 are optional. In step 675, an internal parameter "reduction" is determined by subtracting the internal parameter SW from the current speculation width (ie, the speculation width before adjustment in step 685). Then, in step 680, the slack value is increased by the determined "reduction" value. Then, in step 685 the speculation width is reduced. When using the optional steps 675, 680, it will thus be seen that whenever the speculative width is reduced, the slack value is adjusted positively by the amount of reduction of the speculative width.
图14A及图14B示意地图示在执行图13的过程时,松弛值可如何改变。在图14A的该实例中,假定执行可选步骤630。如斜坡687所示,在经由路径80接收到一些正面指示之后,松弛值最初增大。负面指示使松弛值首次下降,然后,第二负面指示使松弛值在点688处降至负值。在此点处,缩减推测宽度,并且正向调整松弛值缩减量。如斜坡689所指示的,连续的正面进展指示使松弛值增大更多,但在两个负面进展指示之后,松弛值却再次在点690处返回负值。此举再一次使推测宽度缩减,并且松弛值基于推测宽度的缩减而可选地增大。因此,正面进展指示使松弛值遵循线691。14A and 14B schematically illustrate how the slack value may change when the process of FIG. 13 is performed. In this example of FIG. 14A, it is assumed that optional step 630 is performed. As shown by slope 687 , the slack value initially increases after some positive indication is received via path 80 . A negative indication causes the slack value to drop for the first time, and then a second negative indication causes the slack value to drop to a negative value at point 688 . At this point, the speculation width is reduced, and the slack value reduction is adjusted positively. Successive positive indications of progress increase the slack value even more, as indicated by slope 689 , but after two negative indications of progress the slack value returns to negative again at point 690 . This again reduces the speculation width, and the slack value is optionally increased based on the reduction of the speculation width. Therefore, a positive progress indication causes the slack value to follow line 691 .
在一个实施例中,在接收到正面指示时,正向调整松弛值的量依据推测宽度而定。特定而言,推测宽度越高,正面指示的整体效应越大,由此,斜坡687比斜坡689具有更陡峭的坡度,同样地,斜坡689比斜坡691具有更陡峭的坡度。In one embodiment, upon receiving a positive indication, the amount by which the slack value is adjusted positively depends on the inference width. In particular, the higher the inferred width, the greater the overall effect of the positive indication, whereby slope 687 has a steeper slope than slope 689 , and likewise slope 689 has a steeper slope than slope 691 .
在图14A中,假定不执行可选步骤675、680,由此,未在正向上调整松弛值以补偿推测宽度的缩减。然而,在替代性实施例中,将执行步骤675及680,因此在点688、690处,松弛值将在所遵循的斜坡689、691之前出现正跳越。松弛值中的正跳跃可能也可能不再次采用松弛值正值,因为此举将视在步骤660中应用的松弛值函数而定,由此视所发生的推测宽度缩减的程度而定。In FIG. 14A, it is assumed that optional steps 675, 680 are not performed, whereby the slack value is not adjusted in the positive direction to compensate for the reduction of the speculative width. However, in an alternative embodiment, steps 675 and 680 would be performed so that at points 688, 690 the slack value would have a positive jump ahead of the slope 689, 691 followed. Positive jumps in the slack value may or may not take a positive slack value again, as this will depend on the slack value function applied in step 660, and thus on the extent of the speculative width reduction that occurred.
图14B图示一实例,在该实例中未接收到正面进展指示,由此,松弛值以初始预算值开始,然后在每次接收到负面指示时逐步降低。在松弛值变成负值之后,每当松弛值出现下降时,推测宽度便缩减。FIG. 14B illustrates an example in which no positive indications of progress are received, whereby the slack value starts at an initial budget value and then steps down each time a negative indication is received. After the slack value becomes negative, the speculative width shrinks each time the slack value decreases.
图15图示可使用的虚拟机实施方式。尽管前文所述实施例依照用于操作支持相关技术的特定处理硬件的设备及方法来实施本发明,但亦可能提供所谓的硬件装置虚拟机实施。这些虚拟机实施方式在主处理器730上运行,该主处理器通常运行支持虚拟机程序710的主机操作系统720。通常情况下,需要功能强劲处理器来提供以合理速度执行的虚拟机实施方式,但该种方法在某些情况下比较适合,如出于兼容或再使用的原因而需要运行另一处理器的本端代码的情况。虚拟机程序710能够执行应用程序(或操作系统)700,所得到的结果与由真实硬件装置进行的程序执行所得到的结果相同。由此,可藉由使用虚拟机程序710而在应用程序700内执行包括上述推测向量指令在内的程序指令。Figure 15 illustrates a virtual machine implementation that may be used. While the foregoing described embodiments implement the invention in terms of apparatus and methods for operating specific processing hardware supporting the related art, it is also possible to provide so-called hardware device virtual machine implementations. These virtual machine implementations run on a host processor 730 , which typically runs a host operating system 720 that supports a virtual machine program 710 . Typically, a powerful processor is needed to provide a virtual machine implementation that executes at reasonable speeds, but this approach is appropriate in some cases, such as those that need to run on another processor for compatibility or reuse reasons The situation of the local code. The virtual machine program 710 can execute the application program (or operating system) 700, and the result obtained is the same as that obtained by executing the program on a real hardware device. Thus, program instructions including the speculative vector instructions described above can be executed within the application program 700 by using the virtual machine program 710 .
经由使用上述技术,数据处理设备的至少一个效能特性(如产量或能量消耗效能)可在执行推测向量运算的同时,藉由设法避免以特定推测宽度执行推测向量运算将对所选的效能特性具有过度不利影响的情况而得以改良。藉由该种方法,可动态调整推测宽度,以便设法避免进行耗时或耗能过度却可能并非必需的工作,从而节省时间及能量。为保证进展,在一个实施例中,禁止推测宽度缩减至1个元素以下。By using the techniques described above, at least one performance characteristic of a data processing device, such as throughput or energy consumption performance, can be achieved while performing speculative vector operations by trying to avoid that performing speculative vector operations with a specific speculation width would have a negative impact on the selected performance characteristic. Improvements in the event of excessive adverse effects. In this way, the speculation width can be dynamically adjusted to try to avoid time-consuming or energy-intensive work that may not be necessary, thereby saving time and energy. To guarantee progress, in one embodiment speculative width reductions below 1 element are prohibited.
尽管本文已描述特定实施例,但将了解,本发明并非限定于那些实施例,并且可在本发明的范畴内进行诸多润饰及添加。例如,可在不背离本发明的范畴的前提下,由独立权利要求的特征组成后面的附属权利要求的特征的多种组合。Although specific embodiments have been described herein, it will be understood that the invention is not limited to those embodiments and that numerous modifications and additions are possible within the scope of the invention. For example, various combinations of the features of the following dependent claims may be made from the features of the independent claims without departing from the scope of the present invention.
Claims (28)
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB1317876.9 | 2013-10-09 | ||
| GB1317876.9A GB2519108A (en) | 2013-10-09 | 2013-10-09 | A data processing apparatus and method for controlling performance of speculative vector operations |
| PCT/GB2014/052508 WO2015052485A1 (en) | 2013-10-09 | 2014-08-14 | A data processing apparatus and method for controlling performance of speculative vector operations |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN105612494A true CN105612494A (en) | 2016-05-25 |
| CN105612494B CN105612494B (en) | 2019-07-12 |
Family
ID=49630436
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201480054729.5A Active CN105612494B (en) | 2013-10-09 | 2014-08-14 | For controlling the data processing equipment and method that speculate vector operation efficiency |
Country Status (9)
| Country | Link |
|---|---|
| US (1) | US10261789B2 (en) |
| EP (1) | EP3039532B1 (en) |
| JP (1) | JP6546584B2 (en) |
| KR (1) | KR102271992B1 (en) |
| CN (1) | CN105612494B (en) |
| GB (1) | GB2519108A (en) |
| IL (1) | IL244408B (en) |
| TW (1) | TWI649693B (en) |
| WO (1) | WO2015052485A1 (en) |
Families Citing this family (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2539428B (en) | 2015-06-16 | 2020-09-09 | Advanced Risc Mach Ltd | Data processing apparatus and method with ownership table |
| GB2539429B (en) | 2015-06-16 | 2017-09-06 | Advanced Risc Mach Ltd | Address translation |
| GB2539433B8 (en) * | 2015-06-16 | 2018-02-21 | Advanced Risc Mach Ltd | Protected exception handling |
| GB2540942B (en) * | 2015-07-31 | 2019-01-23 | Advanced Risc Mach Ltd | Contingent load suppression |
| GB2548602B (en) * | 2016-03-23 | 2019-10-23 | Advanced Risc Mach Ltd | Program loop control |
| CN107315716B (en) * | 2016-04-26 | 2020-08-07 | 中科寒武纪科技股份有限公司 | An apparatus and method for performing a vector outer product operation |
| CN107315568B (en) | 2016-04-26 | 2020-08-07 | 中科寒武纪科技股份有限公司 | An apparatus for performing vector logic operations |
| US10108581B1 (en) | 2017-04-03 | 2018-10-23 | Google Llc | Vector reduction processor |
| US11372804B2 (en) * | 2018-05-16 | 2022-06-28 | Qualcomm Incorporated | System and method of loading and replication of sub-vector values |
| GB2580426B (en) * | 2019-01-11 | 2021-06-30 | Advanced Risc Mach Ltd | Controlling use of data determined by a resolve-pending speculative operation |
| KR102305845B1 (en) * | 2020-12-21 | 2021-09-29 | 쿠팡 주식회사 | Electronic apparatus for verifying code and method thereof |
| GB2630754B (en) * | 2023-06-05 | 2025-09-24 | Advanced Risc Mach Ltd | Extension processing circuitry start-up |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1349159A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Microprocessor vector processing method |
| US20080288744A1 (en) * | 2007-05-14 | 2008-11-20 | Apple Inc. | Detecting memory-hazard conflicts during vector processing |
| WO2013095608A1 (en) * | 2011-12-23 | 2013-06-27 | Intel Corporation | Apparatus and method for vectorization with speculation support |
Family Cites Families (27)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2765411B2 (en) * | 1992-11-30 | 1998-06-18 | 株式会社日立製作所 | Virtual computer system |
| US6915395B1 (en) * | 2000-05-03 | 2005-07-05 | Sun Microsystems, Inc. | Active address content addressable memory |
| WO2002084451A2 (en) | 2001-02-06 | 2002-10-24 | Victor Demjanenko | Vector processor architecture and methods performed therein |
| US20040123081A1 (en) * | 2002-12-20 | 2004-06-24 | Allan Knies | Mechanism to increase performance of control speculation |
| US20040215941A1 (en) | 2003-04-24 | 2004-10-28 | Sun Microsystems, Inc. | Method and system to handle register window fill and spill |
| US7149946B2 (en) | 2003-06-13 | 2006-12-12 | Microsoft Corporation | Systems and methods for enhanced stored data verification utilizing pageable pool memory |
| US7149851B1 (en) | 2003-08-21 | 2006-12-12 | Transmeta Corporation | Method and system for conservatively managing store capacity available to a processor issuing stores |
| US7500087B2 (en) | 2004-03-09 | 2009-03-03 | Intel Corporation | Synchronization of parallel processes using speculative execution of synchronization instructions |
| US7395419B1 (en) | 2004-04-23 | 2008-07-01 | Apple Inc. | Macroscalar processor architecture |
| US7961636B1 (en) * | 2004-05-27 | 2011-06-14 | Cisco Technology, Inc. | Vectorized software packet forwarding |
| US20060259737A1 (en) * | 2005-05-10 | 2006-11-16 | Telairity Semiconductor, Inc. | Vector processor with special purpose registers and high speed memory access |
| US7739456B1 (en) | 2007-03-06 | 2010-06-15 | Oracle America, Inc. | Method and apparatus for supporting very large transactions |
| US8019976B2 (en) | 2007-05-14 | 2011-09-13 | Apple, Inc. | Memory-hazard detection and avoidance instructions for vector processing |
| US8019977B2 (en) | 2007-05-14 | 2011-09-13 | Apple Inc. | Generating predicate values during vector processing |
| US8060728B2 (en) | 2007-05-14 | 2011-11-15 | Apple Inc. | Generating stop indicators during vector processing |
| US8739141B2 (en) | 2008-05-19 | 2014-05-27 | Oracle America, Inc. | Parallelizing non-countable loops with hardware transactional memory |
| US8291202B2 (en) * | 2008-08-08 | 2012-10-16 | Qualcomm Incorporated | Apparatus and methods for speculative interrupt vector prefetching |
| KR101586770B1 (en) * | 2008-10-14 | 2016-01-19 | 고쿠리츠다이가쿠호징 나라 센탄카가쿠기쥬츠 다이가쿠인 다이가쿠 | Data processing device |
| US8572341B2 (en) | 2009-09-15 | 2013-10-29 | International Business Machines Corporation | Overflow handling of speculative store buffers |
| JP5491113B2 (en) * | 2009-09-18 | 2014-05-14 | エヌイーシーコンピュータテクノ株式会社 | Vector processing apparatus, vector processing method, and program |
| JP5573134B2 (en) * | 2009-12-04 | 2014-08-20 | 日本電気株式会社 | Vector computer and instruction control method for vector computer |
| US8887171B2 (en) | 2009-12-28 | 2014-11-11 | Intel Corporation | Mechanisms to avoid inefficient core hopping and provide hardware assisted low-power state selection |
| US9552206B2 (en) | 2010-11-18 | 2017-01-24 | Texas Instruments Incorporated | Integrated circuit with control node circuitry and processing circuitry |
| TWI636362B (en) | 2011-06-24 | 2018-09-21 | 林正浩 | High-performance cache system and method |
| US9268569B2 (en) * | 2012-02-24 | 2016-02-23 | Apple Inc. | Branch misprediction behavior suppression on zero predicate branch mispredict |
| US9116686B2 (en) * | 2012-04-02 | 2015-08-25 | Apple Inc. | Selective suppression of branch prediction in vector partitioning loops until dependency vector is available for predicate generating instruction |
| US9501276B2 (en) | 2012-12-31 | 2016-11-22 | Intel Corporation | Instructions and logic to vectorize conditional loops |
-
2013
- 2013-10-09 GB GB1317876.9A patent/GB2519108A/en not_active Withdrawn
-
2014
- 2014-08-14 KR KR1020167011068A patent/KR102271992B1/en active Active
- 2014-08-14 CN CN201480054729.5A patent/CN105612494B/en active Active
- 2014-08-14 WO PCT/GB2014/052508 patent/WO2015052485A1/en not_active Ceased
- 2014-08-14 JP JP2016519799A patent/JP6546584B2/en active Active
- 2014-08-14 EP EP14753139.6A patent/EP3039532B1/en active Active
- 2014-08-18 US US14/461,664 patent/US10261789B2/en active Active
- 2014-09-03 TW TW103130434A patent/TWI649693B/en active
-
2016
- 2016-03-03 IL IL244408A patent/IL244408B/en active IP Right Grant
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1349159A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Microprocessor vector processing method |
| US20080288744A1 (en) * | 2007-05-14 | 2008-11-20 | Apple Inc. | Detecting memory-hazard conflicts during vector processing |
| WO2013095608A1 (en) * | 2011-12-23 | 2013-06-27 | Intel Corporation | Apparatus and method for vectorization with speculation support |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3039532B1 (en) | 2020-11-11 |
| KR102271992B1 (en) | 2021-07-05 |
| TWI649693B (en) | 2019-02-01 |
| KR20160065145A (en) | 2016-06-08 |
| CN105612494B (en) | 2019-07-12 |
| TW201514850A (en) | 2015-04-16 |
| JP2016536665A (en) | 2016-11-24 |
| US10261789B2 (en) | 2019-04-16 |
| US20150100755A1 (en) | 2015-04-09 |
| GB201317876D0 (en) | 2013-11-20 |
| JP6546584B2 (en) | 2019-07-17 |
| GB2519108A (en) | 2015-04-15 |
| WO2015052485A1 (en) | 2015-04-16 |
| EP3039532A1 (en) | 2016-07-06 |
| IL244408B (en) | 2020-03-31 |
| IL244408A0 (en) | 2016-04-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN105612494B (en) | For controlling the data processing equipment and method that speculate vector operation efficiency | |
| US6907520B2 (en) | Threshold-based load address prediction and new thread identification in a multithreaded microprocessor | |
| US7711935B2 (en) | Universal branch identifier for invalidation of speculative instructions | |
| KR101511837B1 (en) | Improving performance of vector partitioning loops | |
| US7609582B2 (en) | Branch target buffer and method of use | |
| US20160055004A1 (en) | Method and apparatus for non-speculative fetch and execution of control-dependent blocks | |
| US20100095151A1 (en) | Processor Apparatus for Executing Instructions with Local Slack Prediction of Instructions and Processing Method Therefor | |
| US9003171B2 (en) | Page fault prediction for processing vector instructions | |
| US20120166765A1 (en) | Predicting branches for vector partitioning loops when processing vector instructions | |
| US11972264B2 (en) | Micro-operation supply rate variation | |
| US9311094B2 (en) | Predicting a pattern in addresses for a memory-accessing instruction when processing vector instructions | |
| EP2776919A1 (en) | Reducing hardware costs for supporting miss lookahead | |
| US9098295B2 (en) | Predicting a result for an actual instruction when processing vector instructions | |
| US9122485B2 (en) | Predicting a result of a dependency-checking instruction when processing vector instructions | |
| US8683178B2 (en) | Sharing a fault-status register when processing vector instructions | |
| CN110402434B (en) | cache miss thread balance | |
| US8924693B2 (en) | Predicting a result for a predicate-generating instruction when processing vector instructions | |
| JP5104862B2 (en) | Instruction execution control device and instruction execution control method | |
| KR100837400B1 (en) | Method and apparatus for processing according to multithreading / nonsequential merging technique | |
| JP2008527559A (en) | Processor and instruction issuing method thereof | |
| GB2416412A (en) | Branch target buffer memory array with an associated word line and gating circuit, the circuit storing a word line gating value |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |