CN106951217A

CN106951217A - By the instruction prefetch device of readily available prefetcher accuracy dynamic control

Info

Publication number: CN106951217A
Application number: CN201610973966.2A
Authority: CN
Inventors: 保罗·E·基钦
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2016-01-07
Filing date: 2016-11-04
Publication date: 2017-07-14
Anticipated expiration: 2036-11-04
Also published as: CN106951217B; TWI723072B; KR102615010B1; US20170199739A1; US10296463B2; KR20170082965A; TW201725508A

Abstract

An instruction prefetcher dynamically controlled by readily available prefetcher accuracy. According to one general aspect, an apparatus may include a branch prediction unit, a fetch unit, and a prefetcher circuit or unit. The branch prediction unit may be configured to output predicted instructions. The fetch unit may be configured to fetch the next instruction from the cache memory. The prefetcher circuit may be configured to prefetch previously predicted instructions into the cache memory based on a relationship between the predicted instruction and the next instruction.

Description

Instruction prefetcher dynamically controlled by readily available prefetcher accuracy

本申请要求于2016年1月7日提交的发明名称为“由易于得到的预取器准确性动态控制的指令预取器”的第62/276,067号美国临时专利申请以及于2016年4月18日提交的第15/132,230号美国非临时专利申请的优先权。本先前提交的申请的主题通过引用包含于此。This application claims U.S. Provisional Patent Application No. 62/276,067, filed January 7, 2016, entitled "Instruction Prefetcher Dynamically Controlled by Readily Available Prefetcher Accuracy," and filed on April 18, 2016. Priority to U.S. Nonprovisional Patent Application No. 15/132,230 filed on . The subject matter of this previously filed application is hereby incorporated by reference.

技术领域technical field

本描述涉及预取数据，更具体地讲，涉及指令预取的控制。This description relates to prefetching data, and more specifically, to the control of instruction prefetching.

背景技术Background technique

在计算机系统架构中，指令预取是一种用于通过减少等待状态来加速程序的执行的技术。预取一般发生在处理器或处理器的子单元(例如，预取单元)在实际需要指令或数据块之前向主存储器请求指令或数据块时。一旦指令/数据块从主存储器或系统存储器返回，则指令/数据块通常被放置在高速缓冲存储器中。当做出从高速缓冲存储器访问指令/数据块的请求时，相比于假设必须向主存储器或系统存储器做出请求的情况，从高速缓冲存储器访问指令/数据块可快地多。因此，预取隐藏了存储器访问延迟。In computer system architecture, instruction prefetching is a technique used to speed up the execution of programs by reducing wait states. Prefetching generally occurs when a processor or subunit of a processor (eg, a prefetch unit) requests an instruction or data block from main memory before the instruction or data block is actually needed. Once an instruction/data block is returned from main memory or system memory, the instruction/data block is typically placed in cache memory. When a request is made to access an instruction/data block from cache, it can be accessed much faster from cache than if the request had to be made to main memory or system memory. Therefore, prefetching hides memory access latency.

由于程序一般被顺序地执行，因此，当以程序的顺序对指令进行预取时，性能很可能最佳。可选地，预取可以是复杂分支预测算法的一部分，其中，处理器尝试预期计算的结果，并提前预取正确的指令。Since programs are generally executed sequentially, performance is likely to be best when instructions are prefetched in program order. Alternatively, prefetching can be part of a complex branch prediction algorithm, in which the processor tries to anticipate the outcome of a computation, and prefetches the correct instruction ahead of time.

在计算机系统架构中，分支预测器或分支预测单元是试图在结果被实际计算出并已知之前预测将进行分支(例如，如果-则-否则(if-then-else)结构、跳转指令)的哪个路径的数字电路。分支预测器的目的通常是为了提高指令流水线中的流量。在很多现代流水线的微处理器系统架构中，分支预测器在实现高效性能方面扮演极其重要的角色。In computer system architecture, a branch predictor or branch prediction unit that attempts to predict that a branch will be taken before the result is actually calculated and known (eg, if-then-else constructs, jump instructions) Which path of the digital circuit. The purpose of a branch predictor is usually to improve the flow in the instruction pipeline. In many modern pipelined microprocessor system architectures, branch predictors play an extremely important role in achieving efficient performance.

双路分支通常使用条件跳转指令来实现。条件跳转可以为“不跳转(not taken)”并使用紧跟在条件分支之后的代码的第一分支来继续执行，或者条件跳转可以为“跳转(taken)”并跳转到存储代码的第二分支的程序存储器中的不同的位置。通常不能确定地知道条件跳转将是跳转还是不跳转，直到条件已被计算并且条件分支已经转到指令流水线中的执行阶段为止。Two-way branches are usually implemented using conditional jump instructions. A conditional jump can be "not taken" and use the first branch of the code immediately following the conditional branch to continue execution, or a conditional jump can be "taken" and jump to the store A different location in program memory for the second branch of code. It is often not known with certainty whether a conditional jump will be a jump or not until the condition has been evaluated and the conditional branch has gone to the execute stage in the instruction pipeline.

在没有分支预测的情况下，处理器通常将不得不等待，直到流水线中在下一指令可进入提取阶段之前条件跳转指令已经转到执行阶段为止。分支预测器通过尝试猜测跳转指令最有可能跳转还是不跳转来试图避免这样的时间浪费。被猜测为最有可能的分支随后被提取并推测地执行。如果分支预测器检测出猜测的分支是错误的，则推测地执行或部分执行的指令通常被丢弃，并且流水线使用正确的分支重新开始，引发延迟。Without branch prediction, the processor would typically have to wait until the conditional jump instruction has passed to the execute stage in the pipeline before the next instruction can enter the fetch stage. Branch predictors try to avoid such time-wasting by trying to guess which jump instruction is most likely to take or not to take. The branch guessed to be the most likely is then fetched and executed speculatively. If the branch predictor detects that the guessed branch is wrong, speculatively executed or partially executed instructions are usually discarded and the pipeline restarts with the correct branch, inducing a delay.

发明内容Contents of the invention

根据一个总体方面，一种设备可包括分支预测单元、提取单元和预取器电路或单元。分支预测单元可被配置为输出预测的指令。提取单元可被配置为从高速缓冲存储器提取下一指令。预取器电路可被配置为基于预测的指令与下一指令之间的关系将先前预测的指令预取到高速缓冲存储器中。According to one general aspect, an apparatus may include a branch prediction unit, a fetch unit, and a prefetcher circuit or unit. The branch prediction unit may be configured to output predicted instructions. The fetch unit may be configured to fetch the next instruction from the cache memory. The prefetcher circuit may be configured to prefetch previously predicted instructions into the cache memory based on a relationship between the predicted instruction and the next instruction.

根据另一总体方面，一种方法可包括：由预测电路预测将由处理器执行的预测的指令。所述方法可包括：由提取电路从高速缓冲存储器提取下一指令。所述方法还可包括：确定预测的指令与下一指令之间的关系是否满足一个或多个预定标准的集合。所述方法可包括：如果所述一个或多个预定标准的集合，则将预测的指令预取到高速缓冲存储器中。According to another general aspect, a method may include predicting, by a predictive circuit, a predicted instruction to be executed by a processor. The method may include fetching, by the fetch circuit, a next instruction from the cache memory. The method may also include determining whether a relationship between the predicted instruction and the next instruction satisfies a set of one or more predetermined criteria. The method may include prefetching the predicted instruction into the cache memory if the set of one or more predetermined criteria is present.

根据另一总体方面，一种设备可包括处理器、高速缓冲存储器、分支预测单元、提取单元和预取器电路或单元。处理器可被配置为执行指令。高速缓冲存储器可被配置为暂时存储指令。分支预测单元可被配置为输出预测的指令，其中，预测的指令被推测性地预测为将由处理器执行，其中，分支预测单元与提取单元分离。提取单元可被配置为从高速缓冲存储器提取下一指令。预取器电路可被配置为：响应于预测的指令与下一指令之间的关系满足一个或多个预定标准，将先前预测的指令预取到高速缓冲存储器中。According to another general aspect, an apparatus may include a processor, a cache memory, a branch prediction unit, a fetch unit, and a prefetcher circuit or unit. Processors can be configured to execute instructions. The cache memory may be configured to temporarily store instructions. The branch prediction unit may be configured to output predicted instructions, wherein the predicted instructions are speculatively predicted to be executed by the processor, wherein the branch prediction unit is separate from the fetch unit. The fetch unit may be configured to fetch the next instruction from the cache memory. The prefetcher circuit may be configured to prefetch previously predicted instructions into the cache memory in response to a relationship between the predicted instruction and the next instruction satisfying one or more predetermined criteria.

在下面的附图和描述中陈述了一个或多个实施例的细节。通过描述和附图，以及通过权利要求，其他特征将是清楚的。The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

结合多个附图中的至少一个附图充分地示出和/或描述了一种用于预取数据并更具体地用于控制指令的预取的系统和/或方法，在说明书中更加完整地阐述了所述系统和/或方法。A system and/or method for prefetching data, and more particularly for controlling instruction prefetching, is fully shown and/or described in connection with at least one of the various figures, more fully in the specification The system and/or method are described in detail.

附图说明Description of drawings

图1是根据本公开的主题的系统的示例实施例的框图。FIG. 1 is a block diagram of an example embodiment of a system according to the disclosed subject matter.

图2是根据本公开的主题的设备的示例实施例的框图。FIG. 2 is a block diagram of an example embodiment of a device according to the disclosed subject matter.

图3是根据本公开的主题的设备的示例实施例的框图。FIG. 3 is a block diagram of an example embodiment of a device according to the disclosed subject matter.

图4是可包括根据本公开的主题的原理产生的装置的信息处理系统的原理框图。Fig. 4 is a functional block diagram of an information handling system that may include an apparatus produced in accordance with the principles of the disclosed subject matter.

各种附图中的相同的参考标号表示相同的元件。The same reference numerals in the various drawings represent the same elements.

具体实施方式detailed description

以下，将参照附图对各种示例实施例进行更加全面地描述，在附图中示出了一些示例实施例。然而，本公开的主题可以以多种不同的形式来实现，并且不应该被解释为限于这里阐述的示例实施例。相反，提供这些示例实施例使得本公开将是彻底的和完整的，并且将本公开的主题的范围全面地传达给本领域技术人员。在附图中，为了清晰，可能夸大层和区域的大小以及相对大小。Various example embodiments will be described more fully hereinafter with reference to the accompanying drawings, in which some example embodiments are shown. The disclosed subject matter may, however, be embodied in many different forms and should not be construed as limited to the example embodiments set forth herein. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosed subject matter to those skilled in the art. In the drawings, the size and relative sizes of layers and regions may be exaggerated for clarity.

将理解，当元件或层被称为“在……上”、“连接到”或“结合到”另一个元件或层时，它可直接在所述另一个元件或层上、直接连接到或结合到所述另一个元件或层，或者可存在中间元件或层。相反，当元件被称为“直接在……上”、“直接连接到”或“直接结合到”另一个元件或层时，不存在中间元件和层。相同的标号始终指代相同的元件。如这里使用的，术语“和/或”包括一个或多个关联的所列项的任何组合和全部组合。It will be understood that when an element or layer is referred to as being "on," "connected to" or "coupled to" another element or layer, it can be directly on, directly connected to, or to another element or layer, or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element or layer, there are no intervening elements or layers present. Like reference numerals refer to like elements throughout. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

将理解，尽管可在这里使用术语“第一”、“第二”、“第三”等描述各种元件、组件、区域、层和/或部分，但是这些元件、组件、区域、层和/或部分不应该由这些术语限制。这些术语只是用于将一个元件、组件、区域、层或部分与另一个元件、组件、区域、层或部分进行区分。因此，在不脱离本公开的主题的教导的情况下，以下讨论的第一元件、组件、区域、层或部分可被称为第二元件、组件、区域、层或部分。It will be understood that although the terms "first", "second", "third" etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections or parts should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the presently disclosed subject matter.

为了描述简便，在这里可使用空间相对术语(诸如“在…以下”、“在…下面”、“下面的”、“在…之上”、“上面的”等)来描述在附图中示出的一个元件或特征与另一元件或特征的关系。将理解，空间相对术语意图包含除了在附图中描述的方向之外的装置在使用或操作中的不同方向。例如，如果翻转在附图中的装置，则被描述为在其他元件或特征“下面”或“以下”的元件可被定向为在其他元件或特征的“上面”。因此，示例性术语“在…下面”可包含上面和下面的两个方向。装置可被另外定向(旋转90度或朝向其他方向)，并相应地解释这里使用的与空间相关的描述符。For ease of description, spatially relative terms (such as "under", "below", "below", "above", "above", etc.) may be used herein to describe the The relationship of one element or feature to another element or feature. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the exemplary term "below" can encompass both an orientation of above and below. The device may be otherwise oriented (rotated 90 degrees or oriented in other orientations) and the spatially relative descriptors used herein interpreted accordingly.

这里使用的术语仅是用于描述特定的示例实施例的目的，而意图不在于限制本公开的主题。如这里所使用的，除非上下文明确地另有指示，否则单数形式也意图包括复数形式。还将理解，当在本说明书中使用术语“包括”和/或“包含”时，表明存在描述的特征、整体、步骤、操作、元件和/或组件，但不排除存在或添加一个或多个其他特征、整体、步骤、操作、元件、组件和/或它们的组。The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the subject matter of the present disclosure. As used herein, singular forms are intended to include plural forms unless the context clearly dictates otherwise. It will also be understood that when the terms "comprising" and/or "comprising" are used in this specification, it indicates the presence of the described features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more Other features, integers, steps, operations, elements, components and/or groups thereof.

在这里参照作为理想化的示例实施例(和中间结构)的示意图的截面图对示例实施例进行描述。因此，由于例如制造技术和/或公差导致的示图形状的变化是可预期的。因此，示例实施例不应被解释为限于这里示出的区域的特定形状，而是包括由于例如制造引起的形状上的偏差。例如，以矩形示出的注入区域通常将在其边缘上具有圆形的或弯曲的特征和/或注入浓度的梯度，而不是从注入区域到非注入的区域的突然变化。类似地，通过注入形成的埋入区可导致在埋入区和发生注入的表面之间的区域中的一些注入。因此，在附图中示出的区域在本质上是示意图，并且它们的形状不意图示出装置的区域的实际形状，并且不意图不限制本公开的主题的范围。Example embodiments are described herein with reference to cross-sectional illustrations that are schematic illustrations of idealized example embodiments (and intermediate structures). Accordingly, variations in the shapes of the illustrations as a result, for example, of manufacturing techniques and/or tolerances are to be expected. Thus, example embodiments should not be construed as limited to the particular shapes of regions illustrated herein but are to include deviations in shapes that result, for example, from manufacturing. For example, an implanted region illustrated as a rectangle will, typically, have rounded or curved features and/or a gradient of implant concentration at its edges rather than an abrupt change from implanted to non-implanted region. Similarly, a buried region formed by implantation may result in some implantation in the region between the buried region and the surface where the implantation occurs. Thus, the regions illustrated in the figures are schematic in nature and their shapes are not intended to illustrate the actual shape of a region of a device and are not intended to limit the scope of the presently disclosed subject matter.

除非另有定义，否则这里使用的所有术语(包括技术术语和科学术语)具有和本公开的主题所属领域的普通技术人员普遍理解的含义相同的含义。还将理解，除非在这里明确地定义，否则术语(诸如在通用字典中定义的术语)应该被解释为具有与它们在相关领域的语境中的含义一致的含义，而不将被解释为理想化或过于正式的意义。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosed subject matter belongs. It will also be understood that, unless expressly defined herein, terms (such as those defined in commonly used dictionaries) should be construed to have a meaning consistent with their meaning in the context of the relevant art, and not to be construed as ideal cultured or overly formal.

以下，将参照附图对示例实施例进行详细说明。Hereinafter, example embodiments will be described in detail with reference to the accompanying drawings.

图1是根据本公开的主题的系统100的示例实施例的框图。在各种实施例中，系统100可包括计算机、一些分立集成电路或片上系统(SoC)。如下所述，系统100可包括为了不使本公开的主题模糊而未在附图中示出的一些其他组件。FIG. 1 is a block diagram of an example embodiment of a system 100 in accordance with the disclosed subject matter. In various embodiments, system 100 may include a computer, a number of discrete integrated circuits, or a system on a chip (SoC). As described below, system 100 may include some other components that are not shown in the figures in order not to obscure the subject matter of the present disclosure.

在示出的实施例中，系统100包括系统存储器或主存储器104。在各种实施例中，系统存储器104可以由动态随机存取存储器(DRAM)构成。但是，应该理解，上面仅是不限制本公开的主题的一个说明性示例。在这样的实施例中，系统存储器104可包括模块(例如，双列直插式存储模块(DIMM))上的存储器，系统存储器104可以是焊接的或另外与系统100固定地集成的集成芯片，或者系统存储器104甚至可被合并为包括系统100(例如，SoC)的集成芯片的部分。应该理解，上面仅仅是不限制本公开的主题的一些说明性示例。In the illustrated embodiment, system 100 includes system memory or main memory 104 . In various embodiments, system memory 104 may consist of dynamic random access memory (DRAM). However, it should be understood that the above is merely an illustrative example which does not limit the subject matter of the present disclosure. In such an embodiment, the system memory 104 may comprise memory on a module (e.g., a dual in-line memory module (DIMM)), the system memory 104 may be an integrated chip that is soldered or otherwise fixedly integrated with the system 100, Or system memory 104 may even be incorporated as part of an integrated chip comprising system 100 (eg, SoC). It should be understood that the above are merely illustrative examples that do not limit the subject matter of the present disclosure.

在示出的实施例中，系统存储器104可被配置为存储多条数据或信息。这些条数据可包括使处理器102执行各种操作的指令。系统存储器104通常可以是包括若干高速缓冲存储器的更大的存储器分级体系的部分。在各种实施例中，这里描述的操作可由那个存储器分级体系(例如，等级2(L2)高速缓冲存储器)的另一层或级来执行。本领域技术人员应理解，虽然参照系统存储器104描述了操作，但是本公开的主题不限于该说明性示例。In the illustrated embodiment, system memory 104 may be configured to store multiple pieces of data or information. These pieces of data may include instructions that cause processor 102 to perform various operations. System memory 104 may typically be part of a larger memory hierarchy including several cache memories. In various embodiments, the operations described herein may be performed by another layer or level of that memory hierarchy (eg, Level 2 (L2) cache memory). Those skilled in the art will appreciate that although operations are described with reference to system memory 104 , the disclosed subject matter is not limited to this illustrative example.

在示出的实施例中，系统100还包括处理器102。处理器102可被配置为执行由各种指令命令的多个操作。这些指令可由诸如算术逻辑单元(ALU)、浮点单元(FPU)、加载/存储单元(LSU)和指令提取单元116(IFU)等的各种执行单元(多半未示出)执行。应该理解，单元仅是组合在一起以执行处理器102的部分功能的电路的聚集。单元通常执行处理器102的流水线架构中的一个或多个操作。In the illustrated embodiment, system 100 also includes processor 102 . Processor 102 may be configured to perform a number of operations commanded by various instructions. These instructions may be executed by various execution units (probably not shown), such as the arithmetic logic unit (ALU), floating point unit (FPU), load/store unit (LSU), and instruction fetch unit 116 (IFU). It should be understood that a unit is merely an aggregation of circuits combined together to perform a portion of the functions of the processor 102 . A unit typically performs one or more operations in the pipelined architecture of processor 102 .

在示出的实施例中，处理器102可包括分支预测单元(BPU)或电路112。如上所述，当处理器102正执行多个指令的流时，所述多个指令中的一个(或多个)指令可以是分支指令。分支指令是使指令流到两条或更多条路径中的一条路径之间的分支或分叉的一个指令。分支指令的典型示例是如果-则(if-then)结构，其中，在如果-则结构中，如果满足特定条件(例如，用户点击“OK”按钮)，则第一指令集将被执行，如果未满足特定条件(例如，用户点击“Cancel”按钮)，则第二指令集将被执行。如上所述，由于新指令必须在分支、跳转或if-then结构的结果被知道(当解析分支指令的流水线阶段在流水线的深处时)之前进入处理器102的流水线，因此这在流水线处理器架构中是个问题。因此，必须阻止新指令进入流水线，直到分支指令被解析为止(因此否定了流水线体系结构的主要优势)，或者处理器102必须做出指令流将分支到哪条路径的猜测并推测地将那些指令放在流水线中。BPU 112可被配置为预测指令流将如何分支。在示出的实施例中，BPU 112可被配置为输出预测的指令172，或更准确地，输出存储预测的指令172的存储器地址。In the illustrated embodiment, processor 102 may include a branch prediction unit (BPU) or circuit 112 . As noted above, when processor 102 is executing a stream of multiple instructions, one (or more) of the multiple instructions may be a branch instruction. A branch instruction is an instruction that causes instruction flow to branch or fork between one of two or more paths. A typical example of a branch instruction is an if-then structure, where, in an if-then structure, a first set of instructions will be executed if a certain condition is met (for example, the user clicks an "OK" button), if If the specified condition is not met (for example, the user clicks on the "Cancel" button), the second set of instructions will be executed. As mentioned above, since the new instruction must enter the pipeline of the processor 102 before the result of the branch, jump, or if-then structure is known (when the pipeline stage that resolves the branch instruction is deep in the pipeline), this is done in the pipeline. is a problem in the architecture of the server. Therefore, new instructions must be blocked from entering the pipeline until the branch instruction is resolved (thus negating the main advantage of the pipelined architecture), or the processor 102 must make a guess as to which path the instruction stream will branch to and speculatively forward those instructions put in the pipeline. BPU 112 may be configured to predict how the instruction stream will branch. In the illustrated embodiment, BPU 112 may be configured to output predicted instructions 172 , or more precisely, output memory addresses where predicted instructions 172 are stored.

在示出的实施例中，处理器102包括分支预测地址队列(BPAQ)114。BPAQ 114可包括配置为存储已由BPU 112预测的预测的指令172的多个地址的存储器结构。BPAQ 114可以以先进先出(FIFO)的顺序存储这些预测的指令172的地址，以便以与BPU 112预测它们的顺序相同的顺序从BPAQ 114输出指令地址。In the illustrated embodiment, processor 102 includes a branch predictive address queue (BPAQ) 114 . BPAQ 114 may include a memory structure configured to store a plurality of addresses of predicted instructions 172 that have been predicted by BPU 112 . BPAQ 114 may store these predicted instruction 172 addresses in a first-in-first-out (FIFO) order so that instruction addresses are output from BPAQ 114 in the same order that BPU 112 predicted them.

在示出的实施例中，处理器102包括配置为从存储器分级体系提取指令并将它们放置在处理器102的流水线中的指令提取单元(IFU)116。在这样的实施例中，IFU 116可被配置为从BPAQ 114获得与至少最新或最老的指令(下一指令174)关联的存储器地址，并向存储器分级体系请求实际指令174。理想地，指令174将被从存储器分级体系快速提供并被放置在处理器102的流水线中。In the illustrated embodiment, processor 102 includes an instruction fetch unit (IFU) 116 configured to fetch instructions from the memory hierarchy and place them in the pipeline of processor 102 . In such an embodiment, IFU 116 may be configured to obtain from BPAQ 114 the memory address associated with at least the newest or oldest instruction (next instruction 174 ) and request the actual instruction 174 from the memory hierarchy. Ideally, instructions 174 would be quickly served from the memory hierarchy and placed in the pipeline of processor 102 .

在示出的实施例中，BPAQ 114被配置为使BPU 112与IFU 116分离。在这样的实施例中，BPU 112以与IFU 116消耗指令不同(例如，比IFU 116消耗指令更快)的速度来预测指令172。在连接的或无缓冲的架构或实施例中，会要求IFU 116与BPU 112预测指令一样快地提取预测的指令。在示出的实施例中，由于BPU 112产生的预测的指令172的任何附加的地址可简单地存储在BPAQ 114中，所以IFU 116可经历延迟(例如，高速缓存未命中和流水线停滞等)并且不影响BPU 112。当IFU 116能够恢复消耗新的指令174的地址时，地址将在BPAQ 114中等待。In the illustrated embodiment, BPAQ 114 is configured to separate BPU 112 from IFU 116 . In such embodiments, BPU 112 predicts instructions 172 at a different rate (eg, faster than IFU 116 consumes instructions) than IFU 116 consumes instructions. In connected or unbuffered architectures or embodiments, IFU 116 may be required to fetch predicted instructions as quickly as BPU 112 predicts them. In the illustrated embodiment, since any additional addresses of predicted instructions 172 produced by BPU 112 can simply be stored in BPAQ 114, IFU 116 can experience delays (e.g., cache misses and pipeline stalls, etc.) and BPU 112 is not affected. When the IFU 116 is able to resume consuming the address of the new instruction 174, the address will wait in the BPAQ 114.

返回到IFU 116的提取机制，理想地，可(通过存储器访问178)从等级1(L1)指令高速缓冲存储器118提取指令174。在这样的实施例中，作为存储器分级体系的顶级或更高级，L1指令高速缓冲存储器118可相对地快，并在流水线中引发少量或不引发延迟。然而，L1指令高速缓冲存储器118有时可不包括期望的指令174。这将造成缓存未命中，并且指令将不得不从存储器分级体系(例如，系统存储器104)的较低较慢的级提取或加载。由于指令将不会以一个预循环的速度(或者处理器的体系结构中任意的最大速度)输入到流水线中，所以这样的缓存未命中可在处理器102的流水线中造成延迟。Returning to the fetch mechanism of IFU 116 , ideally, instructions 174 may be fetched (via memory access 178 ) from level 1 (L1 ) instruction cache 118 . In such an embodiment, as the top level or higher of the memory hierarchy, the L1 instruction cache 118 may be relatively fast and induce little or no delay in the pipeline. However, the L1 instruction cache 118 may sometimes not include the desired instruction 174 . This would cause a cache miss, and the instruction would have to be fetched or loaded from a lower, slower level of the memory hierarchy (eg, system memory 104). Such a cache miss may cause a delay in the pipeline of the processor 102 since instructions will not be fed into the pipeline at a pre-cycle speed (or whatever maximum speed the processor's architecture is).

在示出的实施例中，处理器102包括指令预取单元(IPFU)120。IPFU 120被配置为：在IFU 116执行实际的提取操作之前，将指令预取到L1指令高速缓冲存储器118中。因此，IPFU 120减少了IFU 116经历的任何缓存未命中的发生。IPFU 120可通过在IFU 116请求之前向L1指令高速缓冲存储器118请求预测的指令172来完成此操作。在这样的实施例中，如果缓存未命中随后发生，则L1指令高速缓冲存储器118将开始向系统存储器104请求缺失的指令的处理。在这样的实施例中，指令可在IFU 116请求它的时候被接收到并存储在L1指令高速缓冲存储器118中。In the illustrated embodiment, processor 102 includes an instruction prefetch unit (IPFU) 120 . IPFU 120 is configured to prefetch instructions into L1 instruction cache 118 before IFU 116 performs the actual fetch operation. Thus, IPFU 120 reduces the occurrence of any cache misses experienced by IFU 116 . IPFU 120 may do this by requesting predicted instructions 172 from L1 instruction cache 118 before IFU 116 requests. In such an embodiment, if a cache miss subsequently occurs, L1 instruction cache 118 will begin requesting processing of the missing instruction from system memory 104 . In such embodiments, instructions may be received and stored in L1 instruction cache 118 as IFU 116 requests it.

在各种实施例中，因为IPFU 120的目的之一是为了确认L1指令高速缓冲存储器118预加载有任何期望的指令，并且不实际使用指令本身，所以IPFU 120可丢弃任何返回的指令172或可不包括实际接收指令172的结构。例如，IPFU 120与L1指令高速缓冲存储器118之间的信号177可包括请求存储器地址和接收关于那些请求的任何状态更新(例如，是否存在缓存未命中，是否实现了请求)所需的控制信号，但可不包括数据线或将携带由存储器访问请求的实际数据(例如，构成指令172本身的值)的信号。应该理解，上面仅是不限制本公开的主题的一个说明性示例。In various embodiments, IPFU 120 may discard any returned instructions 172 or may not Contains the structure that actually receives the instruction 172 . For example, signals 177 between IPFU 120 and L1 instruction cache 118 may include control signals needed to request memory addresses and receive any status updates regarding those requests (e.g., whether there was a cache miss, whether the request was fulfilled), But data lines or signals that would carry the actual data requested by the memory access (eg, the values that make up the instruction 172 itself) may not be included. It should be understood that the above is merely an illustrative example which does not limit the subject matter of the present disclosure.

在示出的实施例中，如果(由BPU 112发出的)预测的指令172与(由IFU 116提取的)下一指令174之间存在确定的关系，则IPFU 120可被配置为只预取预测的指令172。在各种实施例中，两个指令172与指令174之的关系可以是对BPU 112的预测的推测性的程度的测量。例如，如果BPU 112不确定它的预测的正确性，则IPFU 120可不希望预取预测的指令172。In the illustrated embodiment, the IPFU 120 may be configured to only prefetch the predicted instruction 172 (issued by the BPU 112) and the next instruction 174 (fetched by the IFU 116) if there is a definite relationship Directive 172. In various embodiments, the relationship between two instructions 172 and instructions 174 may be a measure of how speculative the BPU 112's predictions are. For example, IPFU 120 may not wish to prefetch predicted instructions 172 if BPU 112 is uncertain about the correctness of its prediction.

在各种实施例中，预取未正确预测的指令172可具有不想要的效果。例如，如果未正确预测的指令172被提取并放置在流水线中，则它将稍后不得不被冲掉并且任何计算的结果将不得不被撤销，造成处理器102的昂贵的调遣。此外，加载到L1指令高速缓冲存储器118但从未被提取或使用的任何指令可造成高速缓冲存储器污染。当未使用或不想要的数据充满高速缓冲存储器的有限空间时，高速缓冲存储器污染通常会移出可能期望或可用的数据。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In various embodiments, prefetching incorrectly predicted instructions 172 may have unwanted effects. For example, if an incorrectly predicted instruction 172 is fetched and placed in the pipeline, it will have to be flushed later and the results of any calculations will have to be undone, causing an expensive dispatch of the processor 102 . Furthermore, any instructions loaded into the L1 instruction cache 118 but never fetched or used may cause cache pollution. Cache pollution typically evictes data that may be expected or available when unused or unwanted data fills the limited space of the cache. It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

在一些实施例中，BPU 112可向IPFU 120指示预测的指令172的推测性的程度。然而，在示出的示例中，推测性的程度可从BPAQ 114内发现的易于得到的信息(用信号176示出)来推断。在这样的实施例中，预测的指令172与下一指令174之间的关系可以是针对它们在BPAQ 114中的各自地址或BPAQ 114的当前深度的它们彼此之间的距离。在这样的实施例中，IFU 116越是往后，预测的指令172就变得越具推测性。在各种实施例中，如果距离超出了预定阈值(在图2中讨论)，则IPFU 120可限制(throttle)或抑制任何预取动作。然而，如果深度低于阈值，则IPFU 120可预取预测的指令172。应该理解，上面仅是不限制本公开的主题的一个说明性示例。In some embodiments, BPU 112 may indicate to IPFU 120 the degree of speculativeness of predicted instructions 172 . However, in the example shown, the degree of speculativeness may be inferred from readily available information found within the BPAQ 114 (shown by signal 176 ). In such an embodiment, the relationship between the predicted instruction 172 and the next instruction 174 may be their distance from each other for their respective addresses in the BPAQ 114 or the current depth of the BPAQ 114 . In such an embodiment, the further back the IFU 116 is, the more speculative the predicted instructions 172 become. In various embodiments, IPFU 120 may throttle or suppress any prefetch actions if the distance exceeds a predetermined threshold (discussed in FIG. 2 ). However, if the depth is below the threshold, IPFU 120 may prefetch predicted instructions 172 . It should be understood that the above is merely an illustrative example which does not limit the subject matter of the present disclosure.

在一个实施例中，如果IPFU 120已经限制预取并随后不限制或恢复预取，则IPFU120可被配置为在限制开始时已经预测的任何指令处开始恢复预取，以便预取所有的指令(只要它们在BPAQ 114中)。在另一实施例中，IPFU 120可简单地在限制时间段期间不预取由BPU 112发送的任何预测的指令172，而在由BPU 112输出预测的指令172的新地址时开始或恢复预取。在这样的实施例中，即使错过的指令没被预取，IFU 116有责任提取错过的指令。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In one embodiment, if IPFU 120 has limited prefetching and subsequently does not limit or resume prefetching, IPFU 120 may be configured to begin resuming prefetching at any instruction that had been predicted when the throttling began, so that all instructions ( as long as they are in BPAQ 114). In another embodiment, the IPFU 120 may simply not prefetch any predicted instructions 172 sent by the BPU 112 during a restricted period of time, but start or resume prefetching when a new address for a predicted instruction 172 is output by the BPU 112 . In such an embodiment, IFU 116 is responsible for fetching the missed instruction even if the missed instruction was not prefetched. It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

图2是与本公开的主题一致的设备200的示例实施例的框图。在各种实施例中，如上所述，设备200可以是处理器的一部分。在一些实施例中，设备200可以是被配置为预测、预取并最终提取用于处理器执行的指令的更大或统一的指令提取单元或预取单元的部分。在另一实施例中，这里描述的元件可以是分立的。应该理解，上面仅是不限制本公开的主题的一些说明性示例。FIG. 2 is a block diagram of an example embodiment of a device 200 consistent with the subject matter of this disclosure. In various embodiments, device 200 may be part of a processor, as described above. In some embodiments, device 200 may be part of a larger or unified instruction fetch unit or prefetch unit configured to predict, prefetch, and ultimately fetch instructions for processor execution. In another embodiment, the elements described herein may be discrete. It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

在示出的实施例中，设备200包括分支预测单元(BPU)212。在示出的实施例中，设备200包括指令提取单元(IFU)216。设备200还包括指令预取单元(IPFU)220和分支预测地址队列(BPAQ)214。In the illustrated embodiment, device 200 includes a branch prediction unit (BPU) 212 . In the illustrated embodiment, device 200 includes an instruction fetch unit (IFU) 216 . Device 200 also includes an instruction prefetch unit (IPFU) 220 and a branch predictive address queue (BPAQ) 214 .

在示出的实施例中，突出显示并明显示出了BPAQ 214的内部。虽然具体地示出了六个队列条目或字段，但是应该理解，上面仅是不限制本公开的主题的一个说明性示例。当BPU 212输出预测的指令时，或更具体地讲，输出预测的指令的存储器地址时，它们可以以FIFO的方式在BPAQ 214内排队。在示出的实施例中，存在排队在BPAQ 214内的四个指令地址：预测的指令254b的地址、预测的指令254a的地址、预测的指令254的地址和下一指令252的地址。由于下一指令252是将被IFU 216处理的下一个指令(假设没有诸如流水线冲洗(pipeline flush)的异常事件发生)，所以最老的预测的指令指的是下一指令252。In the illustrated embodiment, the interior of the BPAQ 214 is highlighted and clearly shown. While six queue entries or fields are specifically shown, it should be understood that the above is merely an illustrative example that does not limit the subject matter of the present disclosure. As BPU 212 outputs predicted instructions, or more specifically, memory addresses of predicted instructions, they may be queued within BPAQ 214 in a FIFO fashion. In the illustrated embodiment, there are four instruction addresses queued within BPAQ 214 : the address of predicted instruction 254 b , the address of predicted instruction 254 a , the address of predicted instruction 254 , and the address of next instruction 252 . Since next instruction 252 is the next instruction to be processed by IFU 216 (assuming no exceptional events such as pipeline flushes occur), the oldest predicted instruction refers to next instruction 252 .

在示出的实施例中，BPAQ 214保持有效指令计数(VIC)256。在这样的实施例中，VIC 256可以简单地是由BPAQ 214当前存储的预测的指令的地址的数量的计数(例如，示出的示例中是四)。在一些实施例中，它可以是最新进入队列的指令地址(预测的指令254b)的索引。在另一实施例中，VIC 256可以是下一指令252的地址的索引与最新预测的指令254b的地址的索引之间的差。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In the illustrated embodiment, BPAQ 214 maintains a valid instruction count (VIC) 256 . In such an embodiment, VIC 256 may simply be a count of the number of addresses of predicted instructions currently stored by BPAQ 214 (eg, four in the example shown). In some embodiments, it may be the index of the most recently queued instruction address (predicted instruction 254b). In another embodiment, VIC 256 may be the difference between the index of the address of the next instruction 252 and the index of the address of the latest predicted instruction 254b. It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

在这样的实施例中，VIC 256被提供到IPFU 220。IPFU 220接收VIC 256作为与最新预测的指令254b关联的推测的程度或置信程度的标识符。在示出的实施例中，IPFU 220可将VIC 256与(由IPFU 220存储的)阈值266进行比较。如果VIC 256低于阈值266，则IPFU220可预取最新预测的指令254b。如果，VIC 256大于或等于阈值266，则IPFU 220可对预取最新预测的指令254b进行限制或另外抑制。在这样的实施例中，如果与最新预测的指令254b关联的推测的程度过高或者置信程度过低，则IPFU 220可被配置为停止任何预取。应该理解，上面仅是不限制本公开的主题的一个说明性示例，并且，在其他实施例中，限制可基于VIC 256是否小于阈值或任何其他比较机制(诸如，滑动比较)。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In such embodiments, VIC 256 is provided to IPFU 220 . The IPFU 220 receives the VIC 256 as an identifier of the degree of speculation or confidence associated with the latest predicted instruction 254b. In the illustrated embodiment, IPFU 220 may compare VIC 256 to threshold 266 (stored by IPFU 220 ). If VIC 256 is below threshold 266, IPFU 220 may prefetch the latest predicted instruction 254b. If VIC 256 is greater than or equal to threshold 266 , IPFU 220 may limit or otherwise suppress prefetching of the most recently predicted instruction 254b. In such an embodiment, IPFU 220 may be configured to halt any prefetching if the degree of speculation associated with the most recently predicted instruction 254b is too high or the degree of confidence is too low. It should be understood that the above is merely one illustrative example that does not limit the disclosed subject matter, and that, in other embodiments, the limitation may be based on whether VIC 256 is less than a threshold or any other comparison mechanism (such as a sliding comparison). It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

在各种实施例中，可动态调节阈值266。然而，在示出的实施例中，阈值266可以是静态的预定值。在一个实施例中，虽然阈值266可包括2的值，但是应该理解，上面仅是不限制本公开的主题的一个说明性示例。In various embodiments, threshold 266 may be dynamically adjusted. However, in the illustrated embodiment, threshold 266 may be a static predetermined value. In one embodiment, while threshold 266 may include a value of 2, it should be understood that the above is merely one illustrative example that does not limit the subject matter of the present disclosure.

图3是与本公开的主题一致的设备300的示例实施例的框图。在各种实施例中，如上所述，设备300可以是处理器的一部分。在一些实施例中，设备300可以是被配置为预测、预取并最终提取用于处理器执行的指令的更大或统一的指令提取单元或预取单元的部分。在另一实施例中，这里描述的元件可以是分立的。应该理解，上面仅是不限制本公开的主题的一些说明性示例。FIG. 3 is a block diagram of an example embodiment of a device 300 consistent with the subject matter of the present disclosure. In various embodiments, device 300 may be part of a processor, as described above. In some embodiments, device 300 may be part of a larger or unified instruction fetch unit or prefetch unit configured to predict, prefetch, and ultimately fetch instructions for processor execution. In another embodiment, the elements described herein may be discrete. It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

在示出的实施例中，设备300包括分支预测单元(BPU)312。在示出的实施例中，设备300包括指令提取单元(IFU)216、指令预取单元(IPFU)320和分支预测地址队列(BPAQ)314。In the illustrated embodiment, device 300 includes a branch prediction unit (BPU) 312 . In the illustrated embodiment, device 300 includes instruction fetch unit (IFU) 216 , instruction prefetch unit (IPFU) 320 , and branch predictive address queue (BPAQ) 314 .

虽然BPAQ 314示出了六个队列条目或字段，但是应该理解，上面仅是不限制本公开的主题的一个说明性示例。当BPU 312输出预测的指令时，或更具体地讲，输出预测的指令的存储器地址时，它们可以以FIFO的方式在BPAQ 314内排队。在示出的实施例中，存在当前排队在BPAQ 314内的四个指令地址：预测的指令254b的地址、预测的指令254a的地址、预测的指令254的地址和下一指令252的地址。While BPAQ 314 shows six queue entries or fields, it should be understood that the above is merely one illustrative example that does not limit the subject matter of the present disclosure. As BPU 312 outputs predicted instructions, or more specifically, memory addresses of predicted instructions, they may be queued within BPAQ 314 in a FIFO fashion. In the illustrated embodiment, there are four instruction addresses currently queued within BPAQ 314 : the address of predicted instruction 254 b , the address of predicted instruction 254 a , the address of predicted instruction 254 , and the address of next instruction 252 .

在示出的实施例中，BPU 312输出分别与预测的指令254b的地址、预测的指令254a的地址、预测的指令254的地址和下一指令252的地址关联的置信程度354b、置信程度354a、置信程度354和置信程度352。置信程度354b、置信程度354a、置信程度354和置信程度352可指示预测的指令将被处理器实际执行的BPU 320的置信程度或推测的程度。In the illustrated embodiment, the BPU 312 outputs a confidence level 354b, a confidence level 354a, Confidence level 354 and confidence level 352 . Confidence level 354b, confidence level 354a, confidence level 354, and confidence level 352 may indicate a degree of confidence or speculation by BPU 320 that the predicted instruction will actually be executed by the processor.

在一些实施例中，置信程度354b、置信程度354a、置信程度354和置信程度352可以是二进制，并指示所有预测中是否存在任何推测。例如，如果在指令流中没有遇到未解析的分支指令，则流中所有的指令将必定被跳转(除非发生不可预见的错误，诸如，除以零的错误)。因此，预测的指令254b、预测的指令254a、预测的指令254和下一指令252中的任意一个可被绝对可信地预测，并且可假设它们将总是需要被预取。相反，如下所述，如果已经遇到未解析的分支指令，则预测的指令254b、预测的指令254a、预测的指令254和下一指令252可能本质上都是推测性的，并且IPFU 320可相应地采取动作。In some embodiments, Confidence Level 354b, Confidence Level 354a, Confidence Level 354, and Confidence Level 352 may be binary and indicate whether there are any guesses in all predictions. For example, if no unresolved branch instruction is encountered in the instruction stream, all instructions in the stream will necessarily be jumped (unless an unforeseen error occurs, such as a division by zero error). Thus, any of predicted instruction 254b, predicted instruction 254a, predicted instruction 254, and next instruction 252 can be predicted with absolute confidence, and it can be assumed that they will always need to be prefetched. Conversely, as described below, if an unresolved branch instruction has been encountered, predicted instruction 254b, predicted instruction 254a, predicted instruction 254, and next instruction 252 may all be speculative in nature, and IPFU 320 may respond accordingly. take action.

在另一个实施例中，置信程度354b、置信程度354a、置信程度354和置信程度352指示指令流中仍然未解析的分支指令的数量。例如，如果没有遇到未解析的分支指令，则发出的预测的指令254可与0的置信程度354(其中，更低的值指示更高的置信)关联。一旦遇到第一个未解析的分支指令，置信程度354就可增加至1。在第一个未解析的分支指令与第二个未解析的分支指令之间预测的预测的指令254也可与1的置信程度关联，这是因为所述预测的指令254与开始指令流的这个分叉或分支的第一个分支指令相关，所以BPU 312将不再管那些预测的指令的对错。然而，当遇到第二个未解析的分支指令时，置信程度354可增至2，并且所有之后的指令都可用2的值进行标记(除非遇到第三个未解析的分支指令)。指示那些指令的2的值是双倍推测性的并基于两个层次的猜测。In another embodiment, confidence level 354b, confidence level 354a, confidence level 354, and confidence level 352 indicate the number of branch instructions in the instruction stream that are still unresolved. For example, if no unresolved branch instructions were encountered, the predicted instruction 254 issued may be associated with a confidence level 354 of 0 (where a lower value indicates a higher confidence). Confidence level 354 may be increased to 1 upon encountering the first unresolved branch instruction. The predicted instruction 254 predicted between the first unresolved branch instruction and the second unresolved branch instruction can also be associated with a confidence level of 1 because the predicted instruction 254 is associated with this The first branch instruction of a fork or branch is relevant, so the BPU 312 will not care whether those predicted instructions are right or wrong. However, when a second unresolved branch instruction is encountered, the confidence level 354 may increase to 2, and all subsequent instructions may be marked with a value of 2 (unless a third unresolved branch instruction is encountered). A value of 2 indicates that those instructions are double speculative and based on two levels of guesswork.

在另一实施例中，置信程度(例如，置信程度354)指示BPU 312的状态机或计数器的内部状态。在各种实施例中，BPU 312可基于离开内部状态机(未示出)的它的预测。在一些实施例中，采用具有诸如例如强未跳转、弱未跳转、弱跳转和强跳转的状态的四状态状态机。在这样的实施例中，关于状态(例如，“强”或“弱”方面)的信息可包括在置信程度(例如，置信程度354)中。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In another embodiment, the confidence level (eg, confidence level 354 ) is indicative of an internal state of a state machine or counter of the BPU 312 . In various embodiments, the BPU 312 may base its predictions away from an internal state machine (not shown). In some embodiments, a four-state state machine is employed having states such as, for example, Strong Not Jumped, Weak Not Jumped, Weakly Jumped, and Strongly Jumped. In such an embodiment, information about the state (eg, "strong" or "weak" aspects) may be included in the confidence level (eg, confidence level 354). It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

在示出的实施例中，各个置信程度354b、置信程度354a、置信程度354和置信程度352可被存储在BPAQ 314内的各自的预测的指令254b、预测的指令254a、预测的指令254和下一指令252的旁边。在示出的实施例中，下一指令252与置信程度352关联；预测的指令254与置信程度354关联；预测的指令254a与置信程度352a关联；预测的指令254b与置信程度354b关联。然而，在其它实施例中，BPAQ 314可不存储置信程度354b、置信程度354a、置信程度354和置信程度352，使得IPFU 320仅使用与最新预测的指令关联的置信程度(例如，与预测的指令254b关联的置信程度354b)。In the illustrated embodiment, each confidence level 354b, confidence level 354a, confidence level 354, and confidence level 352 may be stored within the BPAQ 314 for a respective predicted instruction 254b, predicted instruction 254a, predicted instruction 254, and next Next to an instruction 252. In the illustrated embodiment, next instruction 252 is associated with confidence level 352; predicted instruction 254 is associated with confidence level 354; predicted instruction 254a is associated with confidence level 352a; predicted instruction 254b is associated with confidence level 354b. However, in other embodiments, BPAQ 314 may not store confidence level 354b, confidence level 354a, confidence level 354, and confidence level 352, so that IPFU 320 only uses the confidence level associated with the most recently predicted instruction (e.g., Confidence level of association 354b).

在这样的实施例中，IPFU 320接收置信程度352与预测的指令地址(例如，与预测的指令254的地址关联的置信程度354)。在示出的实施例中，当确定预取预测的指令254b或限制任何预取时，IPFU 320可被配置为考虑置信程度354b。在示出的实施例中，IPFU 320可包括或存储置信程度阈值366。在这样的实施例中，如果置信程度354b(在预测的指令254b的情况下)大于置信程度阈值366，则停止或限制预取。然而，可预取具有低于置信程度阈值366的置信程度354b的预测的指令254b。In such an embodiment, IPFU 320 receives confidence level 352 and a predicted instruction address (eg, confidence level 354 associated with the address of predicted instruction 254 ). In the illustrated embodiment, the IPFU 320 may be configured to consider the confidence level 354b when determining to prefetch a predicted instruction 254b or limit any prefetch. In the illustrated embodiment, IPFU 320 may include or store a confidence level threshold 366 . In such an embodiment, if the confidence level 354b (in the case of the predicted instruction 254b) is greater than the confidence level threshold 366, prefetching is stopped or limited. However, predicted instructions 254b having a confidence level 354b below a confidence level threshold 366 may be prefetched.

在一些实施例中，IPFU 320可以以下一指令252的置信程度352与最新预测的指令245b的置信程度354b的差作为预取决定的基础。在这样的实施例中，阈值366可与相对的置信程度而非绝对的值进行比较。也就是说，下一指令252与预测的指令254b之间的置信程度关系或差别是预取决定的决定因素。应该理解，上面仅是不限制本公开的主题的一个说明性示例。In some embodiments, the IPFU 320 may base the prefetch decision on the difference between the confidence level 352 of the next instruction 252 and the confidence level 354b of the most recently predicted instruction 245b. In such an embodiment, the threshold 366 may be compared to relative confidence levels rather than absolute values. That is, the confidence level relationship or difference between the next instruction 252 and the predicted instruction 254b is a determining factor in the prefetch decision. It should be understood that the above is merely an illustrative example which does not limit the subject matter of the present disclosure.

在另一实施例中，IPFU 320也可采用如上所述的VIC 256。例如，如果BPU 312生成高置信的大量的预测的指令254，但IFU 216仅提取少量的下一指令252，则BPAQ 314的深度变得相对地大。结果，完全基于置信程度354做出预取决定的IPFU 320的实施例可预取如此多的预测的指令254，以致高速缓冲存储器变满或受污染。当IFU 216最终开始考虑提取下一指令252时，高速缓冲存储器可能已经将它移出，以支持稍后预测的指令(例如，预测的指令254b)，因此使预取的期望的益处无效。In another embodiment, IPFU 320 may also employ VIC 256 as described above. For example, if the BPU 312 generates a large number of predicted instructions 254 with high confidence, but the IFU 216 fetches only a small number of next instructions 252, the depth of the BPAQ 314 becomes relatively large. As a result, an embodiment of the IPFU 320 that makes prefetch decisions based solely on the confidence level 354 may prefetch so many predicted instructions 254 that the cache becomes full or polluted. When the IFU 216 finally begins to consider fetching the next instruction 252, the cache may have already moved it out in favor of a later predicted instruction (eg, predicted instruction 254b), thus negating the desired benefit of prefetching.

在这样的实施例中，可采用多标准IPFU 320。IPFU 320可被配置为考虑置信程度354，以及VIC 256。如果置信程度352的值或VIC 256的值超出它各自的阈值366或266，则IPFU 320可限制预取操作。在这样的实施例中，IPFU 320可被配置为预取推测性不大的预测的指令254，并且还不会使高速缓冲存储器充满还未被IFU 216请求的指令254。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In such embodiments, a multi-criteria IPFU 320 may be employed. IPFU 320 may be configured to take confidence level 354 into account, as well as VIC 256 . IPFU 320 may limit prefetch operations if the value of confidence level 352 or the value of VIC 256 exceeds its respective threshold 366 or 266 . In such an embodiment, IPFU 320 may be configured to prefetch predicted instructions 254 that are less speculative, and not to fill the cache with instructions 254 that have not been requested by IFU 216 . It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

应该理解，上面仅是不限制本公开的主题的一些说明性示例。在其他的实施例中，各种其他方案可被实施例的IFPU采用，以确定是否应该预取预测的指令。本公开的主题不限于关于图2或图3所讨论的实施例。It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure. In other embodiments, various other schemes may be employed by the IFPU of an embodiment to determine whether a predicted instruction should be prefetched. The disclosed subject matter is not limited to the embodiments discussed with respect to FIG. 2 or FIG. 3 .

图4是可包括根据本公开的主题的原理产生的半导体装置的信息处理系统400的原理框图。FIG. 4 is a functional block diagram of an information handling system 400 that may include a semiconductor device produced in accordance with the principles of the disclosed subject matter.

参照图4，信息处理系统400可包括根据本公开的主题的原理构造的一个或多个装置。在另一实施例中，信息处理系统400可采用或执行根据本公开的主题的原理的一种或多种技术。Referring to FIG. 4 , an information handling system 400 may include one or more devices constructed in accordance with the principles of the disclosed subject matter. In another embodiment, information handling system 400 may employ or perform one or more techniques in accordance with principles of the disclosed subject matter.

在各种实施例中，信息处理系统400可包括计算装置(诸如，例如膝上型计算机、台式计算机、工作站、服务器、刀片服务器、个人数字助理、智能电话、平板电脑和其他合适的计算机等)或虚拟机或其虚拟计算装置。在各种实施例中，信息处理系统400可由用户(未示出)使用。In various embodiments, information handling system 400 may include computing devices such as, for example, laptop computers, desktop computers, workstations, servers, blade servers, personal digital assistants, smartphones, tablet computers, and other suitable computers, etc. or a virtual machine or its virtual computing appliance. In various embodiments, information handling system 400 may be used by a user (not shown).

根据本公开的主题的信息处理系统400还可包括中央处理器(CPU)、逻辑或处理器410。在一些实施例中，处理器410可包括一个或多个功能单元块(FUB)或组合逻辑块(CLB)415。在这样的实施例中，组合逻辑块可包括各种布尔逻辑操作(例如，与非、或非、非、异或等)、稳定逻辑装置(例如，触发器和锁存器等)、其他逻辑装置或其组合。这些逻辑组合操作可以以简单或复杂的方式进行配置，以处理输入信号来得到期望的结果。应该理解，虽然描述了同步的组合逻辑操作的一些说明性示例，但是本公开的主题不被这样的限制，并可包括异步的操作或其混合。在一个实施例中，组合逻辑操作可包括多个互补金属氧化物半导体(CMOS)晶体管。在各种实施例中，虽然这些CMOS晶体管可布置在执行逻辑操作的门中，但是，应该理解，其他技术可被使用并且落入本公开的主题的范围内。Information handling system 400 according to the disclosed subject matter may also include a central processing unit (CPU), logic or processor 410 . In some embodiments, processor 410 may include one or more functional unit blocks (FUBs) or combinational logic blocks (CLBs) 415 . In such embodiments, the combinational logic blocks may include various Boolean logic operations (eg, NAND, NOR, NOT, XOR, etc.), stable logic devices (eg, flip-flops and latches, etc.), other logic devices or combinations thereof. These logical combinational operations can be configured in simple or complex ways to process input signals to obtain desired results. It should be understood that while some illustrative examples of synchronous combinatorial logic operations are described, the disclosed subject matter is not so limited and may include asynchronous operations or mixtures thereof. In one embodiment, a combinational logic operation may include a plurality of complementary metal oxide semiconductor (CMOS) transistors. In various embodiments, while these CMOS transistors may be arranged in gates performing logic operations, it should be understood that other techniques may be used and fall within the scope of the disclosed subject matter.

根据本公开的主题的信息处理系统400还可包括易失性存储器420(例如，随机存取存储器(RAM)等)。根据本公开的主题的信息处理系统400还可包括非易失性存储器430(例如，硬盘驱动器、光学存储器、NAND或闪存等)。在一些实施例中，易失性存储器420、非易失性存储器430或其组合或部分可被称为“存储介质”。在各种实施例中，易失性存储器420和/或非易失性存储器430可被配置为以半永久或基本永久的形式存储数据。The information handling system 400 according to the presently disclosed subject matter may also include a volatile memory 420 (eg, random access memory (RAM), etc.). The information handling system 400 according to the presently disclosed subject matter may also include a non-volatile memory 430 (eg, hard disk drive, optical memory, NAND or flash memory, etc.). In some embodiments, volatile memory 420, non-volatile memory 430, or combinations or portions thereof may be referred to as "storage media." In various embodiments, volatile memory 420 and/or nonvolatile memory 430 may be configured to store data in a semi-permanent or substantially permanent form.

在各种实施例中，信息处理系统400可包括：被配置为允许信息处理系统400成为通信网络的部分并通过通信网络通信的一个或多个网络接口440。Wi-Fi协议的示例可包括(但不限于)电气和电子工程师协会(IEEE)802.11g和IEEE 802.11n等。蜂窝协议的示例可包括(但不限于)IEEE 802.16m(又名，高级无线MAN(城域网))，高级长期演进(LTE))，增强数据率的GSM(全球移动通信系统)演进(EDGE)，演进的高速分组接入(HSPA+)等。有线协议的示例可包括(但不限于)IEEE 802.3(又名，以太网)、光纤通道、电力线通信(例如，HomePlug和IEEE 1901等)等。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In various embodiments, information handling system 400 may include one or more network interfaces 440 configured to allow information handling system 400 to be part of and communicate over a communication network. Examples of Wi-Fi protocols may include, but are not limited to, Institute of Electrical and Electronics Engineers (IEEE) 802.11g and IEEE 802.11n, among others. Examples of cellular protocols may include (but are not limited to) IEEE 802.16m (aka Wireless MAN Advanced (Metropolitan Area Network)), Long Term Evolution (LTE) Advanced), Enhanced Data Rates for GSM (Global System for Mobile Communications) Evolution (EDGE ), Evolved High Speed Packet Access (HSPA+), etc. Examples of wired protocols may include, but are not limited to, IEEE 802.3 (aka, Ethernet), Fiber Channel, Powerline Communications (eg, HomePlug and IEEE 1901, etc.), and the like. It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

根据本公开的主题的信息处理系统400还可包括用户接口单元450(例如，显示适配器、触觉接口和人机接口装置等)。在各种实施例中，这个用户接口单元450可被配置为从用户接收输入和/或将输出提供给用户。其他种类的装置同样可被使用，以提供与用户的交互；例如，提供给用户的反馈可以是任何形式的感觉反馈(例如，视觉反馈、听觉反馈或触觉反馈)；来自用户的输入可以以包括声、语音或触觉输入的任何形式被接收。The information processing system 400 according to the presently disclosed subject matter may further include a user interface unit 450 (eg, a display adapter, a tactile interface, and a human interface device, etc.). In various embodiments, this user interface unit 450 may be configured to receive input from a user and/or provide output to the user. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual, auditory, or tactile feedback); input from the user may include Any form of acoustic, speech or tactile input is received.

在各种实施例中，信息处理系统400可包括一个或多个其他装置或硬件组件460(例如，显示器或监视器、键盘、鼠标、相机、指纹识别器和视频处理器等)。应该理解，上面仅是不限制本公开的主题的一些说明性示例。In various embodiments, information handling system 400 may include one or more other devices or hardware components 460 (eg, display or monitor, keyboard, mouse, camera, fingerprint reader, video processor, etc.). It should be understood that the above are merely some illustrative examples which do not limit the subject matter of the present disclosure.

根据本公开的主题的信息处理系统400还可包括一个或多个系统总线405。在这样的实施例中，系统总线405可被配置为通信地连接处理器410、易失性存储器420、非易失性存储器430、网络接口440、用户接口单元450以及一个或多个硬件组件460。由处理器410处理的数据或从非易失性存储器430的外部输入的数据可存储在非易失性存储器430或易失性存储器420中。Information handling system 400 according to the presently disclosed subject matter may also include one or more system buses 405 . In such an embodiment, system bus 405 may be configured to communicatively couple processor 410, volatile memory 420, nonvolatile memory 430, network interface 440, user interface unit 450, and one or more hardware components 460 . Data processed by the processor 410 or data input from the outside of the nonvolatile memory 430 may be stored in the nonvolatile memory 430 or the volatile memory 420 .

在各种实施例中，信息处理系统400可包括或执行一个或多个软件组件470。在一些实施例中，软件组件470可包括操作系统(OS)和/或应用。在一些实施例中，OS可被配置为：将一个或多个服务提供给应用，并管理信息处理系统400的应用和各种硬件组件(例如，处理器410和网络接口440等)或充当信息处理系统400的应用与各种硬件组件之间的媒介。在这样的实施例中，信息处理系统400可包括一个或多个本地应用，所述一个或多个本地应用可安装在本地(例如，非易失性存储器430内等)，并被配置为直接由处理器410直接执行并直接与OS进行交互。在这样的实施例中，本地应用可包括预编译的机器可读代码。在一些实施例中，本地应用可包括：被配置为将源代码或目标代码转换成随后由处理器410执行的可执行代码的脚本解释器(例如，C shell(csh)、AppleScript和AutoHotkey等)或虚拟执行机(VM)(例如，Java虚拟机和微软公共语音运行库等)。In various embodiments, information handling system 400 may include or execute one or more software components 470 . In some embodiments, software components 470 may include an operating system (OS) and/or applications. In some embodiments, the OS may be configured to: provide one or more services to applications, and manage applications and various hardware components (eg, processor 410 and network interface 440 , etc.) of information handling system 400 or serve as an information The intermediary between the applications of the processing system 400 and the various hardware components. In such an embodiment, information handling system 400 may include one or more local applications that may be installed locally (eg, within non-volatile memory 430 , etc.) and configured to directly Executed directly by the processor 410 and interacting directly with the OS. In such an embodiment, the native application may include pre-compiled machine-readable code. In some embodiments, a native application may include a script interpreter (e.g., C shell (csh), AppleScript, AutoHotkey, etc.) configured to convert source code or object code into executable code that is then executed by processor 410 Or a virtual execution machine (VM) (for example, a Java virtual machine and a Microsoft public speech runtime library, etc.).

上面描述的半导体装置可使用各种封装技术来封装。例如，根据本公开的主题的原理构造的半导体装置可使用以下项中的任何一个来封装：堆叠装配(POP)技术、球栅阵列(BGA)技术、芯片级封装(CSP)技术、带引线的塑料芯片载体(PLCC)技术、塑料双列直插式封装(PDIP)技术、华夫裸片封装、晶片形式的裸片、板上芯片(COB)技术、陶瓷双列直插式封装(CERDIP)技术、塑料公制四方扁平扁平封装(PMQFP)技术、塑料四方扁平扁平封装(PQFP)技术、小外形集成电路(SOIC)技术、收缩型小外形封装(SSOP)技术、薄型小外形封装(TSOP)技术、薄四方扁平封装(TQFP)技术、系统级封装(SIP)技术、多芯片封装(MCP)技术、晶圆级制造封装(WFP)技术、以及晶片级处理堆叠封装(WSP)技术或本领域技术人员将已知的其他技术。The semiconductor devices described above may be packaged using various packaging techniques. For example, semiconductor devices constructed in accordance with the principles of the presently disclosed subject matter may be packaged using any of the following: stack-on-package (POP) technology, ball grid array (BGA) technology, chip-scale packaging (CSP) technology, leaded Plastic Chip Carrier (PLCC) Technology, Plastic Dual In-line Package (PDIP) Technology, Waffle Die Package, Die in Wafer Form, Chip on Board (COB) Technology, Ceramic Dual In-line Package (CERDIP) Technology, Plastic Metric Quad Flat Package (PMQFP) Technology, Plastic Quad Flat Package (PQFP) Technology, Small Outline Integrated Circuit (SOIC) Technology, Shrink Small Outline Package (SSOP) Technology, Thin Small Outline Package (TSOP) Technology , thin quad flat package (TQFP) technology, system-in-package (SIP) technology, multi-chip package (MCP) technology, wafer-level manufacturing package (WFP) technology, and wafer-level processing stacked package (WSP) technology or technology in this field Other techniques will be known to persons.

方法步骤可由执行计算机程序以通过对输入数据进行操作并生成输出来执行功能的一个或多个可编程处理器来执行。方法步骤还可由专用逻辑电路(例如，FPGA(现场可编程门阵列)或ASIC(专用集成电路))执行，并且设备可被实现为专用逻辑电路。Method steps may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps can also be performed by, and apparatus can be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

在各种实施例中，计算机可读介质可包括当被执行时使装置执行方法步骤的至少一部分的指令。在一些实施例中，计算机可读介质可包括在磁介质、光学介质、其他介质或其组合(例如，CD-ROM、硬盘驱动器、只读存储器或闪存驱动器等)中。在这样的实施例中，计算机可读介质可以是有形且非暂时性地实现的制造品。In various embodiments, a computer-readable medium may include instructions that, when executed, cause an apparatus to perform at least a portion of the method steps. In some embodiments, the computer readable medium may be included in magnetic media, optical media, other media, or combinations thereof (eg, CD-ROM, hard drive, read-only memory or flash drive, etc.). In such embodiments, the computer readable medium may be a tangible and non-transitory embodied article of manufacture.

虽然已经参照示例实施例对公开的主题的原理进行了描述，但是本领域技术人员将清楚，在不脱离这些公开的构思的精神和范围的情况下，可对示例实施例进行各种改变和修改。因此，应理解，上述实施例不是限制性的，而仅是说明性的。因此，公开的构思的范围将由权利要求及其等同物的最宽泛的可允许的解释来确定，而不应该被前述的描述所局限或限制。因此，将理解，权利要求旨在覆盖落入实施例的范围之内的所有修改和改变。While the principles of the disclosed subject matter have been described with reference to example embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made to the example embodiments without departing from the spirit and scope of these disclosed concepts. . Therefore, it should be understood that the above-described embodiments are not limiting, but illustrative only. Accordingly, the scope of the disclosed concept is to be determined by the broadest permissible interpretation of the claims and their equivalents, and shall not be restricted or limited by the foregoing description. Therefore, it will be understood that the claims are intended to cover all modifications and changes that fall within the scope of the embodiments.

Claims

1. A device for prefetching data, comprising:

a branch prediction unit configured to output predicted instructions;

a fetch unit configured to fetch a next instruction from the cache memory;

A prefetcher circuit configured to prefetch a previously predicted instruction into a cache memory based on a relationship between the predicted instruction and a next instruction.

2. The apparatus of claim 1, wherein the prefetcher circuit is configured to limit prefetching if a relationship between the predicted instruction and the next instruction does not satisfy a predetermined criterion.

3. The apparatus of claim 1, further comprising a branch prediction queue configured to store one or more predicted instructions of the plurality of predicted instructions and separate the branch prediction unit from the fetch unit.

4. The apparatus of claim 3, wherein the branch prediction queue outputs a valid instruction count to the prefetcher circuit, wherein the valid instruction count indicates a relationship between the predicted instruction and the next instruction.

5. The apparatus of claim 1, wherein the prefetcher circuit is configured to prefetch the previously predicted instruction if the number of predicted instructions between the predicted instruction and the next instruction does not exceed a threshold.

6. The apparatus of claim 1, wherein the relationship between the predicted instruction and the next instruction is indicative of a degree of speculation employed by the branch prediction unit.

7. The apparatus of claim 1, wherein the branch prediction unit provides the memory address of the predicted instruction to the prefetcher circuit.

8. The apparatus of claim 1 , wherein the branch prediction unit uses the predicted instruction to indicate a degree of confidence that the branch prediction unit has that the predicted instruction will be executed by the apparatus;

Wherein, the prefetcher circuit is configured to prefetch the predicted instruction if the confidence level is at or above a threshold, and refrain from prefetching the predicted instruction if the confidence level is below the threshold value.

9. A method for prefetching data comprising:

predicted instructions to be executed by the processor predicted by the prediction circuit;

fetching the next instruction from the cache memory by the fetch circuit;

determining whether a relationship between the predicted instruction and the next instruction satisfies a set of one or more predetermined criteria;

The predicted instruction is prefetched into the cache memory if the set of one or more predetermined criteria is met.

10. The method of claim 9, further comprising: prefetching instructions into cache memory if the relationship between the predicted instruction and the next instruction does not satisfy the set of one or more predetermined criteria Limit.

11. The method of claim 9, further comprising enqueuing the predicted instructions in a queue, wherein the queue separates the prediction unit from the fetch unit.

12. The method of claim 11, wherein the step of determining the relationship comprises:

A valid instruction count is received from the queue by the prefetcher circuit, wherein the valid instruction count indicates a relationship between the predicted instruction and the next instruction.

13. The method of claim 11, wherein determining a relationship comprises determining whether a number of predicted instructions between a predicted instruction and a next instruction exceeds a threshold.

14. The method of claim 9, wherein determining the relationship comprises determining a degree of speculation involved in predicting the predicted instruction.

15. The method of claim 9, wherein prefetching the predicted instruction comprises receiving a memory address of the predicted instruction from a prediction circuit.

16. An apparatus for prefetching data comprising:

a processor configured to execute instructions;

a cache memory configured to temporarily store instructions;

a branch prediction unit configured to output predicted instructions, wherein the predicted instructions are speculatively predicted to be executed by the processor, wherein the branch prediction unit is separate from the fetch unit;

the fetch unit may be configured to fetch a next instruction from the cache memory;

A prefetcher circuit configured to prefetch a previously predicted instruction into the cache memory in response to a relationship between the predicted instruction and a next instruction satisfying one or more predetermined criteria.

17. The apparatus of claim 16, further comprising a branch prediction queue configured to store one or more predicted instructions of the plurality of predicted instructions and separate the branch prediction unit from the fetch unit.

18. The apparatus of claim 17, wherein the branch prediction queue outputs a valid instruction count to the prefetcher circuit;

Wherein, the prefetcher circuit is configured to prefetch the previously predicted instruction if the valid instruction count does not exceed the threshold, and suppress the prefetching of the previously predicted instruction if the valid instruction count exceeds the threshold.

19. The apparatus of claim 16, wherein the relationship between the predicted instruction and the next instruction is indicative of a degree of speculation employed by the branch prediction unit.

20. The apparatus of claim 16, wherein the branch prediction unit provides the memory address of the predicted instruction to the prefetcher circuit.