CN114253607A - Method, system, and apparatus for out-of-order access to shared microcode sequencers by a clustered decode pipeline - Google Patents

Method, system, and apparatus for out-of-order access to shared microcode sequencers by a clustered decode pipeline Download PDF

Info

Publication number
CN114253607A
CN114253607A CN202110982471.7A CN202110982471A CN114253607A CN 114253607 A CN114253607 A CN 114253607A CN 202110982471 A CN202110982471 A CN 202110982471A CN 114253607 A CN114253607 A CN 114253607A
Authority
CN
China
Prior art keywords
cluster
decode
instruction
instruction block
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110982471.7A
Other languages
Chinese (zh)
Inventor
T·马达利尔
J·库姆斯
V·阿加瓦尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN114253607A publication Critical patent/CN114253607A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/223Execution means for microinstructions irrespective of the microinstruction function, e.g. decoding of microinstructions and nanoinstructions; timing of microinstructions; programmable logic arrays; delays and fan-out problems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/382Pipelined decoding, e.g. using predecoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3844Speculative instruction execution using dynamic branch prediction, e.g. using branch history tables
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

Systems, methods, and apparatus are described relating to circuitry for enabling out-of-order access to shared microcode sequencers by a clustered decode pipeline. A hardware processor core comprising: a first decoding cluster comprising a plurality of decoder circuits; a second decoding cluster comprising a plurality of decoder circuits; fetch circuitry to fetch a first instruction block and send the first instruction block to a first decode cluster, and to fetch a second instruction block younger in program order than the first instruction block and send the second instruction block to a second decode cluster; a microcode sequencer including a memory storing a plurality of micro-operations; and arbitration circuitry to arbitrate access by the first decoding cluster and the second decoding cluster to a shared read port of the memory, the arbitration circuitry to: access to a shared read port of a memory by a second decode cluster, but not the first decode cluster, is allowed when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.

Description

Method, system, and apparatus for out-of-order access to shared microcode sequencers by a clustered decode pipeline
Technical Field
The present disclosure relates generally to electronics, and more particularly, embodiments of the present disclosure relate to circuitry for enabling out-of-order access to a shared microcode sequencer by a clustered decode pipeline.
Background
The processor or set of processors executes instructions from an instruction set, such as an Instruction Set Architecture (ISA). The instruction set is a programming-related part of the computer architecture and generally includes native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O). It should be noted that the term instruction may refer herein to a macro-instruction, e.g., an instruction provided to a processor for execution, or to a micro-instruction, e.g., an instruction decoded from a macro-instruction by a decoder of the processor.
Drawings
The present disclosure is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 illustrates a processor core having multiple decode clusters and a shared microcode sequencer according to an embodiment of the disclosure.
FIG. 2 illustrates an example clustered decoding program flow in accordance with embodiments of the present disclosure.
FIG. 3 illustrates an arbitration circuit for arbitrating access to a microcode sequencer memory for multiple decode clusters according to an embodiment of the present disclosure.
FIG. 4 illustrates a flow diagram for arbitrating in-order access of microcode sequencer memory by multiple decode clusters according to an embodiment of the present disclosure.
FIG. 5 illustrates a flow diagram for arbitrating out-of-order access of multiple decode clusters to a microcode sequencer memory according to an embodiment of the present disclosure.
FIG. 6 is a flowchart illustrating operations for arbitrating out-of-order access by multiple decode clusters to a microcode sequencer memory according to embodiments of the present disclosure.
FIG. 7A is a block diagram illustrating an example in-order pipeline and an example register renaming out-of-order issue/execution pipeline, according to embodiments of the disclosure.
Fig. 7B is a block diagram illustrating an example embodiment of an in-order architecture core and an example register renaming out-of-order issue/execution architecture core to be included in a processor according to an embodiment of the disclosure.
Figure 8A is a block diagram of a single processor core and its connection to an on-die interconnect network and its local subset of a level two (L2) cache, according to an embodiment of the present disclosure.
Figure 8B is an expanded view of a portion of the processor core in figure 8A according to an embodiment of the present disclosure.
FIG. 9 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have an integrated graphics device, according to an embodiment of the disclosure.
Fig. 10 is a block diagram of a system according to one embodiment of the present disclosure.
Fig. 11 is a block diagram of a more specific example system in accordance with an embodiment of the present disclosure.
Shown in fig. 12 is a block diagram of a second more specific exemplary system according to an embodiment of the present disclosure.
Shown in fig. 13 is a block diagram of a system on chip (SoC) according to an embodiment of the present disclosure.
FIG. 14 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the disclosure.
Detailed Description
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
References in the specification to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
A (e.g., hardware) (e.g., having one or more cores) processor may execute (e.g., user-level) instructions (e.g., instruction threads) to operate on data, for example, to perform arithmetic, logical, or other functions. For example, software may include a plurality of instructions (e.g., macro-instructions) provided to a processor (e.g., one or more cores thereof) that then executes (e.g., decodes and executes) the plurality of instructions to perform corresponding operations. In some embodiments, a processor includes circuitry (e.g., one or more decoder circuits) to convert (e.g., decode) instructions into one or more micro-operations (μ ops or micro-operations), e.g., where the micro-operations are executed directly by hardware (e.g., by execution circuitry). One or more micro-operations corresponding to an instruction (e.g., a macro-instruction) may be referred to as a micro-code stream for the instruction. Micro-operations may be referred to as micro-instructions, e.g., micro-instructions that result from a processor decoding a macro-instruction. In one embodiment, the instructions are 64-bit and/or 32-bit instructions of an Instruction Set Architecture (ISA). In one embodiment, the instruction is
Figure BDA0003229653540000031
Instructions of an Instruction Set Architecture (ISA) (e.g., 64-bit and/or 32-bit). In some embodiments, converting the instruction into one or more micro-operations is associated with an instruction fetch and/or decode portion of a pipeline of the processor.
In some processors, microcode (e.g., including micro-operations) is stored in a memory (e.g., Read Only Memory (ROM)) of the processor, for example, where ROM is commonly referred to as microcode ROM or μ ROM. In some embodiments, reading out microcode from read-only memory (e.g., reading one or more micro-operations) is performed by a microcode sequencer (e.g., microcode sequencer circuitry) of a processor. In one embodiment, data (e.g., micro-operations) in the read-only memory is stored therein during the manufacturing process, e.g., the data is not modifiable (e.g., when the consumer is present). Thus, in some embodiments, the non-modifiable nature of the read-only memory storing the microcode prevents updates to the microcode. Some processors include patch memory for patching one or more micro-operations of the read-only memory. For example, for an instruction to be executed, where the processor is to source a set of micro-operations for the instruction from patch memory, rather than sourcing a set of (e.g., outdated) micro-operations for the instruction stored in read-only memory. In some embodiments, the data stored in the patch memory is modifiable (e.g., when the customer is present).
Some processors (e.g., some cores) may implement multiple decoding clusters (e.g., where each cluster has multiple decoder circuits), for example, as a way to efficiently increase decoding bandwidth. In some embodiments, the decoder circuitry is to decode (e.g., macro) instructions into a set of one or more micro-operations to be executed (e.g., as primitives) by the execution circuitry(s). In one embodiment, the decoder circuitry is to decode certain (e.g., macro) instructions into corresponding sets of one or more micro-operations without utilizing a microcode sequencer (e.g., a microcode sequencer separate from any decoding cluster and/or decoder circuitry), and to decode other (e.g., macro) instructions into corresponding sets of one or more micro-operations by utilizing a microcode sequencer (e.g., a microcode sequencer separate from any decoding cluster and/or decoder circuitry). In one embodiment, the decoder circuit is configured to output a certain number of micro-operations per cycle (e.g., one micro-operation per cycle and/or between one and four micro-operations per cycle).
FIG. 1 illustrates a processor core 100 having multiple decode clusters 108A-108B and a shared microcode sequencer 128, according to an embodiment of the disclosure. Processor core 100 may be, for example, one of multiple cores of a processor of a system. The depicted processor core 100 includes a branch predictor 102 (e.g., for predicting one or more branches of code (e.g., instructions) to be executed by the processor core 100).
In some embodiments, branch operations (e.g., instructions) are either unconditional (e.g., a branch is taken each time the instruction is executed) or conditional (e.g., the direction taken for the branch depends on the condition), e.g., where the instructions to be executed following a conditional branch (e.g., a conditional jump) are not deterministically known until the condition on which the branch depends is resolved. Here, rather than waiting until the condition is resolved, the branch predictor 102 (e.g., branch predictor circuitry) of the processor may perform (e.g., speculatively perform) branch prediction to predict whether the branch will be taken or not taken, and/or (e.g., if predicted taken) predict a target instruction (e.g., a target address) of the branch. In one embodiment, if a branch is predicted taken, processor core 100 fetches the instruction(s) of the taken direction (e.g., path) of the branch (e.g., the instruction found at the predicted branch target address) and speculatively executes the instruction. In some embodiments, where the processor has not determined whether the prediction is correct, the instructions executed following the branch prediction are speculative. In some embodiments, processor core 100 parses branch instructions at the back end of the pipeline circuitry (e.g., in execution circuitry(s) 140 and/or retirement (writeback) circuitry 138). In one embodiment, if the branch instruction is determined not to be taken by the processor (e.g., by the back end), all instructions (and, for example, their data) that are currently in the pipeline circuitry after the taken branch instruction are flushed (e.g., discarded). In some embodiments, the branch predictor 102 (e.g., a branch predictor circuit) learns from past behavior of branches to predict a next (e.g., incoming) branch. In some embodiments, the branch predictor 102 predicts a suitable subset of instructions (e.g., consecutive in original program order) as a block of code (e.g., ending with a branch instruction). As one example, processor core 100 may receive code to be executed and, in response, may divide the code into blocks.
Fig. 2 illustrates an example clustered decoding program stream 200 in accordance with an embodiment of the disclosure, e.g., where cluster 0 is decoding cluster 108A in fig. 1 and cluster 1 is decoding cluster 108B in fig. 1. Program flow 200 illustrates (e.g., program) code (e.g., instructions) that is divided into code blocks a-F (e.g., where a is the "oldest" code block in program order and F is the "youngest" code block in program order), and each code block is assigned to either decoding cluster 0 or decoding cluster 1 for decoding.
Referring again to fig. 1, processor core 100 (e.g., via fetch circuitry 104 and/or branch predictor 102) may send instruction blocks (e.g., blocks a-F in fig. 2) to a decode cluster, e.g., where a first instruction block a is sent to decode cluster 0108A, a second instruction block B (next in program order (e.g., younger)) is sent to decode cluster N108B, and so on. In the two cluster example, the third (next in program order (e.g., younger)) instruction block C may be sent to the next available decode cluster (e.g., after the decode cluster has completed decoding its current instruction block). In the two cluster example, the third (next in program order (e.g., younger)) instruction block C may be sent to the next decode cluster (e.g., to decode cluster 108A in this example). Although two decode clusters 108A-108B are shown, it should be understood that three or more decode clusters may be utilized (e.g., where "N" is a positive integer greater than one), for example, where all three or more decode clusters use a single microcode sequencer (e.g., all three or more decode clusters arbitrate for a single (e.g., unique) read port of a microcode sequencer (e.g., a unique microcode sequencer of the core and/or processor)).
In certain embodiments, each decode cluster includes two or more (e.g., superscalar x86) instruction decoders that are capable of decoding different basic code blocks out of order with respect to one another, e.g., where decode cluster 108A includes a first decoder circuit 120A (e.g., a decoder) and a second decoder circuit 122A (e.g., a decoder), and decode cluster 108B includes a second decoder circuit 120B (e.g., a decoder) and a second decoder circuit 122B (e.g., a decoder).
In some embodiments, branch predictor 102 of processor core 100 divides code into blocks (e.g., from a set of consecutive instructions of a program), for example, by indicating a start instruction and/or an end instruction for each block. In some embodiments, fetch circuitry 104 of processor core 100 divides code into blocks (e.g., from a set of consecutive instructions of a program).
The individual code blocks may then be sent to their respective decode clusters for decoding, e.g., with the instructions to be decoded of each code block being stored in respective instruction data queues (e.g., instruction data queue 110A as an input queue for decode cluster 108A and instruction data queue 110B as an input queue for decode cluster 108B).
Optionally, processor core 100 includes (e.g., a first level) instruction cache 106 to, for example, cache one or more instructions without having to load them from memory. In some embodiments, the fetch circuitry 104 sends the code blocks to their respective decode clusters via the instruction cache 106. Instruction cache 106 may include an instruction cache tag and/or an instruction Translation Lookaside Buffer (TLB).
In some embodiments, once the code blocks are sent to their corresponding decode clusters 108A-108B (e.g., in the instruction data queue 110A of decode cluster 108A and in the instruction data queue 110B of decode cluster 108B), the decode clusters begin decoding the code blocks in parallel (e.g., via parallel decoder circuits therein). In some embodiments, the decode clusters operate independently of each other, so the code blocks may be decoded out of order (e.g., out of program order). In some embodiments, the distribution circuitry 138 is responsible for distributing operations (e.g., micro-operations) to the execution circuitry 140 (e.g., execution units) in a suitable program order.
The processor core depicts a first decoding cluster 108A having a plurality of decoder circuits 120A-122A in a first set 112A and a second decoding cluster 108B having a plurality of decoder circuits 120B-122B in a second set 112B. In some embodiments, the decoder circuit(s) (120A, 122A, 120B, 122B) are used (e.g., each) to decode (e.g., macro-) instructions into a set of one or more micro-operations to be executed (e.g., as primitives) by the execution circuit(s) 140. In some embodiments, the decoder circuitry (120A, 122A, 120B, 122B) is to decode certain (e.g., macro) instructions into corresponding sets of one or more micro-operations without utilizing the micro-code sequencer 128 (e.g., a micro-code sequencer separate from any decoding cluster and/or decoder circuitry), and/or to decode other (e.g., macro) instructions (e.g., Complex Instruction Set Computer (CISC) instructions) into corresponding sets of one or more micro-operations by utilizing the micro-code sequencer 128 (e.g., a micro-code sequencer separate from any decoding cluster and/or decoder circuitry). In one embodiment, the decoder circuits (120A, 122A, 120B, 122B) are used to output a certain number of micro-operations per cycle (e.g., one micro-operation per cycle and/or between one and four micro-operations per cycle). In some embodiments, a "microcode" instruction generally refers to an instruction as follows: a decode cluster (e.g., a set of decoders) requests the microcode sequencer 128 to load a corresponding set of one or more (e.g., a plurality of) micro-operations (μ ops) from a microcode sequencer memory 132 (e.g., a Read Only Memory (ROM)) into a decode pipeline (e.g., into a corresponding instruction decode queue), e.g., rather than directly generating the set of one or more micro-operations for the instruction by the decoder circuitry. For example, to implement some (e.g., complex) (e.g., x86) instructions, the microcode sequencer 128 is used to divide the instructions into sequences of smaller (e.g., micro) operations (also referred to as micro-operations or μ ops).
In some embodiments, a microcode sequencer is utilized for many purposes, for example, due to the nature of x86 and/or the need to build a sequencer for many micro-operations, and produces a structure containing numerous (e.g., tens of thousands) micro-operations. In certain embodiments, because these micro-operation sequences require a large amount of storage (e.g., greater than 100 Kilobits (KB)), the micro-code sequencer 128 is physically built as a single (e.g., read) port memory 132 (e.g., ROM) array in which the decode clusters then share the single (e.g., unique) read port 131 of the micro-code sequencer. Instead of replicating the microcode sequencer 128, sharing the microcode sequencer 128 is a significant die area savings, for example, because the area of the microcode sequencer 128 is larger than the area of one of the decode clusters.
Since some (e.g., x86) instructions may map to numerous (e.g., tens, hundreds, etc.) corresponding micro-operations (e.g., and some of these sequences require an architecturally serializing action for the instruction, e.g., the some sequences force any older things to complete and prevent any younger things from starting), once a decode cluster passes control into the microcode sequencer 128, the decode cluster must wait until the instruction's sequence of micro-operations completes (e.g., and the microcode sequencer releases control back to the decode cluster). In certain embodiments, microcode sequencer 128 includes arbitration circuitry 130 (e.g., arbitration logic circuitry) for arbitrating access to a single decode cluster at a time.
For example, to ensure that processor core 100 does not suspend (e.g., due to architecturally serialization requirements), one embodiment of arbitration circuit 130 allows only the oldest decode cluster (e.g., the decode cluster that is decoding the oldest instruction block in program order) to use the permission of microcode sequencer 128. Referring to fig. 2, code block C is considered to be older in program order than code block D. In this embodiment, if a younger code block contains an instruction that requires the use of the microcode sequencer 128, the decode cluster will stall once it detects the instruction, e.g., and the stalled decode cluster will resume decoding once the code block that the decode cluster is decoding becomes the oldest code block across the other decode cluster(s). However, in certain embodiments, such decoding cluster stalling affects performance, e.g., prevents decoding clusters from operating in parallel.
In some instruction sets, there are (e.g., performance sensitive) (e.g., CISC) instructions that have less than a threshold number (e.g., 10) (e.g., arbitration threshold 134) of micro-operations and, for example, do not require complex protocols (e.g., fences and/or serialization). In some embodiments, for example, while the decode cluster determines (e.g., computes) an entry point (e.g., address) for use by such instructions when indexing into memory 132 (e.g., ROM), the instruction may have its properties generated by the decode cluster (e.g., for determining whether it is eligible for out-of-order (e.g., code block) access to the shared microcode sequencer). By knowing these characteristics (e.g., the cycle duration of the micro-operation used to generate the instruction by the micro-operation sequencer and/or the maximum number of micro-operations that will be required by the particular instruction (e.g., its micro-code sequence)), in some embodiments it is possible to pre-allocate all necessary resources within the pipeline to ensure that control will be released deterministically after it is passed to the micro-code sequencer 128. For example, in some embodiments, using this information, a younger decode cluster (e.g., a decode cluster that is decoding younger blocks of instructions in program order) accessing microcode sequencer 128 for one of the instructions that is to access the microcode sequencer will not cause a functional problem.
In some embodiments, each decode cluster (e.g., each decoder circuit in some embodiments) includes a data structure (e.g., as a programmable logic array) for storing a corresponding entry point value (e.g., an address) and/or a number of bits (e.g., a number of cycles to generate a corresponding micro-operation of an instruction and/or a number of micro-operations of an instruction) into the memory 132 of the microcode sequencer 128 for one or more instructions. For example, (1) utilizing a data structure 114A of decode cluster 108A that includes one or more entries that each indicate (e.g., for a single instruction) an entry point 116A for the instruction and/or a bit 118A for the instruction (e.g., bypass enable or bypass qualification) bit that indicates a number of cycles for generating a corresponding micro-operation of the instruction and/or a number of micro-operations of the instruction, e.g., as an encoded value), and/or (2) utilizing a data structure 114B of decode cluster 108B that includes one or more entries that each indicate (e.g., for a single instruction) an entry point 116B for the instruction and/or a bit 118B for the instruction (e.g., bypass enable or bypass qualification) bit that indicates a number of cycles for generating a corresponding micro-operation of the instruction and/or a number of micro-operations of the instruction, e.g., as encoded values). See, for example, the discussion below of the encoded values for two bits. In certain embodiments, data structure 114A and data structure 114B are copies of each other, e.g., they include the same data. In one embodiment, data structures 114A and 114B are loaded with their data at the time of manufacture. In one embodiment, the data structures 114A and 114B are loaded with their data during processor boot, for example, by executing basic input/output system (BIOS) firmware or Unified Extensible Firmware Interface (UEFI) firmware. In one embodiment, data structure 114A and data structure 114B are programmable logic arrays. As discussed below, in certain embodiments, the arbitration circuitry 130 uses data from the data structure 114A and/or the data structure 114B to arbitrate access to the microcode sequencer 128, e.g., access to a (e.g., unique) shared read port 131 of a memory 132 (e.g., a microcode sequencer ROM (MS-ROM)) (e.g., a (e.g., dedicated) memory within a processor core).
Embodiments herein opportunistically allow out-of-order access to (e.g., read ports of) a shared microcode sequencer, e.g., reducing the amount of time a decode cluster is stalled, which improves decode bandwidth and performance. Embodiments herein allow out-of-order access to (e.g., a read port of) a shared microcode sequencer by decode clusters that decode younger code blocks of an instruction corresponding to a sequence of micro-operations below a threshold, such as a number of clock cycles of the microcode sequencer to service an operation of the instruction and/or a number of corresponding micro-operations in the microcode sequencer for the instruction. Some embodiments allow a decode cluster that is decoding a younger block of code to access a microcode sequencer if the decode cluster detects that the instruction requires only the microcode sequencer for less than a certain number of clock cycles. Embodiments herein improve the performance of decoding clusters by: the decode cluster(s) that are decoding younger blocks of code are allowed access to the microcode sequencer and, for example, the cluster is allowed to continue decoding instructions because the cluster is not stalled. The use of a shared microcode sequencer may (or may not) be used with (e.g., "templated") decoder circuitry that decodes (e.g., via a programmable logic array) some instructions into a (e.g., short, e.g., less than 5 μ ops) stream of micro-operations within the decoder circuitry (e.g., within a decoder channel). However, to use such templated decoders, some embodiments utilize hints that tell the set of decoders how to align an instruction with that particular decoder channel, e.g., where the hints are generated by running an Instruction Length Decode (ILD) block earlier in the pipeline, but running the ILD itself may have disadvantages because it may compromise performance and power.
After the instructions are decoded into their respective micro-operations (e.g., by a decoder circuit or a microcode sequencer), in some embodiments, these micro-operations are stored in an instruction decode queue. In fig. 1 (e.g., at the end of the decode stage), decode cluster 108A includes an instruction decode queue 124A (e.g., an instruction queue), instruction decode queue 124A receives respective micro-operations from decoder circuits 120A-122A and from microcode sequencer 128 (e.g., when decode cluster 108A is arbitrated for access to memory 132), and decode cluster 108B includes an instruction decode queue 124B (e.g., an instruction queue), instruction decode queue 124B receives respective micro-operations from decoder circuits 120B-122B and from microcode sequencer 128 (e.g., when decode cluster 108B is arbitrated for access to memory 132). Optionally, a switch 136 is included to couple the output(s) of instruction decode queues 124A-124B to the input(s) of distribution circuit 138. In some embodiments, the distribution circuitry 138 is used to send micro-operations (e.g., in program order) from the instruction decode queues 124A-124B to execution circuitry in the execution circuitry 140 (e.g., based on the type of micro-operation and the type of execution circuitry, e.g., integer, vector, floating point, etc.). In one embodiment, one or more instruction decode queues are loaded out of program order but read in program order. Execution circuitry 140 may access storage, such as registers 142 and/or data cache 144 (e.g., one or more levels of cache hierarchy). Retirement circuitry 138 may then retire the corresponding instruction once the results are generated by execution circuitry 140.
FIG. 3 illustrates an arbitration circuit 130 for arbitrating access by multiple decode clusters 108A-108B to a microcode sequencer memory 132 (e.g., a single read port 131), according to an embodiment of the present disclosure. Arbitration circuitry 130 may be part of microcode sequencer 128 or elsewhere, e.g., as a separate component of processor core 100 in FIG. 1. The cluster request may include, for example, the following indications looked up in a data structure as discussed herein: an indication of a number of cycles of corresponding micro-operations for generating the requested instruction and/or an indication of a number of micro-operations for the requested instruction. The arbitration circuit 130 may determine the number of cycles of corresponding micro-operations for generating the requested instruction and/or the number of micro-operations for the requested instruction, e.g., via its own data structure storing such information. In some embodiments, the arbitration threshold 134 (which may be set by a user, for example, in some embodiments) is a value that indicates when out-of-order decode access to the memory 132 is allowed. For example, where the arbitration threshold 134 is a maximum number of cycles of corresponding micro-operations for the instruction generating the request and/or a maximum number of micro-operations for the instruction requesting the permitted out-of-order access to the memory 132. In some embodiments, the decode cluster request is generated by a decode cluster that detects an instruction to be serviced by the microcode sequencer. In some embodiments, the decode cluster uses the opcode and/or other information encoded in the macro instruction to read a data structure to determine whether the macro instruction should be serviced by the microcode sequencer 128.
FIG. 4 illustrates a flow diagram 400 for arbitrating in-order access of microcode sequencer memory by multiple decode clusters in accordance with an embodiment of the present disclosure. Referring to fig. 2, code block D is "older" than code block E because code block D precedes code block E in program order. Thus, if decoding cluster 1 is decoding code block D while cluster 0 is decoding code block E, decoding cluster 1 will be decoding an older code block. One method of arbitrating ordered access allows only decoding clusters that are decoding the oldest code blocks to access the microcode sequencer, e.g., if a decoding cluster is decoding a younger code block and it requests a microcode sequencer, the cluster will be stalled until the code block becomes oldest. For example, and once the code block is oldest, the decode cluster may access the microcode sequencer. The flow chart 400 includes: at 402, a decode cluster (e.g., "X," where X is an identifier of one of a plurality of decode clusters) requests a microcode sequencer. At 404, it is determined whether the request is from the oldest code block being decoded, and if so, access to the microcode sequencer is granted for the request at 406, and if no at 404, the cluster is stopped at 408 (e.g., for some number of cycles), and then at 410 it is rechecked whether the request is now from the oldest code block being decoded (e.g., the decoding of the code block previously requested as a block was completed), and if not 410, the cluster is stopped again 408 (e.g., for some number of cycles), and then re-checks at 410 whether the request is now from the oldest code block being decoded (e.g., decoding of the code block previously requested as a block is completed), and if so at 410, access to the microcode sequencer is granted for the request at 406.
As an example, according to fig. 2, it is assumed that decoding cluster 1 is decoding code block D while cluster 0 is decoding code block E. If decode cluster 1 decodes an instruction that requires a microcode sequencer, then decode cluster 1 will be granted access to the microcode sequencer. However, in this example, if decode cluster 0 were to decode an instruction requiring a microcode sequencer, cluster 0 would be stalled until cluster 1 begins decoding code block F, e.g., once decode cluster 1 begins decoding code block F, decode cluster 0 would be granted access to the microcode sequencer because code block E is "older" than code block F.
However, in some embodiments, this arbitration scheme has drawbacks. For example, if a microcode sequencer is not in use by a decoding cluster decoding an older code block, a decoding cluster decoding a younger code block may use the microcode sequencer for (e.g., CISC) instructions concurrently with other decoding clusters decoding the older code block. With the arbitration scheme in FIG. 4, in some embodiments, when a decoder decoding a younger block of code detects any (e.g., CISC) instruction that is to use a microcode sequencer, the decoder must stall even if the microcode sequencer is not in use, for example, where such stalling reduces the decode bandwidth and starves the allocation and execution circuitry, resulting in performance issues.
Next, an example is described with reference to fig. 2, where decoding cluster 0 will decode code blocks A, C and E (e.g., one code block at a time), while decoding cluster 1 will decode code blocks B, D and F (e.g., one code block at a time). In one embodiment, code blocks A, C and E are each a certain number (e.g., 12-15) of instructions that do not require a microcode sequencer, while code blocks B, D and F are each a certain number (e.g., 4-5) of instructions, but one of the (e.g., first three) instructions is an instruction that is to use a microcode sequencer. For this code structure, the processor core may encounter the following scenario: decoding cluster 0 starts decoding code block a and decoding cluster 1 starts decoding block B, decoding cluster 1 immediately stops because an instruction to use the microcode sequencer is detected (e.g., attempted to decode) and code block B is younger than code block a; decoding cluster 0 completes decoding code block a and starts decoding code block C, and decoding cluster 1 resumes decoding code block B, since code block B is now the oldest block and has now access to the microcode sequencer; decoding cluster 1 completes decoding of code block B and starts decoding of code block D, decoding cluster 0 has not completed decoding of code block C, and decoding cluster 1 stops immediately because an instruction to use a microcode sequencer is detected (e.g., attempted to decode) and code block D is younger than code block C. Such stalling (e.g., stuttering) prevents decode cluster 0 from decoding instructions in parallel with decode cluster 1 and has lower performance than decode clusters operating in parallel.
Embodiments herein address this performance issue by improving microcode sequencer access arbitration, such as shown in FIG. 5.
FIG. 5 illustrates a flow diagram 500 for arbitrating out-of-order access of multiple decode clusters to a microcode sequencer memory according to an embodiment of the present disclosure. The flow chart 500 includes: at 502, a decode cluster (e.g., "X," where X is an identifier of one of a plurality of decode clusters) requests a microcode sequencer. At 504, it is determined whether the request is from the oldest code block being decoded, and if yes at 504, then it is checked at 514 whether the microcode sequencer is in use, and if yes, then it is checked at 516 whether the microcode sequencer is in use, and then if no at 514, it is determined at 506 whether the instruction qualifies to originate from a younger code block, and if yes at 506, it is checked at 514 whether the microcode sequencer is in use, and if yes, then it is checked at 516, and if yes, then it is stopped at 516 (e.g., for a certain number of cycles), and then it is checked at 514 whether the microcode sequencer is in use, and if no at 506, then it is stopped at 508 (e.g., for a certain number of cycles), and then checks at 510 if the request is now from the oldest code block being decoded (e.g., decoding of the code block previously being the blocking request is completed), and if not at 510, stops the cluster again at 508 (e.g., for a certain number of cycles), and then rechecks at 510 if the request is now from the oldest code block being decoded (e.g., decoding of the code block previously being the blocking request is completed), and if yes at 510, grants access to the microcode sequencer for the request at 512.
In certain embodiments, the checking for eligibility at 506 includes: (1) check whether a number of cycles (or other time period) of corresponding micro-operations for generating the requested instruction is below a threshold (e.g., arbitration threshold 134 in fig. 1-2), and/or (2) check whether a number of micro-operations for the requested instruction is below a threshold (e.g., arbitration threshold 134 in fig. 1-2). In some embodiments, the checking for eligibility at 506 includes checking whether a corresponding instruction decode queue (e.g., an instruction decode queue) has available storage space for a plurality of micro-operations of the instruction. In some embodiments, the microcode sequencer includes a single write port 133, and the single write port 133 is switched between instruction decode queues, e.g., to select an instruction decode queue for a decode cluster for which the instruction is authorized to access the microcode sequencer (e.g., shared read port 131 in fig. 1-2) to store the instruction's corresponding micro-operations in the correct instruction decode queue. In some embodiments, the arbitration threshold 134 is stored in a memory such as that of FIGS. 1-2.
In some embodiments, (1) the arbitration threshold 134 (e.g., the threshold number of cycles) is compared to a number of cycles (or other time period) for corresponding micro-operations of the instruction generating the request, and/or (2) the arbitration threshold 134 (e.g., the threshold number of micro-operations) is compared to a number of micro-operations of the instruction requesting, and the request is granted when the arbitration threshold(s) is not exceeded (and/or is equal), for example. In one embodiment, (1) the number of cycles (or other time period) of the corresponding micro-operation of the instruction used to generate the request is indicated by one or more bits (e.g., bypass qualifier bits) provided by the request to the microcode sequencer, for example, as determined by searching the data structures (e.g., data structures 114A-114B). In one embodiment, the request includes: (i) an entry point (e.g., an address) into a memory (e.g., memory 132 (e.g., MS-ROM)) that stores the micro-operation, and (ii) one or more bits (e.g., a bypass qualifier bit) that indicate whether an arbitration threshold has been exceeded. In some embodiments, the one or more bits (e.g., the bypass qualifier bits) are a plurality of bits, e.g., a two-bit encoded value, e.g., where 00 indicates a first number of cycles (e.g., one cycle) or micro-operations (e.g., three micro-operations) for the particular instruction, where 01 indicates a second number of cycles (e.g., two cycles) or micro-operations (e.g., six micro-operations) for the particular instruction, where 10 indicates a third number of cycles (e.g., three cycles) or micro-operations (e.g., nine micro-operations) for the particular instruction, and where 11 indicates a fourth number of cycles (e.g., four or more cycles) or micro-operations (e.g., ten or more micro-operations) for the particular instruction. In one embodiment, the arbitration threshold is ten micro-operations, e.g., at 506 in FIG. 5, e.g., such that one or more bits (e.g., bypass qualification bits) of 00, 01, or 10 in the above example indicate that the instruction qualifies for out-of-order use of the microcode sequencer, and one or more bits (e.g., bypass qualification bits) of 11 in the above example indicate that the instruction is ineligible for out-of-order use of the microcode sequencer.
Additionally or alternatively, for example, at 506 in fig. 5, available storage space (e.g., slots) in the target instruction decode queue is used in eligibility checking, e.g., the number of micro-operations of the instruction (e.g., as determined from one or more bits (e.g., bypass qualifier bits)) is compared to the available storage space (e.g., slots) in the target instruction decode queue, and if the amount of storage space is not available, the out-of-order use of the instruction to the microcode sequencer is ineligible (e.g., out-of-order decode access to the memory 132 is not allowed).
Embodiments herein include arbitration of access to a microcode sequencer that allows a decode cluster that decodes a younger code block (e.g., its decoder circuitry) to access the microcode sequencer when the microcode sequencer is not in use and the decode cluster detects (e.g., its decoder circuitry) an instruction that qualifies to originate from the younger code block (e.g., CISC). Embodiments herein thus improve the performance of the code described above (e.g., in fig. 2) because when the decode cluster 1 detects an instruction to use a microcode sequencer, the decode cluster 1 will not stall and the decoder clusters can operate in parallel, essentially doubling the decode bandwidth of the decode cluster compared to an in-order arbitration scheme.
FIG. 6 is a flow diagram illustrating operations 600 for arbitrating out-of-order access to microcode sequencer memory by multiple decode clusters according to embodiments of the present disclosure. Some or all of the operations 600 (or other processes, or variations, and/or combinations thereof described herein) are performed under control of arbitration circuitry (e.g., of a microcode sequencer).
The operations 600 include: at block 602, a first instruction block is sent to a first decode cluster of a processor for decoding, the first decode cluster including a plurality of decoder circuits. The operations 600 further include: at block 604, a second instruction block younger in program order than the first instruction block is sent to a second decode cluster of the processor for decoding, the second decode cluster including a plurality of decoder circuits. The operations 600 further include: at block 606, access of the first decode cluster and the second decode cluster to a shared read port of a microcode sequencer including a memory storing a plurality of micro-operations is arbitrated, via an arbitration circuit of the processor, to allow access to the shared read port by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
Exemplary architectures, systems, etc. that may be used above are detailed below.
At least some embodiments of the disclosed technology may be described in terms of the following examples:
example 1. a hardware processor core, comprising:
a first decoding cluster comprising a plurality of decoder circuits;
a second decoding cluster comprising a plurality of decoder circuits;
fetch circuitry to fetch a first instruction block and send the first instruction block to the first decode cluster for decoding, and to fetch a second instruction block younger in program order than the first instruction block and send the second instruction block to the second decode cluster for decoding;
a microcode sequencer including a memory storing a plurality of micro-operations; and
arbitration circuitry to arbitrate access by the first decoding cluster and the second decoding cluster to a shared read port of the memory, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
Example 2 the hardware processor core of example 1, wherein the arbitration circuitry is to: when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is above or equal to the arbitration threshold, stopping access by the second decode cluster to the shared read port of the memory of the microcode sequencer until the second instruction block is the oldest instruction block in program order being decoded by the hardware processor core.
Example 3 the hardware processor core of example 2, wherein the stall is a stall of decoding of the second decoding cluster.
Example 4 the hardware processor core of example 1, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below the arbitration threshold and an instruction decode queue of the second decode cluster has available storage space for the number of corresponding micro-operations.
The hardware processor core of example 1, wherein the second decode cluster includes a data structure to store one or more bits indicating the number of corresponding micro-operations in the microcode sequencer for instructions of the second instruction block, and to send the one or more bits to the arbitration circuitry in response to a request to decode instructions of the second instruction block.
Example 6 the hardware processor core of example 5, wherein the data structure of the second decode cluster is to store an entry point value indicating an entry point in the memory for a corresponding micro-operation of an instruction of the second instruction block, and the second decode cluster is to send the entry point value and the one or more bits to the arbitration circuitry in response to the request to decode an instruction of the second instruction block.
Example 7 the hardware processor core of example 1, wherein the arbitration circuitry is to: when an instruction of the first instruction block has one or more corresponding micro-operations in the microcode sequencer, access to the shared read port of the memory is allowed to the first decode cluster that decodes the first instruction block but not the second decode cluster that decodes the second instruction block.
Example 8 the hardware processor core of example 1, wherein the shared read port of the memory is the only read port into the memory of the microcode sequencer.
Example 9. a method, comprising:
sending a first instruction block to a first decode cluster of a processor for decoding, the first decode cluster comprising a plurality of decoder circuits;
sending a second instruction block younger in program order than the first instruction block to a second decode cluster of the processor for decoding, the second decode cluster comprising a plurality of decoder circuits; and
arbitrating, via an arbitration circuit of the processor, access by the first decoding cluster and the second decoding cluster to a shared read port of a microcode sequencer that includes a memory storing a plurality of micro-operations to allow access to the shared read port by the second decoding cluster decoding the second instruction block but not the first decoding cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
Example 10 the method of example 9, wherein the arbitrating comprises: when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is greater than or equal to the arbitration threshold, stopping access by the second decode cluster to the shared read port of the memory of the microcode sequencer until the second instruction block is the oldest instruction block in program order being decoded by the first decode cluster and the second decode cluster.
Example 11 the method of example 10, wherein the stopping is stopping of decoding of the second decoding cluster.
Example 12 the method of example 9, wherein the arbitrating comprises: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below the arbitration threshold and an instruction decode queue of the second decode cluster has available storage space for the number of corresponding micro-operations.
Example 13. the method of example 9, further comprising:
reading one or more bits of a data structure from the second decode cluster, the one or more bits indicating the number of corresponding micro-operations in the microcode sequencer for an instruction of the second instruction block; and
sending the one or more bits to the arbitration circuitry in response to a request to decode an instruction of the second instruction block.
Example 14 the method of example 13, further comprising:
reading an entry point value of the data structure from the second decode cluster, the entry point value indicating an entry point in the memory for a corresponding micro-operation of an instruction of the second instruction block; and
sending the entry point value and the one or more bits to the arbitration circuit in response to the request to decode an instruction of the second instruction block.
Example 15 the method of example 9, wherein the arbitrating comprises: when an instruction of the first instruction block has one or more corresponding micro-operations in the microcode sequencer, access to the shared read port of the memory is allowed to the first decode cluster that decodes the first instruction block but not the second decode cluster that decodes the second instruction block.
Example 16. the method of example 9, wherein the shared read port of the memory is the only read port into the memory of the microcode sequencer.
Example 17 a hardware processor core, comprising:
a first decoding cluster comprising a plurality of decoder circuits;
a second decoding cluster comprising a plurality of decoder circuits;
a branch predictor to identify a first instruction block and a second instruction block younger in program order than the first instruction block, to cause the first instruction block to be sent to the first decode cluster for decoding, and to cause the second instruction block to be sent to the second decode cluster for decoding;
a microcode sequencer including a memory storing a plurality of micro-operations; and
arbitration circuitry to arbitrate access by the first decoding cluster and the second decoding cluster to a shared read port of the memory, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
Example 18 the hardware processor core of example 17, wherein the arbitration circuitry is to: when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is above or equal to the arbitration threshold, stopping access by the second decode cluster to the shared read port of the memory of the microcode sequencer until the second instruction block is the oldest instruction block in program order being decoded by the hardware processor core.
Example 19 the hardware processor core of example 18, wherein the stop is a stop of decoding of the second decoding cluster.
Example 20 the hardware processor core of example 17, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below the arbitration threshold and an instruction decode queue of the second decode cluster has available storage space for the number of corresponding micro-operations.
The hardware processor core of example 17, wherein the second decode cluster includes a data structure to store one or more bits indicating the number of corresponding micro-operations in the microcode sequencer for instructions of the second instruction block, and to send the one or more bits to the arbitration circuitry in response to a request to decode instructions of the second instruction block.
Example 22 the hardware processor core of example 21, wherein the data structure of the second decode cluster is to store an entry point value indicating an entry point in the memory for a corresponding micro-operation of an instruction of the second instruction block, and the second decode cluster is to send the entry point value and the one or more bits to the arbitration circuitry in response to the request to decode the instruction of the second instruction block.
Example 23 the hardware processor core of example 17, wherein the arbitration circuitry is to: when an instruction of the first instruction block has one or more corresponding micro-operations in the microcode sequencer, access to the shared read port of the memory is allowed to the first decode cluster that decodes the first instruction block but not the second decode cluster that decodes the second instruction block.
Example 24 the hardware processor core of example 17, wherein the shared read port of the memory is the only read port into the memory of the microcode sequencer.
In yet another embodiment, an apparatus comprises a data storage device storing code that, when executed by a hardware processor, causes the hardware processor to perform any of the methods disclosed herein. The apparatus may be as described in the detailed description. The method may be as described in the detailed description.
The instruction set may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, location of bits) to specify an operation (e.g., opcode) to be performed, as well as operand(s) and/or other data field(s) (e.g., mask) on which the operation is to be performed, and so on. Some instruction formats are further decomposed by the definition of instruction templates (or subformats). For example, an instruction template for a given instruction format may be defined to have different subsets of the fields of the instruction format (the included fields are typically in the same order, but at least some fields have different bit positions, since fewer fields are included) and/or to have a given field interpreted differently. Thus, each instruction of the ISA is expressed using a given instruction format (and, if defined, a given one of the instruction templates of that instruction format) and includes fields for specifying operations and operands. For example, an exemplary ADD instruction has a particular opcode and fingerA command format, the particular instruction format including an opcode field to specify the opcode and an operand field to select operands (Source 1/destination and Source 2); and the ADD instruction appearing in the instruction stream will have particular contents in the operand field that select particular operands. The SIMD extension sets referred to as advanced vector extensions (AVX) (AVX1 and AVX2) and using Vector Extension (VEX) encoding schemes have been introduced and/or released (see, e.g., month 11 of 2018)
Figure BDA0003229653540000221
64 and IA-32 architecture software developer manuals; and see month 10 2018
Figure BDA0003229653540000222
Architectural instruction set extension programming reference).
Exemplary core architecture, processor, and computer architecture
Processor cores can be implemented in different processors in different ways for different purposes. For example, implementations of such cores may include: 1) a general-purpose ordered core intended for general-purpose computing; 2) a high performance general out-of-order core intended for general purpose computing; 3) dedicated cores intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU comprising one or more general-purpose in-order cores intended for general-purpose computing and/or one or more general-purpose out-of-order cores intended for general-purpose computing; and 2) coprocessors comprising one or more dedicated cores intended primarily for graphics and/or science (throughput). Such different processors result in different computer system architectures that may include: 1) a coprocessor on a separate chip from the CPU; 2) a coprocessor in the same package as the CPU but on a separate die; 3) coprocessors on the same die as the CPU (in which case such coprocessors are sometimes referred to as dedicated logic, such as integrated graphics and/or scientific (throughput) logic, or as dedicated cores); and 4) a system on chip that can include the described CPU (sometimes referred to as application core(s) or application processor(s), coprocessors and additional functionality described above on the same die. An exemplary graphics processor is described next. Following is a description of an exemplary core architecture and an exemplary processor and computer architecture.
Exemplary core architecture
In-order and out-of-order core block diagrams
FIG. 7A is a block diagram illustrating an example in-order pipeline and an example register renaming out-of-order issue/execution pipeline, according to embodiments of the disclosure. Fig. 7B is a block diagram illustrating an example embodiment of an in-order architecture core and an example register renaming out-of-order issue/execution architecture core to be included in a processor according to embodiments of the disclosure. The solid line blocks in fig. 7A-7B illustrate an in-order pipeline and an in-order core, while the optional addition of the dashed blocks illustrates a register renaming, out-of-order issue/execution pipeline and core. Given that the ordered aspect is a subset of the out-of-order aspect, the out-of-order aspect will be described.
In FIG. 7A, a processor pipeline 700 includes a fetch stage 702, a length decode stage 704, a decode stage 706, an allocation stage 708, a renaming stage 710, a scheduling (also known as dispatch or issue) stage 712, a register read/memory read stage 714, an execution stage 716, a write back/memory write stage 718, an exception handling stage 722, and a commit stage 724.
Fig. 7B shows a processor core 790, the processor core 790 including a front end unit 730, the front end unit 730 coupled to an execution engine unit 750, and both the front end unit 730 and the execution engine unit 750 coupled to a memory unit 770. The core 790 may be a Reduced Instruction Set Computing (RISC) core, a Complex Instruction Set Computing (CISC) core, a Very Long Instruction Word (VLIW) core, or a hybrid or alternative core type. As yet another option, the core 790 may be a special-purpose core, such as, for example, a network or communication core, compression engine, coprocessor core, general purpose computing graphics processing unit (GPGPU) core, graphics core, or the like.
Front end unit 730 includes a branch prediction unit 732, the branch prediction unit 732 coupled to an instruction cache unit 734, the instruction cache unit 734 coupled to an instruction Translation Lookaside Buffer (TLB)736, the instruction translation lookaside buffer 736 coupled to an instruction fetch unit 738, the instruction fetch unit 738 coupled to a decode unit 740. Decode unit 740 (or a decoder or decoder unit) may decode instructions (e.g., macro-instructions) and generate as output one or more micro-operations, micro-code entry points, micro-instructions, other instructions, or other control signals decoded from or otherwise reflective of the original instructions. Decoding unit 740 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, Programmable Logic Arrays (PLAs), microcode read-only memories (ROMs), and the like. In one embodiment, the core 790 includes a microcode ROM or other medium (e.g., within the decode unit 740, or otherwise within the front end unit 730) that stores microcode for certain macro-instructions. The decode unit 740 is coupled to a rename/allocator unit 752 in the execution engine unit 750.
The execution engine unit 750 includes a rename/allocator unit 752, the rename/allocator unit 752 being coupled to a retirement unit 754 and a set 756 of one or more scheduler units. Scheduler unit(s) 756 represent any number of different schedulers, including reservation stations, central instruction windows, and the like. Scheduler unit(s) 756 are coupled to physical register file unit(s) 758. Each physical register file unit of physical register file unit(s) 758 represents one or more physical register files, where different physical register files store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, state (e.g., an instruction pointer that is an address of a next instruction to be executed), and so forth. In one embodiment, physical register file unit(s) 758 include vector register units, writemask register units, and scalar register units. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. Physical register file unit(s) 758 are overlapped by retirement unit 754 to illustrate the various ways in which register renaming and out-of-order execution may be implemented (e.g., using reorder buffer(s) and retirement register file(s); using future file(s), history buffer(s), retirement register file(s); using register maps and register pools, etc.). Retirement unit 754 and physical register file unit(s) 758 are coupled to execution cluster(s) 760. Execution cluster(s) 760 include a set of one or more execution units 762 and a set of one or more memory access units 764. The execution units 762 may perform various operations (e.g., shifts, additions, subtractions, multiplications) and may perform on various data types (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include multiple execution units dedicated to a particular function or set of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. Scheduler unit(s) 756, physical register file unit(s) 758, and execution cluster(s) 760 are shown as being possibly multiple, as certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file unit(s), and/or execution cluster-and in the case of a separate memory access pipeline, implement certain embodiments in which only the execution cluster of that pipeline has memory access unit(s) 764). It should also be understood that where separate pipelines are used, one or more of these pipelines may be issued/executed out-of-order, and the remaining pipelines may be in-order.
The set 764 of memory access units is coupled to the memory unit 770, the memory unit 770 including a data TLB unit 772, the data TLB unit 772 coupled to a data cache unit 774, the data cache unit 774 coupled to a second level (L2) cache unit 776. In one example embodiment, the memory access unit 764 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 772 in the memory unit 770. The instruction cache unit 734 is also coupled to a second level (L2) cache unit 776 in the memory unit 770. The L2 cache unit 776 is coupled to one or more other levels of cache and ultimately to main memory.
In some embodiments, prefetch circuitry 778 is included to prefetch data, e.g., to predict access addresses and to bring data for those addresses (e.g., from memory 780) into one or more caches.
By way of example, the exemplary register renaming out-of-order issue/execution core architecture may implement the pipeline 700 as follows: 1) instruction fetch 738 execute the fetch stage 702 and length decode stage 704; 2) the decode unit 740 performs the decode stage 706; 3) rename/allocator unit 752 performs allocation stage 708 and renaming stage 710; 4) scheduler unit(s) 756 perform scheduling stage 712; 5) physical register file unit(s) 758 and memory unit 770 perform register read/memory read stage 714; execution cluster 760 executes execution stage 716; 6) the memory unit 770 and the physical register file unit(s) 758 perform the write back/memory write stage 718; 7) units may be involved in the exception handling stage 722; and 8) retirement unit 754 and physical register file unit(s) 758 perform commit stage 724.
The core 790 may support one or more instruction sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS technologies, inc. of sunnyvale, california; the ARM instruction set of ARM holdings, inc. of sunnyvale, california (with optional additional extensions such as NEON)), including the instruction(s) described herein. In one embodiment, the core 790 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2), thereby allowing operations used by many multimedia applications to be performed using packed data.
It should be appreciated that a core may support multithreading (performing a set of two or more parallel operations or threads), and that multithreading may be accomplished in a variety of ways, including time-division multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads in which a physical core is simultaneously multithreading), or a combination thereof (e.g., time-division fetching and decoding and thereafterSuch as
Figure BDA0003229653540000251
Simultaneous multithreading in a hyper-threading technique).
Although register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. Although the illustrated embodiment of the processor also includes a separate instruction and data cache unit 734/774 and a shared L2 cache unit 776, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a level one (L1) internal cache or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache external to the core and/or the processor. Alternatively, all caches may be external to the core and/or processor.
Concrete exemplary ordered core architecture
Fig. 8A-8B illustrate block diagrams of more specific example in-order core architectures that would be one of several logic blocks in a chip, including other cores of the same type and/or different types. Depending on the application, the logic blocks communicate with some fixed function logic, memory I/O interfaces, and other necessary I/O logic over a high bandwidth interconnection network (e.g., a ring network).
Figure 8A is a block diagram of a single processor core and its connection to the on-die interconnect network 802 and its local subset of the second level (L2) cache 804, according to an embodiment of the present disclosure. In one embodiment, the instruction decode unit 800 supports the x86 instruction set with a packed data instruction set extension. The L1 cache 806 allows low latency access to cache memory into scalar and vector units. While in one embodiment (to simplify the design), a scalar unit 808 and a vector unit 810 use separate register sets (respectively, scalar registers 812 and vector registers 814), and data transferred between these registers is written to memory and then read back in from a level one (L1) cache 806, alternative embodiments of the present disclosure may use different approaches (e.g., use a single register set or include a communication path that allows data to be transferred between the two register files without being written and read back).
The local subset 804 of the L2 cache is part of a global L2 cache, which is divided into multiple separate local subsets, one for each processor core, of the global L2 cache. Each processor core has a direct access path to its own local subset 804 of the L2 cache. Data read by a processor core is stored in its L2 cache subset 804 and can be accessed quickly, in parallel with other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 804 and is flushed from other subsets, if necessary. The ring network ensures consistency of shared data. The ring network is bidirectional to allow agents such as processor cores, L2 caches and other logic blocks to communicate with each other on-chip. Each ring data path is 512 bits wide per direction.
Figure 8B is an expanded view of a portion of the processor core in figure 8A according to an embodiment of the present disclosure. FIG. 8B includes the L1 data cache 806A portion of the L1 cache 804, as well as more details regarding the vector unit 810 and the vector registers 814. In particular, vector unit 810 is a 16-wide Vector Processing Unit (VPU) (see 16-wide ALU 828) that executes one or more of integer, single-precision floating-point, and double-precision floating-point instructions. The VPU supports blending of register inputs through blending unit 820, numerical conversion through numerical conversion units 822A-B, and replication of memory inputs through replication unit 824. Write mask register 826 allows masking of the resulting vector writes.
Fig. 9 is a block diagram of a processor 900 that may have more than one core, may have an integrated memory controller, and may have an integrated graphics device, according to an embodiment of the disclosure. The solid line block diagram in fig. 9 illustrates a processor 900 having a single core 902A, a system agent 910, a set 916 of one or more bus controller units, while the optional addition of the dashed line block illustrates an alternative processor 900 having multiple cores 902A-N, a set 914 of one or more integrated memory controller units in the system agent unit 910, and application specific logic 908.
Thus, different implementations of processor 900 may include: 1) a CPU, where dedicated logic 908 is integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and cores 902A-N are one or more general-purpose cores (e.g., general-purpose in-order cores, general-purpose out-of-order cores, a combination of both); 2) coprocessors, where cores 902A-N are a large number of special-purpose cores intended primarily for graphics and/or science (throughput); and 3) coprocessors, where cores 902A-N are a number of general purpose ordered cores. Thus, the processor 900 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput Many Integrated Core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 900 may be part of and/or may be implemented on one or more substrates using any of a variety of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache 904A-904N within the cores, a set of one or more shared cache units 906, and external memory (not shown) coupled to a set of integrated memory controller units 914. The set of shared cache units 906 may include one or more intermediate levels of cache, such as a level two (L2), a level three (L3), a level four (L4), or other levels of cache, a Last Level Cache (LLC), and/or combinations thereof. While in one embodiment, the ring-based interconnect unit 912 interconnects the integrated graphics logic 908, the set of shared cache units 906, and the system agent unit 910/integrated memory controller unit(s) 914, alternative embodiments may interconnect such units using any number of well-known techniques. In one embodiment, coherency is maintained between one or more cache molecules 906 and cores 902A-N.
In some embodiments, one or more of the cores 902A-N are capable of implementing multithreading. System agent 910 includes those components that coordinate and operate cores 902A-N. The system agent unit 910 may include, for example, a Power Control Unit (PCU) and a display unit. The PCU may be or may include logic and components needed to regulate the power states of cores 902A-N and integrated graphics logic 908. The display unit is used to drive one or more externally connected displays.
The cores 902A-N may be homogeneous or heterogeneous in terms of the architectural instruction set; that is, two or more of the cores 902A-N may be capable of executing the same instruction set, while other cores may be capable of executing only a subset of the instruction set or a different instruction set.
Exemplary computer architecture
Fig. 10-13 are block diagrams of exemplary computer architectures. Other system designs and configurations known in the art for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network appliances, hubs, switches, embedded processors, Digital Signal Processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cell phones, portable media players, handheld devices, and various other electronic devices are also suitable. In general, a wide variety of systems or electronic devices capable of containing a processor and/or other execution logic as disclosed herein are generally suitable.
Referring now to fig. 10, shown is a block diagram of a system 1000 in accordance with one embodiment of the present disclosure. System 1000 may include one or more processors 1010, 1015 coupled to a controller hub 1020. In one embodiment, the controller hub 1020 includes a Graphics Memory Controller Hub (GMCH)1090 and an input/output hub (IOH)1050 (which may be on separate chips); the GMCH1090 includes memory and graphics controllers to which the memory 1040 and the coprocessor 1045 are coupled; IOH 1050 couples an input/output (I/O) device 1060 to GMCH 1090. Alternatively, one or both of the memory and graphics controllers are integrated within the processor (as described herein), the memory 1040 and the coprocessor 1045 are coupled directly to the processor 1010, and the controller hub 1020 and the IOH 1050 are in a single chip. The memory 1040 may include Microcode Sequencer (MS) arbitration code 1040A, e.g., to store code that, when executed, causes the processor to perform any of the methods of the present disclosure.
The optional nature of additional processors 1015 is indicated in FIG. 10 by the dashed lines. Each processor 1010, 1015 may include one or more of the processing cores described herein and may be some version of the processor 900.
The memory 1040 may be, for example, a Dynamic Random Access Memory (DRAM), a Phase Change Memory (PCM), or a combination of the two. For at least one embodiment, controller hub 1020 communicates with processor(s) 1010, 1015 via a multi-drop bus such as a front-side bus (FSB), a point-to-point interface such as a Quick Path Interconnect (QPI), or similar connection 1095.
In one embodiment, the coprocessor 1045 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 1020 may include an integrated graphics accelerator.
There may be various differences between the physical resources 1010, 1015 including a range of quality metrics for architectural, microarchitectural, thermal, power consumption characteristics, and so on.
In one embodiment, processor 1010 executes instructions that control data processing operations of a general type. Embedded within these instructions may be coprocessor instructions. Processor 1010 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1045. Thus, processor 1010 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect to coprocessor 1045. Coprocessor(s) 1045 accepts and executes received coprocessor instructions.
Referring now to fig. 11, shown is a block diagram of a first more specific exemplary system 1100 in accordance with an embodiment of the present disclosure. As shown in FIG. 11, multiprocessor system 1100 is a point-to-point interconnect system, and includes a first processor 1170 and a second processor 1180 coupled via a point-to-point interconnect 1150. Each of processors 1170 and 1180 may be some version of the processor 900. In one embodiment of the disclosure, processors 1170 and 1180 are processors 1010 and 1015, respectively, and coprocessor 1138 is coprocessor 1045. In another embodiment, processors 1170 and 1180 are respectively processor 1010 and coprocessor 1045.
Processors 1170 and 1180 are shown including Integrated Memory Controller (IMC) units 1172 and 1182, respectively. Processor 1170 also includes as part of its bus controller units point-to-point (P-P) interfaces 1176 and 1178; similarly, the second processor 1180 includes P-P interfaces 1186 and 1188. Processors 1170, 1180 may exchange information via a point-to-point (P-P) interface 1150 using P-P interface circuits 1178, 1188. As shown in FIG. 11, IMCs 1172 and 1182 couple the processors to respective memories, namely a memory 1132 and a memory 1134, which may be portions of main memory locally attached to the respective processors.
Processors 1170, 1180 may each exchange information with a chipset 1190 via individual P-P interfaces 1152, 1154 using point to point interface circuits 1176, 1194, 1186, 1198. Chipset 1190 may optionally exchange information with the coprocessor 1138 via a high-performance interface 1139. In one embodiment, the coprocessor 1138 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.
A shared cache (not shown) may be included in either processor, or external to both processors but connected with the processors via a P-P interconnect, such that if a processor is placed in a low power mode, local cache information for either or both processors may be stored in the shared cache.
Chipset 1190 may be coupled to a first bus 1116 via an interface 1196. In one embodiment, first bus 1116 may be a Peripheral Component Interconnect (PCI) bus or a bus such as a PCI express bus or another third generation I/O interconnect bus, although the scope of the present disclosure is not so limited.
As shown in fig. 11, various I/O devices 1114 may be coupled to first bus 1116, along with a bus bridge 1118, which couples first bus 1116 to a second bus 1120. In one embodiment, one or more additional processors 1115, such as coprocessors, high-throughput MIC processors, GPGPU's, accelerators (such as, e.g., graphics accelerators or Digital Signal Processing (DSP) units), field programmable gate arrays, or any other processor, are coupled to first bus 1116. In one embodiment, second bus 1120 may be a Low Pin Count (LPC) bus. Various devices may be coupled to second bus 1120 including, for example, a keyboard and/or mouse 1122, communication devices 1127, and a storage unit 1128 such as a disk drive or other mass storage device which may include instructions/code and data 1130, in one embodiment. Further, an audio I/O1124 may be coupled to the second bus 1120. Note that other architectures are possible. For example, instead of the point-to-point architecture of fig. 11, a system may implement a multi-drop bus or other such architecture.
Referring now to fig. 12, shown is a block diagram of a second more specific example system 1200 in accordance with an embodiment of the present disclosure. Like elements in fig. 11 and 12 bear like reference numerals, and certain aspects of fig. 11 have been omitted from fig. 12 to avoid obscuring other aspects of fig. 12.
Fig. 12 illustrates that processors 1170, 1180 may include integrated memory and I/O control logic ("CL") 1172 and 1182, respectively. Thus, the CL 1172, 1182 include integrated memory controller units and include I/O control logic. Fig. 12 illustrates that not only are the memories 1132, 1134 coupled to the CL 1172, 1182, but also that the I/O devices 1214 are coupled to the control logic 1172, 1182. Legacy I/O devices 1215 are coupled to the chipset 1190.
Referring now to fig. 13, shown is a block diagram of a SoC 1300 in accordance with an embodiment of the present disclosure. Like elements in fig. 9 bear like reference numerals. In addition, the dashed box is an optional feature on more advanced socs. In fig. 13, interconnect cell(s) 1302 are coupled to: an application processor 1310 that includes a set of one or more cores 902A-N and a shared cache unit(s) 906; a system agent unit 910; bus controller unit(s) 916; integrated memory controller unit(s) 914; a set of one or more coprocessors 1320 which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a Static Random Access Memory (SRAM) unit 1330; a Direct Memory Access (DMA) unit 1332; and a display unit 1340 for coupling to one or more external displays. In one embodiment, the coprocessor(s) 1320 include a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, a high-throughput MIC processor, embedded processor, or the like.
The various embodiments disclosed herein (e.g., of mechanisms) may be implemented in hardware, software, firmware, or a combination of such implementations. Embodiments of the disclosure may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 1130 illustrated in FIG. 11, may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor, such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represent various logic in a processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as "IP cores" may be stored on a tangible, machine-readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, but are not limited to, non-transitory, tangible arrangements of articles of manufacture made or formed by machines or devices, including storage media such as hard disks; any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks; semiconductor devices such as Read Only Memory (ROM), Random Access Memory (RAM) such as Dynamic Random Access Memory (DRAM) and Static Random Access Memory (SRAM), Erasable Programmable Read Only Memory (EPROM), flash memory, Electrically Erasable Programmable Read Only Memory (EEPROM); phase Change Memory (PCM); magnetic or optical cards; or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the present disclosure also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, devices, processors, and/or system features described herein. These embodiments are also referred to as program products.
Simulation (including binary conversion, code deformation, etc.)
In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, the instruction converter may transform (e.g., using static binary transformations, dynamic binary transformations including dynamic compilation), morph, emulate, or otherwise convert the instruction into one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on the processor, off-processor, or partially on and partially off-processor.
FIG. 14 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the disclosure. In the illustrated embodiment, the instruction converter is a software instruction converter, but alternatively, the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. Fig. 14 illustrates that a program in the form of a high-level language 1402 can be compiled using an x86 compiler 1404 to generate x86 binary code 1406 that can be natively executed by a processor 1416 having at least one x86 instruction set core. Processor 1416 with at least one x86 instruction set core represents performing operations with at least one x86 instruction set core by compatibly executing or otherwise processing
Figure BDA0003229653540000331
Any processor with substantially the same functionality: 1)
Figure BDA0003229653540000333
substantial portion of the instruction set of the x86 instruction set core, or 2) targeting at least one x86 instruction set core
Figure BDA0003229653540000332
Operating on a processor to fetch and store instructions having at least one x86 instruction set core
Figure BDA0003229653540000334
An object code version of an application or other software that has substantially the same result as the processor. The x86 compiler 1404 represents a compiler operable to generate x86 binary code 1406 (e.g., object code) that may or may not be executed on a processor 1416 having at least one x86 instruction set core via additional linking processes. Similarly, FIG. 14 shows that an alternative instruction set compiler 1408 may be used to compile a program in a high-level language 1402 to generate MIPS instructions that may be executed by a processor 1414 that does not have at least one x86 instruction set core (e.g., MIPS instructions having the capability to execute MIPS technologies of Sonerval, Calif.)A processor that collects, and/or executes, cores of the ARM instruction set of ARM holdings, sunnyvale, california) native execution. The instruction converter 1412 is used to convert the x86 binary code 1406 into code that can be natively executed by the processor 1414 without the x86 instruction set core. This converted code is unlikely to be identical to the alternative instruction set binary code 1410, because an instruction converter capable of doing so is difficult to manufacture; however, the translated code will complete the general operation and be made up of instructions from the alternate instruction set. Thus, the instruction converter 1412 represents software, firmware, hardware, or a combination thereof that allows a processor or other electronic device without an x86 instruction set processor or core to execute the x86 binary code 1406 by emulation, simulation, or any other process.

Claims (24)

1. A hardware processor core, comprising:
a first decoding cluster comprising a plurality of decoder circuits;
a second decoding cluster comprising a plurality of decoder circuits;
fetch circuitry to fetch a first instruction block and send the first instruction block to the first decode cluster for decoding, and to fetch a second instruction block younger in program order than the first instruction block and send the second instruction block to the second decode cluster for decoding;
a microcode sequencer including a memory storing a plurality of micro-operations; and
arbitration circuitry to arbitrate access by the first decoding cluster and the second decoding cluster to a shared read port of the memory, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
2. The hardware processor core of claim 1, wherein the arbitration circuitry is to: when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is above or equal to the arbitration threshold, stopping access by the second decode cluster to the shared read port of the memory of the microcode sequencer until the second instruction block is the oldest instruction block in program order being decoded by the hardware processor core.
3. The hardware processor core of claim 2, wherein the stall is a stall of decoding of the second decode cluster.
4. The hardware processor core of claim 1, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below the arbitration threshold and an instruction decode queue of the second decode cluster has available storage space for the number of corresponding micro-operations.
5. The hardware processor core of claim 1, wherein the second decode cluster comprises a data structure to store one or more bits indicating the number of corresponding micro-operations in the microcode sequencer for instructions of the second instruction block, and to send the one or more bits to the arbitration circuitry in response to a request to decode instructions of the second instruction block.
6. The hardware processor core of claim 5, wherein the data structure of the second decode cluster is to store an entry point value that is to indicate an entry point in the memory for a corresponding micro-operation of an instruction of the second instruction block, and the second decode cluster is to send the entry point value and the one or more bits to the arbitration circuitry in response to the request to decode an instruction of the second instruction block.
7. The hardware processor core of claim 1, wherein the arbitration circuitry is to: when an instruction of the first instruction block has one or more corresponding micro-operations in the microcode sequencer, access to the shared read port of the memory is allowed to the first decode cluster that decodes the first instruction block but not the second decode cluster that decodes the second instruction block.
8. The hardware processor core of any one of claims 1-7, wherein the shared read port of the memory is the only read port into the memory of the microcode sequencer.
9. A method, comprising:
sending a first instruction block to a first decode cluster of a processor for decoding, the first decode cluster comprising a plurality of decoder circuits;
sending a second instruction block younger in program order than the first instruction block to a second decode cluster of the processor for decoding, the second decode cluster comprising a plurality of decoder circuits; and
arbitrating, via an arbitration circuit of the processor, access by the first decoding cluster and the second decoding cluster to a shared read port of a microcode sequencer that includes a memory storing a plurality of micro-operations to allow access to the shared read port by the second decoding cluster decoding the second instruction block but not the first decoding cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
10. The method of claim 9, wherein the arbitrating comprises: when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is greater than or equal to the arbitration threshold, stopping access by the second decode cluster to the shared read port of the memory of the microcode sequencer until the second instruction block is the oldest instruction block in program order being decoded by the first decode cluster and the second decode cluster.
11. The method of claim 10, wherein the stop is a stop of decoding of the second decoding cluster.
12. The method of claim 9, wherein the arbitrating comprises: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below the arbitration threshold and an instruction decode queue of the second decode cluster has available storage space for the number of corresponding micro-operations.
13. The method of claim 9, further comprising:
reading one or more bits of a data structure from the second decode cluster, the one or more bits indicating the number of corresponding micro-operations in the microcode sequencer for an instruction of the second instruction block; and
sending the one or more bits to the arbitration circuitry in response to a request to decode an instruction of the second instruction block.
14. The method of claim 13, further comprising:
reading an entry point value of the data structure from the second decode cluster, the entry point value indicating an entry point in the memory for a corresponding micro-operation of an instruction of the second instruction block; and
sending the entry point value and the one or more bits to the arbitration circuit in response to the request to decode an instruction of the second instruction block.
15. The method of claim 9, wherein the arbitrating comprises: when an instruction of the first instruction block has one or more corresponding micro-operations in the microcode sequencer, access to the shared read port of the memory is allowed to the first decode cluster that decodes the first instruction block but not the second decode cluster that decodes the second instruction block.
16. The method of any of claims 9-15, wherein the shared read port of the memory is the only read port into the memory of the microcode sequencer.
17. A hardware processor core, comprising:
a first decoding cluster comprising a plurality of decoder circuits;
a second decoding cluster comprising a plurality of decoder circuits;
a branch predictor to identify a first instruction block and a second instruction block younger in program order than the first instruction block, to cause the first instruction block to be sent to the first decode cluster for decoding, and to cause the second instruction block to be sent to the second decode cluster for decoding;
a microcode sequencer including a memory storing a plurality of micro-operations; and
arbitration circuitry to arbitrate access by the first decoding cluster and the second decoding cluster to a shared read port of the memory, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when a number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below an arbitration threshold.
18. The hardware processor core of claim 17, wherein the arbitration circuitry is to: when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is above or equal to the arbitration threshold, stopping access by the second decode cluster to the shared read port of the memory of the microcode sequencer until the second instruction block is the oldest instruction block in program order being decoded by the hardware processor core.
19. The hardware processor core of claim 18, wherein the stall is a stall of decoding of the second decode cluster.
20. The hardware processor core of claim 17, wherein the arbitration circuitry is to: allowing access to the shared read port of the memory by the second decode cluster decoding the second instruction block but not the first decode cluster decoding the first instruction block when the number of corresponding micro-operations of instructions of the second instruction block in the microcode sequencer is below the arbitration threshold and an instruction decode queue of the second decode cluster has available storage space for the number of corresponding micro-operations.
21. The hardware processor core of claim 17, wherein the second decode cluster comprises a data structure to store one or more bits indicating the number of corresponding micro-operations in the microcode sequencer for instructions of the second instruction block, and to send the one or more bits to the arbitration circuitry in response to a request to decode instructions of the second instruction block.
22. The hardware processor core of claim 21, wherein the data structure of the second decode cluster is to store an entry point value that is to indicate an entry point in the memory for a corresponding micro-operation of an instruction of the second instruction block, and the second decode cluster is to send the entry point value and the one or more bits to the arbitration circuitry in response to the request to decode an instruction of the second instruction block.
23. The hardware processor core of claim 17, wherein the arbitration circuitry is to: when an instruction of the first instruction block has one or more corresponding micro-operations in the microcode sequencer, access to the shared read port of the memory is allowed to the first decode cluster that decodes the first instruction block but not the second decode cluster that decodes the second instruction block.
24. The hardware processor core of any one of claims 17-23, wherein the shared read port of the memory is the only read port into the memory of the microcode sequencer.
CN202110982471.7A 2020-09-25 2021-08-25 Method, system, and apparatus for out-of-order access to shared microcode sequencers by a clustered decode pipeline Pending CN114253607A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17/033,649 2020-09-25
US17/033,649 US11907712B2 (en) 2020-09-25 2020-09-25 Methods, systems, and apparatuses for out-of-order access to a shared microcode sequencer by a clustered decode pipeline

Publications (1)

Publication Number Publication Date
CN114253607A true CN114253607A (en) 2022-03-29

Family

ID=80791344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110982471.7A Pending CN114253607A (en) 2020-09-25 2021-08-25 Method, system, and apparatus for out-of-order access to shared microcode sequencers by a clustered decode pipeline

Country Status (2)

Country Link
US (1) US11907712B2 (en)
CN (1) CN114253607A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117580145A (en) * 2023-11-21 2024-02-20 白盒子(上海)微电子科技有限公司 Radio frequency control method for high-precision timing

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12423103B2 (en) * 2021-12-13 2025-09-23 Intel Corporation Instruction decode cluster offlining
US20230315473A1 (en) * 2022-04-02 2023-10-05 Intel Corporation Variable-length instruction steering to instruction decode clusters
US12260214B1 (en) * 2022-09-30 2025-03-25 Amazon Technologies, Inc. Throughput increase for compute engine
CN115525344B (en) * 2022-10-31 2023-06-27 海光信息技术股份有限公司 Decoding method, processor, chip and electronic equipment
CN115525343B (en) * 2022-10-31 2023-07-25 海光信息技术股份有限公司 Parallel decoding method, processor, chip and electronic equipment
US20250306937A1 (en) * 2024-03-29 2025-10-02 Intel Corporation Concurrent decode of complex instructions having varying numbers of decoded instructions

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5630083A (en) * 1994-03-01 1997-05-13 Intel Corporation Decoder for decoding multiple instructions in parallel
US5913049A (en) * 1997-07-31 1999-06-15 Texas Instruments Incorporated Multi-stream complex instruction set microprocessor
US20010032307A1 (en) * 1998-12-30 2001-10-18 Joseph Rohlman Micro-instruction queue for a microprocessor instruction pipeline
US6363471B1 (en) * 2000-01-03 2002-03-26 Advanced Micro Devices, Inc. Mechanism for handling 16-bit addressing in a processor
US6968444B1 (en) * 2002-11-04 2005-11-22 Advanced Micro Devices, Inc. Microprocessor employing a fixed position dispatch unit
US9710277B2 (en) * 2010-09-24 2017-07-18 Intel Corporation Processor power management based on class and content of instructions
US20180173534A1 (en) * 2016-12-20 2018-06-21 Intel Corporation Branch Predictor with Branch Resolution Code Injection
US11467838B2 (en) * 2018-05-22 2022-10-11 Advanced Micro Devices, Inc. Fastpath microcode sequencer
US11748649B2 (en) * 2019-12-13 2023-09-05 Intel Corporation Apparatus and method for specifying quantum operation parallelism for a quantum control processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117580145A (en) * 2023-11-21 2024-02-20 白盒子(上海)微电子科技有限公司 Radio frequency control method for high-precision timing

Also Published As

Publication number Publication date
US20220100500A1 (en) 2022-03-31
US11907712B2 (en) 2024-02-20

Similar Documents

Publication Publication Date Title
US11243775B2 (en) System, apparatus and method for program order queue (POQ) to manage data dependencies in processor having multiple instruction queues
US10503505B2 (en) Read and write masks update instruction for vectorization of recursive computations over independent data
CN107408036B (en) User-Level Forking and Combining Processors, Methods, Systems and Instructions
US11907712B2 (en) Methods, systems, and apparatuses for out-of-order access to a shared microcode sequencer by a clustered decode pipeline
CN106708753B (en) Apparatus and method for accelerating operation in processor using shared virtual memory
US9122475B2 (en) Instruction for shifting bits left with pulling ones into less significant bits
CN111752616A (en) System, apparatus and method for symbolic memory address generation
EP3547119B1 (en) Apparatus and method for speculative conditional move operation
CN113535236A (en) Method and apparatus for instruction set architecture based and automated load tracing
US20180365022A1 (en) Dynamic offlining and onlining of processor cores
US11941409B2 (en) Methods, systems, and apparatuses for a multiprocessor boot flow for a faster boot process
CN104050415B (en) The sane and high performance instruction called for system
EP3330863A1 (en) Apparatuses, methods, and systems to share translation lookaside buffer entries
CN112241288A (en) Detecting dynamic control flow reconvergence points for conditional branches in hardware
US12190157B2 (en) Methods, systems, and apparatuses for scalable port-binding for asymmetric execution ports and allocation widths of a processor
EP3109754A1 (en) Systems, methods, and apparatuses for improving performance of status dependent computations
EP4020170A1 (en) Methods, systems, and apparatuses to optimize partial flag updating instructions via dynamic two-pass execution in a processor
US9886318B2 (en) Apparatuses and methods to translate a logical thread identification to a physical thread identification
US10437590B2 (en) Inter-cluster communication of live-in register values
US12572358B2 (en) System, apparatus and methods for minimum serialization in response to non-serializing register write instruction
CN115858022A (en) Scalable switch point control circuitry for clustered decoding pipeline
US12505043B1 (en) Methods and apparatus for timed hardware delay for reductions in instruction fetch traffic
US11275588B2 (en) Context save with variable save state size
CN120723303A (en) Apparatus and method for remote atomic floating-point operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination