CN105453028B - Vector accumulation method and equipment - Google Patents

Vector accumulation method and equipment Download PDF

Info

Publication number
CN105453028B
CN105453028B CN201480043504.XA CN201480043504A CN105453028B CN 105453028 B CN105453028 B CN 105453028B CN 201480043504 A CN201480043504 A CN 201480043504A CN 105453028 B CN105453028 B CN 105453028B
Authority
CN
China
Prior art keywords
output
vector
elements
input
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201480043504.XA
Other languages
Chinese (zh)
Other versions
CN105453028A (en
Inventor
阿贾伊·阿南特·英格尔
马克·默里·霍夫曼
迪帕克·马修
曾贸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Publication of CN105453028A publication Critical patent/CN105453028A/en
Application granted granted Critical
Publication of CN105453028B publication Critical patent/CN105453028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • G06F9/30038Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • G06F9/3897Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros with adaptable data path

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

在特定实施例中,一种方法包含在处理器处执行向量指令。所述向量指令包含包含多个元素的向量输入。执行所述向量指令包含将所述多个元素中的第一元素提供为第一输出。执行所述向量指令进一步包含对所述多个元素中的所述第一元素及第二元素执行算术运算,以提供第二输出。执行所述向量指令进一步包含将所述第一输出及所述第二输出存储于输出向量中。

In a particular embodiment, a method includes executing vector instructions at a processor. The vector instructions include a vector input comprising a plurality of elements. Executing the vector instructions includes providing a first element of the plurality of elements as a first output. Executing the vector instructions further includes performing an arithmetic operation on the first element and a second element of the plurality of elements to provide a second output. Executing the vector instructions further includes storing the first output and the second output in an output vector.

Description

Vector accumulates method and apparatus
To the cross reference of related application
Present application opinion co-owns U.S. Non-provisional Patent application case the 13/967th filed on August 14th, 2013, The full content of No. 191 priority, the application case is clearly incorporated herein by reference.
Technical field
The present invention relates generally to vector arithmetic reduction.
Background technique
The progress of technology has brought smaller and more powerful computing device.For example, there is currently a variety of portable People's computing device includes wireless computing device, for example, portable radiotelephone, personal digital assistant (PDA), tablet computer And teleseme, it is small in size, light-weight and be easy to be carried by user.These many computing devices include be incorporated into it is therein its Its device.For example, radio telephone also may include Digital Still Camera, digital video camcorder, digital recorder and audio Archives player.Include software application also, these computing devices can handle executable instruction, for example, can be used for accessing because The Web-browser application of spy's net, and using still camera or video camera and provide the more of multi-media player function Media application.
These many computing devices include for handling wireless transmission, and other activities associated with largely computing repeatedly Vector processor.Vector processor executes instruction, and described instruction holds the multiple inputs that may be disposed so that one-dimensional array or vector Row operation.Vector instruction is executed to make it possible to execute certain operations to multiple inputs.For example, conventional vectorial addition is executed Reduction instruction can calculate single total value based on multiple inputs.Other operations (for example, integral function and accumulating density function) It can also make other than using one or more part summations (for example, one or more summations all or less than input in multiple inputs) With single summation.In order to generate and export one or more part summations, multiple vector instructions are executed.Compared to execution single vector-quantities Addition reduction is instructed to generate and export single summation, is executed multiple vector instructions and is routinely increased memory utilization rate and power Consumption.
Summary of the invention
A kind of method for disclosing execution accumulating vector arithmetic reduction instruction.Can be executed at processor the accumulating to Arithmetic reduction instruction is measured, enables to execute input vector multiple gradual arithmetical operations (for example, gradual addition is transported It calculates).The input vector may include the multiple input elements stored with sequential order.Execute the accumulating vector arithmetic reduction Instruction can bring the output vector with multiple output elements.Each output element may be based on for the arithmetical operation being applied to The correspondence input element of the input vector and any result for being sequentially previously entered element of the input vector.Therefore, institute State multiple output valves can correspond to the multiple input element multiple portions summation and all the multiple input elements Summation.At least one of the input element or the output element can be covered, to prevent one or more input elements from including In the accumulating vector arithmetic reduction operations, or prevent one or more output element storage accumulating vector arithmetic reduction knots Fruit.
The section packets size that can be instructed based on segmented vector arithmetic reduction configure reduction tree selectively to execute point Segmentation vector arithmetic reduction instruction.The reduction tree may include being arranged to multiple adders of multiple rows.It can be based on the section Packet size, one or more adders being selectively enabled in multiple rows, and can be by the addition selectively enabled Device generates multiple output valves.Arithmetic (for example, addition) operation is executed by the input to one or more groups, institute can be generated simultaneously State multiple output valves.Each group can have the section packets size as caused by the adder selectively enabled.Cause This, single reduction tree can be configured to perform multiple section vector arithmetic reduction instructions, wherein each instruction has different sections Packet size.
In a particular embodiment, a kind of method, which is included at processor, executes vector instruction.The vector instruction includes packet Vector input containing multiple elements.Executing the vector instruction includes that the first element in the multiple element is provided as first Output.Execute the vector instruction further include in the multiple element first element and second element execute the One arithmetical operation, to provide the second output.It executes the vector instruction and further includes and exported described first and described second Output is stored in output vector.
In another particular embodiment, a kind of equipment includes the processor comprising reduction tree.It include multiple elements in identification Vector input vector instruction execution during, the reduction tree is configured to mention the first element in the multiple element For exporting element for first.The reduction tree is further configured to first element and second in the multiple element Element executes the first arithmetical operation, to provide the second output element.The reduction tree is further configured to defeated by described first Element and the second output element are stored in output vector out.
In another particular embodiment, a kind of equipment includes defeated for the first element in multiple elements to be provided as first Device out.Vector instruction instruction is inputted comprising the vector of the multiple element.The equipment is further included for based on institute First element and second element of stating multiple elements generate the device of the second output.The equipment is further included for inciting somebody to action The device that first output and second output are stored in output vector.
In another particular embodiment, a kind of non-transitory computer-readable media includes and when executed by the processor, makes It obtains the instruction that the processor proceeds as follows: the first element in multiple elements is provided as the first output element;To institute First element and second element stated in multiple elements execute arithmetical operation, to provide the second output;And by described first Output and second output are stored in output vector.It is defeated that the multiple element is contained in the vector as indicated by vector instruction In entering.
In another particular embodiment, a kind of equipment includes reduction tree, and it includes multiple inputs, multiple adders and multiple Output.During processor is configured to the execution of the first instruction comprising the first section packets size, and include the second section The reduction tree is used during the execution of second instruction of packet size.The reduction tree is configured to generate multiple outputs simultaneously Element.
In another particular embodiment, a kind of method is included in reception at processor and refers to comprising the vector of section packets size It enables.The processor includes reduction tree.The reduction tree includes multiple inputs, multiple arithmetic operation units and multiple outputs.Institute The method of stating, which further includes, determines the section packets size.The method is further included based on the section packets size, The vector instruction is executed using the reduction tree, to generate the multiple output simultaneously.The reduction tree property of can be chosen Configuration with multiple and different section packets sizes for being used together.
In yet another specific embodiment, a kind of method include execute include multiple input elements vector instruction.Execute institute Stating vector instruction includes to be grouped the first subset of the multiple input element, to form the first set of input element.Execute institute It states vector instruction and further includes the second subset for being grouped the multiple input element, to form the second set of input element. It executes the vector instruction and further includes and the first arithmetical operation executed to the input element of the first set, and to described the The input element of two set executes the second arithmetical operation.The vector instruction is executed to further include on rotation output register Content, and after the content for rotating the output register, by the first result of first arithmetical operation and described Second result of the second arithmetical operation is inserted into the output register.
The specific advantages as provided by least one of described disclosed embodiment are to be configured to single accumulating The reduction tree of multiple portions result is generated during the execution of vector arithmetic reduction instruction.Compared to the multiple vector instructions of execution to produce Raw similar output executes the single accumulating vector arithmetic reduction and instructs the less space that can be used in memory, and can drop Low power consumption.Another specific advantages as provided by least one of described disclosed embodiment are that can be configured to have Have during the execution of the first instruction of the first section packets size and during the execution of the second instruction with second packet size Use the processor of single reduction tree.It is more compared to being used during the execution of the multiple instruction with different section packets sizes A reduction tree can reduce the chip area and power consumption of the processor using the single reduction tree.
Other aspects, advantage and feature of the invention will become aobvious after checking the entire application case comprising following sections And it is clear to: [brief description of drawings], [embodiment] and [claims].
Detailed description of the invention
Fig. 1 is the schema for executing the illustrative process of accumulating vector arithmetic reduction instruction;
Fig. 2 is the block diagram to execute the illustrative embodiments of the system of vector instruction;
Fig. 3 to 6 is the block diagram of the illustrative embodiments of reduction tree;
Fig. 7 is the block diagram of the illustrative embodiments of a part of reduction tree;
Fig. 8 is the block diagram of another illustrative embodiments of reduction tree;
Fig. 9 is the schema for executing the illustrative process of segmented vector arithmetic reduction instruction;
Figure 10 is the schema for executing the illustrative process of rotation segmented vector arithmetic reduction instruction;
Figure 11 A to B is the schema for executing the illustrative process of the accumulating vector arithmetic reduction instruction comprising shielding;
Figure 12 is the flow chart for executing the illustrative embodiments of method of the first accumulating vector arithmetic reduction instruction;
Figure 13 is the flow chart that the illustrative embodiments of method of vector instruction is executed using reduction tree;
Figure 14 is the flow chart for executing the illustrative embodiments of method of rotation segmented vector arithmetic reduction instruction;And
Figure 15 is the block diagram of the mancarried device comprising reduction tree.
Specific embodiment
Referring to Fig. 1, the schema for executing the illustrative process of vector instruction is disclosed, and is generally assigned therein as 100.Vector Instruction may include accumulating vector arithmetic reduction instruction, such as illustrative accumulating vector arithmetic reduction instruction 101.Accumulating to Amount arithmetic reduction instruction 101 can be performed at processor, such as pipeline vector processor, as described with reference to figure 2.Processor It can receive the input vector 122 comprising multiple elements 102.Processor can handle input vector 122 and generate output vector 120. Output vector 120 (for example, being stored in multiple output elements in output vector 120) may be based on accumulating vector arithmetic reduction Instruction 101.For example, executing accumulating vector arithmetic reduction instruction 101 can be by by the element-specific in multiple elements 102 It adds on the sequential order of input vector 122 in one or more sequentially in multiple elements 102 before element-specific Other elements (for example, being added to be accumulation) and generate specific output.
Multiple elements 102 (for example, input vector 122) and output vector 120 may include N number of element, and wherein N is greater than one Integer.Multiple elements 102 may include the first ekahafnium (s0), second element 106 (s1), third element 108 (s2) and N Element 110 (s (N-1)).Can store multiple elements 102 with sequential order, such as " s0, s1, s2 ... s (N-1) ", wherein s0 For first, sequentially element and s (N-1) are last sequentially element according to sequential order.Although showing four elements, multiple members Element number (for example, N) in element 102 can be more or less than four.In a particular embodiment, accumulating vector arithmetic is being executed Before reduction instruction 101, vector permutation being executed using input vector 122 and is instructed, multiple elements 102 are arranged with sequential order.
Executing accumulating vector arithmetic reduction instruction 101 can produce the multiple output elements being stored in output vector 120 (for example, multiple output valves).Output vector 120 can have the element with 122 same number of input vector (for example, N).It executes tired Product formula vector arithmetic reduction instruction 101 may include providing N number of output element.N number of output element can be stored in output vector 120 In.For example, the first output element 112, second exports element 114, third exports element 116 and N output element 118 can It is stored in output vector 120.Output element 112 to 118 can be stored in simultaneously in output vector 120.For example, it is handling During the single execution circulation of the execution accumulating vector arithmetic reduction instruction 101 of device, the first output element 112 and the second output Element 114 can be stored in output vector 120.
Each output element in multiple outputs element 112 to 118 (for example, N number of output element) may be based on to multiple Arithmetical operation (for example, add operation) performed by one or more elements in element 102.In use with specific sequential order " s0, s1, s2 ... s (N-1) " sequencing multiple elements 102 execute the reduction of accumulating vector arithmetic instruction 101 after, first Output element 112 can be equal to s0, the second output element 114 can be equal to s0+s1, third output element 116 can be equal to s0+s1+s2, And N output element 118 can be equal to the summation (s0+s1+ ...+s (N-1)) of each element in multiple elements 102.Citing comes It says, executing accumulating vector arithmetic reduction instruction 101 may include that the first ekahafnium is provided to (for example, generation) for the first output Element 112, and the first ekahafnium is added to second element 106 to provide (for example, generation) second output element 114.First Output element 112 and the second output element 114 can be stored in the different output elements of output vector 120.Execute accumulating to Amount arithmetic reduction instruction 101, which can further include, is added to third element 108 for the first ekahafnium and second element 106, to mention Element 116 is exported for third, and third output element 116 is stored in output vector 120.Execute accumulating vector arithmetic about Letter instruction 101, which can further include, is added each of the element in multiple elements 102, exports element to provide N 118, and N output element 118 is stored in output vector 120.
As illustrated in Figure 1, accumulating vector arithmetic reduction instruction 101 may include instruction name 180 (vrcadd) (example Such as, operation code opcode).Accumulating vector arithmetic reduction instruction 101 also can include one or more of field, such as the first field 182 (Vu), Second field 184 (Vd), third field 186 (Q), the 4th field 188 (Op), the 5th field 190 (sc32) and the 6th field 192 (sat).The first value being stored in the first field 182 can be indicated for the execution in accumulating vector arithmetic reduction instruction 101 The input vector 122 (for example, vector Vu) that period uses, and the second value being stored in the second field 184 can indicate for The output vector 120 (for example, vector Vd) used during the execution of accumulating vector arithmetic reduction instruction 101.It is stored in third Third value in field 186 can indicator panel cover (for example, shielding Q), such as be described in further detail with reference to Figure 11 A to B;It is stored in The 4th value in 4th field 188 can indicate operation vector (for example, operation vector Op);Be stored in the 5th field 190 Five values can indicate input value type, such as be described in further detail referring to figs. 3 to 4;And be stored in the 6th field 192 the 6th Value may indicate whether to execute saturation during accumulating vector arithmetic reduction, as described with reference to fig 7.
Although having described add operation, accumulating vector arithmetic reduction instruction 101 is not limited to only execute add operation. For example, accumulating vector arithmetic reduction instruction 101 can indicate one or more arithmetic fortune to execute to multiple elements 102 It calculates.One or more arithmetical operations may include add operation, subtraction or combinations thereof.For example, one or more can be used to add Method operation is held using one or more subtractions, or using one or more add operations and the combination of one or more subtractions Row arithmetic reduction.One or more calculations can be indicated by the value in specific fields (for example, special parameter) (for example, the 4th field 188) Art operation.For example, the 4th field 188 may include being directed toward storage operation vector (for example, indicating one or more arithmetical operations Vector) memory in position or be directed toward storage operation vector register pointer.Each element of operation vector can refer to Showing will be to the specific of the corresponding element of multiple elements 102 execution during the execution of accumulating vector arithmetic reduction instruction 101 Operation (for example, add operation or subtraction).It, can be when at least one of one or more arithmetical operations are subtraction Multiple output elements are generated before to one or more element supplements in multiple elements 102.For example, it is exported in offer first It, can be based on the calculation of accumulating vector before (for example, before generating multiple output elements) element 112 and the second output element 114 Art reduction instructs 101 (for example, based on the 4th values being stored in the 4th field 188) to one or more in multiple elements 102 Element supplement.
During operation, processor can receive accumulating vector arithmetic reduction instruction 101.Multiple elements can be used in processor 102 execute accumulating vector arithmetic reduction instruction, to generate multiple output elements and be stored in output vector 120.It is more A output element can indicate the multiple portions result of accumulating vector arithmetic reduction operations.
Compared to the generation multiple portions during the execution of multiple vector instructions as a result, accumulating vector arithmetic reduction instructs 101 can be by generating multiple portions result (for example, multiple output elements 112 to 118) during the execution that single vector-quantities instruct And provide storage and power consumption benefit.For example, multiple portions are generated compared to during the execution of multiple vector instructions As a result, generate that multiple portions result can be used in memory or register set during the execution of single vector-quantities instruction less deposits Storage area, and the power consumption of processor can be reduced.
Fig. 2 is the block diagram for being configured to execute the embodiment of the system 200 of vector instruction.System 200 may include being configured With received vector instruction 220 and input vector 122 and provide the processor 202 of output vector 120.Vector instruction 220 can be Fig. 1 Accumulating vector arithmetic reduction instruction 101.Alternatively, as illustrative non-limiting example, vector instruction 220 can be point Segmentation vector arithmetic reduction instruct (such as with reference to described by Fig. 9) or rotation the reduction of segmented vector arithmetic instruction (such as with reference to Described by Figure 10).
Processor 202 may include arithmetic logic unit (ALU) 204 and control logic 210.ALU 204 may include reduction tree 206 and rotary unit 208.ALU 204 can be configured to receive input vector 122, and using reduction tree 206 to input vector 122 execute one or more arithmetical operations.Reduction tree 206 can provide output vector 120.Output vector 120 can be provided to 220 positions that are identified of amount instruction, such as register or position in memory.For example, output vector 120 can be mentioned It is supplied to the position of the specific fields (for example, second field 184 of Fig. 1) based on vector instruction 220.
ALU 204 and reduction tree 206 can be the part of execution pipeline.For example, processor 202 can be for comprising one or more The pipeline vector processor of a pipeline.Reduction tree 206 may be included in one or more pipelines.Reduction tree 206, which can have, to be based on The number grade (for example, grade depth) of the number of (input vector 122) input element.The number of stages of reduction tree 206 can correspond to In input element number with 2 for bottom logarithm.For example, when input element number is 32, reduction tree 206 can have There is Pyatyi.Reduction tree 206 may include the multiple arithmetic operation units for being arranged to one or more rows.Every level-one of reduction tree 206 can A line arithmetic operation unit corresponding to reduction tree 206.
Control logic 210 can be configured to be based on vector instruction 220 (for example, the accumulating vector arithmetic reduction of Fig. 1 instructs 101) one or more adders in multiple adders of (for example, being selectively enabled) reduction tree 206 are selected, Fig. 3 is such as referred to Described by 7.Being selectively enabled one or more arithmetic operation units may make reduction tree 206 to provide (for example, generation) and be used for One or more the output elements being inserted into output vector 120.
Rotary unit 208 can be configured to receive rotating vector 280, and selectively rotate rotation based on vector instruction 220 Steering volume 280, as with reference to further illustrated in Figure 10.Rotary unit 208 can be configured to insert by one or more output elements Before entering (for example, storage) in output vector 120, rotating vector 280 is rotated.For example, rotary unit 208 can be with reduction Tree 206 generates one or more output elements based on input vector 122 and rotates rotating vector 280 in parallel.It can will be rotated through rotation Vector and one or more output elements are provided to multiplexer 212, for insertion into output vector 120 (for example, generating Output vector 120).For example, when input vector 122 and rotating vector 280 respectively contain 16 elements, and vector instruction When 220 execution generates eight output elements using reduction tree 206, the eight output element is may be selected in multiplexer 212 And from eight through rotating rotating vector through rotating element for insertion into output vector 120.It can be other based on having The input vector 122 and/or rotating vector 280 of size, or based on the vector instruction 220 for generating different number of output element It executes to choose other selections.In an alternative embodiment, rotating vector 280 can be input vector 122, and can will come from input Multiple input elements of vector 122 are provided to rotary unit 208 and reduction tree 206.
As illustrative example, rotary unit 208 can be rotator or cylinder vector shifter.Rotating vector 280 may include Multiple foregoing elements (for example, as executing multiple elements caused by previous vector instruction).It can be identified by vector instruction 220 Rotating vector 280.For example, rotating vector 280 can be stored in by vector instruction 220 field identification position (for example, Register or position in memory) in.In a particular embodiment, identical as the associated first position of rotating vector 280 In the second position associated with output vector 120.For example, particular register can be identified as exporting by vector instruction 220 Vector 120, and previous institute's storage element (for example, content) of particular register can be used as rotating vector 280.Particular register Previous institute's storage value at place can be the result of previous vector arithmetic reduction instruction.In another embodiment, with 280 phase of rotating vector Associated first position is identical to the third place associated with input vector 122.In other embodiments, can from be stored in Another value (for example, by being stored in the different value being different from the field of output vector 120) in another field of amount instruction 220 Identify rotating vector 280, or can the instruction name (for example, operation code opcode) based on vector instruction 220 make a reservation for the rotating vector.
During operation, processor 202 can be configured to receive and execute vector instruction 220, to use reduction tree 206 right Input vector 122 executes vector arithmetic reduction (for example, the reduction of accumulating vector arithmetic or the reduction of segmented vector arithmetic).Reduction Tree 206 can execute vector arithmetic reduction to input vector 122, to generate multiple results (for example, in the list of processor 202 simultaneously During one executes circulation).During the execution of vector instruction 220, multiple results as caused by reduction tree 206 be can be stored in defeated In outgoing vector 120.
Compared to the other systems for generating multiple portions result during the execution of multiple vector instructions, system 200 can lead to Cross single vector-quantities instruction (for example, vector instruction 220) execution during generate multiple portions result (for example, multiple results) and Improvement in storage and power consumption is provided.
Referring to Fig. 3, the block diagram of the first illustrative embodiments of reduction tree 300 is disclosed.For example, reduction tree 300 can wrap Reduction tree 206 containing Fig. 2.Reduction tree 300 can be used for executing the instruction of accumulating vector arithmetic, such as the accumulating vector of Fig. 1 is calculated Art instructs the vector instruction 220 of 101 or Fig. 2.Reduction tree 300 can be configured to receive be stored in it is multiple in input vector 122 Input element (includes the first input element 302 and the second input element 304), and provides (for example, generation) to be stored in output Multiple output elements in vector 120.Output vector 120 may include the first output element 306 and the second output element 308.
Each input element in multiple input elements and each output element in multiple output elements may include one or Multiple daughter elements.For example, the first input element 302 may include more than first input daughter element 330 to 336 (s0 to s3), Such as first input daughter element 330 (s0), second input daughter element 332 (s1), third input daughter element 334 (s2) and the 4th son Element 336 (s3).Second input element 304 may include input daughter element 338 to 344 (s4 to s7) more than second, such as the 5th Input daughter element 338 (s4), the 6th input daughter element 340 (s5), the 7th input daughter element 342 (s6) and the 8th input daughter element 344(s7).In addition, the first output element 306 may include more than first and export daughter element 366 to 372 (d0 to d3), such as first Export daughter element 366 (d0), the second output daughter element 368 (d1), third output daughter element 370 (d2) and the 4th output daughter element 372(d3).Second output element 308 may include more than second output daughter element 374 to 380 (d4 to d7), such as the 5th output Daughter element 374 (d4), the 6th output daughter element 376 (d5), the 7th output daughter element 378 (d6) and the 8th output daughter element 380 (d7).Each input element and output element can have same size (for example, same number position).In addition, each input Element can have the size (for example, same number position) for being identical to each output daughter element.For example, each input element (for example, first input element 302) and each output element can be 64 positions, and may include four sixteen bit daughter elements (for example, input daughter element 330 to 336).In an alternative embodiment, it is individual for inputting each of daughter element 330 to 344 Input element, and each of daughter element 366 to 380 is exported for individual output elements, so that input vector 122 includes multiple Input element 330 to 344, and output vector 120 includes multiple output elements 366 to 380.
Reduction tree 300 may include multiple arithmetic operation units.In a particular embodiment, multiple arithmetic operation units can be more A adder includes first adder 320 and second adder 321.In other embodiments, multiple arithmetic operation units can wrap Combination containing subtracter or adder and subtracter.Multiple adders may include (for example, being arranged to) one or more row adders.It lifts For example, multiple adders may include (for example, being arranged to) the first row 312.Although depicted as comprising single row, but multiple additions Device may include more than one row.
It can be instructed based on received accumulating vector arithmetic reduction, be selectively enabled one or more in multiple adders Adder, as described with reference to fig 7.Without the adder that is selectively enabled (in Fig. 3 as illustrated by shade, such as second Adder 321) it can be configured to output the received specific input (for example, zero is added into specific input) in adder place, such as With reference to described by Fig. 7.For example, second adder 321 can be configured to receive the first input element 302, and export wait deposit The first input element 302 being stored in output vector 120.The adder being selectively enabled (is added in Fig. 3 by unshaded Musical instruments used in a Buddhist or Taoist mass explanation, such as first adder 320) it can be configured to perform add operation.For example, first adder 320 can base Add operation is executed in the first input element 302 and the second input element 304.First adder 320 can produce defeated equal to first Enter the adder output of the summation of element 302 and the second input element 304.Adder can be exported and be provided as to be stored in output Output element (for example, second output element 308) in vector 120.Via selective enabling, multiple adders can produce (example Such as, provide) it is stored in multiple output elements in output vector 120.
Multiple input elements can have from accumulating vector arithmetic reduction instruction (for example, from the accumulating that is stored in Fig. 1 to Measure arithmetic reduction instruction 101 the 5th field 190 in value) instruction input type.Input type can recognize real number, imaginary number or Plural (for example, combination of real number and imaginary number), and can be in addition associated with element size.It is multiple when input type is real number Each daughter element in element can indicate real number value.When input type is imaginary number, each daughter element in element can indicate empty Numerical value.When input type is plural number, for each element, an at least daughter element can indicate real number value and at least one other son members Element can indicate imaginary value.Therefore, reduction tree 300 can support multiple and different input types, such as 64 real numbers, 64 Imaginary number, 32 real numbers, 32 imaginary numbers, sixteen bit real number, sixteen bit imaginary number, 32 plural numbers, sixteen bit plural number, One or more other input types, or any combination thereof.
For example, when input type is sixteen bit plural number, each input element 302 and 304 can be 64 positions, Each input daughter element s0, s2, s4 and s6 can indicate sixteen bit real number value, and each input daughter element s1, s3, s5 and s7 can tables Show sixteen bit imaginary value.Therefore, every one or six ten four input elements can be with two sixteen bit plural numbers input daughter elements (for example, the A pair of of s0 and s1 and second couple of s2 and s3) it is associated.As another example, when input type identifies 32 plural numbers, often One input element 302 and 304 can be 64 positions, first couple of input daughter element s0 and s1 and second couple of input daughter element s4 And s5 can indicate 32 real number values, and third can to input daughter element s2 and s3 and the 4th couple of input daughter element s6 and s7 Indicate 32 imaginary values.Therefore, every one or six ten four input elements can be with 32 plural number input daughter element (examples Such as, first couple of input daughter element s0 and s1 and second couple of input daughter element s2 and s3 or third are to input daughter element s4 and s5 And the 4th couple of input daughter element s6 and s7) associated.In each example, multiple output elements may include and input element class Like the output element and output daughter element (for example, output element can have the type identified by input type) of type.
Each adder in multiple adders may include multiple sub- adders.For example, first adder 320 can wrap Containing the first sub- adder 322, the second sub- adder 324, the sub- adder 326 of third and the 4th sub- adder 328.In particular implementation In example, first adder 320 is segmented to execute 64 adders of four sixteen bit add operations (for example, each Sub- adder 322 to 328 indicates a segmentation of first adder 320).In an alternative embodiment, each sub- adder 322 arrives 328 be sixteen bit adder, and first adder 320 indicates a group of four sixteen bit adders.In multiple adders Each adder can have the configuration similar to first adder 320 (for example, second adder 321 may include that four sons add Musical instruments used in a Buddhist or Taoist mass).Although the description 64 adders and sub- adder of sixteen bit, can be used other sizes adder and sub- addition Device, the adder and sub- adder of the size of the input element (for example) based on input vector 122.
Each adder can be configured to execute multiple add operations via multiple sub- adders with interleaved manner.Citing comes It says, first adder 320 can be configured to use the first sub- adder 322 that first input daughter element 330 (s0) and the 5th is defeated Enter daughter element 338 (s4) to be added, second input daughter element 332 (s1) is inputted son member with the 6th using the second sub- adder 324 340 (s5) of element are added, third input daughter element 334 (s2) are inputted daughter element 342 with the 7th using third sub- adder 326 (s6) it is added, and the 4th input daughter element 336 (s3) is inputted into daughter element 344 (s7) with the 8th using the 4th sub- adder 328 It is added.Therefore, reduction tree 300 can be configured with use the first input element 302 and the second input element 304 by daughter element with Interleaved manner executes accumulating vector arithmetic reduction operations.By daughter element executing interleaving formula addition aloows reduction tree right Daughter element with different types of data (for example, real number, imaginary number or plural number) executes add operation.
Multiple adders of bottom line (for example, the first row 312) in multiple adders can be exported and be provided as output member Element (for example, output element 306 and 308) is simultaneously stored in output vector 120.It for example, can be by the every of second adder 321 Each output of one sub- adder is provided as the corresponding output daughter element of the first output element 306, and can be by first adder 320 Each output of each sub- adder 322 to 328 be provided as the corresponding output daughter element of the second output element 308.It is multiple defeated Element 306 and 308 (for example, multiple output daughter elements 366 to 380) can indicate the multiple portions of accumulating vector arithmetic reduction out As a result.
Executing the received accumulating vector arithmetic reduction instruction of institute and can produce has by accumulating vector arithmetic reduction instruction The multiple portions result of the accumulating vector arithmetic reduction instruction of the input type identified.For example, when accumulating vector Arithmetic reduction instruction is associated with complex operation (for example, instruction complex operation), and input type be sixteen bit plural number (for example, Inputting daughter element s0, s2, s4 and s6 indicates real number value and input daughter element s1, s3, s5 and s7 expression imaginary value) when, it executes tired Product formula vector arithmetic reduction instruction may include generating the first real number daughter element of the first output element 306 (for example, the first output Element 366 (d0)) and the first output element 306 the first imaginary number daughter element (for example, second output daughter element 368 (d1)).It holds Row accumulating vector arithmetic reduction instruction can further include generate second output element 308 the second real number daughter element (for example, 5th output daughter element 374 (d4)) and second output element 308 the second imaginary number daughter element (for example, the 6th output daughter element 376(d5)).Therefore, when input type identification input element 302 and 304 is plural number, output element 306 and 308 can be multiple Number.
During operation, reduction tree 300 can be used for executing received accumulating vector arithmetic reduction instruction.Executing accumulation During formula vector arithmetic reduction instructs, it can be instructed, be selectively enabled in multiple adders based on the reduction of accumulating vector arithmetic One or more adders, with generate comprising output element 306 and 308 (for example, including multiple output daughter elements 366 to 380 (d0 to d7)) multiple output elements.For example, 320 (example of first adder is optionally completely or at least partially enabled Such as, it can be instructed based on the reduction of accumulating vector arithmetic, be selectively enabled one or more of sub- adder 322 to 328).? During the execution of accumulating vector arithmetic reduction instruction, one or more outputs of multiple adders can be provided for being stored in Output element 306 and 308 (for example, multiple output daughter elements 366 to 380 (d0 to d7)) in output vector 120.
Referring to Fig. 4, the block diagram of the second illustrative embodiments of reduction tree 400 is disclosed.Accumulating vector arithmetic can executed Reduction tree is used during reduction instruction (for example, vector instruction 220 that the accumulating vector arithmetic reduction of Fig. 1 instructs 101 or Fig. 2) 400.As illustrative non-limiting example, reduction tree 400 may include the reduction tree 206 of Fig. 2 or the reduction tree 300 of Fig. 3.Citing For, reduction tree 400 can explanatory diagram 3 reduction tree 300 extension, to support input vector 122 tool there are four input element Embodiment.Reduction tree 400 may include multiple adders, arrive comprising first adder 320, second adder 321 and adder 402 408, the adder be configured to based on the reduction of accumulating vector arithmetic instruct and be selectively activated to produce output to Amount 120.Although Fig. 4 illustrates that multiple adders, reduction tree 400 may include a number of other arithmetic operation units.
Input vector 122 may include the first input element 302, the second input element 304, third input element 410 and Four input elements 412.Each input element may include multiple input daughter elements.For example, the first input element 302 may include Input daughter element s0 to s3, the second input element 304 may include input daughter element s4 to s7, third input element 410 may include Daughter element s8 to s11 is inputted, and the 4th input element 412 may include input daughter element s12 to s15.Output vector 120 may include Four output elements.For example, output vector 120 may include that the output of the first output element 306, second element 308, third are defeated Element 422 and the 4th output element 424 out.Each output element may include multiple output daughter elements.For example, the first output Element 306 may include output daughter element d0 to d3, second output element 308 may include output daughter element d4 to d7, third export Element 422 may include output daughter element d8 to d11, and the 4th output element 424 may include output daughter element d12 to d15.
Multiple adders may include (for example, being arranged to) multiple rows, such as the first row 312 and the second row 414.Although showing Two rows, but in other embodiments, the (for example) number based on the input element in input vector 122, multiple adders can Comprising compared with multirow or less rows.Although for tool, there are four adders by every a line 312,414 explanation, in other embodiments, Number (for example) based on the input element in input vector 122, every a line can have greater than four or less than four adders. Each of adder 402 to 408 may include four sub- adders, as with reference to described by the adder 320 and 321 of Fig. 3.
It can be instructed based on received accumulating vector arithmetic reduction, be selectively enabled one or more in multiple adders Adder, as described with reference to fig 7.The adder enabled to unselected property is (in Fig. 4 as illustrated by shade, such as second Adder 321 and third adder 402) it can be configured to output in the received specific input in adder place (for example, by zero Add to specific input), as described by Fig. 7.For example, second adder 321 can be configured to receive the first input member Element 302, and the first input element 302 is output to the adder in the second row 414.The adder being selectively enabled is (in Fig. 4 In illustrated by unshaded adder, such as first adder 320, the 4th adder 404, fifth adder 406 and the 6th Adder 408) it can be configured to perform add operation.For example, first adder 320 can be based on the first input element 302 And second input element 304 execute add operation, and the 4th adder 404 can be configured with based on third input element 410 and 4th input element 412 executes add operation.Fifth adder 406 can be exported based on the first adder of first adder 320 And the second adder output (for example, value of third input element 410) of third adder 402 executes add operation, and the 6th Adder 408 can be exported based on first adder and the output of the third adder of the 4th adder 404 executes add operation.
The adder for being used for the second row 414 can be exported to the multiple outputs member being provided as to be stored in output vector 120 Element (for example, output element 306,308,422 and 424).Via selective enabling, multiple adders can produce (for example, offer) The multiple output elements being stored in output vector 120.Element 306,308,422 and 424 is exported (for example, output daughter element d0 One or more portion of product of accumulating vector arithmetic reduction can be indicated to d15).For example, the first output element 306 can be First input element 302, the second output element 308 can be the summation of the first input element 302 and the second input element 304, the Three output elements 422 can be the summation of the first input element 302, the second input element 304 and third input element 410, and the Four output elements 424 can be the first input element 302, the second input element 304, third input element 410 and the 4th input member The summation of element 412.Output element 306,308,422 and 424 can by daughter element be generated, wherein executing addition fortune with interleaved manner It calculates to generate output daughter element d0 to d15, as explained with reference to fig. 3.For example, output daughter element d8 can be equal to input son member The summation of plain s0, s4 and s8, and export daughter element d12 and can be equal to the summation of input daughter element s0, s4, s8 and s12.It can be similar Mode generates each output daughter element.
Although one reduction tree 400 (for example, reduction network) of Fig. 4 instruction sheet, in other embodiments, reduction tree 400 can It is logically divided into the parallel reduction network of multiple accumulating of interleaved manner operation.For example, in an alternative embodiment, Each accumulating reduction network may include the specific sub- adder of each adder (for example, the first accumulating reduction network can wrap The sub- adder of correspondence first containing each adder).Each accumulating reduction network can be parallel with other accumulating reduction networks Ground carries out operation, and the result from each accumulating reduction network can be stored in output vector 120.For example, reduction Tree 400 can logically be divided into four sixteen bit accumulating reduction networks.In another example, reduction tree 400 can logically divide It is cut into two 32 accumulating reduction networks.
During operation, reduction tree 400 can be used for executing received accumulating vector arithmetic reduction instruction.Executing accumulation During formula vector arithmetic reduction instructs, it can be instructed and be selectively enabled in multiple adders based on accumulating vector arithmetic reduction One or more adders, to generate multiple output elements 306,308,422 and 424.In accumulating vector arithmetic reduction instruction During execution, multiple output elements 306,308,422 and 424 be can be stored in output vector 120.
Referring to Fig. 5, the block diagram of the third illustrative embodiments of reduction tree 500 is disclosed.It can be instructed in accumulating vector arithmetic Reduction tree is used during the execution of (for example, vector instruction 220 that the accumulating vector arithmetic reduction of Fig. 1 instructs 101 or Fig. 2) 500.As illustrative non-limiting example, reduction tree 500 may include the reduction tree 206, the reduction tree of Fig. 3 300 or Fig. 4 of Fig. 2 Reduction tree 400.Reduction tree 500 can be configured to receive the multiple input elements 502 being stored in input vector 122, and mention For (for example, generation) to be stored in multiple output elements 506 in output vector 120.
Reduction tree 500 may include multiple input elements 502, multiple adders 504 and multiple output elements 506.Although Fig. 5 Illustrate multiple adders 504, but reduction tree 500 may include a number of other arithmetic operation units.Multiple input elements 502 may include Input element s0 to the s15 of input vector 122.Multiple output elements 506 may include that the output element d0 of output vector 120 is arrived d15.Multiple input elements 502 (s0 to s15) can be with sequential order (such as " s0, s1, s2 ... s15 ") sequencing, and wherein s0 is According to sequential order first sequentially element and s15 are last sequentially element.Multiple output elements 506 (d0 to d15) can arrange At similar sequential order " d0, d1, d2 ... d15 ".
Each input element of multiple input elements 502 can have same size.For example, multiple input elements 502 Each input element can be 64 positions.Each output element of multiple output elements 506 can also have same size.It lifts For example, each output element of multiple output elements 506 can be 64 positions.In a particular embodiment, each input member Element can have the size (for example, 64 positions) for being identical to each output element.The number of input element can be equal to output member The number of element.For example, input vector 122 can have 16 input elements, and output vector 120 can have 16 it is defeated Element out.The number and size of element are illustrative;Input element and output element can have different from illustrated other Size, and vector (for example, input vector 122 and output vector 120) can have different from illustrated other sizes (for example, The element of other numbers).Although undeclared, each input element may include multiple input daughter elements (for example, four input Element), and each output element may include four output daughter elements, as referring to figs. 3 to described by 4.Based on by accumulating vector The indicated type of arithmetic reduction instruction, each input element and each output element can be real number, imaginary number or plural number, such as close Described by Fig. 3 to 4.
Multiple adders 504 may be disposed to the adder of multiple rows, include the first row 512, the second row 514, the third line 516 And fourth line 518.Although illustrating the adder of four rows, in other embodiments, (for example) based on input element and defeated The number of element out, reduction tree 500 may include that (for example, being arranged to) is less than four rows or four rows or more.In multiple adders 504 Each adder can have same size.For example, each adder in multiple adders 504 can be 64 additions Device.Although not showing, each adder in multiple adders 504 may include multiple sub- adders, and can be configured with by Add operation is executed to daughter element with interleaved manner, such as referring to figs. 3 to described by 4.
Each adder can be exported the adder provided in the same column on next line, and can also be such as institute's exhibition in Fig. 5 It is routed to other adders with showing, so that reduction tree 500 can generate multiple output elements 506 (d0 to d15).Citing For, it can be by the output of the first adder (for example, adder of the first row 512 below input element s1) of the first row 512 It is routed to the second adder (for example, adder of the second row 514 below input element s2) and the second row of the second row 514 514 third adder (for example, adder of the second row 514 below input element s3).It can be by the output of third adder It is routed to the 4th adder, the fifth adder of the third line 516, the 6th adder of the third line 516 and third of the third line 516 The 7th adder (for example, being respectively the adder of the third line 516 below input element s4 to s7) of row 516.In addition, can The output of 7th adder is routed to eight adders of fourth line 518 (for example, the below input element s8 to s15 the 4th The adder of row 518).
It can be instructed based on the reduction of accumulating vector arithmetic, one or more being selectively enabled in multiple adders 504 add Musical instruments used in a Buddhist or Taoist mass.For example, one or more can be selectively enabled by control logic (not shown) (for example, control logic 210 of Fig. 2) Adder (as illustrated by the unshaded adder as Fig. 5).One or more not enabled adders are (such as by Fig. 5's plus negative Shadow adder is shown) it can be configured to output and receive input (for example, zero is added into specific input), as referred to Fig. 7 institute Description.
Reduction tree 500 can be configured with same based on multiple input element s0 to s15 and accumulating vector arithmetic reduction instruction When generate multiple output element d0 to d15.For example, reduction tree 500 can be configured so that the first input element s0 to be provided as First input element s0 is added to the second input element s1 to provide the second output element s1 by the first output element d0, and will First output element s0 and the second output element s1 are stored in output vector 120.Reduction tree 500 can be configured first yuan Plain s0 and second element s1 is added to third element s2 to provide third output element d2.In addition, reduction tree 500 can be configured with Summation by generating each input element s0 to s15 generates output element d15.Output element d3 can be arrived in a similar manner D14 is produced as partial buildup summation.
During operation, reduction tree 500 can be used for executing the received accumulating vector arithmetic reduction instruction of institute.In accumulating During the execution of vector arithmetic reduction instruction, reduction tree 500 can receive multiple input elements 502 from input vector 122.It is accumulating During the execution of formula vector arithmetic reduction instruction, multiple adders in multiple adders 504 are optionally enabled, to provide (for example, generation) multiple output element d0 to d15, and multiple output element d0 to d15 can be stored in output vector 120.
Referring to Fig. 6, the block diagram of the 4th illustrative embodiments of reduction tree 600 is disclosed.Accumulating vector arithmetic can executed Reduction tree is used during reduction instruction (for example, vector instruction 220 that the accumulating vector arithmetic reduction of Fig. 1 instructs 101 or Fig. 2) 600.Reduction tree 600 may include reduction tree 206, the reduction tree of Fig. 3 300, the reduction tree of Fig. 4 400, the reduction tree of Fig. 5 of Fig. 2 500 or combinations thereof.Reduction tree 600 can be configured more to be received based on accumulating vector arithmetic reduction instruction from input vector 122 A input element, and generate multiple output elements of output vector 610.Although Fig. 6 illustrates multiple adders, reduction tree 600 It may include a number of other arithmetic operation units.
Reduction tree 600 can receive multiple input elements from input vector 122, defeated comprising the first input element 302 and second Enter element 304.First input element 302 may include input daughter element s0 to s3, and the second input element 304 may include input Element s4 to s7.Input element and input daughter element can have the size indicated by accumulating vector arithmetic reduction instruction.It lifts For example, input element 302 and 304 can be 64 positions, and inputting daughter element s0 to s7 can be 16 positions.Output vector 610 may include the first output element 306 and the second output element 608.First output element 306 may include that output element d0 is arrived D3, and the second output element 608 may include output element d4 to d7.Output element and output daughter element can have from accumulating to Measure the indicated size of arithmetic reduction instruction.For example, output element 306 and 608 can be 64 positions, and export son member Plain d0 to d7 can be 16 positions.Although depicted as comprising two elements, but input vector 122 and output vector 610 may include Any number of element (for example, any number of daughter element), and can have other sizes different from 64 positions.
Reduction tree 600 may include being configured to be instructed and be selectively enabled to produce based on the reduction of accumulating vector arithmetic Multiple adders of raw output vector 610 include first adder 320, second adder 321, third adder 618 and the 4th Adder 619.Multiple adders may include (for example, being arranged to) multiple rows, and it includes the first row 312, the second row 614 and thirds Row 616.Each adder in multiple adders may include multiple sub- adders.For example, each in multiple adders Adder can be 64 adders, and may include four sub- adders of sixteen bit.It can be based on the reduction of accumulating vector arithmetic Instruction, one or more adders being selectively enabled in multiple adders.For example, it can select as described with reference to fig. 3 Enable to selecting property first adder 320 (for example, sub- adder 322 to 328).
Third adder 618 in second row 614 may include being configured to the output and third of the first sub- adder 322 The 5th sub- adder 625 that the output of sub- adder 326 is added.Third adder 618 also may include being configured to the second son The 6th sub- adder 627 that the output of adder 324 is added with the output of the 4th sub- adder 328.By the way that sub- adder is defeated Be added out, third adder 618 can the output applied arithmetic reduction based on sub- adder 322,324,326 and 328 to generate son Two of adder 625 and 627 export through reduction.Similarly, the 4th adder 619 of the third line 616 can be based on sub- adder 625 and 627 output, using the 7th sub- 629 applied arithmetic reduction of adder to generate additionally through reduction value.Therefore, second is defeated Element 608 may include sixteen bit reduction value and other parts value based on multiple input daughter element s0 to s7 out.Citing comes It says, output daughter element d4 can be equal to input daughter element s0 and input the summation of daughter element s4, and output daughter element d5 can be equal to input The summation of daughter element s1 and input daughter element s5, output daughter element d6 can be equal to the summation of input daughter element s0, s2, s4 and s6, And it exports daughter element d7 and can be equal to the summation of input daughter element s0 to s7.
During operation, reduction tree 600 can be used for executing accumulating vector arithmetic reduction instruction.In accumulating vector arithmetic Reduction instruction execution during, can based on accumulating vector arithmetic reduction instruct be selectively enabled in multiple adders one or Multiple adders, to generate multiple output elements 306 and 608 for being stored in output vector 610 (for example, multiple outputs Daughter element d0 to d7).
Referring to Fig. 7, the block diagram of the illustrative embodiments of a part of reduction tree 700 is disclosed.The part of reduction tree 700 It can be the reduction tree of the reduction tree 206 of Fig. 2, the reduction tree of Fig. 3 300, the reduction tree of Fig. 4 400, the reduction tree of Fig. 5 500 or Fig. 6 600 a part.Vector instruction can executed (for example, the vector of the accumulating vector arithmetic reduction instruction 101 of Fig. 1, Fig. 2 refers to Enable 220, the segmented vector arithmetic reduction with reference to described in Fig. 9 instruct 901, or with reference to the described rotation segmented of Figure 10 to The part of reduction tree 700 is used during measuring arithmetic reduction instruction 1001).The part of reduction tree 700 can be configured with Receive the first input element 702 (s0) from input vector based on vector instruction, and generate for being stored in output vector the One output element 706 (d0).
The part of reduction tree 700 may include the first multiplexer 720, and first multiplexer 720 couples To first adder 712 and it is configured to for the first input element 702 (s0) to be received as the first mux input and inputs (example for zero Such as, there is the input of the value equal to logical zero) it is received as the 2nd mux input.Although illustrating first adder 712, other In embodiment, the part of reduction tree 700 may include different arithmetic operation units (for example, subtrator).First multichannel is multiple It can be configured with device 720 to receive first control signal 744 from control logic (for example, control logic 210 of Fig. 2).More than first Path multiplexer 720 can be configured with based on first control signal 744 the first mux input the 2nd mux input between select, with Mux output is provided as to the first adder input 732 of first adder 712.For example, when first control signal 744 is When particular value, the first multiplexer 720 can provide the first input element 702 to first as first adder input 732 Adder 712.When the first controlling value 744 is different value, the first multiplexer 720 can regard zero input as first adder Input 732, which provides, arrives first adder 712.Therefore, control logic (for example, passing through setting first control signal 744) can be through matching It sets and receives zero input (for example, the value for being equal to logical zero) to enable the subset of multiple adders based on vector instruction.
The part of reduction tree 700 may include the first saturated logic circuit 730, first saturated logic circuit 730 It is coupled to first adder 712 and is configured so that the output of first adder 712 is saturated.Make the output of first adder 712 Saturation can prevent the output of first adder 712 more than maximum value or minimize value or less.First saturated logic circuit 730 can It is configured to the output based on first adder 712 and exports and export (for example, value) through saturation.For example, when the first addition When the output of device 712 is between minimum value and maximum value, there can be the output equal to first adder 712 through saturation output Value.When the output of first adder 712 is more than maximum value, there can be the value of maximum value through saturation output, and when the first addition When the output valve of device 712 is less than minimum value, there can be the value of minimum value through saturation output.
The part of reduction tree 700 may include the second multiplexer 724 for being coupled to the first saturated logic circuit 730.The Two multiplexers 724 can be configured so that the first saturated logic circuit 730 is received as the 3rd mux input through saturation output, And the output of the first multiplexer 720 is received as the 4th mux input.Second multiplexer 724 can be configured to be based on Second control signal 746 is inputted between the 4th mux input in the 3rd mux and is selected, and mux output is provided as to be stored in defeated The first output element 706 in outgoing vector.When second control signal 746 is particular value, the second multiplexer 724 be may skip First adder 712 (for example, the 4th mux input is provided as mux output).When not skipping over first adder 712, first First adder input 732 is added by adder 712 with second adder input 734.Second adder input 734 can be from another The received value of output institute, zero or certain other value of one adder.Pass through selection the 4th mux input, the second multiplexer 724 It may skip the execution of the add operation using first adder input 732 and second adder input 734, and can be by the first multichannel The output of multiplexer 720 is provided as mux output.Therefore, control logic can be configured to skip over the first addition based on vector instruction Device 712.In an alternative embodiment, first adder 712 can be skipped over by deactivating frequency input (not shown).
Although only showing an input element, the part of reduction tree 700 can grasp any number of input element Make.For example, the part of reduction tree 700 may include additional circuit (for example, multiplexer, adder, saturated logic circuit And connector), to be operated to the input vector with more than one input element.For example, the part of reduction tree 700 May include the adder of additional row, wherein each extra additions device include corresponding first multiplexer, saturated logic circuit and Third multiplexer.Additional circuit and adder can be controlled by the extra control signals from control logic.Therefore, reduction tree 700 part may be included in each of reduction tree 300 to 600 of Fig. 3 to 6.
During the execution of vector instruction, the part of reduction tree 700 can be configured to receive the first input element 702, and Generate the first output element 706 for being stored in output vector.First multiplexer 720 can be based on first control signal 744, provide zero input to first adder 712.The output of first saturated logic circuit, 730 saturable first adder 712. Second multiplexer 724 can skip over first adder 712 based on second control signal 746.
Referring to Fig. 8, the block diagram of the 5th illustrative embodiments of reduction tree 800 is disclosed.Reduction tree 800 may include the pact of Fig. 2 The reduction tree of one or more of the reduction tree 300 to 600 of 206, Fig. 3 to 6 (as described further in this article), Fig. 7 is set in letter 700 part or any combination thereof.Segmented vector arithmetic reduction instruction can executed (for example, being segmented with reference to described in Fig. 9 Formula vector arithmetic reduction instruction 901, or make with reference to during the described rotation segmented vector arithmetic reduction instruction 1001) of Figure 10 With reduction tree 800.It can selectively be configured about based on the section packets size being contained in segmented vector arithmetic reduction instruction Letter tree 800 enables to execute vector instruction.Section packets size can be with one or more groups of multiple input elements 802 Size is associated.For example, executing segmented vector arithmetic reduction instruction may include that multiple input elements 802 are grouped as tool There are one or more groups of section packets size, one or more segmented vector arithmetic reduction are executed to one or more groups later Operation.Reduction tree 800 can be configured to enable multiple segmented vector arithmetic reduction respectively with different section packets sizes The execution of instruction.For example, reduction tree 800 can be configured to enable first segmented with the section packets size for two The execution of vector arithmetic reduction instruction and the second segmented vector arithmetic reduction instruction with the section packets size for four.To the greatest extent The section packets size that pipe describes as two and four, but reduction tree 800 can support other section packets sizes.
Reduction tree 800 may include multiple input elements 802 (for example, multiple input element s0 to s15), multiple adders 804, and it is configured to export multiple outputs of multiple outputs element 806 (d0 to d15) (for example, multiple adders of bottom line Output).Although Fig. 8 illustrates multiple adders 804, reduction tree 800 may include a number of other arithmetic fortune in other embodiments Calculate unit.Processor (for example, processor 210 of Fig. 2) can be configured in the first segmentation comprising the first section packets size During the execution of formula vector arithmetic reduction instruction and the second segmented vector arithmetic reduction comprising the second section packets size refers to Reduction tree 800 is used during the execution of order.Reduction tree 800 can be configured with while generate multiple output elements 806 (d0 be arrived d15).For example, circulation can be executed in the associated uniprocessor of the execution that instructs with the first segmented vector arithmetic reduction Period generates multiple output elements 806 (d0 to d15).
Reduction tree 800 can be configured to receive multiple input elements 802 (s0 to s15) from input vector 822.Reduction tree 800 can be configured to generate to be stored in multiple output elements 806 (d0 to d15) in output vector 820.Multiple input elements 802 (s0 to s15) can be with sequential order (such as " s0, s1, s2 ... s15 ") sequencing, and wherein s0 is the according to sequential order One sequentially element and s15 are last sequentially element.Multiple output elements 806 (d0 to d15) can similar sequential order (such as " d0, d1, d2 ... d15 ") sequencing, wherein d0 is that first sequentially element and d15 are last sequentially element.
Reduction tree 800 can have be identical to output element input element number, and each input element can have it is identical In the size of each output element.For example, input vector 822 may include 16 64 input elements, and export Vector 820 may include 16 64 output elements.Although not showing, each input element may include multiple sixteen bits Input daughter element, and each output element may include that multiple sixteen bits export daughter element, such as referring to figs. 3 to described by 4.It is multiple Input element and multiple output elements can indicate real number value, imaginary value or combinations thereof.In a particular embodiment, when input type is When plural, each input element in multiple input elements may include corresponding real part and corresponding imaginary part.Can by with Interleaved manner executes the first arithmetical operation to one or more real parts, and executes the second arithmetic fortune to one or more imaginary parts It calculates and generates each output element, such as referring to figs. 3 to described by 4.
Although 60 nibbles of description element and sixteen bit daughter element, each input element and each output element can have Size other than 64 positions, and each input daughter element and each output daughter element can have in addition to 16 positions it Outer size.
Multiple adders 804 may be disposed to the adder of multiple rows, as demonstrated.Multiple adders 804 may include (example Such as, it is arranged to) the first row 812, the second row 814, the third line 816 and fourth line 818.Although illustrating four row adders, (example The number of number and output element such as) based on input element, reduction tree 800 are alternatively less than comprising (for example, being arranged to) Four rows are more than four rows.Each adder in multiple adders 804 can have same size.For example, multiple adders Each adder in 804 can be 64 adders.Each adder although not showing, in multiple adders 804 May include multiple sub- adders, and can be configured with by daughter element with interleaved manner execute add operation, such as referring to figs. 3 to Described by 4.
Can selectively be routed via multiple paths 830 to 844 (as shown by the dashed path in Fig. 8) from one or One or more adders of multirow adder export, so that reduction tree 800 can generate multiple output elements 806, (d0 is arrived d15).For example, first value as caused by first adder 850 can be provided to the second addition via first path 830 Device 852 can provide the second value as caused by second adder 852 to third adder 854 via the second path 840, and The third value as caused by third adder 854 can be provided to the 4th adder 856 by third path 844.It can be via road Diameter 832 to 836 and 842 other values are similarly provided between one or more adders.It can be based on the reduction of segmented vector arithmetic The section packets size of instruction is selectively enabled each path in multiple paths 830 to 844.For example, based on segmentation Formula arithmetic reduction instruction (for example, being based on section packets size), can be by selecting first value as caused by first adder 850 It is selected as the adder input to second adder 852 and enables first path 830, and can be added by the way that zero input is selected as second The adder of musical instruments used in a Buddhist or Taoist mass 852 inputs and deactivates first path 830.One or more adders in multiple adders 804 can have through Configuration with select adder input correspondence multiplexer (not shown), for example, with reference to described in Fig. 7 from zero input and by First multiplexer 720 of selection adder input in value provided by respective path.Corresponding multiplexer can be based on control Signal processed enables respective path (for example, selection is inputted as provided by respective path) or deactivated respective path (for example, selection zero Input), as described with reference to fig 7.
Processor may include the section packets size for being configured to instruct based on segmented vector arithmetic reduction, selectively Configure the control logic (for example, control logic 210 of Fig. 2) of reduction tree 800.Selectively configuration reduction tree 800 may include base One or more adders are selectively enabled (as described in one or more unshaded adders in Fig. 8 in section packets size It is bright) and selection respective adders input.For example, control logic can be configured to refer in the first segmented vector arithmetic reduction During the execution of order, the first subset of multiple adders 804 is selectively enabled based on the first section packets size and selects to add The first subset of correspondence (for example, reduction tree 800 can be configured as the first configuration) of musical instruments used in a Buddhist or Taoist mass input, and calculated in the second segmented vector During the execution of art reduction instruction, the second subset of multiple adders 804 is selectively enabled based on the second section packets size And the correspondence second subset (for example, reduction tree 800 can be configured as the second configuration) for selecting adder to input.Reduction tree 800 Specific configuration can be associated with the specific subset for enabling adder and the specific subset for selecting adder to input.Control logic can make The corresponding son for enabling the specific subset of multiple adders 804 with one or more control signal-selectivities and adder being selected to input Collect (for example, the specific subset for being selectively enabled multiple paths 830 to 844), as described with reference to fig 7.For example, work as area Section packet size be two when, can deactivate each of multiple paths 830 to 844 (for example, for multiple paths 830 to 844 Each of associated each adder input zero may be selected), and can only enable unshaded in the first row 812 plus Musical instruments used in a Buddhist or Taoist mass.When section packets size is four, can only enable in first subset (830 to 836) and row 812 to 814 in path not Add shade adder.When section packets size is eight, the second subset (830 to 842) and row 812 that can only enable path are arrived Unshaded adder in 816.When section packets size is 16, all multiple paths 830 to 844 and row can be enabled 812 to 818 all unshaded adders.Therefore, control logic can be configured to be based on section packets size selectively Enable the subset of adder and the subset (for example, subset of selection respective adders input) in path.
By one or more adders being selectively enabled in multiple adders 804, and selects one or more to correspond to and add Musical instruments used in a Buddhist or Taoist mass input, reduction tree 800 can be configured to be calculated with being based on multiple input elements 802 (s0 to s15) and being contained in segmented vector Area in art reduction instruction (for example, the first segmented vector arithmetic reduction instruction or the second segmented vector arithmetic reduction instruction) Section packet size and generate multiple output elements 806 (d0 to d15) simultaneously.For example, when section packets size is two, about Letter tree 800 can produce first output element d1 of (for example, offer) equal to s0+s1, the second output element d3 equal to s2+s3, Third output element d5 equal to s4+s5, the 4th output element d7 equal to s6+s7, element is exported equal to the 5th of s8+s9 D9, the 6th output element d11 equal to s10+s11, element d13 is exported and equal to the of s14+s15 equal to the 7th of s12+s13 Eight output element d15.When section packets size is four, reduction tree 800 can produce the second output member equal to s0+s1+s2+s3 Plain d3, the 4th output element d7 equal to s4+s5+s6+s7, element d11 is exported equal to the 6th of s8+s9+s10+s11 and is equal to The 8th output element d15 of s12-s13+s14+s15.When section packets size is eight, reduction tree 800 be can produce equal to s0+ The 4th output element d7 of s1+s2+s3+s4+s5+s6+s7, and equal to the of s8+s9+s10+s11+s12-s13+s14+s15 Eight output element d15.When section packets size is 16, reduction tree 800 be can produce equal to each input element s0 to s15's 8th output element d15 of summation.Therefore, it is based on section packets size, is enabled to 800 property of may be configured to select of reduction tree more One or more adders of a row 812 to 818 simultaneously select one or more respective adders to input, to generate multiple output members simultaneously Element 806.
During operation, reduction tree 800 can be used for executing segmented vector arithmetic reduction instruction.In segmented vector arithmetic During the execution of reduction instruction, reduction tree 800 can receive multiple input elements 802 (s0 to s15) from input vector 822.Citing For, during the execution of the first segmented vector arithmetic reduction instruction, multiple input elements 802 (s0 to s15) can be grouped At one or more first groups with the first section packets size, and in the execution of the second segmented vector arithmetic reduction instruction The multiple input element can be grouped as one or more second groups with second packet size by period.Segmented to During the execution for measuring arithmetic reduction instruction, multiple outputs (for example, multiple adders of fourth line 818 export) can be used, select Property enable one or more adders in multiple adders 804 to generate multiple output elements 806 (d0 to d15), and can will Multiple output elements 806 (d0 to d15) are stored in output vector 820.
Reduction tree 800 make it possible for single reduction tree execute have the first segmented of the first section packets size to Measure arithmetic reduction instruction, and the instruction of the second segmented vector arithmetic reduction with the second section packets size.Compared to comprising Processor for the multiple reduction trees used during the execution of the multiple instruction with different section packets sizes, uses list One reduction tree, which may make, can reduce device size and power consumption.
Referring to Fig. 9, the schema for executing the certain illustrative process of vector instruction is disclosed, and is generally designated as 900. Vector instruction may include segmented vector arithmetic reduction instruction, such as an illustrative segmented vector arithmetic reduction instruction 901. Segmented vector arithmetic reduction instruction 901 can be performed at processor (for example, processor 202 of Fig. 2), and the processor includes One or more of reduction tree, such as the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, the reduction tree 700 of Fig. 7 Partially, the reduction tree 800 of Fig. 8, or any combination thereof.Processor can receive more in an input register 910 comprising being stored in The input vector of a input element 902.Processor can handle multiple input elements 902, and generate an output register 920 simultaneously Multiple output elements 924 (for example, content).
Multiple output elements 924 may be based on segmented vector arithmetic reduction instruction 901.For example, by being based on dividing The section packets size of segmentation vector arithmetic reduction instruction 901 adds to the specific input element in multiple input elements 902 One or more other input elements in multiple input elements 902, executing segmented vector arithmetic reduction instruction 901 can produce one Specific output element.
Input register 910 may include multiple input elements 902.For example, multiple input elements 902 are (for example, input Vector) it may include N number of element, wherein N is the integer greater than one.Multiple input elements 902 may include input element s0 to s (N- 1).Multiple input elements 902 can be stored with sequential order (such as " s0, s1, s2 ... s (N-1) "), and wherein s0 is first sequentially Input element and s (N-1) are last sequentially input element.Although showing five input elements, the number of multiple input elements 902 Mesh (for example, N) may include more than five elements or less than five elements.
Before executing segmented vector arithmetic reduction instruction 901, output register 920 may include multiple foregoing elements 922.Multiple foregoing elements 922 may include foregoing elements d0 to d (N-1).Multiple foregoing elements 922 may be included in another vector In (for example, rotating vector 280 of Fig. 2) or different vectors.Multiple foregoing elements 922 can be stored in by segmented vector arithmetic about In 901 positions that are identified of letter instruction, such as another register or a position in memory.Multiple foregoing elements may include In segmented vector arithmetic reduction instruction 901, or the field or ginseng that 901 can be instructed by being stored in segmented vector arithmetic reduction Value (for example, by index) instruction in number.Before executing segmented vector arithmetic reduction instruction, it can be stored according to sequential order Multiple foregoing elements 922.It for example, can be according to specific sequential order " d0, d1, d2, d3 ... d (N-1) " (for example, d0 is First sequentially foregoing elements and d (N-1) are last sequentially foregoing elements) store multiple foregoing elements 922.
Process 900 illustrates the segmented vector arithmetic reduction instruction 901 for the illustrative section packets size for having for two It executes.Executing segmented vector arithmetic reduction instruction may include that multiple input elements 902 are grouped as multiple groups, such as first The input element 904 of set and the input element 906 of second set.Can input element 904 to first set execute first and calculate Art (for example, addition) operation, with generate equal to s0+s1 first as a result, and can input element 906 to second set execute the Two arithmetic (for example, addition) operation, to generate the second result for being equal to s2+s3.First result (s0+s1) can be inserted into output In first output element 916 of register 920, and the second result (s2+s3) can be inserted into the second defeated of output register 920 Out in element 918.When generated result number is less than the output element number in output register 920, multiple previous members One or more foregoing elements in element 922 can remain in output register 920 (for example, can be without overwrite).For example, when When first output element 916 and the second output element 918 are inserted into output register 920, multiple output elements may include Foregoing elements d0 and d2 in multiple output elements 924.When the section packets size of segmented vector arithmetic reduction instruction 901 is When different size, multiple input elements 902 can the grouped input element at different sets, and can produce Different Results.
As illustrated in figure 9, segmented vector arithmetic reduction instruction 901 may include the instruction name for being portrayed as title vraddw Claim 980 (for example, operation code opcodes).Segmented vector arithmetic reduction instruction 901 also may include the first field 982 (Vu), the second field 984 (Vd), third field 986 (Q), the 4th field 988 (Op), the 5th field 990 (s2), the 6th field 992 (sc32) and Seven fields 994 (sat).The first value being stored in the first field 982 can indicate the input being such as stored in input register 910 Vector.In an alternative embodiment, the first value being stored in the first field 982 can indicate a pair of of input vector (for example, vector Vu and additional vector Vv), wherein the primary vector (for example, Vu) of the opposite amount is associated with real number, and it is described to vector Secondary vector (for example, Vv) is associated with imaginary number.Second value in second field 984 can be indicated in segmented vector arithmetic The output vector being stored in output register 920 used during the execution of reduction instruction 901.It is stored in third field 986 In third value can indicator panel cover (for example, shielding Q), such as with reference to described by Figure 11 A to B;It is stored in the 4th field 988 4th value can indicate operation vector (for example, operation vector Op);The 5th value being stored in the 5th field 990 can indicate section point Group size (for example, " s2 " can be designated as two section packets size);The 6th value being stored in the 6th field 992 can indicate defeated Enter the type (for example, " sc32 " can indicate 32 plural input types) of value;And be stored in the 7th field 994 the 7th Value may indicate whether to be saturated during the execution that segmented vector arithmetic reduction instructs.Although describing seven fields, Segmented vector arithmetic reduction instruction may include compared with multi-field or less field.
Although having described add operation, segmented vector arithmetic reduction instruction 901 is not limited to only execute add operation. For example, segmented vector arithmetic reduction instruction 901 can indicate one or more arithmetic to execute to multiple input elements 902 Operation.One or more arithmetical operations may include add operation and subtraction.It can be by specific fields (for example, special parameter) (example Such as, the 4th field 988) in value indicate one or more arithmetical operations.For example, the 4th field 988 may include being directed toward storage Position or direction in the memory of operation vector (for example, the vector for indicating one or more arithmetical operations) store operation vector The pointer of register.During each element of operation vector can indicate to stay in the execution that segmented vector arithmetic reduction instructs 901 The certain operations (for example, add operation or subtraction) that corresponding element in multiple input elements 902 is executed.Citing comes It says, executing segmented vector arithmetic reduction instruction may include that multiple input elements 902 are grouped as one based on section packets size Or multiple input groups, and one or more arithmetical operations are executed to generate multiple output elements 924 to one or more input groups. It, can be before generating multiple output elements 924, to multiple when at least one of one or more arithmetical operations are subtraction One or more element supplements in input element 902.
During operation, processor can receive segmented vector arithmetic reduction instruction 901.Multiple inputs can be used in processor Element 902 executes segmented vector arithmetic reduction instruction 901, is posted with generating multiple output elements 924 and being stored in output In storage 920.Multiple output elements 924 can indicate to be based on multiple input elements 902 being grouped as one or more input elements group Group as a result, it is described grouping be based on segmented vector arithmetic reduction instruction 901 section packets size.
Multiple output elements 924 are generated by the section packets size based on segmented vector arithmetic reduction instruction 901, point Segmentation vector arithmetic reduction instruction 901 makes it possible for multiple points that single reduction tree execution has different section packets sizes Segmentation vector arithmetic reduction instruction.Compared to comprising for during the execution of the multiple instruction with different section packets sizes The processor of the multiple reduction trees used, may make using single reduction tree can reduce device size and power consumption.
Referring to Figure 10, the schema for executing the certain illustrative process of rotation segmented vector arithmetic reduction instruction is disclosed, and It is generally designated as 1000.Rotating segmented vector arithmetic reduction instruction can instruct for single vector-quantities, and may include explanation Property rotation the reduction of segmented vector arithmetic instruction 1001.Rotation segmented vector arithmetic reduction instruction 1001 can be performed in processor At (for example, processor 202 of Fig. 2), the processor include reduction tree, such as Fig. 2 reduction tree 206, Fig. 3 to 6 reduction One or more of tree 300 to 600, the part of the reduction tree 700 of Fig. 7, the reduction tree 800 of Fig. 8 or any combination thereof.Processor It can receive the input vector of multiple input elements 902 comprising being stored in input register 910.Processor can handle multiple defeated Enter element 902, and generates multiple output elements 1024 (for example, content) of output register 920 simultaneously.
Rotating segmented vector arithmetic reduction instruction 1001 may include the 1080 (example of instruction name for being portrayed as title vraddw Such as, operation code opcode).Rotating segmented vector arithmetic reduction instruction 1001 also may include the first field 1082 (Vu), the second field 1084 (Vd), third field 1086 (Q), the 4th field 1088 (Op), the 5th field 1090 (s2), the 6th field 1092 (sc32), the 7th field 1094 (sat) and the 8th field 1096 (rot).Although illustrating eight fields, segmented vector is rotated Arithmetic reduction instruction 1001 may include compared with multi-field or less field.Field 1082 to 1094 can correspond to the segmented of Fig. 9 to Measure the field of arithmetic reduction instruction 901.The value being stored in the 8th field 1096 may indicate whether to rotate.Citing comes It says, the value being stored in the 8th field 1096 can indicate direction and the size of the rotation of life pending.Rotation can have equal to one The rotation amount (for example, 64 positions) of the size of input element, and can be for the left.In other embodiments, it is stored in the 8th Value in field 1096 can indicate other sizes and the direction of rotation.As another example, it is stored in the 8th field 1096 Value can indicate not rotate (for example, rotation segmented vector arithmetic reduction instruction 1001 can be similar to the segmented vector of Fig. 9 Arithmetic reduction instruction 901 is operated).In a particular embodiment, the value (not shown) being stored in the 9th field can indicate Before the result of arithmetical operation is stored in output register 920, if post overwrite (for example, being set equal to zero) output Multiple foregoing elements 922 (for example, content) in storage 920.In an alternative embodiment, different field is stored in (for example, Eight fields 1096) in value may indicate whether multiple foregoing elements 922 in overwrite output register 920.
901 execution can be instructed to carry out rotation segmented vector plus spin step according to segmented vector arithmetic reduction The execution of arithmetic reduction instruction 1001.For example, rotation the reduction of segmented vector arithmetic instruction 1001 execution may include Before the result for generating arithmetical operation, it is determined whether multiple foregoing elements 922 in rotation output register 920.In response to closing (for example, based on the value being stored in the 8th field 1096) is determined in rotate multiple foregoing elements 922 first, it can be by by the Multiple foregoing elements 922 (for example, content) in the rotation amount rotation output register 920 of eight fields 1096 instruction.Citing comes It says, when rotation amount is 64 positions and direction is to the right, multiple foregoing elements 922 can be rotated to the right to a previously member Element.Therefore, (for example, generating result and being stored during the execution of rotation segmented vector arithmetic reduction instruction 1001 Before in output register 920), sequentially element can store d (N-1), output register 920 to the first of output register 920 Second sequentially element can store d (0), sequentially element can store d (1) third of output register 920, and output register 920 last sequentially element can store d (N-2).It as another example, can be by rotation amount by multiple elder generations when direction is to the left Preceding element 922 rotates to the left.In response to determining about not rotate multiple foregoing elements 922 second (for example, based on being stored in Value in 8th field 1096), multiple foregoing elements 922 can be maintained to previous in-sequence order (for example, d (0) ... d (N- 1)).For example, when the value being stored in the 8th field 1096 is zero or null value (for example, working as the 8th field 1096 not When being contained in rotation segmented vector arithmetic reduction instruction 1001), multiple foregoing elements 922 can not be rotated.Therefore, it can be based on Segmented vector arithmetic reduction instruction 1001 is rotated, multiple foregoing elements 922 selectively (for example, optionally) are rotated.
Executing rotation segmented vector arithmetic reduction instruction 1001 also may include determining whether to the multiple foregoing elements of overwrite 922.It for example, can be based on rotation segmented vector arithmetic reduction instruction 1001 (for example, based on being stored in the 9th field Value), zero will be set as (for example, lid by each element that the result of arithmetical operation is replaced in multiple foregoing elements 922 It writes).The respective adders of zero can be all received (such as by the input in the first row adder 812 of Fig. 8 by two inputs in reduction tree Illustrated by adder below element s0) specific foregoing elements are set as zero.In other embodiments, can will it is multiple previously Element 922 is set as (for example, overwrite is) different value.
After having rotated multiple foregoing elements 922 in output register 920, it can be produced based on multiple input elements 902 Raw arithmetic operation results, and insert result into output register 920.Rotate segmented vector arithmetic reduction instruction 1001 Execution may include that multiple input elements 902 are grouped as to multiple groups, such as the input element 904 and second set of first set Input element 906.Can input element 904 to first set execute (for example, addition) operation of the first arithmetic to generate first As a result s0+s1, and can input element 906 to second set execute (for example, addition) operation of the second arithmetic to generate the second knot Fruit s2+s3.First result (s0+s1) can be inserted into the first output element 1016 of output register 920, and can be by second As a result (s2+s3) is inserted into the second output element 1018 of output register 920.First output element 1016 and the second output Element 1018 can be the different output elements of output register 920.
In first number of the input element in the input element 904 of first set and the input element 906 of second set The second number of input element may be based on instructing 1001 section packets that are identified by rotation segmented vector arithmetic reduction Size.For example, the first number of element and the second number of element can be identical.When the number of produced result is less than output When output element number in register 920, in multiple foregoing elements 922 one or more through rotation foregoing elements (or when It is one or more zeros when generating the multiple foregoing elements 922 of overwrite before result) (example in output register 920 can be remained in It such as, can be without overwrite).For example, it is inserted into output when by the first output element 1016 and the second output element 1018 and deposits When in device 920, multiple output elements may include in multiple output elements 1024 through rotation foregoing elements d (N-1) and d1.When point Segmentation vector arithmetic reduction instruction 1001 section packets size be different size when, multiple input elements 902 can it is grouped at The input element of different sets, and can produce Different Results.
During operation, processor can receive rotation segmented vector arithmetic reduction instruction 1001.Processor can be used more A input element 902 executes rotation segmented vector arithmetic reduction instruction 1001, with generate multiple output elements 1024 and by its It is stored in output register 920.Can be based on rotation segmented vector arithmetic reduction instruction 1001, selectively rotation output is posted The content (for example, multiple foregoing elements 922) of storage, and one or more inputs can be grouped as based on section packets size is based on Multiple input elements 902 of groups of elements generate as a result, and can insert result into output register 920.
Referring to Figure 11 A, the first illustrative embodiments for executing the accumulating vector arithmetic reduction instruction with masking is disclosed Schema, and be generally designated as 1100.In illustrative non-limiting example, accumulating vector arithmetic reduction instruction can 101 are instructed for the accumulating vector arithmetic reduction of Fig. 1.Accumulating vector arithmetic reduction instruction can recognize shielding 1130 (for example, to Amount shielding).As explained by reference to figure 1, the third field that shielding 1130 can instruct 101 by being stored in accumulating vector arithmetic reduction Value instruction in 186 (Q).For example, shielding 1130 may be included in accumulating vector arithmetic reduction instruction in, or can by comprising Pointer instruction in instruction, wherein index is directed toward position or the register being stored in the data structure of shielding 1130.It can base It is equal to zero in the corresponding element of shielding 1130, covers individual values (for example, element) of multiple elements 102 (for example, mentioning as zero It is supplied to the reduction tree for generating one or more output elements).Alternatively, one can be equal to based on the element of shielding 1130 and hidden Cover described value.
During the execution of accumulating vector arithmetic reduction instruction, the first ekahafnium can be provided as the first output member Before element 112, shielding 1130 is applied to multiple elements 102.Using the correspondence that shielding 1130 may include depending on shielding 1130 Masking value provides zero for the element-specific in multiple elements 102.As demonstrated, 1130 will shielded applied to multiple elements Before 102, input vector 122 includes element s0, s1, s2 and s (N-1).After application shielding 1130, multiple elements 102 are wrapped Containing s0, zero (corresponding element based on shielding 1130 is equal to zero, is provided to replace s1), s2 and s (N-1).In another implementation In example, it may include one or more in the multiple elements 102 modified in input vector 122 that shielding 1130, which is applied to multiple elements, The value of element.After it will shield 1130 applied to multiple elements 102, accumulating vector can be carried out as explained by reference to figure 1 The execution of arithmetic reduction instruction.Therefore, output vector 120 may include exporting element 112 equal to the first of s0, being equal to 0+s0 (example Such as, s0) the second output element 114, third equal to s2+s0 export element 116, and equal to s0+s2+ ...+s (N-1) N exports element 118.
Referring to Figure 11 B, the figure for executing the second illustrative embodiments of the instruction of the accumulating vector arithmetic comprising masking is disclosed Formula, and it is generally designated as 1101.Executing accumulating vector arithmetic reduction instruction may include that will shield 1130 applied to defeated Outgoing vector 120.
During the execution of accumulating vector arithmetic reduction instruction, shielding 1130 can be applied to output vector 120 to produce Raw masked output vector 1126.It can bring using shielding 1130 with element s0, zero, s0+s1+s2 and s0+ as demonstrated The masked output vector 1126 of s1+s2+ ...+s (N-1).Although Figure 11 B, which is shown, is stored in output vector will export element Application shielding 1130 after in 120, but shielding 1130 can be applied to the knot of arithmetical operation before inserting output vector 120 Fruit.For example, it can prevent from for one or more outputs (for example, s0+s1) being stored in output vector 120 based on shielding 1130, So that the not preceding value in overwrite output vector 120.In a particular embodiment, output vector 120 and masked output vector 1126 can be stored at same position, such as at identical register.
In addition, the masking shown in Figure 11 A to B can also be applied in a similar manner to the segmented vector arithmetic of Fig. 9 Reduction instructs the rotation segmented vector arithmetic reduction instruction 1001 of 901 or Figure 10.For example, in segmented vector arithmetic During the execution of reduction instruction 901, shielding 1130 can be applied to multiple elements 102 before being grouped multiple elements 102.As Another example can be stored with output vector 120 in rotation during the execution of rotation segmented vector arithmetic reduction instruction 1001 Output register content after (for example, rotation output vector 120 content after), by shielding 1130 be applied to output Vector 120.Being shielded output vector 1126 may include the first output element 1142 equal to s0, the second output element equal to 0 1144, the third equal to s0+s1+s2 exports element 1146, and the N equal to s0+s1+ ...+s (N-1) exports element 1148.
Referring to Figure 12, illustrate the process for executing the illustrative embodiments of the method 1200 of accumulating vector arithmetic reduction instruction Figure.Accumulating vector arithmetic reduction instruction can instruct the vector instruction of 101 or Fig. 2 for the accumulating vector arithmetic reduction of Fig. 1 220.In a particular embodiment, method 1200 can be executed by the processor 202 of Fig. 2.
At 1202, vector instruction can be executed at processor.Vector instruction can be the accumulating vector arithmetic reduction of Fig. 1 Instruction 101.Vector instruction may include that the vector comprising multiple input elements inputs.For example, vector input can be Fig. 1 to 6 Input vector 122.Vector input may include multiple input elements 102 of Fig. 1.Multiple input elements (for example, vector input) It can be stored by sequential order.Vector input can be identified by vector instruction.It for example, can be by being stored in specific fields (for example, ginseng Number) (such as Fig. 1 vector arithmetic reduction instruction 101 third field 184) in value identification vector input.
At 1204, the first input element in multiple input elements can be provided as the first output element.First input Element can be the first ekahafnium (s0) of Fig. 1, and the first output element can be the first output element 112 (s0) of Fig. 1.Citing For, the first input element can be provided to (example by the way that zero input (for example, the value for being equal to logical zero) is added to the first input element Such as, generate) it is the first output element.It can be based on the control signal from the control logic being contained in processor, in addition zero is defeated Enter, such as with reference to described by Fig. 7.
It, can be to the first input element and the second input element execution the first arithmetic fortune in multiple input elements at 1206 It calculates, with the second output element of offer (for example, generation).For example, the first arithmetical operation can be add operation.In other implementations In example, the first arithmetical operation can be subtraction.Second input element can be the second element 106 (s1) of Fig. 1, and second exports Element can be the second output element 114 (s0+s1) of Fig. 1.For example, it can will be equal to the first input element and the second input member It is the second output element that the value of the summation of element, which generates (for example, offer),.Each input element and each output element may include more A daughter element, and addition by daughter element can be executed with interleaved manner, such as referring to figs. 3 to described by 4.
At 1208, the first output element and the second output element can be stored in output vector.Output vector can be The output vector 120 of Fig. 1 to 6.For example, the element value of the first input element (for example, be equal to) and the can be exported by first Two output elements (for example, the value for being equal to the summation of the first input element and the second input element) are stored in the difference of output vector It exports in element, as demonstrated in Figure 1.
Additional output element can be generated by this method.For example, can in multiple input elements the first input element, Second input element and third input element execute the second arithmetical operation, export element with generation (for example, offer) third.Cause This, can by the element-specific in multiple input elements and in multiple elements on sequential order sequentially specific defeated One or more other input elements before entering element execute specific arithmetical operation and generate specific output element.
It according to method 1200, can produce multiple output elements (for example, the first output element and second output element), and institute The multiple portions result of accumulating vector arithmetic reduction can be indicated by stating output element.Compared to the execution phase in multiple vector instructions Between generate multiple portions as a result, by during the execution that single vector-quantities instruct generate multiple portions as a result, method 1200 can mention For the improvement in terms of storage and power consumption.
Referring to Figure 13, illustrate the flow chart that the illustrative embodiments of the method 1300 of vector instruction is executed using reduction tree. Vector instruction can be the vector instruction 220 of Fig. 2 or the segmented vector arithmetic reduction instruction 901 of Fig. 9.In a particular embodiment, Method 1300 can be executed by the processor 202 of Fig. 2.
At 1302, the vector instruction comprising section packets size can be received at processor.For example, vector instruction It can be the segmented vector arithmetic reduction instruction 901 of Fig. 9 with the section packets size as indicated by the 5th field 990.Place Managing device may include reduction tree.Reduction tree may include the reduction of the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, Fig. 7 Set 700 part, the reduction tree 800 of Fig. 8 or any combination thereof.Reduction tree may include it is multiple input, multiple arithmetic operation units, And multiple outputs.For example, as illustrative example, multiple inputs can for Fig. 8 multiple input elements 802 or Fig. 9 it is more A input element 902;Multiple arithmetic operation units can be multiple adders 804 of Fig. 8;And multiple outputs can be the multiple of Fig. 8 Export element 806 or multiple output elements 924 of Fig. 9.
At 1304, it may be determined that section packets size.For example, can based on vector instruction specific fields (for example, figure 9 the 5th field 990), determine section packets size.During the execution of vector instruction, section packets size can indicate with it is more The size of a one or more associated groups of input element.
At 1306, it can be based on section packets size, execute vector instruction using reduction tree to generate multiple outputs simultaneously. For example, executing vector instruction may include that multiple input elements are grouped as to one or more groups with section packets size Group, and one or more arithmetical operations are executed to generate multiple outputs to one or more groups.It can be instructed, located based on vector reduction Multiple outputs are generated during the single treatment circulation of reason device.
Reduction tree, which can be, optionally to be configured, for being used together with multiple and different section packets sizes.Citing For, the configuration of reduction tree can be associated with particular section packet size.The configuration of reduction tree can be with enabling arithmetic operation unit Specific subset and selection arithmetic operation unit input specific subset (for example, the specific subset in the path enabled) it is related Connection, for example, Fig. 8 multiple adders 804 and multiple paths 830 to 844 subset.Determining the section packets in vector instruction After size, processor can determine whether reduction tree is configured for use in and be used together with the section packets size (for example, about Whether letter tree is in specific configuration associated with section packets size).In response to determining that reduction tree is not configured for use in It is used together with section packets size, it can be based on the configuration of the big minor change reduction tree of section packets.For example, based on section point Group size, can enable one or more arithmetic operation units in multiple arithmetic operation units, and one or more arithmetic fortune may be selected Calculate unit input.It is used together in response to determining that reduction tree is configured for use in section packets size, usable reduction tree is held Row vector instruction.It for example, can nothing when reduction tree has been configured in specific configuration associated with section packets size Reduction tree need to be changed before executing vector instruction.
According to method 1300, reduction tree, which can be, optionally to be configured, with for from different section packets sizes Multiple instruction be used together.Compared to comprising for being used during executing the multiple instruction with different section packets sizes Multiple reduction trees processor, may make using single reduction tree can reduce device size and power consumption.
Referring to Figure 14, illustrate the illustrative embodiments for executing the method 1400 of rotation segmented vector arithmetic reduction instruction Flow chart.Rotating segmented vector arithmetic reduction instruction can be the vector instruction 220 of Fig. 2 or the rotation segmented vector of Figure 10 Arithmetic reduction instruction 1001.In a particular embodiment, method 1400 can be executed by the processor 202 of Fig. 2.
At 1402, the vector instruction comprising multiple input elements can be performed.For example, vector instruction can be rotation point Segmentation vector arithmetic reduction instruction 1001, and multiple input elements can be multiple input elements 902 of Figure 10.
At 1404, it can be grouped the first subset of multiple input elements, to form the input element of first set.Citing comes It says, the input element of first set can be the input element 1004 of the first set of Figure 10.It can be based on being contained in rotation segmented Section packets size in vector arithmetic reduction instruction, is grouped the first subset of multiple input elements, to form first set Input element.For example, the specific fields (for example, parameter) that can be instructed by rotation segmented vector arithmetic reduction (such as are schemed 5th field 1090 of 10 rotation segmented vector arithmetic reduction instruction 1001) identification section packet size.
At 1406, it can be grouped the second subset of multiple input elements, to form the input element of second set.Citing comes It says, the input element of second set can be the input element 1006 of the second set of Figure 10.It can be based on being contained in rotation segmented Section packets size in vector arithmetic reduction instruction, is grouped the second subset of multiple input elements, to form second set Input element.In a particular embodiment, the size of the first set of input element can be with the size of the second set of input element It is identical.In an alternative embodiment, the size of the second set of the size and input element of the first set of input element can be Different size.
At 1408, the first arithmetical operation can be executed to the input element of first set.It for example, can be to first set Input element execute the first add operation.In a particular embodiment, the first arithmetical operation can be indicated by operation vector.It can be by depositing It is stored in specific fields (for example, parameter) (such as the rotation segmented vector of Figure 10 of rotation segmented vector arithmetic reduction instruction Arithmetic reduction instruction 1001 the 4th field 1088) in value indicate operation vector.
At 1410, the second arithmetical operation can be executed to the input element of second set.It for example, can be to second set Input element execute the second add operation.In a particular embodiment, the second arithmetical operation can be indicated by operation vector.
At 1412, the content of rotatable output register.For example, output register can be deposited for the output of Figure 10 Device 1020, and multiple foregoing elements (for example, content) can be contained, such as multiple foregoing elements 922 of Figure 10.It can be by being stored in rotation Turning the specific fields (for example, parameter) of segmented vector arithmetic reduction instruction, (such as the rotation segmented vector arithmetic of Figure 10 is about Letter instruction 1001 the second field 1084) in value identify output register.As illustrative example, multiple foregoing elements can be As a result, or can be multiple null values caused by the vector instruction as performed by previous.In a particular embodiment, multiple foregoing elements can For the result of previously performed rotation segmented vector arithmetic reduction instruction.The content of rotation output register may include being based on depositing The specific fields (for example, parameter) of rotation segmented vector arithmetic reduction instruction are stored in (for example, the rotation segmented vector of Figure 10 The 8th field 1096 (for example, rotation field) of arithmetic reduction instruction 1001) in value, selectively (for example, optionally) revolve Turn the content of output register.For example, the value being stored in rotation field can indicate the size of rotation and the direction of rotation, And can by the rotation size and on the direction of rotation rotate output register content.Can based on rotation segmented to Measure the specific fields of arithmetic reduction instruction, the content of overwrite (for example, being set equal to zero) output register.
It, can be by the first result of the first arithmetical operation and second after the content of rotation output register at 1414 Second result of arithmetical operation is inserted into output register.For example, the first result can be inserted into output register In first output element, and the second result can be inserted into the second output element of output register.First output element can Element 1016 is exported for the first of Figure 10, and the second output element can be the second output element 1018 of Figure 10.First result and Second result can overwrite be previously stored in output register the value of (and through rotating at 1412).
Multiple section packets sizes can be referred to via using single reduction tree to execute single vector-quantities according to method 1400 It enables to execute rotation and segmented vector arithmetic reduction.Compared to comprising for there are the more of different section packets sizes in execution The processor of the multiple reduction trees used during a instruction, may make using single reduction tree can reduce device size and power Consumption.
Referring to Figure 15, describe the particular illustrative embodiment of the device (for example, communication device) comprising reduction tree 1580 Block diagram, the reduction tree is for executing accumulating vector arithmetic reduction instruction 1562 and segmented vector arithmetic reduction instruction 1564, and described device is generally designated as 1500.As illustrative example, reduction tree 1580 may include the reduction tree of Fig. 2 206, the reduction tree 800 of the reduction tree 300 to 600 of Fig. 3 to 6, the part of the reduction tree of Fig. 7 700 or Fig. 8.Device 1500 can be Wireless electron device, and may include the processor for being coupled to memory 1532, for example, digital signal processor (DSP) 1510.
Processor 1510, which can be configured to perform, to be stored in memory 1532 (for example, computer-readable storage medium) Computer executable instructions 1560 (for example, program of one or more instructions).Instruction 1560 may include accumulating vector arithmetic about Letter instruction 1562 and/or segmented vector arithmetic reduction instruction 1564.Accumulating vector arithmetic reduction instruction 1562 can be Fig. 1 Accumulating vector arithmetic reduction instruct 101 or Fig. 2 vector instruction 220.Segmented vector arithmetic reduction instruction 1564 can be The segmented vector arithmetic reduction of the vector instruction 220 of Fig. 2, Fig. 9 instructs the rotation segmented vector arithmetic of 901 or Figure 10 about Letter instruction 1001.
Camera interface 1568 is coupled to processor 1510, and is additionally coupled to video camera (for example, video camera 1570). Display controller 1526 is coupled to processor 1510 and display 1528.Encoder/decoder (codec) 1534 can also It is coupled to processor 1510.Loudspeaker 1536 and microphone 1538 can be coupled to codec 1534.Wireless interface 1540 can coupling Processor 1510 and antenna 1542 are closed, so that can will mention via antenna 1542 and the 1540 received wireless data of institute of wireless interface It is supplied to processor 1510.
In a particular embodiment, processor 1510, which can be configured to perform, is stored in non-transitory computer-readable media Computer executable instructions 1560 at (for example, memory 1532), described instruction is executable so that computer is (for example, place Manage device 1510) the first element in multiple elements is provided as the first output element.Computer executable instructions 1560 may include Accumulating vector arithmetic reduction instruction 1562.Multiple elements can be multiple elements 102 of Fig. 1, and can be stored in input vector (example Such as the input vector 122 of Fig. 1 to 6) in.Computer executable instructions 1560 further can be executed by computer, to multiple members The first element and second element in element execute arithmetical operation, to provide the second output.Calculating further can be executed by computer First output and the second output are stored in output vector by machine executable instruction 1560.Output vector can be Fig. 1 to 6 Output vector 120.
In a particular embodiment, processor 1510, which can be configured to perform, is stored in non-transitory computer-readable media Described instruction can be performed so that computer is (for example, place in computer executable instructions 1560 at (for example, memory 1532) Manage device 1510) receive the vector instruction comprising section packets size.Vector instruction can instruct for the reduction of segmented vector arithmetic 1564.Computer executable instructions 1560 can be executed further to determine section packets size.Computer can further be executed can Execute instruction 1560 with based on section packets size using reduction tree execute vector instruction come and meanwhile generate multiple outputs.As saying Bright property example, reduction tree may include the portion of the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, the reduction tree 700 of Fig. 7 Point or Fig. 8 reduction tree 800.Reduction tree may include multiple inputs, multiple arithmetic operation units and multiple outputs.Reduction tree can It optionally configures, for being used together with multiple and different section packets sizes.
In a particular embodiment, processor 1510, display controller 1526, memory 1532, codec 1534, nothing Line interface 1540 and camera interface 1568 are contained in system in package or systemonchip device 1522.In specific embodiment In, input unit 1530 and electric supply 1544 are coupled to systemonchip device 1522.In addition, in a particular embodiment, As illustrated in figure 15, display 1528, input unit 1530, loudspeaker 1536, microphone 1538, antenna 1542, video are taken the photograph Camera 1570 and electric supply 1544 are outside systemonchip device 1522.However, display 1528, input unit 1530, each of loudspeaker 1536, microphone 1538, antenna 1542, video camera 1570 and electric supply 1544 It can be coupled to the component (for example, interface or controller) of systemonchip device 1522.
It can be by field programmable gate array (FPGA) device, special application integrated circuit (ASIC), such as central processing list The processing unit of first (CPU), digital signal processor (DSP), controller, another hardware device, firmware in devices or its any group Close the method 1200 to 1400 for implementing Figure 12 to 14.As an example, the instruction in memory 1532 can be stored in by execution Processor originates method 1200, the method for Figure 13 1300, the method for Figure 14 1400 of Figure 12 or any combination thereof, such as about Figure 15 It is described.
In conjunction with one or more of described embodiment, announcement may include for providing the first element in multiple elements For the equipment of the device of the first output.Device for offer may include one or more adders of reduction tree, such as the pact of Fig. 2 Letter tree 206, the reduction tree 300 to 600 of Fig. 3 to 6, the part of the reduction tree of Fig. 7 700, Fig. 8 reduction tree 800, be configured to by First element is provided as one or more other devices or circuit of the first output, or any combination thereof.Equipment can further include For the device based on the second output of the first element and second element generation in multiple elements.Device for generation may include One or more adders of reduction tree, such as the reduction tree of the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, Fig. 7 700 part, Fig. 8 reduction tree 800, be configured to generate based on the first element and second element the second output one or more Other devices or circuit, or any combination thereof.Equipment can further include defeated for the first output and the second output to be stored in Device in outgoing vector.Device for storage may include the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, Fig. 7 The part of reduction tree 700, Fig. 8 reduction tree 800, be configured to that one or more being stored in output vector will be exported it is other Device or circuit, or any combination thereof.
Equipment also may include the device for being saturated the second output.Device for being saturated the second output may include Fig. 7's First saturated logic circuit 730 or the second saturated logic circuit 732, one or more the other devices for being configured to saturation output or Circuit, or any combination thereof.
In conjunction with one or more of described embodiment, announcement may include for being based on vector instruction while generating multiple defeated The equipment of device out.For simultaneously generate device may include the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, The part of the reduction tree 700 of Fig. 7, Fig. 8 reduction tree 800, be configured to based on vector instruction while generating the one of multiple outputs Or a number of other devices or circuit, or any combination thereof.It can be by processor in the first instruction comprising the first section packets size Execution during and comprising the second section packets size second instruction execution during using for simultaneously generation device.
It may include set-top box, amusement unit, navigation device, communication dress that one or more of disclosed embodiment, which may be implemented in, It sets, personal digital assistant (PDA), fixed position data cell, mobile position data unit, mobile phone, cellular phone, meter It is calculation machine, portable computer, tablet computer, desktop computer, monitor, computer monitor, TV, tuner, wireless Electricity, satelline radio, music player, digital music player, portable music player, video player, digital video Player, digital video disk (DVD) player, portable digital video player or combinations thereof system or equipment (for example, Device 1500) in.As another illustrative non-limiting example, system or equipment may include remote unit, such as mobile phone, Handheld personal communication systems (PCS) unit, such as personal digital assistant portable data units, have global positioning system (GPS) fixed position data cell of the device, navigation device, such as meter reading equipment of function, or storage or retrieval data Or any other device of computer instruction, or any combination thereof.Although Fig. 1 to one or more of 15 can illustrate according to this hair System, equipment and/or the method for bright teaching, but the present invention is not limited to system, equipment and/or methods illustrated by these.This The embodiment of invention may be adapted in any device to contain integrated circuit (comprising memory and on-chip circuitry).
It may include communication device, fixed position data cell, movement that one or more of disclosed embodiment, which may be implemented in, Location data element, mobile phone, cellular phone, computer, tablet computer, portable computer or desktop computer System or equipment (for example, device 1500) in.In addition, device 1500 may include set-top box, it is amusement unit, navigation device, a Personal digital assistant (PDA), monitor, computer monitor, TV, tuner, radio, satelline radio, music player, Digital music player, portable music player, video player, video frequency player, digital video disk (DVD) are broadcast Any other device of device, portable digital video player, storage or retrieval data or computer instruction is put, or combinations thereof. As another illustrative non-limiting example, system or equipment may include remote unit, such as mobile phone, hand-held individual lead to Letter system (PCS) unit, such as personal digital assistant portable data units, have global positioning system (GPS) function The fixed position data cell of device, navigation device, such as meter reading equipment, or storage or retrieval data or computer instruction Any other device, or any combination thereof.
Although Fig. 1 to one or more of 15 can illustrate the system, equipment and/or method of teaching according to the present invention, The present invention is not limited to system, equipment and/or methods illustrated by these.The embodiment of the present invention may be adapted to contain integrated electricity In any device on road (including memory, processor and on-chip circuitry).
Those skilled in the art will be further understood that, can will combine each described in embodiment disclosed herein Kind illustrative components, blocks, configuration, module, circuit and algorithm steps are embodied as electronic hardware, the computer as performed by processor Software, or both combination.Above substantially described in terms of functionality various Illustrative components, block, configuration, module, circuit and Step.This functionality is implemented as hardware and still executes software depending on specific application and force at design in whole system about Beam.For each specific application, those skilled in the art can implement described function in a varying manner, but These implementation decisions should not be construed to cause to depart from the scope of the present invention.
The step of method or algorithm for describing in conjunction with embodiment disclosed herein can be embodied directly in hardware, by Processor execute software module in, or both combination in.Software module can reside within random access memory (RAM), deposit Reservoir, read-only memory (ROM), programmable read only memory (PROM), erasable programmable read-only memory (EPROM), electricity Erasable programmable read-only memory (EEPROM), register, hard disk, moveable magnetic disc, compact disc read-only memory (CD-ROM) Or in the storage media of any other form known in the art.Exemplary non-transitory (for example, tangible) stores media It is coupled to processor, so that processor can read information from storage media and write information to storage media.In the alternative, Storage media can be integrated into processor.Processor and storage media can reside in special application integrated circuit (ASIC).ASIC It can reside in computing device or user terminal.In the alternative, it is resident to can be used as discrete component for processor and storage media In computing device or user terminal.
The previous description of disclosed embodiment is provided so that those skilled in the art can make or using disclosed Embodiment.To those of ordinary skill in the art, the various modifications of these embodiments are readily apparent, and not In the case where departing from the scope of the present invention, the principles defined herein can be applied to other embodiments.Therefore, the present invention is not It is intended to be limited to embodiments shown herein, and should meet may be with the principle that is such as defined by following claims and new The consistent widest range of clever feature.

Claims (46)

1.一种方法,其包括:1. A method comprising: 在处理器处执行一向量指令,其中所述向量指令包括包含多个元素的向量输入,且其中执行所述向量指令包括:Executing a vector instruction at a processor, wherein the vector instruction includes a vector input comprising a plurality of elements, and wherein executing the vector instruction includes: 确定是否在旋转向量的元素上执行旋转操作,其中所述旋转向量的所述元素被配置为与用所述多个元素中的第一元素产生的第一输出一起使用,且其中在基于所述多个元素中的所述第一元素及第二元素执行算术运算以提供第二输出之前,所述旋转向量的所述元素被存储在寄存器或存储器中;及determining whether to perform a rotation operation on an element of a rotation vector, wherein the element of the rotation vector is configured for use with a first output generated with a first element of the plurality of elements, and wherein the rotation is performed based on the The elements of the rotation vector are stored in a register or memory before the first and second elements of the plurality of elements perform an arithmetic operation to provide a second output; and 在所述处理器的单一执行循环期间将所述第一输出及所述第二输出存储于输出向量中邻近所述旋转向量的所述元素的位置,其中所述寄存器或存储器被识别为所述输出向量。storing the first output and the second output in an output vector adjacent to the element of the rotation vector during a single execution cycle of the processor, wherein the register or memory is identified as the output vector. 2.根据权利要求1所述的方法,其中执行所述向量指令进一步包括:2. The method of claim 1, wherein executing the vector instructions further comprises: 对所述多个元素中的所述第一元素、所述第二元素及第三元素执行第二算术运算,以提供第三输出;及performing a second arithmetic operation on the first element, the second element and the third element of the plurality of elements to provide a third output; and 在所述处理器的单一执行循环期间将所述第一输出、所述第二输出及所述第三输出存储于所述输出向量中。The first output, the second output, and the third output are stored in the output vector during a single execution cycle of the processor. 3.根据权利要求1所述的方法,其中执行所述向量指令进一步包括将多个输出中的每一者存储于所述输出向量的不同输出元素中,且其中所述多个输出包含所述第一输出及所述第二输出。3. The method of claim 1, wherein executing the vector instruction further comprises storing each of a plurality of outputs in a different output element of the output vector, and wherein the plurality of outputs comprise the the first output and the second output. 4.根据权利要求1所述的方法,其中按顺序存储所述多个元素,其中执行所述向量指令进一步包括对所述多个元素中的特定元素及所述多个元素中的一或多个其它元素执行第二算术运算以产生特定输出,其中所述一或多个其它元素在所述顺序上在所述特定元素之前。4. The method of claim 1, wherein the plurality of elements are stored in order, wherein executing the vector instruction further comprises performing an update on a particular element of the plurality of elements and one or more of the plurality of elements the other elements perform the second arithmetic operation to produce the particular output, wherein the one or more other elements precede the particular element in the order. 5.根据权利要求4所述的方法,其中所述向量输入的第一大小与所述输出向量的第二大小相同。5. The method of claim 4, wherein a first size of the vector input is the same as a second size of the output vector. 6.根据权利要求5所述的方法,其中执行所述向量指令包含产生包含所述第一输出及所述第二输出的多个输出,且其中所述多个输出中的最后一个输出等于所述多个元素中的每一元素的总和,其中所述多个输出为实数、虚数或复数。6. The method of claim 5, wherein executing the vector instruction comprises generating a plurality of outputs including the first output and the second output, and wherein a last output of the plurality of outputs is equal to the the sum of each of the plurality of elements, wherein the plurality of outputs are real, imaginary, or complex numbers. 7.根据权利要求1所述的方法,其中执行所述向量指令进一步包括在提供所述第一输出之前将屏蔽应用于所述多个元素。7. The method of claim 1, wherein executing the vector instruction further comprises applying a mask to the plurality of elements prior to providing the first output. 8.根据权利要求7所述的方法,其中执行所述向量指令包含产生包含所述第一输出及所述第二输出的多个输出,且其中应用所述屏蔽包括根据所述屏蔽的对应屏蔽值为所述多个元素中的特定元素提供零值以用于产生所述多个输出。8. The method of claim 7, wherein executing the vector instruction comprises generating a plurality of outputs including the first output and the second output, and wherein applying the mask comprises a corresponding mask according to the mask A value of zero is provided for a particular element of the plurality of elements for generating the plurality of outputs. 9.根据权利要求7所述的方法,其中由所述向量指令识别所述屏蔽。9. The method of claim 7, wherein the mask is identified by the vector instruction. 10.根据权利要求1所述的方法,其中执行所述向量指令进一步包括将屏蔽应用于所述输出向量。10. The method of claim 1, wherein executing the vector instruction further comprises applying a mask to the output vector. 11.根据权利要求10所述的方法,其中执行所述向量指令进一步包括基于所述屏蔽防止将一或多个输出存储于所述输出向量中。11. The method of claim 10, wherein executing the vector instruction further comprises preventing one or more outputs from being stored in the output vector based on the masking. 12.根据权利要求1所述的方法,其中当所述向量指令与复数运算相关联时,执行所述向量指令进一步包括:12. The method of claim 1, wherein when the vector instruction is associated with a complex number operation, executing the vector instruction further comprises: 产生所述第一输出的第一实数子元素及所述第一输出的第一虚数子元素;及generating a first real sub-element of the first output and a first imaginary sub-element of the first output; and 产生所述第二输出的第二实数子元素及所述第二输出的第二虚数子元素,producing a second real sub-element of the second output and a second imaginary sub-element of the second output, 其中将所述第一输出及所述第二输出存储于输出向量中包含在所述处理器的单一执行循环期间将所述第一输出的所述第一实数子元素及所述第一输出的所述第一虚数子元素以及所述第二输出的所述第二实数子元素及所述第二输出的所述第二虚数子元素存储于所述输出向量中。wherein storing the first output and the second output in an output vector includes storing the first real sub-element of the first output and the first output of the first output during a single execution cycle of the processor The first imaginary sub-element and the second real sub-element of the second output and the second imaginary sub-element of the second output are stored in the output vector. 13.一种设备,其包括:13. An apparatus comprising: 处理器,其包括约简树,其中在识别包含多个元素的向量输入的一向量指令的执行期间,所述处理器经配置以确定是否在存储于寄存器或存储器中的旋转向量的元素上执行旋转操作,且所述约简树经配置以:a processor comprising a reduction tree, wherein during execution of a vector instruction identifying a vector input comprising a plurality of elements, the processor is configured to determine whether to execute on elements of a rotation vector stored in a register or memory A rotation operation, and the reduction tree is configured to: 存取被配置为与用所述多个元素中的至少第一元素产生的第一输出元素一起使用的所述旋转向量的所述元素;accessing the element of the rotation vector configured for use with a first output element generated with at least a first element of the plurality of elements; 基于所述多个元素中的所述第一元素及第二元素执行算术运算,以提供第二输出元素;及performing an arithmetic operation based on the first and second elements of the plurality of elements to provide a second output element; and 在所述处理器的单一执行循环期间将所述第一输出元素及所述第二输出元素存储于输出向量中邻近所述旋转向量的所述元素的位置,其中所述寄存器或存储器被识别为所述输出向量。storing the first output element and the second output element in an output vector adjacent to the element of the rotation vector during a single execution cycle of the processor, wherein the register or memory is identified as the output vector. 14.根据权利要求13所述的设备,其中所述约简树包括多个算术运算单元、多个输入及多个输出,且其中所述约简树经配置以对所述多个元素中的所述第一元素、所述第二元素及第三元素执行第二算术运算,以提供第三输出元素。14. The apparatus of claim 13, wherein the reduction tree comprises a plurality of arithmetic operation units, a plurality of inputs and a plurality of outputs, and wherein the reduction tree is configured to The first element, the second element, and the third element perform a second arithmetic operation to provide a third output element. 15.根据权利要求14所述的设备,其中所述多个算术运算单元中的特定算术运算单元耦合到经配置以饱和所述特定算术运算单元的输出的饱和逻辑电路。15. The apparatus of claim 14, wherein a particular arithmetic operation unit of the plurality of arithmetic operation units is coupled to a saturation logic circuit configured to saturate an output of the particular arithmetic operation unit. 16.根据权利要求14所述的设备,其中所述处理器进一步包括经配置以基于所述向量指令选择性地启用所述多个算术运算单元中的一或多个算术运算单元的控制逻辑,且其中经由所述一或多个算术运算单元提供所述第一输出元素及所述第二输出元素。16. The apparatus of claim 14, wherein the processor further comprises control logic configured to selectively enable one or more arithmetic operation units of the plurality of arithmetic operation units based on the vector instructions, and wherein the first output element and the second output element are provided via the one or more arithmetic operation units. 17.根据权利要求16所述的设备,其中所述控制逻辑经配置以基于所述向量指令启用所述多个算术运算单元的子集以接收零输入,所述零输入具有等于逻辑零的逻辑值。17. The apparatus of claim 16, wherein the control logic is configured to enable a subset of the plurality of arithmetic operation units to receive a zero input based on the vector instruction, the zero input having a logic equal to logic zero value. 18.根据权利要求16所述的设备,其中所述控制逻辑经配置以基于所述向量指令略过所述多个算术运算单元中的至少一算术运算单元。18. The apparatus of claim 16, wherein the control logic is configured to skip at least one arithmetic operation unit of the plurality of arithmetic operation units based on the vector instruction. 19.根据权利要求13所述的设备,其中所述约简树被逻辑地分割成多个累积式平行约简网络,其中所述多个累积式平行约简网络中的第一累积式约简网络包含多个加法器中的每一加法器的子加法器,其中所述多个累积式平行约简网络中的每一累积式约简网络平行地进行运算,且其中所述多个累积式平行约简网络中的每一累积式约简网络将输出存储于所述输出向量中。19. The apparatus of claim 13, wherein the reduction tree is logically partitioned into a plurality of cumulative parallel reduction networks, wherein a first cumulative reduction of the plurality of cumulative parallel reduction networks The network includes a sub-adder of each adder of a plurality of adders, wherein each cumulative reduction network of the plurality of cumulative parallel reduction networks operates in parallel, and wherein the plurality of cumulative reduction networks Each cumulative reduce network in the parallel reduce network stores an output in the output vector. 20.根据权利要求13所述的设备,其中所述约简树支持多个输入类型,其中所述多个输入类型包含六十四位实数、六十四位虚数、三十二位实数、三十二位虚数、十六位实数、十六位虚数、三十二位复数、十六位复数、一或多个其它输入类型,或其任何组合。20. The apparatus of claim 13, wherein the reduction tree supports a plurality of input types, wherein the plurality of input types comprises sixty-four-bit real, sixty-four imaginary, thirty-two real, three Twelve-bit imaginary number, sixteen-bit real number, sixteen-bit imaginary number, thirty-two-bit complex number, sixteen-bit complex number, one or more other input types, or any combination thereof. 21.一种设备,其包括:21. An apparatus comprising: 用于存储数据的装置;a device for storing data; 用于确定是否在旋转向量的元素上执行旋转操作的装置,其中所述旋转向量的所述元素被配置为与用多个元素中的第一元素产生的第一输出一起使用,其中一向量指令指示包含所述多个元素的向量输入;且其中在基于算术运算和基于所述多个元素中的所述第一元素及第二元素产生第二输出之前,所述旋转向量的所述元素被存储在寄存器或存储器中;及means for determining whether to perform a rotation operation on elements of a rotation vector, wherein the elements of the rotation vector are configured for use with a first output generated with a first element of a plurality of elements, wherein a vector instruction indicating a vector input comprising the plurality of elements; and wherein the elements of the rotation vector are stored in a register or memory; and 用于在单一执行循环期间将所述第一输出及所述第二输出存储于输出向量中邻近所述旋转向量的所述元素的位置的装置,其中所述寄存器或存储器被识别为所述输出向量。means for storing the first output and the second output at a position in an output vector adjacent to the element of the rotation vector during a single execution cycle, wherein the register or memory is identified as the output vector. 22.根据权利要求21所述的设备,其进一步包括用于饱和所述第二输出的装置,所述用于饱和的装置耦合到用于产生的装置。22. The apparatus of claim 21, further comprising means for saturating the second output, the means for saturating being coupled to the means for generating. 23.一种非暂时性计算机可读介质,其包括指令,所述指令在由处理器执行时使得所述处理器进行如下操作:23. A non-transitory computer-readable medium comprising instructions that, when executed by a processor, cause the processor to: 确定是否在旋转向量的元素上执行旋转操作,其中所述旋转向量的所述元素被配置为与用多个元素中的至少第一元素产生的第一输出一起使用,所述多个元素包含于一向量指令的向量输入中;且其中在基于所述多个元素中的所述第一元素及第二元素执行算术运算以提供第二输出之前,所述旋转向量的所述元素被存储于寄存器或存储器中;及determining whether to perform a rotation operation on elements of a rotation vector, wherein the elements of the rotation vector are configured for use with a first output generated with at least a first element of a plurality of elements contained in in a vector input of a vector instruction; and wherein the elements of the rotation vector are stored in registers prior to performing an arithmetic operation based on the first and second elements of the plurality of elements to provide a second output or in memory; and 在所述处理器的单一执行循环期间将所述第一输出及所述第二输出存储于输出向量中邻近所述旋转向量的所述元素的位置,其中所述寄存器或存储器被识别为所述输出向量。storing the first output and the second output in an output vector adjacent to the element of the rotation vector during a single execution cycle of the processor, wherein the register or memory is identified as the output vector. 24.根据权利要求23所述的非暂时性计算机可读介质,其中可进一步执行所述指令以使得处理器基于所述向量指令在使用所述多个元素中的一或多个元素来提供所述第一输出及所述第二输出之前,对所述一或多个元素求补。24. The non-transitory computer-readable medium of claim 23, wherein the instructions are further executable to cause a processor to provide the vector instruction using one or more elements of the plurality of elements based on the vector instruction. Complementing the one or more elements before the first output and the second output. 25.一种设备,其包括:25. An apparatus comprising: 约简树,其包括多个输入元素、多个算术运算单元及多个输出元素,其中所述多个算术运算单元包含多个行的算术运算单元,其中所述多个行的算术运算单元中的第一行经配置以接收所述多个输入元素中的许多输入元素,其中所述多个行的算术运算单元中的最后一行经配置以输出所述多个输出元素中的许多输出元素,其中处理器经配置以在具有第一区段分组大小的第一指令的执行期间及具有第二区段分组大小的第二指令的执行期间使用所述约简树,其中所述约简树经配置以同时产生多个输出值,且其中所述第一区段分组大小对应于所述多个输入元素的第一群组的大小且所述第二区段分组大小对应于所述多个输入元素的第二群组的大小;及A reduction tree comprising a plurality of input elements, a plurality of arithmetic operation units, and a plurality of output elements, wherein the plurality of arithmetic operation units includes a plurality of rows of arithmetic operation units, wherein the plurality of rows of arithmetic operation units A first row of the plurality of input elements is configured to receive a plurality of input elements of the plurality of input elements, wherein a last row of the arithmetic operation units of the plurality of rows is configured to output a plurality of output elements of the plurality of output elements, wherein the processor is configured to use the reduction tree during execution of a first instruction having a first segment packet size and during execution of a second instruction having a second segment packet size, wherein the reduction tree is configured to A plurality of output values are generated simultaneously, and wherein the first segment grouping size corresponds to the size of a first group of the plurality of input elements and the second segment grouping size corresponds to the size of the plurality of input elements the size of the second group; and 旋转单元,其经配置以在将所述许多输出元素存储于输出向量中之前旋转所述输出向量,其中所述旋转单元包括旋转器或筒向量移位器。a rotation unit configured to rotate the output vector before storing the number of output elements in the output vector, wherein the rotation unit includes a rotator or a barrel vector shifter. 26.根据权利要求25所述的设备,其中所述多个算术运算单元包括多个加法器。26. The apparatus of claim 25, wherein the plurality of arithmetic operation units comprise a plurality of adders. 27.根据权利要求25所述的设备,其进一步包括经配置以进行如下操作的控制逻辑:27. The apparatus of claim 25, further comprising control logic configured to: 在所述第一指令的执行期间,基于所述第一区段分组大小,选择性地启用所述多个算术运算单元的第一子集;及during execution of the first instruction, selectively enabling a first subset of the plurality of arithmetic operation units based on the first segment group size; and 在所述第二指令的执行期间,基于所述第二区段分组大小,选择性地启用所述多个算术运算单元的第二子集。During execution of the second instruction, a second subset of the plurality of arithmetic operation units is selectively enabled based on the second segment group size. 28.根据权利要求25所述的设备,其中所述约简树包含于所述处理器的算术逻辑单元ALU中,且其中所述约简树的级的数目取决于所述多个输入元素中的输入元素的数目。28. The apparatus of claim 25, wherein the reduction tree is included in an arithmetic logic unit ALU of the processor, and wherein the number of levels of the reduction tree depends on the plurality of input elements the number of input elements. 29.根据权利要求28所述的设备,其中所述多个行的算术运算单元中的每一行与所述约简树的对应级相关联。29. The apparatus of claim 28, wherein each row of arithmetic operation units of the plurality of rows is associated with a corresponding level of the reduction tree. 30.根据权利要求28所述的设备,其中所述约简树的级的所述数目等于输入元素的所述数目的以2为底的对数。30. The apparatus of claim 28, wherein the number of levels of the reduction tree is equal to the base-2 logarithm of the number of input elements. 31.根据权利要求25所述的设备,其进一步包括一或多个饱和电路,其中所述一或多个饱和电路中的特定饱和电路经配置以从特定算术运算单元接收特定输出,并基于所述特定输出输出经饱和值。31. The apparatus of claim 25, further comprising one or more saturation circuits, wherein a particular one of the one or more saturation circuits is configured to receive a particular output from a particular arithmetic operation unit, and based on the The specified output outputs the saturated value. 32.根据权利要求25所述的设备,其中所述约简树经配置以在所述第一指令和所述第二指令的执行期间,使用多个算术运算同时产生所述多个输出值。32. The apparatus of claim 25, wherein the reduction tree is configured to simultaneously generate the plurality of output values using a plurality of arithmetic operations during execution of the first instruction and the second instruction. 33.一种方法,其包括:33. A method comprising: 在处理器处接收具有区段分组大小的一向量指令,其中所述处理器包括约简树,且其中所述约简树包含多个输入元素、多个算术运算单元及多个输出元素,其中所述多个算术运算单元包含多个行的算术运算单元,其中所述多个行的算术运算单元中的第一行经配置以接收所述多个输入元素中的许多输入元素,其中所述多个行的算术运算单元中的最后一行经配置以输出所述多个输出元素中的许多输出元素,且其中所述区段分组大小对应于所述多个输入元素的一或多个群组的大小;A vector instruction having a segment packet size is received at a processor, wherein the processor includes a reduction tree, and wherein the reduction tree includes a plurality of input elements, a plurality of arithmetic operation units, and a plurality of output elements, wherein The plurality of arithmetic operation units includes a plurality of rows of arithmetic operation units, wherein a first row of the plurality of rows of arithmetic operation units is configured to receive a number of the plurality of input elements, wherein the a last row of the plurality of rows of arithmetic operation units is configured to output a number of the plurality of output elements, and wherein the segment grouping size corresponds to one or more groups of the plurality of input elements the size of; 确定所述区段分组大小;及determining the segment packet size; and 基于所述区段分组大小,使用所述约简树执行所述向量指令,以同时产生多个输出值,其中所述约简树可经选择性地配置以用于与多个不同区段分组大小一起使用。Based on the segment grouping size, the vector instructions are executed using the reduction tree to generate multiple output values simultaneously, wherein the reduction tree is selectively configurable for grouping with multiple different segments size together. 34.根据权利要求33所述的方法,其进一步包括:34. The method of claim 33, further comprising: 确定所述约简树是否经配置以用于与所确定区段分组大小一起使用;及determining whether the reduction tree is configured for use with the determined segment grouping size; and 响应于确定所述约简树未经配置以用于与所述所确定区段分组大小一起使用,基于所述区段分组大小变更配置。In response to determining that the reduction tree is not configured for use with the determined segment grouping size, the configuration is changed based on the segment grouping size. 35.根据权利要求34所述的方法,其进一步包括:35. The method of claim 34, further comprising: 确定所述约简树是否经配置以用于与所述所确定区段分组大小一起使用;及determining whether the reduction tree is configured for use with the determined segment grouping size; and 响应于确定所述约简树经配置以用于与所述所确定区段分组大小一起使用,使用所述约简树执行所述向量指令。In response to determining that the reduction tree is configured for use with the determined segment grouping size, the vector instruction is executed using the reduction tree. 36.根据权利要求33所述的方法,其中执行所述向量指令包括:36. The method of claim 33, wherein executing the vector instructions comprises: 将多个输入分组成具有所述区段分组大小的一或多个群组;及grouping the plurality of inputs into one or more groups having the segment grouping size; and 对所述一或多个群组执行一或多个算术运算,以产生所述多个输出值,其中所述向量指令指示所述一或多个算术运算。One or more arithmetic operations are performed on the one or more groups to generate the plurality of output values, wherein the vector instruction indicates the one or more arithmetic operations. 37.根据权利要求36所述的方法,其中所述多个输入中的每一输入包含对应实数部分及对应虚数部分,且其中通过以交插方式对一或多个实数部分执行第一算术运算并对一或多个虚数部分执行第二算术运算,产生所述多个输出值中的每一输出值。37. The method of claim 36, wherein each input of the plurality of inputs comprises a corresponding real part and a corresponding imaginary part, and wherein the first arithmetic operation is performed by interleaving on one or more real parts and performing a second arithmetic operation on the one or more imaginary parts, producing each of the plurality of output values. 38.根据权利要求33所述的方法,其中所述多个输入元素及所述多个输出元素表示实数值、虚数值或其组合。38. The method of claim 33, wherein the plurality of input elements and the plurality of output elements represent real values, imaginary values, or a combination thereof. 39.一种方法,其包括:39. A method comprising: 执行包含多个输入元素的一向量指令,其中执行所述向量指令包括:Executing a vector instruction comprising a plurality of input elements, wherein executing the vector instruction includes: 分组所述多个输入元素的第一子集以形成输入元素的第一集合;grouping a first subset of the plurality of input elements to form a first set of input elements; 分组所述多个输入元素的第二子集以形成输入元素的第二集合;grouping a second subset of the plurality of input elements to form a second set of input elements; 对所述第一集合的输入元素执行第一算术运算;performing a first arithmetic operation on the input elements of the first set; 对所述第二集合的输入元素执行第二算术运算;performing a second arithmetic operation on the input elements of the second set; 基于存储于所述向量指令的特定字段中的值旋转输出寄存器的内容;及Rotate the contents of the output register based on the value stored in a particular field of the vector instruction; and 在旋转所述输出寄存器的所述内容之后,在单一执行循环期间将所述第一算术运算的第一结果及所述第二算术运算的第二结果插入到所述输出寄存器中。After rotating the contents of the output register, a first result of the first arithmetic operation and a second result of the second arithmetic operation are inserted into the output register during a single execution cycle. 40.根据权利要求39所述的方法,其中所述向量指令为单一向量指令,其中所述多个输入元素中的每一者存储于输入向量中,且其中同时产生所述第一结果及所述第二结果。40. The method of claim 39, wherein the vector instruction is a single vector instruction, wherein each of the plurality of input elements is stored in an input vector, and wherein the first result and all the second result. 41.根据权利要求39所述的方法,其中将所述第一结果及所述第二结果插入到所述输出寄存器中包括盖写所述输出寄存器的对应内容,且其中旋转所述输出寄存器的所述内容包括基于所述向量指令而选择性地旋转所述输出寄存器的所述内容。41. The method of claim 39, wherein inserting the first result and the second result into the output register comprises overwriting corresponding contents of the output register, and wherein rotating the output register's The content includes selectively rotating the content of the output register based on the vector instruction. 42.根据权利要求39所述的方法,其中所述第一集合的输入元素中的元素的第一数目及所述第二集合的输入元素中的元素的第二数目是基于由所述向量指令所识别的区段分组大小。42. The method of claim 39, wherein a first number of elements in the first set of input elements and a second number of elements in the second set of input elements are based on instructions from the vector The identified segment packet size. 43.根据权利要求42所述的方法,其中所述第一集合的输入元素中的元素的所述第一数目与所述第二集合的输入元素中的元素的所述第二数目相同。43. The method of claim 42, wherein the first number of elements in the first set of input elements is the same as the second number of elements in the second set of input elements. 44.根据权利要求39所述的方法,其中将所述第一结果插入到所述输出寄存器的第一输出元素中,其中将所述第二结果插入到所述输出寄存器的第二输出元素中,且其中所述第一输出元素及所述第二输出元素为所述输出寄存器的不同输出元素。44. The method of claim 39, wherein the first result is inserted into a first output element of the output register, wherein the second result is inserted into a second output element of the output register , and wherein the first output element and the second output element are different output elements of the output register. 45.根据权利要求39所述的方法,其中执行所述向量指令进一步包括在分组所述多个输入元素之前将屏蔽应用于所述多个输入元素。45. The method of claim 39, wherein executing the vector instruction further comprises applying a mask to the plurality of input elements prior to grouping the plurality of input elements. 46.根据权利要求39所述的方法,其中执行所述向量指令进一步包括在旋转所述内容之后将屏蔽应用于所述输出寄存器。46. The method of claim 39, wherein executing the vector instruction further comprises applying a mask to the output register after rotating the content.
CN201480043504.XA 2013-08-14 2014-08-04 Vector accumulation method and equipment Active CN105453028B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US13/967,191 US20150052330A1 (en) 2013-08-14 2013-08-14 Vector arithmetic reduction
US13/967,191 2013-08-14
PCT/US2014/049604 WO2015023465A1 (en) 2013-08-14 2014-08-04 Vector accumulation method and apparatus

Publications (2)

Publication Number Publication Date
CN105453028A CN105453028A (en) 2016-03-30
CN105453028B true CN105453028B (en) 2019-04-09

Family

ID=51492424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480043504.XA Active CN105453028B (en) 2013-08-14 2014-08-04 Vector accumulation method and equipment

Country Status (6)

Country Link
US (1) US20150052330A1 (en)
EP (1) EP3033670B1 (en)
JP (1) JP2016530631A (en)
CN (1) CN105453028B (en)
TW (1) TWI507982B (en)
WO (1) WO2015023465A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9678715B2 (en) * 2014-10-30 2017-06-13 Arm Limited Multi-element comparison and multi-element addition
US20160179530A1 (en) * 2014-12-23 2016-06-23 Elmoustapha Ould-Ahmed-Vall Instruction and logic to perform a vector saturated doubleword/quadword add
US10296342B2 (en) 2016-07-02 2019-05-21 Intel Corporation Systems, apparatuses, and methods for cumulative summation
US10466967B2 (en) * 2016-07-29 2019-11-05 Qualcomm Incorporated System and method for piecewise linear approximation
US10108581B1 (en) 2017-04-03 2018-10-23 Google Llc Vector reduction processor
US10331445B2 (en) 2017-05-24 2019-06-25 Microsoft Technology Licensing, Llc Multifunction vector processor circuits
GB2574817B (en) * 2018-06-18 2021-01-06 Advanced Risc Mach Ltd Data processing systems
US11294670B2 (en) * 2019-03-27 2022-04-05 Intel Corporation Method and apparatus for performing reduction operations on a plurality of associated data element values
CN110807521B (en) * 2019-10-29 2022-06-24 中昊芯英(杭州)科技有限公司 Processing device, chip, electronic equipment and method supporting vector operation
GB2601466A (en) * 2020-02-10 2022-06-08 Xmos Ltd Rotating accumulator
US20240004647A1 (en) * 2022-07-01 2024-01-04 Andes Technology Corporation Vector processor with vector and element reduction method
US20240176617A1 (en) * 2022-11-28 2024-05-30 International Business Machines Corporation Vector reduce instruction
US20250208878A1 (en) * 2023-12-20 2025-06-26 Advanced Micro Devices, Inc. Accumulation apertures

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5845112A (en) * 1997-03-06 1998-12-01 Samsung Electronics Co., Ltd. Method for performing dead-zone quantization in a single processor instruction
US20080016321A1 (en) * 2006-07-11 2008-01-17 Pennock James D Interleaved hardware multithreading processor architecture
CN101398753A (en) * 2007-09-27 2009-04-01 辉达公司 System, method and computer program product for performing a scan operation
CN101436121A (en) * 2007-11-15 2009-05-20 辉达公司 Method and device for performing a scan operation on parallel processor architecture
US20100049950A1 (en) * 2008-08-15 2010-02-25 Apple Inc. Running-sum instructions for processing vectors

Family Cites Families (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4996661A (en) * 1988-10-05 1991-02-26 United Technologies Corporation Single chip complex floating point numeric processor
US5542074A (en) * 1992-10-22 1996-07-30 Maspar Computer Corporation Parallel processor system with highly flexible local control capability, including selective inversion of instruction signal and control of bit shift amount
US5717947A (en) * 1993-03-31 1998-02-10 Motorola, Inc. Data processing system and method thereof
US6058473A (en) * 1993-11-30 2000-05-02 Texas Instruments Incorporated Memory store from a register pair conditional upon a selected status bit
US5727229A (en) * 1996-02-05 1998-03-10 Motorola, Inc. Method and apparatus for moving data in a parallel processor
US6542918B1 (en) * 1996-06-21 2003-04-01 Ramot At Tel Aviv University Ltd. Prefix sums and an application thereof
US5864703A (en) * 1997-10-09 1999-01-26 Mips Technologies, Inc. Method for providing extended precision in SIMD vector arithmetic operations
US7395302B2 (en) * 1998-03-31 2008-07-01 Intel Corporation Method and apparatus for performing horizontal addition and subtraction
US6418529B1 (en) * 1998-03-31 2002-07-09 Intel Corporation Apparatus and method for performing intra-add operation
US6295597B1 (en) * 1998-08-11 2001-09-25 Cray, Inc. Apparatus and method for improved vector processing to support extended-length integer arithmetic
US6192384B1 (en) * 1998-09-14 2001-02-20 The Board Of Trustees Of The Leland Stanford Junior University System and method for performing compound vector operations
US6324638B1 (en) * 1999-03-31 2001-11-27 International Business Machines Corporation Processor having vector processing capability and method for executing a vector instruction in a processor
US7624138B2 (en) * 2001-10-29 2009-11-24 Intel Corporation Method and apparatus for efficient integer transform
US6920545B2 (en) * 2002-01-17 2005-07-19 Raytheon Company Reconfigurable processor with alternately interconnected arithmetic and memory nodes of crossbar switched cluster
US7376812B1 (en) * 2002-05-13 2008-05-20 Tensilica, Inc. Vector co-processor for configurable and extensible processor architecture
US7159099B2 (en) * 2002-06-28 2007-01-02 Motorola, Inc. Streaming vector processor with reconfigurable interconnection switch
US7051186B2 (en) * 2002-08-29 2006-05-23 International Business Machines Corporation Selective bypassing of a multi-port register file
TWI221562B (en) * 2002-12-12 2004-10-01 Chung Shan Inst Of Science C6x_VSP-C6x vector signal processor
US7293056B2 (en) * 2002-12-18 2007-11-06 Intel Corporation Variable width, at least six-way addition/accumulation instructions
US20040193847A1 (en) * 2003-03-31 2004-09-30 Lee Ruby B. Intra-register subword-add instructions
EP1623307B1 (en) * 2003-05-09 2015-07-01 QUALCOMM Incorporated Processor reduction unit for accumulation of multiple operands with or without saturation
TW200504592A (en) * 2003-07-24 2005-02-01 Ind Tech Res Inst Reconfigurable apparatus with high hardware efficiency
US7797363B2 (en) * 2004-04-07 2010-09-14 Sandbridge Technologies, Inc. Processor having parallel vector multiply and reduce operations with sequential semantics
DE102006027181B4 (en) * 2006-06-12 2010-10-14 Universität Augsburg Processor with internal grid of execution units
US7725518B1 (en) * 2007-08-08 2010-05-25 Nvidia Corporation Work-efficient parallel prefix sum algorithm for graphics processing units
US7895419B2 (en) * 2008-01-11 2011-02-22 International Business Machines Corporation Rotate then operate on selected bits facility and instructions therefore
CN102047219A (en) * 2008-05-30 2011-05-04 Nxp股份有限公司 Method for vector processing
US9176735B2 (en) * 2008-11-28 2015-11-03 Intel Corporation Digital signal processor having instruction set with one or more non-linear complex functions
US8595467B2 (en) * 2009-12-29 2013-11-26 International Business Machines Corporation Floating point collect and operate
US8667042B2 (en) * 2010-09-24 2014-03-04 Intel Corporation Functional unit for vector integer multiply add instruction
US8868885B2 (en) * 2010-11-18 2014-10-21 Ceva D.S.P. Ltd. On-the-fly permutation of vector elements for executing successive elemental instructions
EP2695054B1 (en) * 2011-04-01 2018-08-15 Intel Corporation Vector friendly instruction format and execution thereof
US9760372B2 (en) * 2011-09-01 2017-09-12 Hewlett Packard Enterprise Development Lp Parallel processing in plural processors with result register each performing associative operation on respective column data
CN104040488B (en) * 2011-12-22 2017-06-09 英特尔公司 Vector instruction to give the complex conjugate of the corresponding complex number
WO2013095634A1 (en) * 2011-12-23 2013-06-27 Intel Corporation Systems, apparatuses, and methods for performing a horizontal partial sum in response to a single instruction
WO2013095631A1 (en) * 2011-12-23 2013-06-27 Intel Corporation Systems, apparatuses, and methods for performing a butterfly horizontal and cross add or substract in response to a single instruction
US9823924B2 (en) * 2013-01-23 2017-11-21 International Business Machines Corporation Vector element rotate and insert under mask instruction
JP6079433B2 (en) * 2013-05-23 2017-02-15 富士通株式会社 Moving average processing program and processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5845112A (en) * 1997-03-06 1998-12-01 Samsung Electronics Co., Ltd. Method for performing dead-zone quantization in a single processor instruction
US20080016321A1 (en) * 2006-07-11 2008-01-17 Pennock James D Interleaved hardware multithreading processor architecture
CN101398753A (en) * 2007-09-27 2009-04-01 辉达公司 System, method and computer program product for performing a scan operation
CN101436121A (en) * 2007-11-15 2009-05-20 辉达公司 Method and device for performing a scan operation on parallel processor architecture
US20100049950A1 (en) * 2008-08-15 2010-02-25 Apple Inc. Running-sum instructions for processing vectors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Parallel prefix adders";Kostas Vitoroulis;《http://users.encs.concordia.ca/~asim/COEN_6501/Lecture_Notes/Parallel%20prefix%20adders%20presentation.ppt》;20061231;演示幻灯片第1-35页 *

Also Published As

Publication number Publication date
EP3033670B1 (en) 2019-11-06
US20150052330A1 (en) 2015-02-19
TWI507982B (en) 2015-11-11
CN105453028A (en) 2016-03-30
WO2015023465A1 (en) 2015-02-19
EP3033670A1 (en) 2016-06-22
JP2016530631A (en) 2016-09-29
TW201519090A (en) 2015-05-16

Similar Documents

Publication Publication Date Title
CN105453028B (en) Vector accumulation method and equipment
EP3026549B1 (en) Systems and methods of data extraction in a vector processor
EP2909713B1 (en) Selective coupling of an address line to an element bank of a vector register file
US8713285B2 (en) Address generation unit for accessing a multi-dimensional data structure in a desired pattern
US7962718B2 (en) Methods for performing extended table lookups using SIMD vector permutation instructions that support out-of-range index values
CN105009075B (en) The vertical addressing mode of the indirect element of vector with horizontal substitution
US11372804B2 (en) System and method of loading and replication of sub-vector values
TW201502978A (en) Vector processing engine with programmable data path for providing multimode base-2X butterfly vector processing circuitry and associated vector processors, systems and methods
CN100437547C (en) Digital signal processor with cascaded SIMD architecture and signal processing method thereof
CN101061460B (en) Micro processor device and method for shuffle operations
CN108319559A (en) Data processing equipment for controlling vector memory access and method
US8843730B2 (en) Executing instruction packet with multiple instructions with same destination by performing logical operation on results of instructions and storing the result to the destination
US8427952B1 (en) Microcode engine for packet processing
JP6687803B2 (en) Systems and methods for piecewise linear approximation
JP2014238859A (en) System and method of processing hierarchical very long instruction packets
CN107229446A (en) A kind of audio data processor
US7441099B2 (en) Configurable SIMD processor instruction specifying index to LUT storing information for different operation and memory location for each processing unit
US20140281421A1 (en) Arbitrary size table lookup and permutes with crossbar
US10162752B2 (en) Data storage at contiguous memory addresses
US8046569B2 (en) Processing element having dual control stores to minimize branch latency
CA3033960C (en) Data storage at contiguous memory addresses

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant