Present application opinion co-owns U.S. Non-provisional Patent application case the 13/967th filed on August 14th, 2013,
The full content of No. 191 priority, the application case is clearly incorporated herein by reference.
Specific embodiment
Referring to Fig. 1, the schema for executing the illustrative process of vector instruction is disclosed, and is generally assigned therein as 100.Vector
Instruction may include accumulating vector arithmetic reduction instruction, such as illustrative accumulating vector arithmetic reduction instruction 101.Accumulating to
Amount arithmetic reduction instruction 101 can be performed at processor, such as pipeline vector processor, as described with reference to figure 2.Processor
It can receive the input vector 122 comprising multiple elements 102.Processor can handle input vector 122 and generate output vector 120.
Output vector 120 (for example, being stored in multiple output elements in output vector 120) may be based on accumulating vector arithmetic reduction
Instruction 101.For example, executing accumulating vector arithmetic reduction instruction 101 can be by by the element-specific in multiple elements 102
It adds on the sequential order of input vector 122 in one or more sequentially in multiple elements 102 before element-specific
Other elements (for example, being added to be accumulation) and generate specific output.
Multiple elements 102 (for example, input vector 122) and output vector 120 may include N number of element, and wherein N is greater than one
Integer.Multiple elements 102 may include the first ekahafnium (s0), second element 106 (s1), third element 108 (s2) and N
Element 110 (s (N-1)).Can store multiple elements 102 with sequential order, such as " s0, s1, s2 ... s (N-1) ", wherein s0
For first, sequentially element and s (N-1) are last sequentially element according to sequential order.Although showing four elements, multiple members
Element number (for example, N) in element 102 can be more or less than four.In a particular embodiment, accumulating vector arithmetic is being executed
Before reduction instruction 101, vector permutation being executed using input vector 122 and is instructed, multiple elements 102 are arranged with sequential order.
Executing accumulating vector arithmetic reduction instruction 101 can produce the multiple output elements being stored in output vector 120
(for example, multiple output valves).Output vector 120 can have the element with 122 same number of input vector (for example, N).It executes tired
Product formula vector arithmetic reduction instruction 101 may include providing N number of output element.N number of output element can be stored in output vector 120
In.For example, the first output element 112, second exports element 114, third exports element 116 and N output element 118 can
It is stored in output vector 120.Output element 112 to 118 can be stored in simultaneously in output vector 120.For example, it is handling
During the single execution circulation of the execution accumulating vector arithmetic reduction instruction 101 of device, the first output element 112 and the second output
Element 114 can be stored in output vector 120.
Each output element in multiple outputs element 112 to 118 (for example, N number of output element) may be based on to multiple
Arithmetical operation (for example, add operation) performed by one or more elements in element 102.In use with specific sequential order
" s0, s1, s2 ... s (N-1) " sequencing multiple elements 102 execute the reduction of accumulating vector arithmetic instruction 101 after, first
Output element 112 can be equal to s0, the second output element 114 can be equal to s0+s1, third output element 116 can be equal to s0+s1+s2,
And N output element 118 can be equal to the summation (s0+s1+ ...+s (N-1)) of each element in multiple elements 102.Citing comes
It says, executing accumulating vector arithmetic reduction instruction 101 may include that the first ekahafnium is provided to (for example, generation) for the first output
Element 112, and the first ekahafnium is added to second element 106 to provide (for example, generation) second output element 114.First
Output element 112 and the second output element 114 can be stored in the different output elements of output vector 120.Execute accumulating to
Amount arithmetic reduction instruction 101, which can further include, is added to third element 108 for the first ekahafnium and second element 106, to mention
Element 116 is exported for third, and third output element 116 is stored in output vector 120.Execute accumulating vector arithmetic about
Letter instruction 101, which can further include, is added each of the element in multiple elements 102, exports element to provide N
118, and N output element 118 is stored in output vector 120.
As illustrated in Figure 1, accumulating vector arithmetic reduction instruction 101 may include instruction name 180 (vrcadd) (example
Such as, operation code opcode).Accumulating vector arithmetic reduction instruction 101 also can include one or more of field, such as the first field 182 (Vu),
Second field 184 (Vd), third field 186 (Q), the 4th field 188 (Op), the 5th field 190 (sc32) and the 6th field 192
(sat).The first value being stored in the first field 182 can be indicated for the execution in accumulating vector arithmetic reduction instruction 101
The input vector 122 (for example, vector Vu) that period uses, and the second value being stored in the second field 184 can indicate for
The output vector 120 (for example, vector Vd) used during the execution of accumulating vector arithmetic reduction instruction 101.It is stored in third
Third value in field 186 can indicator panel cover (for example, shielding Q), such as be described in further detail with reference to Figure 11 A to B;It is stored in
The 4th value in 4th field 188 can indicate operation vector (for example, operation vector Op);Be stored in the 5th field 190
Five values can indicate input value type, such as be described in further detail referring to figs. 3 to 4;And be stored in the 6th field 192 the 6th
Value may indicate whether to execute saturation during accumulating vector arithmetic reduction, as described with reference to fig 7.
Although having described add operation, accumulating vector arithmetic reduction instruction 101 is not limited to only execute add operation.
For example, accumulating vector arithmetic reduction instruction 101 can indicate one or more arithmetic fortune to execute to multiple elements 102
It calculates.One or more arithmetical operations may include add operation, subtraction or combinations thereof.For example, one or more can be used to add
Method operation is held using one or more subtractions, or using one or more add operations and the combination of one or more subtractions
Row arithmetic reduction.One or more calculations can be indicated by the value in specific fields (for example, special parameter) (for example, the 4th field 188)
Art operation.For example, the 4th field 188 may include being directed toward storage operation vector (for example, indicating one or more arithmetical operations
Vector) memory in position or be directed toward storage operation vector register pointer.Each element of operation vector can refer to
Showing will be to the specific of the corresponding element of multiple elements 102 execution during the execution of accumulating vector arithmetic reduction instruction 101
Operation (for example, add operation or subtraction).It, can be when at least one of one or more arithmetical operations are subtraction
Multiple output elements are generated before to one or more element supplements in multiple elements 102.For example, it is exported in offer first
It, can be based on the calculation of accumulating vector before (for example, before generating multiple output elements) element 112 and the second output element 114
Art reduction instructs 101 (for example, based on the 4th values being stored in the 4th field 188) to one or more in multiple elements 102
Element supplement.
During operation, processor can receive accumulating vector arithmetic reduction instruction 101.Multiple elements can be used in processor
102 execute accumulating vector arithmetic reduction instruction, to generate multiple output elements and be stored in output vector 120.It is more
A output element can indicate the multiple portions result of accumulating vector arithmetic reduction operations.
Compared to the generation multiple portions during the execution of multiple vector instructions as a result, accumulating vector arithmetic reduction instructs
101 can be by generating multiple portions result (for example, multiple output elements 112 to 118) during the execution that single vector-quantities instruct
And provide storage and power consumption benefit.For example, multiple portions are generated compared to during the execution of multiple vector instructions
As a result, generate that multiple portions result can be used in memory or register set during the execution of single vector-quantities instruction less deposits
Storage area, and the power consumption of processor can be reduced.
Fig. 2 is the block diagram for being configured to execute the embodiment of the system 200 of vector instruction.System 200 may include being configured
With received vector instruction 220 and input vector 122 and provide the processor 202 of output vector 120.Vector instruction 220 can be Fig. 1
Accumulating vector arithmetic reduction instruction 101.Alternatively, as illustrative non-limiting example, vector instruction 220 can be point
Segmentation vector arithmetic reduction instruct (such as with reference to described by Fig. 9) or rotation the reduction of segmented vector arithmetic instruction (such as with reference to
Described by Figure 10).
Processor 202 may include arithmetic logic unit (ALU) 204 and control logic 210.ALU 204 may include reduction tree
206 and rotary unit 208.ALU 204 can be configured to receive input vector 122, and using reduction tree 206 to input vector
122 execute one or more arithmetical operations.Reduction tree 206 can provide output vector 120.Output vector 120 can be provided to
220 positions that are identified of amount instruction, such as register or position in memory.For example, output vector 120 can be mentioned
It is supplied to the position of the specific fields (for example, second field 184 of Fig. 1) based on vector instruction 220.
ALU 204 and reduction tree 206 can be the part of execution pipeline.For example, processor 202 can be for comprising one or more
The pipeline vector processor of a pipeline.Reduction tree 206 may be included in one or more pipelines.Reduction tree 206, which can have, to be based on
The number grade (for example, grade depth) of the number of (input vector 122) input element.The number of stages of reduction tree 206 can correspond to
In input element number with 2 for bottom logarithm.For example, when input element number is 32, reduction tree 206 can have
There is Pyatyi.Reduction tree 206 may include the multiple arithmetic operation units for being arranged to one or more rows.Every level-one of reduction tree 206 can
A line arithmetic operation unit corresponding to reduction tree 206.
Control logic 210 can be configured to be based on vector instruction 220 (for example, the accumulating vector arithmetic reduction of Fig. 1 instructs
101) one or more adders in multiple adders of (for example, being selectively enabled) reduction tree 206 are selected, Fig. 3 is such as referred to
Described by 7.Being selectively enabled one or more arithmetic operation units may make reduction tree 206 to provide (for example, generation) and be used for
One or more the output elements being inserted into output vector 120.
Rotary unit 208 can be configured to receive rotating vector 280, and selectively rotate rotation based on vector instruction 220
Steering volume 280, as with reference to further illustrated in Figure 10.Rotary unit 208 can be configured to insert by one or more output elements
Before entering (for example, storage) in output vector 120, rotating vector 280 is rotated.For example, rotary unit 208 can be with reduction
Tree 206 generates one or more output elements based on input vector 122 and rotates rotating vector 280 in parallel.It can will be rotated through rotation
Vector and one or more output elements are provided to multiplexer 212, for insertion into output vector 120 (for example, generating
Output vector 120).For example, when input vector 122 and rotating vector 280 respectively contain 16 elements, and vector instruction
When 220 execution generates eight output elements using reduction tree 206, the eight output element is may be selected in multiplexer 212
And from eight through rotating rotating vector through rotating element for insertion into output vector 120.It can be other based on having
The input vector 122 and/or rotating vector 280 of size, or based on the vector instruction 220 for generating different number of output element
It executes to choose other selections.In an alternative embodiment, rotating vector 280 can be input vector 122, and can will come from input
Multiple input elements of vector 122 are provided to rotary unit 208 and reduction tree 206.
As illustrative example, rotary unit 208 can be rotator or cylinder vector shifter.Rotating vector 280 may include
Multiple foregoing elements (for example, as executing multiple elements caused by previous vector instruction).It can be identified by vector instruction 220
Rotating vector 280.For example, rotating vector 280 can be stored in by vector instruction 220 field identification position (for example,
Register or position in memory) in.In a particular embodiment, identical as the associated first position of rotating vector 280
In the second position associated with output vector 120.For example, particular register can be identified as exporting by vector instruction 220
Vector 120, and previous institute's storage element (for example, content) of particular register can be used as rotating vector 280.Particular register
Previous institute's storage value at place can be the result of previous vector arithmetic reduction instruction.In another embodiment, with 280 phase of rotating vector
Associated first position is identical to the third place associated with input vector 122.In other embodiments, can from be stored in
Another value (for example, by being stored in the different value being different from the field of output vector 120) in another field of amount instruction 220
Identify rotating vector 280, or can the instruction name (for example, operation code opcode) based on vector instruction 220 make a reservation for the rotating vector.
During operation, processor 202 can be configured to receive and execute vector instruction 220, to use reduction tree 206 right
Input vector 122 executes vector arithmetic reduction (for example, the reduction of accumulating vector arithmetic or the reduction of segmented vector arithmetic).Reduction
Tree 206 can execute vector arithmetic reduction to input vector 122, to generate multiple results (for example, in the list of processor 202 simultaneously
During one executes circulation).During the execution of vector instruction 220, multiple results as caused by reduction tree 206 be can be stored in defeated
In outgoing vector 120.
Compared to the other systems for generating multiple portions result during the execution of multiple vector instructions, system 200 can lead to
Cross single vector-quantities instruction (for example, vector instruction 220) execution during generate multiple portions result (for example, multiple results) and
Improvement in storage and power consumption is provided.
Referring to Fig. 3, the block diagram of the first illustrative embodiments of reduction tree 300 is disclosed.For example, reduction tree 300 can wrap
Reduction tree 206 containing Fig. 2.Reduction tree 300 can be used for executing the instruction of accumulating vector arithmetic, such as the accumulating vector of Fig. 1 is calculated
Art instructs the vector instruction 220 of 101 or Fig. 2.Reduction tree 300 can be configured to receive be stored in it is multiple in input vector 122
Input element (includes the first input element 302 and the second input element 304), and provides (for example, generation) to be stored in output
Multiple output elements in vector 120.Output vector 120 may include the first output element 306 and the second output element 308.
Each input element in multiple input elements and each output element in multiple output elements may include one or
Multiple daughter elements.For example, the first input element 302 may include more than first input daughter element 330 to 336 (s0 to s3),
Such as first input daughter element 330 (s0), second input daughter element 332 (s1), third input daughter element 334 (s2) and the 4th son
Element 336 (s3).Second input element 304 may include input daughter element 338 to 344 (s4 to s7) more than second, such as the 5th
Input daughter element 338 (s4), the 6th input daughter element 340 (s5), the 7th input daughter element 342 (s6) and the 8th input daughter element
344(s7).In addition, the first output element 306 may include more than first and export daughter element 366 to 372 (d0 to d3), such as first
Export daughter element 366 (d0), the second output daughter element 368 (d1), third output daughter element 370 (d2) and the 4th output daughter element
372(d3).Second output element 308 may include more than second output daughter element 374 to 380 (d4 to d7), such as the 5th output
Daughter element 374 (d4), the 6th output daughter element 376 (d5), the 7th output daughter element 378 (d6) and the 8th output daughter element 380
(d7).Each input element and output element can have same size (for example, same number position).In addition, each input
Element can have the size (for example, same number position) for being identical to each output daughter element.For example, each input element
(for example, first input element 302) and each output element can be 64 positions, and may include four sixteen bit daughter elements
(for example, input daughter element 330 to 336).In an alternative embodiment, it is individual for inputting each of daughter element 330 to 344
Input element, and each of daughter element 366 to 380 is exported for individual output elements, so that input vector 122 includes multiple
Input element 330 to 344, and output vector 120 includes multiple output elements 366 to 380.
Reduction tree 300 may include multiple arithmetic operation units.In a particular embodiment, multiple arithmetic operation units can be more
A adder includes first adder 320 and second adder 321.In other embodiments, multiple arithmetic operation units can wrap
Combination containing subtracter or adder and subtracter.Multiple adders may include (for example, being arranged to) one or more row adders.It lifts
For example, multiple adders may include (for example, being arranged to) the first row 312.Although depicted as comprising single row, but multiple additions
Device may include more than one row.
It can be instructed based on received accumulating vector arithmetic reduction, be selectively enabled one or more in multiple adders
Adder, as described with reference to fig 7.Without the adder that is selectively enabled (in Fig. 3 as illustrated by shade, such as second
Adder 321) it can be configured to output the received specific input (for example, zero is added into specific input) in adder place, such as
With reference to described by Fig. 7.For example, second adder 321 can be configured to receive the first input element 302, and export wait deposit
The first input element 302 being stored in output vector 120.The adder being selectively enabled (is added in Fig. 3 by unshaded
Musical instruments used in a Buddhist or Taoist mass explanation, such as first adder 320) it can be configured to perform add operation.For example, first adder 320 can base
Add operation is executed in the first input element 302 and the second input element 304.First adder 320 can produce defeated equal to first
Enter the adder output of the summation of element 302 and the second input element 304.Adder can be exported and be provided as to be stored in output
Output element (for example, second output element 308) in vector 120.Via selective enabling, multiple adders can produce (example
Such as, provide) it is stored in multiple output elements in output vector 120.
Multiple input elements can have from accumulating vector arithmetic reduction instruction (for example, from the accumulating that is stored in Fig. 1 to
Measure arithmetic reduction instruction 101 the 5th field 190 in value) instruction input type.Input type can recognize real number, imaginary number or
Plural (for example, combination of real number and imaginary number), and can be in addition associated with element size.It is multiple when input type is real number
Each daughter element in element can indicate real number value.When input type is imaginary number, each daughter element in element can indicate empty
Numerical value.When input type is plural number, for each element, an at least daughter element can indicate real number value and at least one other son members
Element can indicate imaginary value.Therefore, reduction tree 300 can support multiple and different input types, such as 64 real numbers, 64
Imaginary number, 32 real numbers, 32 imaginary numbers, sixteen bit real number, sixteen bit imaginary number, 32 plural numbers, sixteen bit plural number,
One or more other input types, or any combination thereof.
For example, when input type is sixteen bit plural number, each input element 302 and 304 can be 64 positions,
Each input daughter element s0, s2, s4 and s6 can indicate sixteen bit real number value, and each input daughter element s1, s3, s5 and s7 can tables
Show sixteen bit imaginary value.Therefore, every one or six ten four input elements can be with two sixteen bit plural numbers input daughter elements (for example, the
A pair of of s0 and s1 and second couple of s2 and s3) it is associated.As another example, when input type identifies 32 plural numbers, often
One input element 302 and 304 can be 64 positions, first couple of input daughter element s0 and s1 and second couple of input daughter element s4
And s5 can indicate 32 real number values, and third can to input daughter element s2 and s3 and the 4th couple of input daughter element s6 and s7
Indicate 32 imaginary values.Therefore, every one or six ten four input elements can be with 32 plural number input daughter element (examples
Such as, first couple of input daughter element s0 and s1 and second couple of input daughter element s2 and s3 or third are to input daughter element s4 and s5
And the 4th couple of input daughter element s6 and s7) associated.In each example, multiple output elements may include and input element class
Like the output element and output daughter element (for example, output element can have the type identified by input type) of type.
Each adder in multiple adders may include multiple sub- adders.For example, first adder 320 can wrap
Containing the first sub- adder 322, the second sub- adder 324, the sub- adder 326 of third and the 4th sub- adder 328.In particular implementation
In example, first adder 320 is segmented to execute 64 adders of four sixteen bit add operations (for example, each
Sub- adder 322 to 328 indicates a segmentation of first adder 320).In an alternative embodiment, each sub- adder 322 arrives
328 be sixteen bit adder, and first adder 320 indicates a group of four sixteen bit adders.In multiple adders
Each adder can have the configuration similar to first adder 320 (for example, second adder 321 may include that four sons add
Musical instruments used in a Buddhist or Taoist mass).Although the description 64 adders and sub- adder of sixteen bit, can be used other sizes adder and sub- addition
Device, the adder and sub- adder of the size of the input element (for example) based on input vector 122.
Each adder can be configured to execute multiple add operations via multiple sub- adders with interleaved manner.Citing comes
It says, first adder 320 can be configured to use the first sub- adder 322 that first input daughter element 330 (s0) and the 5th is defeated
Enter daughter element 338 (s4) to be added, second input daughter element 332 (s1) is inputted son member with the 6th using the second sub- adder 324
340 (s5) of element are added, third input daughter element 334 (s2) are inputted daughter element 342 with the 7th using third sub- adder 326
(s6) it is added, and the 4th input daughter element 336 (s3) is inputted into daughter element 344 (s7) with the 8th using the 4th sub- adder 328
It is added.Therefore, reduction tree 300 can be configured with use the first input element 302 and the second input element 304 by daughter element with
Interleaved manner executes accumulating vector arithmetic reduction operations.By daughter element executing interleaving formula addition aloows reduction tree right
Daughter element with different types of data (for example, real number, imaginary number or plural number) executes add operation.
Multiple adders of bottom line (for example, the first row 312) in multiple adders can be exported and be provided as output member
Element (for example, output element 306 and 308) is simultaneously stored in output vector 120.It for example, can be by the every of second adder 321
Each output of one sub- adder is provided as the corresponding output daughter element of the first output element 306, and can be by first adder 320
Each output of each sub- adder 322 to 328 be provided as the corresponding output daughter element of the second output element 308.It is multiple defeated
Element 306 and 308 (for example, multiple output daughter elements 366 to 380) can indicate the multiple portions of accumulating vector arithmetic reduction out
As a result.
Executing the received accumulating vector arithmetic reduction instruction of institute and can produce has by accumulating vector arithmetic reduction instruction
The multiple portions result of the accumulating vector arithmetic reduction instruction of the input type identified.For example, when accumulating vector
Arithmetic reduction instruction is associated with complex operation (for example, instruction complex operation), and input type be sixteen bit plural number (for example,
Inputting daughter element s0, s2, s4 and s6 indicates real number value and input daughter element s1, s3, s5 and s7 expression imaginary value) when, it executes tired
Product formula vector arithmetic reduction instruction may include generating the first real number daughter element of the first output element 306 (for example, the first output
Element 366 (d0)) and the first output element 306 the first imaginary number daughter element (for example, second output daughter element 368 (d1)).It holds
Row accumulating vector arithmetic reduction instruction can further include generate second output element 308 the second real number daughter element (for example,
5th output daughter element 374 (d4)) and second output element 308 the second imaginary number daughter element (for example, the 6th output daughter element
376(d5)).Therefore, when input type identification input element 302 and 304 is plural number, output element 306 and 308 can be multiple
Number.
During operation, reduction tree 300 can be used for executing received accumulating vector arithmetic reduction instruction.Executing accumulation
During formula vector arithmetic reduction instructs, it can be instructed, be selectively enabled in multiple adders based on the reduction of accumulating vector arithmetic
One or more adders, with generate comprising output element 306 and 308 (for example, including multiple output daughter elements 366 to 380
(d0 to d7)) multiple output elements.For example, 320 (example of first adder is optionally completely or at least partially enabled
Such as, it can be instructed based on the reduction of accumulating vector arithmetic, be selectively enabled one or more of sub- adder 322 to 328).?
During the execution of accumulating vector arithmetic reduction instruction, one or more outputs of multiple adders can be provided for being stored in
Output element 306 and 308 (for example, multiple output daughter elements 366 to 380 (d0 to d7)) in output vector 120.
Referring to Fig. 4, the block diagram of the second illustrative embodiments of reduction tree 400 is disclosed.Accumulating vector arithmetic can executed
Reduction tree is used during reduction instruction (for example, vector instruction 220 that the accumulating vector arithmetic reduction of Fig. 1 instructs 101 or Fig. 2)
400.As illustrative non-limiting example, reduction tree 400 may include the reduction tree 206 of Fig. 2 or the reduction tree 300 of Fig. 3.Citing
For, reduction tree 400 can explanatory diagram 3 reduction tree 300 extension, to support input vector 122 tool there are four input element
Embodiment.Reduction tree 400 may include multiple adders, arrive comprising first adder 320, second adder 321 and adder 402
408, the adder be configured to based on the reduction of accumulating vector arithmetic instruct and be selectively activated to produce output to
Amount 120.Although Fig. 4 illustrates that multiple adders, reduction tree 400 may include a number of other arithmetic operation units.
Input vector 122 may include the first input element 302, the second input element 304, third input element 410 and
Four input elements 412.Each input element may include multiple input daughter elements.For example, the first input element 302 may include
Input daughter element s0 to s3, the second input element 304 may include input daughter element s4 to s7, third input element 410 may include
Daughter element s8 to s11 is inputted, and the 4th input element 412 may include input daughter element s12 to s15.Output vector 120 may include
Four output elements.For example, output vector 120 may include that the output of the first output element 306, second element 308, third are defeated
Element 422 and the 4th output element 424 out.Each output element may include multiple output daughter elements.For example, the first output
Element 306 may include output daughter element d0 to d3, second output element 308 may include output daughter element d4 to d7, third export
Element 422 may include output daughter element d8 to d11, and the 4th output element 424 may include output daughter element d12 to d15.
Multiple adders may include (for example, being arranged to) multiple rows, such as the first row 312 and the second row 414.Although showing
Two rows, but in other embodiments, the (for example) number based on the input element in input vector 122, multiple adders can
Comprising compared with multirow or less rows.Although for tool, there are four adders by every a line 312,414 explanation, in other embodiments,
Number (for example) based on the input element in input vector 122, every a line can have greater than four or less than four adders.
Each of adder 402 to 408 may include four sub- adders, as with reference to described by the adder 320 and 321 of Fig. 3.
It can be instructed based on received accumulating vector arithmetic reduction, be selectively enabled one or more in multiple adders
Adder, as described with reference to fig 7.The adder enabled to unselected property is (in Fig. 4 as illustrated by shade, such as second
Adder 321 and third adder 402) it can be configured to output in the received specific input in adder place (for example, by zero
Add to specific input), as described by Fig. 7.For example, second adder 321 can be configured to receive the first input member
Element 302, and the first input element 302 is output to the adder in the second row 414.The adder being selectively enabled is (in Fig. 4
In illustrated by unshaded adder, such as first adder 320, the 4th adder 404, fifth adder 406 and the 6th
Adder 408) it can be configured to perform add operation.For example, first adder 320 can be based on the first input element 302
And second input element 304 execute add operation, and the 4th adder 404 can be configured with based on third input element 410 and
4th input element 412 executes add operation.Fifth adder 406 can be exported based on the first adder of first adder 320
And the second adder output (for example, value of third input element 410) of third adder 402 executes add operation, and the 6th
Adder 408 can be exported based on first adder and the output of the third adder of the 4th adder 404 executes add operation.
The adder for being used for the second row 414 can be exported to the multiple outputs member being provided as to be stored in output vector 120
Element (for example, output element 306,308,422 and 424).Via selective enabling, multiple adders can produce (for example, offer)
The multiple output elements being stored in output vector 120.Element 306,308,422 and 424 is exported (for example, output daughter element d0
One or more portion of product of accumulating vector arithmetic reduction can be indicated to d15).For example, the first output element 306 can be
First input element 302, the second output element 308 can be the summation of the first input element 302 and the second input element 304, the
Three output elements 422 can be the summation of the first input element 302, the second input element 304 and third input element 410, and the
Four output elements 424 can be the first input element 302, the second input element 304, third input element 410 and the 4th input member
The summation of element 412.Output element 306,308,422 and 424 can by daughter element be generated, wherein executing addition fortune with interleaved manner
It calculates to generate output daughter element d0 to d15, as explained with reference to fig. 3.For example, output daughter element d8 can be equal to input son member
The summation of plain s0, s4 and s8, and export daughter element d12 and can be equal to the summation of input daughter element s0, s4, s8 and s12.It can be similar
Mode generates each output daughter element.
Although one reduction tree 400 (for example, reduction network) of Fig. 4 instruction sheet, in other embodiments, reduction tree 400 can
It is logically divided into the parallel reduction network of multiple accumulating of interleaved manner operation.For example, in an alternative embodiment,
Each accumulating reduction network may include the specific sub- adder of each adder (for example, the first accumulating reduction network can wrap
The sub- adder of correspondence first containing each adder).Each accumulating reduction network can be parallel with other accumulating reduction networks
Ground carries out operation, and the result from each accumulating reduction network can be stored in output vector 120.For example, reduction
Tree 400 can logically be divided into four sixteen bit accumulating reduction networks.In another example, reduction tree 400 can logically divide
It is cut into two 32 accumulating reduction networks.
During operation, reduction tree 400 can be used for executing received accumulating vector arithmetic reduction instruction.Executing accumulation
During formula vector arithmetic reduction instructs, it can be instructed and be selectively enabled in multiple adders based on accumulating vector arithmetic reduction
One or more adders, to generate multiple output elements 306,308,422 and 424.In accumulating vector arithmetic reduction instruction
During execution, multiple output elements 306,308,422 and 424 be can be stored in output vector 120.
Referring to Fig. 5, the block diagram of the third illustrative embodiments of reduction tree 500 is disclosed.It can be instructed in accumulating vector arithmetic
Reduction tree is used during the execution of (for example, vector instruction 220 that the accumulating vector arithmetic reduction of Fig. 1 instructs 101 or Fig. 2)
500.As illustrative non-limiting example, reduction tree 500 may include the reduction tree 206, the reduction tree of Fig. 3 300 or Fig. 4 of Fig. 2
Reduction tree 400.Reduction tree 500 can be configured to receive the multiple input elements 502 being stored in input vector 122, and mention
For (for example, generation) to be stored in multiple output elements 506 in output vector 120.
Reduction tree 500 may include multiple input elements 502, multiple adders 504 and multiple output elements 506.Although Fig. 5
Illustrate multiple adders 504, but reduction tree 500 may include a number of other arithmetic operation units.Multiple input elements 502 may include
Input element s0 to the s15 of input vector 122.Multiple output elements 506 may include that the output element d0 of output vector 120 is arrived
d15.Multiple input elements 502 (s0 to s15) can be with sequential order (such as " s0, s1, s2 ... s15 ") sequencing, and wherein s0 is
According to sequential order first sequentially element and s15 are last sequentially element.Multiple output elements 506 (d0 to d15) can arrange
At similar sequential order " d0, d1, d2 ... d15 ".
Each input element of multiple input elements 502 can have same size.For example, multiple input elements 502
Each input element can be 64 positions.Each output element of multiple output elements 506 can also have same size.It lifts
For example, each output element of multiple output elements 506 can be 64 positions.In a particular embodiment, each input member
Element can have the size (for example, 64 positions) for being identical to each output element.The number of input element can be equal to output member
The number of element.For example, input vector 122 can have 16 input elements, and output vector 120 can have 16 it is defeated
Element out.The number and size of element are illustrative;Input element and output element can have different from illustrated other
Size, and vector (for example, input vector 122 and output vector 120) can have different from illustrated other sizes (for example,
The element of other numbers).Although undeclared, each input element may include multiple input daughter elements (for example, four input
Element), and each output element may include four output daughter elements, as referring to figs. 3 to described by 4.Based on by accumulating vector
The indicated type of arithmetic reduction instruction, each input element and each output element can be real number, imaginary number or plural number, such as close
Described by Fig. 3 to 4.
Multiple adders 504 may be disposed to the adder of multiple rows, include the first row 512, the second row 514, the third line 516
And fourth line 518.Although illustrating the adder of four rows, in other embodiments, (for example) based on input element and defeated
The number of element out, reduction tree 500 may include that (for example, being arranged to) is less than four rows or four rows or more.In multiple adders 504
Each adder can have same size.For example, each adder in multiple adders 504 can be 64 additions
Device.Although not showing, each adder in multiple adders 504 may include multiple sub- adders, and can be configured with by
Add operation is executed to daughter element with interleaved manner, such as referring to figs. 3 to described by 4.
Each adder can be exported the adder provided in the same column on next line, and can also be such as institute's exhibition in Fig. 5
It is routed to other adders with showing, so that reduction tree 500 can generate multiple output elements 506 (d0 to d15).Citing
For, it can be by the output of the first adder (for example, adder of the first row 512 below input element s1) of the first row 512
It is routed to the second adder (for example, adder of the second row 514 below input element s2) and the second row of the second row 514
514 third adder (for example, adder of the second row 514 below input element s3).It can be by the output of third adder
It is routed to the 4th adder, the fifth adder of the third line 516, the 6th adder of the third line 516 and third of the third line 516
The 7th adder (for example, being respectively the adder of the third line 516 below input element s4 to s7) of row 516.In addition, can
The output of 7th adder is routed to eight adders of fourth line 518 (for example, the below input element s8 to s15 the 4th
The adder of row 518).
It can be instructed based on the reduction of accumulating vector arithmetic, one or more being selectively enabled in multiple adders 504 add
Musical instruments used in a Buddhist or Taoist mass.For example, one or more can be selectively enabled by control logic (not shown) (for example, control logic 210 of Fig. 2)
Adder (as illustrated by the unshaded adder as Fig. 5).One or more not enabled adders are (such as by Fig. 5's plus negative
Shadow adder is shown) it can be configured to output and receive input (for example, zero is added into specific input), as referred to Fig. 7 institute
Description.
Reduction tree 500 can be configured with same based on multiple input element s0 to s15 and accumulating vector arithmetic reduction instruction
When generate multiple output element d0 to d15.For example, reduction tree 500 can be configured so that the first input element s0 to be provided as
First input element s0 is added to the second input element s1 to provide the second output element s1 by the first output element d0, and will
First output element s0 and the second output element s1 are stored in output vector 120.Reduction tree 500 can be configured first yuan
Plain s0 and second element s1 is added to third element s2 to provide third output element d2.In addition, reduction tree 500 can be configured with
Summation by generating each input element s0 to s15 generates output element d15.Output element d3 can be arrived in a similar manner
D14 is produced as partial buildup summation.
During operation, reduction tree 500 can be used for executing the received accumulating vector arithmetic reduction instruction of institute.In accumulating
During the execution of vector arithmetic reduction instruction, reduction tree 500 can receive multiple input elements 502 from input vector 122.It is accumulating
During the execution of formula vector arithmetic reduction instruction, multiple adders in multiple adders 504 are optionally enabled, to provide
(for example, generation) multiple output element d0 to d15, and multiple output element d0 to d15 can be stored in output vector 120.
Referring to Fig. 6, the block diagram of the 4th illustrative embodiments of reduction tree 600 is disclosed.Accumulating vector arithmetic can executed
Reduction tree is used during reduction instruction (for example, vector instruction 220 that the accumulating vector arithmetic reduction of Fig. 1 instructs 101 or Fig. 2)
600.Reduction tree 600 may include reduction tree 206, the reduction tree of Fig. 3 300, the reduction tree of Fig. 4 400, the reduction tree of Fig. 5 of Fig. 2
500 or combinations thereof.Reduction tree 600 can be configured more to be received based on accumulating vector arithmetic reduction instruction from input vector 122
A input element, and generate multiple output elements of output vector 610.Although Fig. 6 illustrates multiple adders, reduction tree 600
It may include a number of other arithmetic operation units.
Reduction tree 600 can receive multiple input elements from input vector 122, defeated comprising the first input element 302 and second
Enter element 304.First input element 302 may include input daughter element s0 to s3, and the second input element 304 may include input
Element s4 to s7.Input element and input daughter element can have the size indicated by accumulating vector arithmetic reduction instruction.It lifts
For example, input element 302 and 304 can be 64 positions, and inputting daughter element s0 to s7 can be 16 positions.Output vector
610 may include the first output element 306 and the second output element 608.First output element 306 may include that output element d0 is arrived
D3, and the second output element 608 may include output element d4 to d7.Output element and output daughter element can have from accumulating to
Measure the indicated size of arithmetic reduction instruction.For example, output element 306 and 608 can be 64 positions, and export son member
Plain d0 to d7 can be 16 positions.Although depicted as comprising two elements, but input vector 122 and output vector 610 may include
Any number of element (for example, any number of daughter element), and can have other sizes different from 64 positions.
Reduction tree 600 may include being configured to be instructed and be selectively enabled to produce based on the reduction of accumulating vector arithmetic
Multiple adders of raw output vector 610 include first adder 320, second adder 321, third adder 618 and the 4th
Adder 619.Multiple adders may include (for example, being arranged to) multiple rows, and it includes the first row 312, the second row 614 and thirds
Row 616.Each adder in multiple adders may include multiple sub- adders.For example, each in multiple adders
Adder can be 64 adders, and may include four sub- adders of sixteen bit.It can be based on the reduction of accumulating vector arithmetic
Instruction, one or more adders being selectively enabled in multiple adders.For example, it can select as described with reference to fig. 3
Enable to selecting property first adder 320 (for example, sub- adder 322 to 328).
Third adder 618 in second row 614 may include being configured to the output and third of the first sub- adder 322
The 5th sub- adder 625 that the output of sub- adder 326 is added.Third adder 618 also may include being configured to the second son
The 6th sub- adder 627 that the output of adder 324 is added with the output of the 4th sub- adder 328.By the way that sub- adder is defeated
Be added out, third adder 618 can the output applied arithmetic reduction based on sub- adder 322,324,326 and 328 to generate son
Two of adder 625 and 627 export through reduction.Similarly, the 4th adder 619 of the third line 616 can be based on sub- adder
625 and 627 output, using the 7th sub- 629 applied arithmetic reduction of adder to generate additionally through reduction value.Therefore, second is defeated
Element 608 may include sixteen bit reduction value and other parts value based on multiple input daughter element s0 to s7 out.Citing comes
It says, output daughter element d4 can be equal to input daughter element s0 and input the summation of daughter element s4, and output daughter element d5 can be equal to input
The summation of daughter element s1 and input daughter element s5, output daughter element d6 can be equal to the summation of input daughter element s0, s2, s4 and s6,
And it exports daughter element d7 and can be equal to the summation of input daughter element s0 to s7.
During operation, reduction tree 600 can be used for executing accumulating vector arithmetic reduction instruction.In accumulating vector arithmetic
Reduction instruction execution during, can based on accumulating vector arithmetic reduction instruct be selectively enabled in multiple adders one or
Multiple adders, to generate multiple output elements 306 and 608 for being stored in output vector 610 (for example, multiple outputs
Daughter element d0 to d7).
Referring to Fig. 7, the block diagram of the illustrative embodiments of a part of reduction tree 700 is disclosed.The part of reduction tree 700
It can be the reduction tree of the reduction tree 206 of Fig. 2, the reduction tree of Fig. 3 300, the reduction tree of Fig. 4 400, the reduction tree of Fig. 5 500 or Fig. 6
600 a part.Vector instruction can executed (for example, the vector of the accumulating vector arithmetic reduction instruction 101 of Fig. 1, Fig. 2 refers to
Enable 220, the segmented vector arithmetic reduction with reference to described in Fig. 9 instruct 901, or with reference to the described rotation segmented of Figure 10 to
The part of reduction tree 700 is used during measuring arithmetic reduction instruction 1001).The part of reduction tree 700 can be configured with
Receive the first input element 702 (s0) from input vector based on vector instruction, and generate for being stored in output vector the
One output element 706 (d0).
The part of reduction tree 700 may include the first multiplexer 720, and first multiplexer 720 couples
To first adder 712 and it is configured to for the first input element 702 (s0) to be received as the first mux input and inputs (example for zero
Such as, there is the input of the value equal to logical zero) it is received as the 2nd mux input.Although illustrating first adder 712, other
In embodiment, the part of reduction tree 700 may include different arithmetic operation units (for example, subtrator).First multichannel is multiple
It can be configured with device 720 to receive first control signal 744 from control logic (for example, control logic 210 of Fig. 2).More than first
Path multiplexer 720 can be configured with based on first control signal 744 the first mux input the 2nd mux input between select, with
Mux output is provided as to the first adder input 732 of first adder 712.For example, when first control signal 744 is
When particular value, the first multiplexer 720 can provide the first input element 702 to first as first adder input 732
Adder 712.When the first controlling value 744 is different value, the first multiplexer 720 can regard zero input as first adder
Input 732, which provides, arrives first adder 712.Therefore, control logic (for example, passing through setting first control signal 744) can be through matching
It sets and receives zero input (for example, the value for being equal to logical zero) to enable the subset of multiple adders based on vector instruction.
The part of reduction tree 700 may include the first saturated logic circuit 730, first saturated logic circuit 730
It is coupled to first adder 712 and is configured so that the output of first adder 712 is saturated.Make the output of first adder 712
Saturation can prevent the output of first adder 712 more than maximum value or minimize value or less.First saturated logic circuit 730 can
It is configured to the output based on first adder 712 and exports and export (for example, value) through saturation.For example, when the first addition
When the output of device 712 is between minimum value and maximum value, there can be the output equal to first adder 712 through saturation output
Value.When the output of first adder 712 is more than maximum value, there can be the value of maximum value through saturation output, and when the first addition
When the output valve of device 712 is less than minimum value, there can be the value of minimum value through saturation output.
The part of reduction tree 700 may include the second multiplexer 724 for being coupled to the first saturated logic circuit 730.The
Two multiplexers 724 can be configured so that the first saturated logic circuit 730 is received as the 3rd mux input through saturation output,
And the output of the first multiplexer 720 is received as the 4th mux input.Second multiplexer 724 can be configured to be based on
Second control signal 746 is inputted between the 4th mux input in the 3rd mux and is selected, and mux output is provided as to be stored in defeated
The first output element 706 in outgoing vector.When second control signal 746 is particular value, the second multiplexer 724 be may skip
First adder 712 (for example, the 4th mux input is provided as mux output).When not skipping over first adder 712, first
First adder input 732 is added by adder 712 with second adder input 734.Second adder input 734 can be from another
The received value of output institute, zero or certain other value of one adder.Pass through selection the 4th mux input, the second multiplexer 724
It may skip the execution of the add operation using first adder input 732 and second adder input 734, and can be by the first multichannel
The output of multiplexer 720 is provided as mux output.Therefore, control logic can be configured to skip over the first addition based on vector instruction
Device 712.In an alternative embodiment, first adder 712 can be skipped over by deactivating frequency input (not shown).
Although only showing an input element, the part of reduction tree 700 can grasp any number of input element
Make.For example, the part of reduction tree 700 may include additional circuit (for example, multiplexer, adder, saturated logic circuit
And connector), to be operated to the input vector with more than one input element.For example, the part of reduction tree 700
May include the adder of additional row, wherein each extra additions device include corresponding first multiplexer, saturated logic circuit and
Third multiplexer.Additional circuit and adder can be controlled by the extra control signals from control logic.Therefore, reduction tree
700 part may be included in each of reduction tree 300 to 600 of Fig. 3 to 6.
During the execution of vector instruction, the part of reduction tree 700 can be configured to receive the first input element 702, and
Generate the first output element 706 for being stored in output vector.First multiplexer 720 can be based on first control signal
744, provide zero input to first adder 712.The output of first saturated logic circuit, 730 saturable first adder 712.
Second multiplexer 724 can skip over first adder 712 based on second control signal 746.
Referring to Fig. 8, the block diagram of the 5th illustrative embodiments of reduction tree 800 is disclosed.Reduction tree 800 may include the pact of Fig. 2
The reduction tree of one or more of the reduction tree 300 to 600 of 206, Fig. 3 to 6 (as described further in this article), Fig. 7 is set in letter
700 part or any combination thereof.Segmented vector arithmetic reduction instruction can executed (for example, being segmented with reference to described in Fig. 9
Formula vector arithmetic reduction instruction 901, or make with reference to during the described rotation segmented vector arithmetic reduction instruction 1001) of Figure 10
With reduction tree 800.It can selectively be configured about based on the section packets size being contained in segmented vector arithmetic reduction instruction
Letter tree 800 enables to execute vector instruction.Section packets size can be with one or more groups of multiple input elements 802
Size is associated.For example, executing segmented vector arithmetic reduction instruction may include that multiple input elements 802 are grouped as tool
There are one or more groups of section packets size, one or more segmented vector arithmetic reduction are executed to one or more groups later
Operation.Reduction tree 800 can be configured to enable multiple segmented vector arithmetic reduction respectively with different section packets sizes
The execution of instruction.For example, reduction tree 800 can be configured to enable first segmented with the section packets size for two
The execution of vector arithmetic reduction instruction and the second segmented vector arithmetic reduction instruction with the section packets size for four.To the greatest extent
The section packets size that pipe describes as two and four, but reduction tree 800 can support other section packets sizes.
Reduction tree 800 may include multiple input elements 802 (for example, multiple input element s0 to s15), multiple adders
804, and it is configured to export multiple outputs of multiple outputs element 806 (d0 to d15) (for example, multiple adders of bottom line
Output).Although Fig. 8 illustrates multiple adders 804, reduction tree 800 may include a number of other arithmetic fortune in other embodiments
Calculate unit.Processor (for example, processor 210 of Fig. 2) can be configured in the first segmentation comprising the first section packets size
During the execution of formula vector arithmetic reduction instruction and the second segmented vector arithmetic reduction comprising the second section packets size refers to
Reduction tree 800 is used during the execution of order.Reduction tree 800 can be configured with while generate multiple output elements 806 (d0 be arrived
d15).For example, circulation can be executed in the associated uniprocessor of the execution that instructs with the first segmented vector arithmetic reduction
Period generates multiple output elements 806 (d0 to d15).
Reduction tree 800 can be configured to receive multiple input elements 802 (s0 to s15) from input vector 822.Reduction tree
800 can be configured to generate to be stored in multiple output elements 806 (d0 to d15) in output vector 820.Multiple input elements
802 (s0 to s15) can be with sequential order (such as " s0, s1, s2 ... s15 ") sequencing, and wherein s0 is the according to sequential order
One sequentially element and s15 are last sequentially element.Multiple output elements 806 (d0 to d15) can similar sequential order (such as
" d0, d1, d2 ... d15 ") sequencing, wherein d0 is that first sequentially element and d15 are last sequentially element.
Reduction tree 800 can have be identical to output element input element number, and each input element can have it is identical
In the size of each output element.For example, input vector 822 may include 16 64 input elements, and export
Vector 820 may include 16 64 output elements.Although not showing, each input element may include multiple sixteen bits
Input daughter element, and each output element may include that multiple sixteen bits export daughter element, such as referring to figs. 3 to described by 4.It is multiple
Input element and multiple output elements can indicate real number value, imaginary value or combinations thereof.In a particular embodiment, when input type is
When plural, each input element in multiple input elements may include corresponding real part and corresponding imaginary part.Can by with
Interleaved manner executes the first arithmetical operation to one or more real parts, and executes the second arithmetic fortune to one or more imaginary parts
It calculates and generates each output element, such as referring to figs. 3 to described by 4.
Although 60 nibbles of description element and sixteen bit daughter element, each input element and each output element can have
Size other than 64 positions, and each input daughter element and each output daughter element can have in addition to 16 positions it
Outer size.
Multiple adders 804 may be disposed to the adder of multiple rows, as demonstrated.Multiple adders 804 may include (example
Such as, it is arranged to) the first row 812, the second row 814, the third line 816 and fourth line 818.Although illustrating four row adders, (example
The number of number and output element such as) based on input element, reduction tree 800 are alternatively less than comprising (for example, being arranged to)
Four rows are more than four rows.Each adder in multiple adders 804 can have same size.For example, multiple adders
Each adder in 804 can be 64 adders.Each adder although not showing, in multiple adders 804
May include multiple sub- adders, and can be configured with by daughter element with interleaved manner execute add operation, such as referring to figs. 3 to
Described by 4.
Can selectively be routed via multiple paths 830 to 844 (as shown by the dashed path in Fig. 8) from one or
One or more adders of multirow adder export, so that reduction tree 800 can generate multiple output elements 806, (d0 is arrived
d15).For example, first value as caused by first adder 850 can be provided to the second addition via first path 830
Device 852 can provide the second value as caused by second adder 852 to third adder 854 via the second path 840, and
The third value as caused by third adder 854 can be provided to the 4th adder 856 by third path 844.It can be via road
Diameter 832 to 836 and 842 other values are similarly provided between one or more adders.It can be based on the reduction of segmented vector arithmetic
The section packets size of instruction is selectively enabled each path in multiple paths 830 to 844.For example, based on segmentation
Formula arithmetic reduction instruction (for example, being based on section packets size), can be by selecting first value as caused by first adder 850
It is selected as the adder input to second adder 852 and enables first path 830, and can be added by the way that zero input is selected as second
The adder of musical instruments used in a Buddhist or Taoist mass 852 inputs and deactivates first path 830.One or more adders in multiple adders 804 can have through
Configuration with select adder input correspondence multiplexer (not shown), for example, with reference to described in Fig. 7 from zero input and by
First multiplexer 720 of selection adder input in value provided by respective path.Corresponding multiplexer can be based on control
Signal processed enables respective path (for example, selection is inputted as provided by respective path) or deactivated respective path (for example, selection zero
Input), as described with reference to fig 7.
Processor may include the section packets size for being configured to instruct based on segmented vector arithmetic reduction, selectively
Configure the control logic (for example, control logic 210 of Fig. 2) of reduction tree 800.Selectively configuration reduction tree 800 may include base
One or more adders are selectively enabled (as described in one or more unshaded adders in Fig. 8 in section packets size
It is bright) and selection respective adders input.For example, control logic can be configured to refer in the first segmented vector arithmetic reduction
During the execution of order, the first subset of multiple adders 804 is selectively enabled based on the first section packets size and selects to add
The first subset of correspondence (for example, reduction tree 800 can be configured as the first configuration) of musical instruments used in a Buddhist or Taoist mass input, and calculated in the second segmented vector
During the execution of art reduction instruction, the second subset of multiple adders 804 is selectively enabled based on the second section packets size
And the correspondence second subset (for example, reduction tree 800 can be configured as the second configuration) for selecting adder to input.Reduction tree 800
Specific configuration can be associated with the specific subset for enabling adder and the specific subset for selecting adder to input.Control logic can make
The corresponding son for enabling the specific subset of multiple adders 804 with one or more control signal-selectivities and adder being selected to input
Collect (for example, the specific subset for being selectively enabled multiple paths 830 to 844), as described with reference to fig 7.For example, work as area
Section packet size be two when, can deactivate each of multiple paths 830 to 844 (for example, for multiple paths 830 to 844
Each of associated each adder input zero may be selected), and can only enable unshaded in the first row 812 plus
Musical instruments used in a Buddhist or Taoist mass.When section packets size is four, can only enable in first subset (830 to 836) and row 812 to 814 in path not
Add shade adder.When section packets size is eight, the second subset (830 to 842) and row 812 that can only enable path are arrived
Unshaded adder in 816.When section packets size is 16, all multiple paths 830 to 844 and row can be enabled
812 to 818 all unshaded adders.Therefore, control logic can be configured to be based on section packets size selectively
Enable the subset of adder and the subset (for example, subset of selection respective adders input) in path.
By one or more adders being selectively enabled in multiple adders 804, and selects one or more to correspond to and add
Musical instruments used in a Buddhist or Taoist mass input, reduction tree 800 can be configured to be calculated with being based on multiple input elements 802 (s0 to s15) and being contained in segmented vector
Area in art reduction instruction (for example, the first segmented vector arithmetic reduction instruction or the second segmented vector arithmetic reduction instruction)
Section packet size and generate multiple output elements 806 (d0 to d15) simultaneously.For example, when section packets size is two, about
Letter tree 800 can produce first output element d1 of (for example, offer) equal to s0+s1, the second output element d3 equal to s2+s3,
Third output element d5 equal to s4+s5, the 4th output element d7 equal to s6+s7, element is exported equal to the 5th of s8+s9
D9, the 6th output element d11 equal to s10+s11, element d13 is exported and equal to the of s14+s15 equal to the 7th of s12+s13
Eight output element d15.When section packets size is four, reduction tree 800 can produce the second output member equal to s0+s1+s2+s3
Plain d3, the 4th output element d7 equal to s4+s5+s6+s7, element d11 is exported equal to the 6th of s8+s9+s10+s11 and is equal to
The 8th output element d15 of s12-s13+s14+s15.When section packets size is eight, reduction tree 800 be can produce equal to s0+
The 4th output element d7 of s1+s2+s3+s4+s5+s6+s7, and equal to the of s8+s9+s10+s11+s12-s13+s14+s15
Eight output element d15.When section packets size is 16, reduction tree 800 be can produce equal to each input element s0 to s15's
8th output element d15 of summation.Therefore, it is based on section packets size, is enabled to 800 property of may be configured to select of reduction tree more
One or more adders of a row 812 to 818 simultaneously select one or more respective adders to input, to generate multiple output members simultaneously
Element 806.
During operation, reduction tree 800 can be used for executing segmented vector arithmetic reduction instruction.In segmented vector arithmetic
During the execution of reduction instruction, reduction tree 800 can receive multiple input elements 802 (s0 to s15) from input vector 822.Citing
For, during the execution of the first segmented vector arithmetic reduction instruction, multiple input elements 802 (s0 to s15) can be grouped
At one or more first groups with the first section packets size, and in the execution of the second segmented vector arithmetic reduction instruction
The multiple input element can be grouped as one or more second groups with second packet size by period.Segmented to
During the execution for measuring arithmetic reduction instruction, multiple outputs (for example, multiple adders of fourth line 818 export) can be used, select
Property enable one or more adders in multiple adders 804 to generate multiple output elements 806 (d0 to d15), and can will
Multiple output elements 806 (d0 to d15) are stored in output vector 820.
Reduction tree 800 make it possible for single reduction tree execute have the first segmented of the first section packets size to
Measure arithmetic reduction instruction, and the instruction of the second segmented vector arithmetic reduction with the second section packets size.Compared to comprising
Processor for the multiple reduction trees used during the execution of the multiple instruction with different section packets sizes, uses list
One reduction tree, which may make, can reduce device size and power consumption.
Referring to Fig. 9, the schema for executing the certain illustrative process of vector instruction is disclosed, and is generally designated as 900.
Vector instruction may include segmented vector arithmetic reduction instruction, such as an illustrative segmented vector arithmetic reduction instruction 901.
Segmented vector arithmetic reduction instruction 901 can be performed at processor (for example, processor 202 of Fig. 2), and the processor includes
One or more of reduction tree, such as the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, the reduction tree 700 of Fig. 7
Partially, the reduction tree 800 of Fig. 8, or any combination thereof.Processor can receive more in an input register 910 comprising being stored in
The input vector of a input element 902.Processor can handle multiple input elements 902, and generate an output register 920 simultaneously
Multiple output elements 924 (for example, content).
Multiple output elements 924 may be based on segmented vector arithmetic reduction instruction 901.For example, by being based on dividing
The section packets size of segmentation vector arithmetic reduction instruction 901 adds to the specific input element in multiple input elements 902
One or more other input elements in multiple input elements 902, executing segmented vector arithmetic reduction instruction 901 can produce one
Specific output element.
Input register 910 may include multiple input elements 902.For example, multiple input elements 902 are (for example, input
Vector) it may include N number of element, wherein N is the integer greater than one.Multiple input elements 902 may include input element s0 to s (N-
1).Multiple input elements 902 can be stored with sequential order (such as " s0, s1, s2 ... s (N-1) "), and wherein s0 is first sequentially
Input element and s (N-1) are last sequentially input element.Although showing five input elements, the number of multiple input elements 902
Mesh (for example, N) may include more than five elements or less than five elements.
Before executing segmented vector arithmetic reduction instruction 901, output register 920 may include multiple foregoing elements
922.Multiple foregoing elements 922 may include foregoing elements d0 to d (N-1).Multiple foregoing elements 922 may be included in another vector
In (for example, rotating vector 280 of Fig. 2) or different vectors.Multiple foregoing elements 922 can be stored in by segmented vector arithmetic about
In 901 positions that are identified of letter instruction, such as another register or a position in memory.Multiple foregoing elements may include
In segmented vector arithmetic reduction instruction 901, or the field or ginseng that 901 can be instructed by being stored in segmented vector arithmetic reduction
Value (for example, by index) instruction in number.Before executing segmented vector arithmetic reduction instruction, it can be stored according to sequential order
Multiple foregoing elements 922.It for example, can be according to specific sequential order " d0, d1, d2, d3 ... d (N-1) " (for example, d0 is
First sequentially foregoing elements and d (N-1) are last sequentially foregoing elements) store multiple foregoing elements 922.
Process 900 illustrates the segmented vector arithmetic reduction instruction 901 for the illustrative section packets size for having for two
It executes.Executing segmented vector arithmetic reduction instruction may include that multiple input elements 902 are grouped as multiple groups, such as first
The input element 904 of set and the input element 906 of second set.Can input element 904 to first set execute first and calculate
Art (for example, addition) operation, with generate equal to s0+s1 first as a result, and can input element 906 to second set execute the
Two arithmetic (for example, addition) operation, to generate the second result for being equal to s2+s3.First result (s0+s1) can be inserted into output
In first output element 916 of register 920, and the second result (s2+s3) can be inserted into the second defeated of output register 920
Out in element 918.When generated result number is less than the output element number in output register 920, multiple previous members
One or more foregoing elements in element 922 can remain in output register 920 (for example, can be without overwrite).For example, when
When first output element 916 and the second output element 918 are inserted into output register 920, multiple output elements may include
Foregoing elements d0 and d2 in multiple output elements 924.When the section packets size of segmented vector arithmetic reduction instruction 901 is
When different size, multiple input elements 902 can the grouped input element at different sets, and can produce Different Results.
As illustrated in figure 9, segmented vector arithmetic reduction instruction 901 may include the instruction name for being portrayed as title vraddw
Claim 980 (for example, operation code opcodes).Segmented vector arithmetic reduction instruction 901 also may include the first field 982 (Vu), the second field
984 (Vd), third field 986 (Q), the 4th field 988 (Op), the 5th field 990 (s2), the 6th field 992 (sc32) and
Seven fields 994 (sat).The first value being stored in the first field 982 can indicate the input being such as stored in input register 910
Vector.In an alternative embodiment, the first value being stored in the first field 982 can indicate a pair of of input vector (for example, vector
Vu and additional vector Vv), wherein the primary vector (for example, Vu) of the opposite amount is associated with real number, and it is described to vector
Secondary vector (for example, Vv) is associated with imaginary number.Second value in second field 984 can be indicated in segmented vector arithmetic
The output vector being stored in output register 920 used during the execution of reduction instruction 901.It is stored in third field 986
In third value can indicator panel cover (for example, shielding Q), such as with reference to described by Figure 11 A to B;It is stored in the 4th field 988
4th value can indicate operation vector (for example, operation vector Op);The 5th value being stored in the 5th field 990 can indicate section point
Group size (for example, " s2 " can be designated as two section packets size);The 6th value being stored in the 6th field 992 can indicate defeated
Enter the type (for example, " sc32 " can indicate 32 plural input types) of value;And be stored in the 7th field 994 the 7th
Value may indicate whether to be saturated during the execution that segmented vector arithmetic reduction instructs.Although describing seven fields,
Segmented vector arithmetic reduction instruction may include compared with multi-field or less field.
Although having described add operation, segmented vector arithmetic reduction instruction 901 is not limited to only execute add operation.
For example, segmented vector arithmetic reduction instruction 901 can indicate one or more arithmetic to execute to multiple input elements 902
Operation.One or more arithmetical operations may include add operation and subtraction.It can be by specific fields (for example, special parameter) (example
Such as, the 4th field 988) in value indicate one or more arithmetical operations.For example, the 4th field 988 may include being directed toward storage
Position or direction in the memory of operation vector (for example, the vector for indicating one or more arithmetical operations) store operation vector
The pointer of register.During each element of operation vector can indicate to stay in the execution that segmented vector arithmetic reduction instructs 901
The certain operations (for example, add operation or subtraction) that corresponding element in multiple input elements 902 is executed.Citing comes
It says, executing segmented vector arithmetic reduction instruction may include that multiple input elements 902 are grouped as one based on section packets size
Or multiple input groups, and one or more arithmetical operations are executed to generate multiple output elements 924 to one or more input groups.
It, can be before generating multiple output elements 924, to multiple when at least one of one or more arithmetical operations are subtraction
One or more element supplements in input element 902.
During operation, processor can receive segmented vector arithmetic reduction instruction 901.Multiple inputs can be used in processor
Element 902 executes segmented vector arithmetic reduction instruction 901, is posted with generating multiple output elements 924 and being stored in output
In storage 920.Multiple output elements 924 can indicate to be based on multiple input elements 902 being grouped as one or more input elements group
Group as a result, it is described grouping be based on segmented vector arithmetic reduction instruction 901 section packets size.
Multiple output elements 924 are generated by the section packets size based on segmented vector arithmetic reduction instruction 901, point
Segmentation vector arithmetic reduction instruction 901 makes it possible for multiple points that single reduction tree execution has different section packets sizes
Segmentation vector arithmetic reduction instruction.Compared to comprising for during the execution of the multiple instruction with different section packets sizes
The processor of the multiple reduction trees used, may make using single reduction tree can reduce device size and power consumption.
Referring to Figure 10, the schema for executing the certain illustrative process of rotation segmented vector arithmetic reduction instruction is disclosed, and
It is generally designated as 1000.Rotating segmented vector arithmetic reduction instruction can instruct for single vector-quantities, and may include explanation
Property rotation the reduction of segmented vector arithmetic instruction 1001.Rotation segmented vector arithmetic reduction instruction 1001 can be performed in processor
At (for example, processor 202 of Fig. 2), the processor include reduction tree, such as Fig. 2 reduction tree 206, Fig. 3 to 6 reduction
One or more of tree 300 to 600, the part of the reduction tree 700 of Fig. 7, the reduction tree 800 of Fig. 8 or any combination thereof.Processor
It can receive the input vector of multiple input elements 902 comprising being stored in input register 910.Processor can handle multiple defeated
Enter element 902, and generates multiple output elements 1024 (for example, content) of output register 920 simultaneously.
Rotating segmented vector arithmetic reduction instruction 1001 may include the 1080 (example of instruction name for being portrayed as title vraddw
Such as, operation code opcode).Rotating segmented vector arithmetic reduction instruction 1001 also may include the first field 1082 (Vu), the second field
1084 (Vd), third field 1086 (Q), the 4th field 1088 (Op), the 5th field 1090 (s2), the 6th field 1092
(sc32), the 7th field 1094 (sat) and the 8th field 1096 (rot).Although illustrating eight fields, segmented vector is rotated
Arithmetic reduction instruction 1001 may include compared with multi-field or less field.Field 1082 to 1094 can correspond to the segmented of Fig. 9 to
Measure the field of arithmetic reduction instruction 901.The value being stored in the 8th field 1096 may indicate whether to rotate.Citing comes
It says, the value being stored in the 8th field 1096 can indicate direction and the size of the rotation of life pending.Rotation can have equal to one
The rotation amount (for example, 64 positions) of the size of input element, and can be for the left.In other embodiments, it is stored in the 8th
Value in field 1096 can indicate other sizes and the direction of rotation.As another example, it is stored in the 8th field 1096
Value can indicate not rotate (for example, rotation segmented vector arithmetic reduction instruction 1001 can be similar to the segmented vector of Fig. 9
Arithmetic reduction instruction 901 is operated).In a particular embodiment, the value (not shown) being stored in the 9th field can indicate
Before the result of arithmetical operation is stored in output register 920, if post overwrite (for example, being set equal to zero) output
Multiple foregoing elements 922 (for example, content) in storage 920.In an alternative embodiment, different field is stored in (for example,
Eight fields 1096) in value may indicate whether multiple foregoing elements 922 in overwrite output register 920.
901 execution can be instructed to carry out rotation segmented vector plus spin step according to segmented vector arithmetic reduction
The execution of arithmetic reduction instruction 1001.For example, rotation the reduction of segmented vector arithmetic instruction 1001 execution may include
Before the result for generating arithmetical operation, it is determined whether multiple foregoing elements 922 in rotation output register 920.In response to closing
(for example, based on the value being stored in the 8th field 1096) is determined in rotate multiple foregoing elements 922 first, it can be by by the
Multiple foregoing elements 922 (for example, content) in the rotation amount rotation output register 920 of eight fields 1096 instruction.Citing comes
It says, when rotation amount is 64 positions and direction is to the right, multiple foregoing elements 922 can be rotated to the right to a previously member
Element.Therefore, (for example, generating result and being stored during the execution of rotation segmented vector arithmetic reduction instruction 1001
Before in output register 920), sequentially element can store d (N-1), output register 920 to the first of output register 920
Second sequentially element can store d (0), sequentially element can store d (1) third of output register 920, and output register
920 last sequentially element can store d (N-2).It as another example, can be by rotation amount by multiple elder generations when direction is to the left
Preceding element 922 rotates to the left.In response to determining about not rotate multiple foregoing elements 922 second (for example, based on being stored in
Value in 8th field 1096), multiple foregoing elements 922 can be maintained to previous in-sequence order (for example, d (0) ... d (N-
1)).For example, when the value being stored in the 8th field 1096 is zero or null value (for example, working as the 8th field 1096 not
When being contained in rotation segmented vector arithmetic reduction instruction 1001), multiple foregoing elements 922 can not be rotated.Therefore, it can be based on
Segmented vector arithmetic reduction instruction 1001 is rotated, multiple foregoing elements 922 selectively (for example, optionally) are rotated.
Executing rotation segmented vector arithmetic reduction instruction 1001 also may include determining whether to the multiple foregoing elements of overwrite
922.It for example, can be based on rotation segmented vector arithmetic reduction instruction 1001 (for example, based on being stored in the 9th field
Value), zero will be set as (for example, lid by each element that the result of arithmetical operation is replaced in multiple foregoing elements 922
It writes).The respective adders of zero can be all received (such as by the input in the first row adder 812 of Fig. 8 by two inputs in reduction tree
Illustrated by adder below element s0) specific foregoing elements are set as zero.In other embodiments, can will it is multiple previously
Element 922 is set as (for example, overwrite is) different value.
After having rotated multiple foregoing elements 922 in output register 920, it can be produced based on multiple input elements 902
Raw arithmetic operation results, and insert result into output register 920.Rotate segmented vector arithmetic reduction instruction 1001
Execution may include that multiple input elements 902 are grouped as to multiple groups, such as the input element 904 and second set of first set
Input element 906.Can input element 904 to first set execute (for example, addition) operation of the first arithmetic to generate first
As a result s0+s1, and can input element 906 to second set execute (for example, addition) operation of the second arithmetic to generate the second knot
Fruit s2+s3.First result (s0+s1) can be inserted into the first output element 1016 of output register 920, and can be by second
As a result (s2+s3) is inserted into the second output element 1018 of output register 920.First output element 1016 and the second output
Element 1018 can be the different output elements of output register 920.
In first number of the input element in the input element 904 of first set and the input element 906 of second set
The second number of input element may be based on instructing 1001 section packets that are identified by rotation segmented vector arithmetic reduction
Size.For example, the first number of element and the second number of element can be identical.When the number of produced result is less than output
When output element number in register 920, in multiple foregoing elements 922 one or more through rotation foregoing elements (or when
It is one or more zeros when generating the multiple foregoing elements 922 of overwrite before result) (example in output register 920 can be remained in
It such as, can be without overwrite).For example, it is inserted into output when by the first output element 1016 and the second output element 1018 and deposits
When in device 920, multiple output elements may include in multiple output elements 1024 through rotation foregoing elements d (N-1) and d1.When point
Segmentation vector arithmetic reduction instruction 1001 section packets size be different size when, multiple input elements 902 can it is grouped at
The input element of different sets, and can produce Different Results.
During operation, processor can receive rotation segmented vector arithmetic reduction instruction 1001.Processor can be used more
A input element 902 executes rotation segmented vector arithmetic reduction instruction 1001, with generate multiple output elements 1024 and by its
It is stored in output register 920.Can be based on rotation segmented vector arithmetic reduction instruction 1001, selectively rotation output is posted
The content (for example, multiple foregoing elements 922) of storage, and one or more inputs can be grouped as based on section packets size is based on
Multiple input elements 902 of groups of elements generate as a result, and can insert result into output register 920.
Referring to Figure 11 A, the first illustrative embodiments for executing the accumulating vector arithmetic reduction instruction with masking is disclosed
Schema, and be generally designated as 1100.In illustrative non-limiting example, accumulating vector arithmetic reduction instruction can
101 are instructed for the accumulating vector arithmetic reduction of Fig. 1.Accumulating vector arithmetic reduction instruction can recognize shielding 1130 (for example, to
Amount shielding).As explained by reference to figure 1, the third field that shielding 1130 can instruct 101 by being stored in accumulating vector arithmetic reduction
Value instruction in 186 (Q).For example, shielding 1130 may be included in accumulating vector arithmetic reduction instruction in, or can by comprising
Pointer instruction in instruction, wherein index is directed toward position or the register being stored in the data structure of shielding 1130.It can base
It is equal to zero in the corresponding element of shielding 1130, covers individual values (for example, element) of multiple elements 102 (for example, mentioning as zero
It is supplied to the reduction tree for generating one or more output elements).Alternatively, one can be equal to based on the element of shielding 1130 and hidden
Cover described value.
During the execution of accumulating vector arithmetic reduction instruction, the first ekahafnium can be provided as the first output member
Before element 112, shielding 1130 is applied to multiple elements 102.Using the correspondence that shielding 1130 may include depending on shielding 1130
Masking value provides zero for the element-specific in multiple elements 102.As demonstrated, 1130 will shielded applied to multiple elements
Before 102, input vector 122 includes element s0, s1, s2 and s (N-1).After application shielding 1130, multiple elements 102 are wrapped
Containing s0, zero (corresponding element based on shielding 1130 is equal to zero, is provided to replace s1), s2 and s (N-1).In another implementation
In example, it may include one or more in the multiple elements 102 modified in input vector 122 that shielding 1130, which is applied to multiple elements,
The value of element.After it will shield 1130 applied to multiple elements 102, accumulating vector can be carried out as explained by reference to figure 1
The execution of arithmetic reduction instruction.Therefore, output vector 120 may include exporting element 112 equal to the first of s0, being equal to 0+s0 (example
Such as, s0) the second output element 114, third equal to s2+s0 export element 116, and equal to s0+s2+ ...+s (N-1)
N exports element 118.
Referring to Figure 11 B, the figure for executing the second illustrative embodiments of the instruction of the accumulating vector arithmetic comprising masking is disclosed
Formula, and it is generally designated as 1101.Executing accumulating vector arithmetic reduction instruction may include that will shield 1130 applied to defeated
Outgoing vector 120.
During the execution of accumulating vector arithmetic reduction instruction, shielding 1130 can be applied to output vector 120 to produce
Raw masked output vector 1126.It can bring using shielding 1130 with element s0, zero, s0+s1+s2 and s0+ as demonstrated
The masked output vector 1126 of s1+s2+ ...+s (N-1).Although Figure 11 B, which is shown, is stored in output vector will export element
Application shielding 1130 after in 120, but shielding 1130 can be applied to the knot of arithmetical operation before inserting output vector 120
Fruit.For example, it can prevent from for one or more outputs (for example, s0+s1) being stored in output vector 120 based on shielding 1130,
So that the not preceding value in overwrite output vector 120.In a particular embodiment, output vector 120 and masked output vector
1126 can be stored at same position, such as at identical register.
In addition, the masking shown in Figure 11 A to B can also be applied in a similar manner to the segmented vector arithmetic of Fig. 9
Reduction instructs the rotation segmented vector arithmetic reduction instruction 1001 of 901 or Figure 10.For example, in segmented vector arithmetic
During the execution of reduction instruction 901, shielding 1130 can be applied to multiple elements 102 before being grouped multiple elements 102.As
Another example can be stored with output vector 120 in rotation during the execution of rotation segmented vector arithmetic reduction instruction 1001
Output register content after (for example, rotation output vector 120 content after), by shielding 1130 be applied to output
Vector 120.Being shielded output vector 1126 may include the first output element 1142 equal to s0, the second output element equal to 0
1144, the third equal to s0+s1+s2 exports element 1146, and the N equal to s0+s1+ ...+s (N-1) exports element 1148.
Referring to Figure 12, illustrate the process for executing the illustrative embodiments of the method 1200 of accumulating vector arithmetic reduction instruction
Figure.Accumulating vector arithmetic reduction instruction can instruct the vector instruction of 101 or Fig. 2 for the accumulating vector arithmetic reduction of Fig. 1
220.In a particular embodiment, method 1200 can be executed by the processor 202 of Fig. 2.
At 1202, vector instruction can be executed at processor.Vector instruction can be the accumulating vector arithmetic reduction of Fig. 1
Instruction 101.Vector instruction may include that the vector comprising multiple input elements inputs.For example, vector input can be Fig. 1 to 6
Input vector 122.Vector input may include multiple input elements 102 of Fig. 1.Multiple input elements (for example, vector input)
It can be stored by sequential order.Vector input can be identified by vector instruction.It for example, can be by being stored in specific fields (for example, ginseng
Number) (such as Fig. 1 vector arithmetic reduction instruction 101 third field 184) in value identification vector input.
At 1204, the first input element in multiple input elements can be provided as the first output element.First input
Element can be the first ekahafnium (s0) of Fig. 1, and the first output element can be the first output element 112 (s0) of Fig. 1.Citing
For, the first input element can be provided to (example by the way that zero input (for example, the value for being equal to logical zero) is added to the first input element
Such as, generate) it is the first output element.It can be based on the control signal from the control logic being contained in processor, in addition zero is defeated
Enter, such as with reference to described by Fig. 7.
It, can be to the first input element and the second input element execution the first arithmetic fortune in multiple input elements at 1206
It calculates, with the second output element of offer (for example, generation).For example, the first arithmetical operation can be add operation.In other implementations
In example, the first arithmetical operation can be subtraction.Second input element can be the second element 106 (s1) of Fig. 1, and second exports
Element can be the second output element 114 (s0+s1) of Fig. 1.For example, it can will be equal to the first input element and the second input member
It is the second output element that the value of the summation of element, which generates (for example, offer),.Each input element and each output element may include more
A daughter element, and addition by daughter element can be executed with interleaved manner, such as referring to figs. 3 to described by 4.
At 1208, the first output element and the second output element can be stored in output vector.Output vector can be
The output vector 120 of Fig. 1 to 6.For example, the element value of the first input element (for example, be equal to) and the can be exported by first
Two output elements (for example, the value for being equal to the summation of the first input element and the second input element) are stored in the difference of output vector
It exports in element, as demonstrated in Figure 1.
Additional output element can be generated by this method.For example, can in multiple input elements the first input element,
Second input element and third input element execute the second arithmetical operation, export element with generation (for example, offer) third.Cause
This, can by the element-specific in multiple input elements and in multiple elements on sequential order sequentially specific defeated
One or more other input elements before entering element execute specific arithmetical operation and generate specific output element.
It according to method 1200, can produce multiple output elements (for example, the first output element and second output element), and institute
The multiple portions result of accumulating vector arithmetic reduction can be indicated by stating output element.Compared to the execution phase in multiple vector instructions
Between generate multiple portions as a result, by during the execution that single vector-quantities instruct generate multiple portions as a result, method 1200 can mention
For the improvement in terms of storage and power consumption.
Referring to Figure 13, illustrate the flow chart that the illustrative embodiments of the method 1300 of vector instruction is executed using reduction tree.
Vector instruction can be the vector instruction 220 of Fig. 2 or the segmented vector arithmetic reduction instruction 901 of Fig. 9.In a particular embodiment,
Method 1300 can be executed by the processor 202 of Fig. 2.
At 1302, the vector instruction comprising section packets size can be received at processor.For example, vector instruction
It can be the segmented vector arithmetic reduction instruction 901 of Fig. 9 with the section packets size as indicated by the 5th field 990.Place
Managing device may include reduction tree.Reduction tree may include the reduction of the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, Fig. 7
Set 700 part, the reduction tree 800 of Fig. 8 or any combination thereof.Reduction tree may include it is multiple input, multiple arithmetic operation units,
And multiple outputs.For example, as illustrative example, multiple inputs can for Fig. 8 multiple input elements 802 or Fig. 9 it is more
A input element 902;Multiple arithmetic operation units can be multiple adders 804 of Fig. 8;And multiple outputs can be the multiple of Fig. 8
Export element 806 or multiple output elements 924 of Fig. 9.
At 1304, it may be determined that section packets size.For example, can based on vector instruction specific fields (for example, figure
9 the 5th field 990), determine section packets size.During the execution of vector instruction, section packets size can indicate with it is more
The size of a one or more associated groups of input element.
At 1306, it can be based on section packets size, execute vector instruction using reduction tree to generate multiple outputs simultaneously.
For example, executing vector instruction may include that multiple input elements are grouped as to one or more groups with section packets size
Group, and one or more arithmetical operations are executed to generate multiple outputs to one or more groups.It can be instructed, located based on vector reduction
Multiple outputs are generated during the single treatment circulation of reason device.
Reduction tree, which can be, optionally to be configured, for being used together with multiple and different section packets sizes.Citing
For, the configuration of reduction tree can be associated with particular section packet size.The configuration of reduction tree can be with enabling arithmetic operation unit
Specific subset and selection arithmetic operation unit input specific subset (for example, the specific subset in the path enabled) it is related
Connection, for example, Fig. 8 multiple adders 804 and multiple paths 830 to 844 subset.Determining the section packets in vector instruction
After size, processor can determine whether reduction tree is configured for use in and be used together with the section packets size (for example, about
Whether letter tree is in specific configuration associated with section packets size).In response to determining that reduction tree is not configured for use in
It is used together with section packets size, it can be based on the configuration of the big minor change reduction tree of section packets.For example, based on section point
Group size, can enable one or more arithmetic operation units in multiple arithmetic operation units, and one or more arithmetic fortune may be selected
Calculate unit input.It is used together in response to determining that reduction tree is configured for use in section packets size, usable reduction tree is held
Row vector instruction.It for example, can nothing when reduction tree has been configured in specific configuration associated with section packets size
Reduction tree need to be changed before executing vector instruction.
According to method 1300, reduction tree, which can be, optionally to be configured, with for from different section packets sizes
Multiple instruction be used together.Compared to comprising for being used during executing the multiple instruction with different section packets sizes
Multiple reduction trees processor, may make using single reduction tree can reduce device size and power consumption.
Referring to Figure 14, illustrate the illustrative embodiments for executing the method 1400 of rotation segmented vector arithmetic reduction instruction
Flow chart.Rotating segmented vector arithmetic reduction instruction can be the vector instruction 220 of Fig. 2 or the rotation segmented vector of Figure 10
Arithmetic reduction instruction 1001.In a particular embodiment, method 1400 can be executed by the processor 202 of Fig. 2.
At 1402, the vector instruction comprising multiple input elements can be performed.For example, vector instruction can be rotation point
Segmentation vector arithmetic reduction instruction 1001, and multiple input elements can be multiple input elements 902 of Figure 10.
At 1404, it can be grouped the first subset of multiple input elements, to form the input element of first set.Citing comes
It says, the input element of first set can be the input element 1004 of the first set of Figure 10.It can be based on being contained in rotation segmented
Section packets size in vector arithmetic reduction instruction, is grouped the first subset of multiple input elements, to form first set
Input element.For example, the specific fields (for example, parameter) that can be instructed by rotation segmented vector arithmetic reduction (such as are schemed
5th field 1090 of 10 rotation segmented vector arithmetic reduction instruction 1001) identification section packet size.
At 1406, it can be grouped the second subset of multiple input elements, to form the input element of second set.Citing comes
It says, the input element of second set can be the input element 1006 of the second set of Figure 10.It can be based on being contained in rotation segmented
Section packets size in vector arithmetic reduction instruction, is grouped the second subset of multiple input elements, to form second set
Input element.In a particular embodiment, the size of the first set of input element can be with the size of the second set of input element
It is identical.In an alternative embodiment, the size of the second set of the size and input element of the first set of input element can be
Different size.
At 1408, the first arithmetical operation can be executed to the input element of first set.It for example, can be to first set
Input element execute the first add operation.In a particular embodiment, the first arithmetical operation can be indicated by operation vector.It can be by depositing
It is stored in specific fields (for example, parameter) (such as the rotation segmented vector of Figure 10 of rotation segmented vector arithmetic reduction instruction
Arithmetic reduction instruction 1001 the 4th field 1088) in value indicate operation vector.
At 1410, the second arithmetical operation can be executed to the input element of second set.It for example, can be to second set
Input element execute the second add operation.In a particular embodiment, the second arithmetical operation can be indicated by operation vector.
At 1412, the content of rotatable output register.For example, output register can be deposited for the output of Figure 10
Device 1020, and multiple foregoing elements (for example, content) can be contained, such as multiple foregoing elements 922 of Figure 10.It can be by being stored in rotation
Turning the specific fields (for example, parameter) of segmented vector arithmetic reduction instruction, (such as the rotation segmented vector arithmetic of Figure 10 is about
Letter instruction 1001 the second field 1084) in value identify output register.As illustrative example, multiple foregoing elements can be
As a result, or can be multiple null values caused by the vector instruction as performed by previous.In a particular embodiment, multiple foregoing elements can
For the result of previously performed rotation segmented vector arithmetic reduction instruction.The content of rotation output register may include being based on depositing
The specific fields (for example, parameter) of rotation segmented vector arithmetic reduction instruction are stored in (for example, the rotation segmented vector of Figure 10
The 8th field 1096 (for example, rotation field) of arithmetic reduction instruction 1001) in value, selectively (for example, optionally) revolve
Turn the content of output register.For example, the value being stored in rotation field can indicate the size of rotation and the direction of rotation,
And can by the rotation size and on the direction of rotation rotate output register content.Can based on rotation segmented to
Measure the specific fields of arithmetic reduction instruction, the content of overwrite (for example, being set equal to zero) output register.
It, can be by the first result of the first arithmetical operation and second after the content of rotation output register at 1414
Second result of arithmetical operation is inserted into output register.For example, the first result can be inserted into output register
In first output element, and the second result can be inserted into the second output element of output register.First output element can
Element 1016 is exported for the first of Figure 10, and the second output element can be the second output element 1018 of Figure 10.First result and
Second result can overwrite be previously stored in output register the value of (and through rotating at 1412).
Multiple section packets sizes can be referred to via using single reduction tree to execute single vector-quantities according to method 1400
It enables to execute rotation and segmented vector arithmetic reduction.Compared to comprising for there are the more of different section packets sizes in execution
The processor of the multiple reduction trees used during a instruction, may make using single reduction tree can reduce device size and power
Consumption.
Referring to Figure 15, describe the particular illustrative embodiment of the device (for example, communication device) comprising reduction tree 1580
Block diagram, the reduction tree is for executing accumulating vector arithmetic reduction instruction 1562 and segmented vector arithmetic reduction instruction
1564, and described device is generally designated as 1500.As illustrative example, reduction tree 1580 may include the reduction tree of Fig. 2
206, the reduction tree 800 of the reduction tree 300 to 600 of Fig. 3 to 6, the part of the reduction tree of Fig. 7 700 or Fig. 8.Device 1500 can be
Wireless electron device, and may include the processor for being coupled to memory 1532, for example, digital signal processor (DSP) 1510.
Processor 1510, which can be configured to perform, to be stored in memory 1532 (for example, computer-readable storage medium)
Computer executable instructions 1560 (for example, program of one or more instructions).Instruction 1560 may include accumulating vector arithmetic about
Letter instruction 1562 and/or segmented vector arithmetic reduction instruction 1564.Accumulating vector arithmetic reduction instruction 1562 can be Fig. 1
Accumulating vector arithmetic reduction instruct 101 or Fig. 2 vector instruction 220.Segmented vector arithmetic reduction instruction 1564 can be
The segmented vector arithmetic reduction of the vector instruction 220 of Fig. 2, Fig. 9 instructs the rotation segmented vector arithmetic of 901 or Figure 10 about
Letter instruction 1001.
Camera interface 1568 is coupled to processor 1510, and is additionally coupled to video camera (for example, video camera 1570).
Display controller 1526 is coupled to processor 1510 and display 1528.Encoder/decoder (codec) 1534 can also
It is coupled to processor 1510.Loudspeaker 1536 and microphone 1538 can be coupled to codec 1534.Wireless interface 1540 can coupling
Processor 1510 and antenna 1542 are closed, so that can will mention via antenna 1542 and the 1540 received wireless data of institute of wireless interface
It is supplied to processor 1510.
In a particular embodiment, processor 1510, which can be configured to perform, is stored in non-transitory computer-readable media
Computer executable instructions 1560 at (for example, memory 1532), described instruction is executable so that computer is (for example, place
Manage device 1510) the first element in multiple elements is provided as the first output element.Computer executable instructions 1560 may include
Accumulating vector arithmetic reduction instruction 1562.Multiple elements can be multiple elements 102 of Fig. 1, and can be stored in input vector (example
Such as the input vector 122 of Fig. 1 to 6) in.Computer executable instructions 1560 further can be executed by computer, to multiple members
The first element and second element in element execute arithmetical operation, to provide the second output.Calculating further can be executed by computer
First output and the second output are stored in output vector by machine executable instruction 1560.Output vector can be Fig. 1 to 6
Output vector 120.
In a particular embodiment, processor 1510, which can be configured to perform, is stored in non-transitory computer-readable media
Described instruction can be performed so that computer is (for example, place in computer executable instructions 1560 at (for example, memory 1532)
Manage device 1510) receive the vector instruction comprising section packets size.Vector instruction can instruct for the reduction of segmented vector arithmetic
1564.Computer executable instructions 1560 can be executed further to determine section packets size.Computer can further be executed can
Execute instruction 1560 with based on section packets size using reduction tree execute vector instruction come and meanwhile generate multiple outputs.As saying
Bright property example, reduction tree may include the portion of the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, the reduction tree 700 of Fig. 7
Point or Fig. 8 reduction tree 800.Reduction tree may include multiple inputs, multiple arithmetic operation units and multiple outputs.Reduction tree can
It optionally configures, for being used together with multiple and different section packets sizes.
In a particular embodiment, processor 1510, display controller 1526, memory 1532, codec 1534, nothing
Line interface 1540 and camera interface 1568 are contained in system in package or systemonchip device 1522.In specific embodiment
In, input unit 1530 and electric supply 1544 are coupled to systemonchip device 1522.In addition, in a particular embodiment,
As illustrated in figure 15, display 1528, input unit 1530, loudspeaker 1536, microphone 1538, antenna 1542, video are taken the photograph
Camera 1570 and electric supply 1544 are outside systemonchip device 1522.However, display 1528, input unit
1530, each of loudspeaker 1536, microphone 1538, antenna 1542, video camera 1570 and electric supply 1544
It can be coupled to the component (for example, interface or controller) of systemonchip device 1522.
It can be by field programmable gate array (FPGA) device, special application integrated circuit (ASIC), such as central processing list
The processing unit of first (CPU), digital signal processor (DSP), controller, another hardware device, firmware in devices or its any group
Close the method 1200 to 1400 for implementing Figure 12 to 14.As an example, the instruction in memory 1532 can be stored in by execution
Processor originates method 1200, the method for Figure 13 1300, the method for Figure 14 1400 of Figure 12 or any combination thereof, such as about Figure 15
It is described.
In conjunction with one or more of described embodiment, announcement may include for providing the first element in multiple elements
For the equipment of the device of the first output.Device for offer may include one or more adders of reduction tree, such as the pact of Fig. 2
Letter tree 206, the reduction tree 300 to 600 of Fig. 3 to 6, the part of the reduction tree of Fig. 7 700, Fig. 8 reduction tree 800, be configured to by
First element is provided as one or more other devices or circuit of the first output, or any combination thereof.Equipment can further include
For the device based on the second output of the first element and second element generation in multiple elements.Device for generation may include
One or more adders of reduction tree, such as the reduction tree of the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, Fig. 7
700 part, Fig. 8 reduction tree 800, be configured to generate based on the first element and second element the second output one or more
Other devices or circuit, or any combination thereof.Equipment can further include defeated for the first output and the second output to be stored in
Device in outgoing vector.Device for storage may include the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6, Fig. 7
The part of reduction tree 700, Fig. 8 reduction tree 800, be configured to that one or more being stored in output vector will be exported it is other
Device or circuit, or any combination thereof.
Equipment also may include the device for being saturated the second output.Device for being saturated the second output may include Fig. 7's
First saturated logic circuit 730 or the second saturated logic circuit 732, one or more the other devices for being configured to saturation output or
Circuit, or any combination thereof.
In conjunction with one or more of described embodiment, announcement may include for being based on vector instruction while generating multiple defeated
The equipment of device out.For simultaneously generate device may include the reduction tree 206 of Fig. 2, the reduction tree 300 to 600 of Fig. 3 to 6,
The part of the reduction tree 700 of Fig. 7, Fig. 8 reduction tree 800, be configured to based on vector instruction while generating the one of multiple outputs
Or a number of other devices or circuit, or any combination thereof.It can be by processor in the first instruction comprising the first section packets size
Execution during and comprising the second section packets size second instruction execution during using for simultaneously generation device.
It may include set-top box, amusement unit, navigation device, communication dress that one or more of disclosed embodiment, which may be implemented in,
It sets, personal digital assistant (PDA), fixed position data cell, mobile position data unit, mobile phone, cellular phone, meter
It is calculation machine, portable computer, tablet computer, desktop computer, monitor, computer monitor, TV, tuner, wireless
Electricity, satelline radio, music player, digital music player, portable music player, video player, digital video
Player, digital video disk (DVD) player, portable digital video player or combinations thereof system or equipment (for example,
Device 1500) in.As another illustrative non-limiting example, system or equipment may include remote unit, such as mobile phone,
Handheld personal communication systems (PCS) unit, such as personal digital assistant portable data units, have global positioning system
(GPS) fixed position data cell of the device, navigation device, such as meter reading equipment of function, or storage or retrieval data
Or any other device of computer instruction, or any combination thereof.Although Fig. 1 to one or more of 15 can illustrate according to this hair
System, equipment and/or the method for bright teaching, but the present invention is not limited to system, equipment and/or methods illustrated by these.This
The embodiment of invention may be adapted in any device to contain integrated circuit (comprising memory and on-chip circuitry).
It may include communication device, fixed position data cell, movement that one or more of disclosed embodiment, which may be implemented in,
Location data element, mobile phone, cellular phone, computer, tablet computer, portable computer or desktop computer
System or equipment (for example, device 1500) in.In addition, device 1500 may include set-top box, it is amusement unit, navigation device, a
Personal digital assistant (PDA), monitor, computer monitor, TV, tuner, radio, satelline radio, music player,
Digital music player, portable music player, video player, video frequency player, digital video disk (DVD) are broadcast
Any other device of device, portable digital video player, storage or retrieval data or computer instruction is put, or combinations thereof.
As another illustrative non-limiting example, system or equipment may include remote unit, such as mobile phone, hand-held individual lead to
Letter system (PCS) unit, such as personal digital assistant portable data units, have global positioning system (GPS) function
The fixed position data cell of device, navigation device, such as meter reading equipment, or storage or retrieval data or computer instruction
Any other device, or any combination thereof.
Although Fig. 1 to one or more of 15 can illustrate the system, equipment and/or method of teaching according to the present invention,
The present invention is not limited to system, equipment and/or methods illustrated by these.The embodiment of the present invention may be adapted to contain integrated electricity
In any device on road (including memory, processor and on-chip circuitry).
Those skilled in the art will be further understood that, can will combine each described in embodiment disclosed herein
Kind illustrative components, blocks, configuration, module, circuit and algorithm steps are embodied as electronic hardware, the computer as performed by processor
Software, or both combination.Above substantially described in terms of functionality various Illustrative components, block, configuration, module, circuit and
Step.This functionality is implemented as hardware and still executes software depending on specific application and force at design in whole system about
Beam.For each specific application, those skilled in the art can implement described function in a varying manner, but
These implementation decisions should not be construed to cause to depart from the scope of the present invention.
The step of method or algorithm for describing in conjunction with embodiment disclosed herein can be embodied directly in hardware, by
Processor execute software module in, or both combination in.Software module can reside within random access memory (RAM), deposit
Reservoir, read-only memory (ROM), programmable read only memory (PROM), erasable programmable read-only memory (EPROM), electricity
Erasable programmable read-only memory (EEPROM), register, hard disk, moveable magnetic disc, compact disc read-only memory (CD-ROM)
Or in the storage media of any other form known in the art.Exemplary non-transitory (for example, tangible) stores media
It is coupled to processor, so that processor can read information from storage media and write information to storage media.In the alternative,
Storage media can be integrated into processor.Processor and storage media can reside in special application integrated circuit (ASIC).ASIC
It can reside in computing device or user terminal.In the alternative, it is resident to can be used as discrete component for processor and storage media
In computing device or user terminal.
The previous description of disclosed embodiment is provided so that those skilled in the art can make or using disclosed
Embodiment.To those of ordinary skill in the art, the various modifications of these embodiments are readily apparent, and not
In the case where departing from the scope of the present invention, the principles defined herein can be applied to other embodiments.Therefore, the present invention is not
It is intended to be limited to embodiments shown herein, and should meet may be with the principle that is such as defined by following claims and new
The consistent widest range of clever feature.