CN114548387B - Methods for performing multiplication operations in neural network processors and neural network processors - Google Patents
Methods for performing multiplication operations in neural network processors and neural network processorsInfo
- Publication number
- CN114548387B CN114548387B CN202111326201.7A CN202111326201A CN114548387B CN 114548387 B CN114548387 B CN 114548387B CN 202111326201 A CN202111326201 A CN 202111326201A CN 114548387 B CN114548387 B CN 114548387B
- Authority
- CN
- China
- Prior art keywords
- product
- activation value
- multiplier
- weight
- weights
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/01—Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
- G06F5/012—Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising in floating-point computations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/4836—Computations with rational numbers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/485—Adding; Subtracting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/487—Multiplying; Dividing
- G06F7/4876—Multiplying
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Nonlinear Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Neurology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
- Advance Control (AREA)
Abstract
A method for a neural network processor to perform a multiplication operation and a neural network processor. In some embodiments, the method includes forming a first set of products and forming a second set of products. The step of forming the first set of products may include multiplying the first activation value with the least significant sub-word and the most significant sub-word of the first weight in a first multiplier to form a first partial product and a second partial product, and adding the first partial product to the second partial product. The step of forming the second set of products may include multiplying the second activation value with the first and second subwords of mantissas in the first multiplier to form a third partial product and a fourth partial product and adding the third partial product to the fourth partial product.
Description
The present application claims the priority and benefits of U.S. provisional application No. 63/112,271 entitled "System and method for increasing area and Power efficiency by reassigning weight nibbles (SYSTEM AND METHOD FOR IMPROVING AREA AND POWER EFFICIENCY BY REDISTRIBUTING WEIGHT NIBBLES)" filed 11/2020, and U.S. application No. 17/131,357 filed 12/22/2020, the entire contents of which are incorporated herein by reference.
Technical Field
One or more aspects in accordance with embodiments of the present disclosure relate to processing circuitry, and more particularly, to systems and methods for performing multiple sets of multiplications in a manner that accommodates outliers and is capable of performing both integer and floating point operations.
Background
Processors for neural networks may perform a large number of multiplication and addition operations, some of which may be misuse on processing resources because a large portion of the numbers being processed may be relatively small and only a small portion of outliers may be relatively large. Further, some operations in such systems may be integer operations and some operations may be floating point operations, which may consume a significant amount of chip area and power if performed on separate sets of dedicated hardware.
Thus, there is a need for a system and method for performing multiple sets of multiplications in a manner that accommodates outliers and is capable of performing both integer and floating point operations.
Disclosure of Invention
According to an embodiment of the invention there is provided a method comprising forming a first set of products, each product of the first set of products being an integer product of a first activation value and a corresponding weight of a first plurality of weights, and forming a second set of products, each product of the second set of products being a floating point product of a second activation value and a corresponding weight of a second plurality of weights, the step of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in a first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in a second multiplier to form a second partial product, and adding the first partial product to the second partial product, the step of forming the second set of products comprising multiplying the second activation value with a first sub-word of the mantissa of the first weight of the second plurality of weights in the first multiplier to form a third partial product, multiplying the second activation value with a fourth partial product, and adding the second activation value to the fourth partial product in the second multiplier.
In some embodiments, the second activate value is a nibble of the mantissa of the floating point activate value.
In some embodiments, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa.
In some embodiments, the step of adding the third partial product to the fourth partial product includes performing an offset addition in a first offset adder.
In some embodiments, the offset of the offset adder is equal to the width of the first subword of the mantissa.
In some embodiments, the step of adding the first partial product to the second partial product includes performing an offset addition in a first offset adder.
In some embodiments, the step of forming the first set of products further comprises multiplying the first activation value with a least significant sub-word of a second weight of the first plurality of weights in a third multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the second weight in the third multiplier to form a second partial product, and adding the first partial product to the second partial product.
In some embodiments, the step of forming the first set of products further comprises multiplying the first activation value with a least significant sub-word of a third weight of the first plurality of weights in a fourth multiplier to form a first partial product, the third weight having a most significant nibble equal to zero, and adding the first partial product to zero.
In some embodiments, the first activation value is the most significant sub-word of the integer activation value.
In some embodiments, the method further comprises shifting to the left a sum of the first partial product and the second partial product by a number of bits equal to a magnitude of the first activation value.
According to an embodiment of the invention there is provided a system comprising a processing circuit comprising a first multiplier, a second multiplier and a third multiplier, the processing circuit being configured to form a first set of products, each product of the first set of products being an integer product of a first activation value with a respective weight of a first plurality of weights, and to form a second set of products, each product of the second set of products being a floating point product of a second activation value with a respective weight of a second plurality of weights, the process of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in the first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in the second multiplier to form a second partial product, and adding the first partial product to the second partial product, the process of forming the second set comprising multiplying the second activation value with a second number of weights in the first multiplier to form a second partial product, and adding the second activation value to the second partial product, and the third sub-word forming the second partial product.
In some embodiments, the second activate value is a nibble of the mantissa of the floating point activate value.
In some embodiments, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa.
In some embodiments, the process of adding the third partial product to the fourth partial product includes performing an offset addition in a first offset adder.
In some embodiments, the offset of the offset adder is equal to the width of the first subword of the mantissa.
In some embodiments, the process of adding the first partial product to the second partial product includes performing an offset addition in a first offset adder.
In some embodiments, the process of forming the first set of products further includes multiplying the first activation value with a least significant sub-word of a second weight of the first plurality of weights in a third multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the second weight in the third multiplier to form a second partial product, and adding the first partial product to the second partial product.
In some embodiments, the process of forming the first set of products further includes multiplying the first activation value with a least significant sub-word of a third weight of the first plurality of weights in a fourth multiplier to form a first partial product, the third weight having a most significant nibble equal to zero, and adding the first partial product to zero.
In some embodiments, the first activation value is the most significant sub-word of the integer activation value.
According to an embodiment of the invention there is provided a system comprising means for processing comprising a first multiplier, a second multiplier and a third multiplier, the means for processing being configured to form a first set of products, each product of the first set of products being an integer product of a first activation value with a respective weight of a first plurality of weights, and to form a second set of products, each product of the second set of products being a floating point product of a second activation value with a respective weight of a second plurality of weights, the process of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in the first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in the second multiplier to form a second partial product, and adding the first partial product to the second partial product, the process of forming the second set of products comprising multiplying the second activation value with a second sub-word of the second weight in the second multiplier, and adding the second activation value to the second partial product, forming a fourth sub-word.
Drawings
These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims and appended drawings, wherein:
FIG. 1 is a block diagram of a portion of a neural network processor, according to an embodiment of the present disclosure;
FIG. 2A is a block diagram of a portion of a hybrid processing circuit according to an embodiment of the present disclosure;
FIG. 2B is a data map according to an embodiment of the present disclosure;
FIG. 2C is a block diagram of a portion of a hybrid processing circuit according to an embodiment of the present disclosure, and
Fig. 3 is a data map according to an embodiment of the present disclosure.
Detailed Description
The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary embodiments of processors for fine-grained sparse (fine-GRAIN SPARSE) integer and floating-point operations provided in accordance with the present disclosure, and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. However, it is to be understood that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. Like element numbers are intended to indicate like elements or features as shown elsewhere herein.
The neural network (e.g., when performing the inference (inference)) may perform a number of computations in which an activation (or "activation value") (an element of the Input Feature Map (IFM)) is multiplied by a weight. The product of the activation and the weights may form a multi-dimensional array, which may be summed along one or more axes to form an array or "tensor (tensor)" (which may be referred to as an output signature (OFM)). Referring to fig. 1, dedicated hardware may be employed to perform such calculations. The activations may be stored in a Static Random Access Memory (SRAM) 105 and fed into a multiplier accumulator (multiplier accumulator, MAC) array, which may include (i) a plurality of blocks (which may be referred to as "bricks" 110), each of which may include a plurality of multipliers for multiplying the activations with weights, (ii) one or more adder trees for adding together the products generated by the bricks, and (iii) one or more accumulators for accumulating the sums generated by the adder trees. Each activation value may be broadcast (broadcast) to a plurality of multipliers conceptually arranged in a row in the representation of fig. 1. A plurality of adder trees 115 may be employed to form a sum.
In operation, the weights may fall within a range of values, and the distribution of the values of the weights makes relatively small weights significantly more common than relatively large weights. For example, if each weight is represented as an 8-bit number (e.g., an 8-bit integer INT8 or an 8-bit floating point number), then many weights (e.g., most weights or weights exceeding 3/4) may have a value less than 16 (i.e., the most significant nibble (most significant nibble) is zero), and then the weight with the non-zero most significant nibble may be referred to as an "outlier". In some embodiments, appropriately configured hardware may achieve increased speed and power efficiency by taking advantage of these characteristics of weights. However, the inventive concept is not so limited, and for example, each weight may also be represented as a 16-bit number (e.g., a 16-bit integer or a 16-bit floating point number FP 16).
Fig. 2A shows a portion of a hybrid processing circuit (referred to as "hybrid" because it applies to both integer and floating point operations). Referring to fig. 2A, in some embodiments, a plurality of multipliers 205 are used to multiply the weights with the activations, e.g., one nibble at a time. Each multiplier may be a 4 x 4 (i.e., 4 bits by 4 bits) multiplier, with the first input 210 configured to receive a corresponding weight nibble and the second input 215 configured to receive an activation nibble (which may be broadcast to all multipliers). An embodiment with nine multipliers is shown, in some embodiments there are more multipliers (more power generating but more expensive circuits) and in some embodiments there are fewer multipliers (less power generating but less cost circuits). The weight buffer 220 (only the output row of which is shown) may include a respective register for each multiplier 205. The output of multiplier 205 may be fed to a plurality of combining circuits 225, each combining circuit 225 may include one or more multiplexers 230, adders, and inverting-shifting circuits 235. The system may include the same number of combining circuits 225 as the multipliers 205, or the system may contain fewer combining circuits 225 than multipliers 205 (as shown), or the system may contain more combining circuits 225 than multipliers 205.
In operation, each multiplier may produce a partial product during each clock cycle. These partial products may be added together to form an integer product in an integer operation (or may be a partial product in a floating point operation), each of which may be processed by an inverse-shift circuit 235, and the result (e.g., unit 0、Unit 1、……、Unit 7) may be sent to an adder tree (e.g., adder tree 0, adder tree 1, a.the., adder tree 7) to add with other integer products. For example, in one example, as shown in fig. 2A, the first two values in the output row of the weight buffer 220 may be the least significant nibble L0 and the most significant nibble M0 of the first weight, respectively, and the active value nibble being broadcast may be the least significant nibble of the first (8 bit) active value. However, the inventive concept is not limited thereto, and for example, in another example, the first two values in the output row of the weight buffer 220 may be the most significant nibble M0 and the least significant nibble L0 of the first weight, respectively. The activation value nibble may be multiplied with the least significant nibble L0 of the first weight to form a first partial product P0 and the activation value nibble may be multiplied with the most significant nibble M0 of the first weight to form a second partial product P1. Although fig. 2A illustrates an example in which the first partial product P0 and the second partial product P1 are calculated using two multipliers (e.g., a first multiplier and a second multiplier) during one clock cycle, the inventive concept is not limited thereto, and for example, in another example, the first partial product P0 and the second partial product P1 may be calculated using the same multiplier (e.g., a third multiplier) during two clock cycles, respectively. In fig. 2A, the output rows of the weight buffer 220 may include nibbles L2, L3, M4, L7, M7, and 0, and the partial product P 2、P3、……、Pn-1 may be formed in a similar manner. These partial products may be routed to the first combining circuit 225 (leftmost one in fig. 2A) through the connection structure (connection fabric, or connection configuration) 240. The connection structure 240 may include multiplexers (e.g., UL0, UM0, UL1, UM1, um.i., UL7, and UM 7) 230, multiplexer 230 is depicted as a separate element in fig. 2A to facilitate the illustration (using arrows) of the data path (data routing) performed by multiplexer 230. In the first combining circuit 225, the product of (i) the weight (two nibbles) and (ii) the activation value nibble may be calculated as an offset sum of the first partial product and the second partial product (calculated by the corresponding offset adder 245).
As used herein, an "offset sum" of two values is the result of an "offset addition" which is an adder that forms the sum of (i) a first one of the two values and (ii) a second one of the two values that is shifted to the left by a plurality of bits (e.g., four bits), and "offset adder" is an adder that performs addition of an offset between positions of the two numbers that have significant bits of the two numbers (referred to as an "offset" of the offset adder). As used herein, the "Significance (SIGNIFICANCE)" of a nibble (or more generally, a sub-word (discussed in further detail below)) is the position that the nibble occupies in a word (word), the nibble being part of the word (e.g., whether the nibble is the most significant nibble or the least significant nibble of an 8-bit word). Thus, the most significant nibble of an 8-bit word has a significance four bits greater than the least significant nibble. For example, for integer arithmetic, the difference between the significance of the first subword of weights and the significance of the second subword of weights is equal to the width of the first subword of weights and the offset of the offset adder is equal to the width of the first subword of weights. For example, for a floating point operation, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa, and the offset of the offset adder is equal to the width of the first sub-word of the mantissa. The difference between the validity of the first sub-word and the validity of the second sub-word corresponds to an offset between the position occupied by the first sub-word in the word and the position occupied by the second sub-word in the word. A word (e.g., N bits, N being an integer greater than 1) may be divided into a plurality of sub-words. The first sub-word of the plurality of sub-words that occupies the most significant sub-word in the word is referred to as the most significant sub-word, and the second sub-word of the plurality of sub-words that occupies the most trailing sub-word is referred to as the least significant sub-word.
Each of the inverting-shifting circuits 235 may convert between (i) a sign and magnitude representation and (ii) a two's complement representation, and each of the inverting-shifting circuits 235 may shift the result as needed for proper addition to occur in the adder tree. For example, if the activate value nibble is the most significant nibble, the output of offset adder 245 may be shifted (e.g., by 4 bits to the left) so that in the adder tree the output bits will be properly aligned with the bits of other products (e.g., the product of the least significant nibble with the weight and the activate value). For example, if multiplier 205 is an unsigned integer multiplier and the adder tree is a two's complement adder tree, then conversion between the sign and size representation and the two's complement representation may be performed.
As shown in fig. 2B, the arrangement of weight nibbles in the weight buffer 220 may be the result of preprocessing. As shown, the original weight array 250 may include a first row belonging to the least significant nibble (labeled "L0" etc.) and a second row belonging to the most significant nibble (labeled "M0" etc.). Some of the nibbles may be zeros as shown in the blank cell in fig. 2B. The preprocessing may rearrange these nibbles as the weight buffer is filled (e.g., as indicated by the arrow in fig. 2B) such that the weight buffer contains a smaller proportion of zero value nibbles than the original weight array 250. In the example of fig. 2B, the eight weights (each consisting of the least significant nibble and the most significant nibble) are rearranged such that the zero value nibble is discarded and the non-zero nibble is placed into the eight positions of a row of weight buffers (the ninth position contains zero) such that when the row of weight buffers is processed by the array of nine multipliers 205 (fig. 2A), eight of the multipliers are used and only one (the ninth) is unused. Thus, as shown in fig. 2B, the most significant nibble of weight 2in the original weight array 250 is 0, so that the weight buffer includes only the least significant nibble L2 of weight 2 and does not include the most significant nibble of weight 2. In this case, the activation value nibble and the least significant nibble L2 of weight 2 may be multiplied in one multiplier 205 to form a partial product P 2, and the partial products P 2 and 0 may be fed into the combining circuit 225 to be added. Furthermore, in some cases, the sparseness of the original weight array 250 may not be sufficient to allow all the most significant nibbles to be in the same row of the weight buffer as the corresponding least significant nibbles, and some or all of the products may be formed in two clock cycles, with the activation value remaining the same for both cycles. The preprocessing may also generate an array of control signals that may be used to control the connection structure 240 (e.g., multiplexer 230) such that each partial product is sent to the appropriate input of the offset adder 245 according to the validity of the factors forming it.
As shown in fig. 2C, the hybrid processing circuit may also include a plurality of variable shifting units (or "shifting units," "variable shifting circuits") 260, the variable shifting units 260 enabling the hybrid processing circuit to perform floating point operations on floating point activations and floating point weights in a floating point mode of the hybrid processing circuit. Each such floating point number may be an FP16 floating point number (using a format such as according to the IEEE 754-2008 standard) having one sign bit, an 11-bit mantissa (or "significant digit (significand)") (represented by 10 bits and one implicit leading bit (IMPLICIT LEAD bit) or "hidden bit"), and a five-bit exponent. The 11-bit mantissa may be padded with one zero bit and divided into three nibbles, "high" (most significant) nibbles, "low" (least significant) nibbles, and (moderately significant) mid-nibbles (such that concatenating (concatenating) the high nibbles, the mid nibbles, and the low nibbles in order produces a 12-bit (padded) mantissa).
Floating point multiplication may then be performed by the hybrid processing circuit of fig. 2C by multiplying one nibble of the weights with an activated one nibble at a time in each of the multipliers 205 to form a partial product of the "high nibble, medium nibble, and low nibble of the mantissa of each weight" and the "high nibble, medium nibble, and low nibble of the activated mantissa". The (12-bit (b) wide) output of each inverting-shifting circuit 235 may be fed to a corresponding variable shifting unit 260, the variable shifting unit 260 may shift its received data between 0 and N bits to the right in floating point mode (where N may be 8 or a greater number or a smaller number depending in part on the size of the mantissa used in the adder tree (or the number of bits corresponding to the size), the size of the mantissa used in the adder tree may be selected based on the precision to be achieved), and the variable shifting unit 260 may shift its received data to the left in integer mode by 0 or M bits (where M may be the size of the most significant sub-word used in the adder tree as an activation value (or the number of bits corresponding to the size), for example, the variable shifting unit 260 may shift its received data to the left in integer mode by M bits when the activation value is the most significant sub-word of the integer activation value; the variable shifting unit 260 may shift its received data to the left in integer mode when the activation value is the least significant sub-word of the integer activation value). Additional optional shifting may be obtained by selecting one or the other input of offset adder 245 and by selecting the amount of shifting applied in inverting-shifting circuit 235. Thus, the hybrid processing circuit of fig. 2C is able to produce properly aligned outputs for each combination of the significance of the four input nibbles to the two multipliers 205, with the two multipliers 205 feeding either of the offset adders 245 during a given clock cycle (subject to the constraint that the significance of the two input values to the offset adder 245 differ by four bits, the offset adder 245 shifting one input by four bits relative to the other input).
FIG. 3 illustrates an example of preprocessing for an array of floating point weights. In a floating point representation, nibble sparsity (which may be relatively common for integer weights, e.g., the most significant nibble of the weights that have a value of zero) may be relatively rare, but the majority of the weights may be equal to zero, as shown for the original weight array 305, with all three nibbles (low (L), medium (M), and high (H)) of each of the three weights being zero. Fig. 3 shows how the three nibbles of the mantissa of each non-zero weight (weights 0,2, 4, 5, and 6) may be rearranged, first forming a first intermediate matrix 310, then forming a second intermediate matrix 315, then forming a final matrix 320, which final matrix 320 may be suitable for storage in a weight buffer. In the final matrix, all non-zero elements are in the first two rows and all products can be formed in two operations (e.g., in two clock cycles) while the original weight array 305 is loaded into the weight buffer, three operations will be used.
Although some examples are presented herein for embodiments having 8-bit weights, 8-bit activation values, four-weight wide weight buffers, and weights and activations that can process one nibble at a time, it will be understood that these parameters and other similar parameters in this disclosure are used as specific examples only, and any of these parameters may be changed, for ease of explanation. Thus, for example, the size of the weight may be a "word" and the size of a portion of the weight may be a "sub-word", with the word size being one byte and the sub-word size being one nibble in the embodiment of FIG. 2A. In other embodiments, for example, the word may be 12 bits, the sub-word may be 6 bits, or the word may be 16 bits, and the sub-word may be one byte. Further, the method of performing integer operations and floating point operations described in the present application may be a method of performing multiplication operations by a neural network processor, and components (e.g., SRAM, MAC, multiplier, combining circuit, adder tree, multiplexer, adder, inverting-shifting circuit, variable shifting unit, weight buffer, etc.) described in the present application may be included in the neural network processor.
As used herein, a "portion" of an item means "at least some" of the item, and thus may represent less than all of the item or may represent all of the item. Thus, as a special case, "a portion" of an article includes the entire article (i.e., the entire article is an example of a portion of the article). As used herein, the term "or" should be interpreted as "and/or" such that, for example, "a or B" means "a" or "B" or any of "a and B".
The terms "processing circuit" and "means for processing" as used herein each refer to any combination of hardware, firmware, and software for processing data or digital signals. The processing circuit hardware may include, for example, application Specific Integrated Circuits (ASICs), general-purpose or special-purpose Central Processing Units (CPUs), digital Signal Processors (DSPs), graphics Processors (GPUs), and programmable logic devices, such as Field Programmable Gate Arrays (FPGAs). In processing circuitry, as used herein, each function is performed by hardware configured (i.e., hardwired) to perform the function, or by more general-purpose hardware (such as a CPU) configured to execute instructions stored in a non-transitory storage medium. The processing circuitry may be fabricated on a single Printed Circuit Board (PCB) or distributed over several interconnected PCBs. The processing circuitry may include other processing circuitry, for example, the processing circuitry may include two processing circuits (FPGA and CPU) interconnected on a PCB.
As used herein, when a method (e.g., adjustment) or a first quantity (e.g., a first variable) is referred to as being "based on" a second quantity (e.g., a second variable), this means that the second quantity is an input to the method or affects the first quantity (e.g., the second quantity may be an input (e.g., a unique input or one of several inputs) that is a function of computing the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as the second quantity (e.g., stored at the same location or locations in memory)).
It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Accordingly, a first element, first component, first region, first layer or first section discussed herein could be termed a second element, second component, second region, second layer or second section without departing from the spirit and scope of the inventive concept.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concepts. As used herein, the terms "substantially," "about," and the like are used as approximate terms and not as degree terms, and are intended to take into account the inherent deviations of measured or calculated values that one of ordinary skill in the art would recognize. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. A representation such as "at least one of the elements of the list is modified after the list of elements, without modifying individual elements of the list. Furthermore, the use of "may" in describing embodiments of the inventive concepts means "one or more embodiments of the present disclosure. Furthermore, the term "exemplary" is intended to mean exemplary or illustrative. As used herein, the term "use" and variants thereof may be considered synonymous with the term "utilization" and variants thereof.
It will be understood that when an element or layer is referred to as being "on," "connected to," "coupled to" or "adjacent to" another element or layer, it can be directly on, connected to, coupled to or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," "directly connected to," "directly coupled to," or "directly adjacent to" another element or layer, there are no intervening elements or layers present.
Any numerical range recited herein is intended to include all sub-ranges subsumed with the same numerical precision within the range recited. For example, a range of "1.0 to 10.0" or "between 1.0 and 10.0" is intended to include all subranges between (and including) the minimum value of 1.0 and the maximum value of 10.0 listed (i.e., having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0 (such as, for example, 2.4 to 7.6)). Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein, and any minimum numerical limitation recited herein is intended to include all higher numerical limitations subsumed therein.
Although exemplary embodiments of processors for fine-grained sparse integer and floating point operations have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Thus, it will be appreciated that processors for fine-grained sparse integer and floating point operations constructed in accordance with the principles of the present disclosure may be implemented as different from that specifically described herein. The invention is also defined in the appended claims and equivalents thereof.
Claims (20)
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202063112271P | 2020-11-11 | 2020-11-11 | |
| US63/112,271 | 2020-11-11 | ||
| US17/131,357 | 2020-12-22 | ||
| US17/131,357 US11861327B2 (en) | 2020-11-11 | 2020-12-22 | Processor for fine-grain sparse integer and floating-point operations |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114548387A CN114548387A (en) | 2022-05-27 |
| CN114548387B true CN114548387B (en) | 2026-01-02 |
Family
ID=77998845
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111326201.7A Active CN114548387B (en) | 2020-11-11 | 2021-11-10 | Methods for performing multiplication operations in neural network processors and neural network processors |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US11861327B2 (en) |
| EP (1) | EP4002093B1 (en) |
| KR (1) | KR102832860B1 (en) |
| CN (1) | CN114548387B (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12229659B2 (en) | 2020-10-08 | 2025-02-18 | Samsung Electronics Co., Ltd. | Processor with outlier accommodation |
| US11861327B2 (en) | 2020-11-11 | 2024-01-02 | Samsung Electronics Co., Ltd. | Processor for fine-grain sparse integer and floating-point operations |
| US11861328B2 (en) * | 2020-11-11 | 2024-01-02 | Samsung Electronics Co., Ltd. | Processor for fine-grain sparse integer and floating-point operations |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114546333A (en) * | 2020-11-11 | 2022-05-27 | 三星电子株式会社 | Processor and method of operating a processor |
Family Cites Families (34)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7428566B2 (en) | 2004-11-10 | 2008-09-23 | Nvidia Corporation | Multipurpose functional unit with multiply-add and format conversion pipeline |
| US8042073B1 (en) | 2007-11-28 | 2011-10-18 | Marvell International Ltd. | Sorted data outlier identification |
| US9628107B2 (en) | 2014-04-07 | 2017-04-18 | International Business Machines Corporation | Compression of floating-point data by identifying a previous loss of precision |
| EP3230899A4 (en) * | 2014-12-10 | 2018-08-01 | Kyndi, Inc. | Weighted subsymbolic data encoding |
| EP3274930A4 (en) | 2015-03-24 | 2018-11-21 | Hrl Laboratories, Llc | Sparse inference modules for deep learning |
| EP3465550B1 (en) * | 2016-05-26 | 2023-09-27 | Samsung Electronics Co., Ltd. | Accelerator for deep neural networks |
| CN109328361B (en) * | 2016-06-14 | 2020-03-27 | 多伦多大学管理委员会 | Accelerator for deep neural network |
| CN119251375A (en) | 2016-08-19 | 2025-01-03 | 莫维迪厄斯有限公司 | Dynamic culling of matrix operations |
| US10360163B2 (en) * | 2016-10-27 | 2019-07-23 | Google Llc | Exploiting input data sparsity in neural network compute units |
| US11544545B2 (en) | 2017-04-04 | 2023-01-03 | Hailo Technologies Ltd. | Structured activation based sparsity in an artificial neural network |
| WO2018199721A1 (en) | 2017-04-28 | 2018-11-01 | 서울대학교 산학협력단 | Method and apparatus for accelerating data processing in neural network |
| US10725740B2 (en) * | 2017-08-31 | 2020-07-28 | Qualcomm Incorporated | Providing efficient multiplication of sparse matrices in matrix-processor-based devices |
| EP3704638A1 (en) * | 2017-10-30 | 2020-09-09 | Fraunhofer Gesellschaft zur Förderung der Angewand | Neural network representation |
| GB2568083B (en) | 2017-11-03 | 2021-06-02 | Imagination Tech Ltd | Histogram-based per-layer data format selection for hardware implementation of deep neutral network |
| US11144815B2 (en) * | 2017-12-04 | 2021-10-12 | Optimum Semiconductor Technologies Inc. | System and architecture of neural network accelerator |
| US20190171927A1 (en) | 2017-12-06 | 2019-06-06 | Facebook, Inc. | Layer-level quantization in neural networks |
| US11630666B2 (en) | 2018-02-13 | 2023-04-18 | Shanghai Cambricon Information Technology Co., Ltd | Computing device and method |
| US10678508B2 (en) * | 2018-03-23 | 2020-06-09 | Amazon Technologies, Inc. | Accelerated quantized multiply-and-add operations |
| US10769526B2 (en) | 2018-04-24 | 2020-09-08 | Intel Corporation | Machine learning accelerator architecture |
| US12099912B2 (en) * | 2018-06-22 | 2024-09-24 | Samsung Electronics Co., Ltd. | Neural processor |
| US11151769B2 (en) | 2018-08-10 | 2021-10-19 | Intel Corporation | Graphics architecture including a neural network pipeline |
| US10871946B2 (en) | 2018-09-27 | 2020-12-22 | Intel Corporation | Methods for using a multiplier to support multiple sub-multiplication operations |
| US11341369B2 (en) | 2018-11-15 | 2022-05-24 | Nvidia Corporation | Distributed batch normalization using partial populations |
| KR102775183B1 (en) * | 2018-11-23 | 2025-03-04 | 삼성전자주식회사 | Neural network device for neural network operation, operating method of neural network device and application processor comprising neural network device |
| US12045724B2 (en) | 2018-12-31 | 2024-07-23 | Microsoft Technology Licensing, Llc | Neural network activation compression with outlier block floating-point |
| US20200226444A1 (en) | 2019-01-15 | 2020-07-16 | BigStream Solutions, Inc. | Systems, apparatus, methods, and architecture for precision heterogeneity in accelerating neural networks for inference and training |
| CN109901814A (en) * | 2019-02-14 | 2019-06-18 | 上海交通大学 | Custom floating point number and its calculation method and hardware structure |
| US12165038B2 (en) | 2019-02-14 | 2024-12-10 | Microsoft Technology Licensing, Llc | Adjusting activation compression for neural network training |
| US12182577B2 (en) | 2019-05-01 | 2024-12-31 | Samsung Electronics Co., Ltd. | Neural-processing unit tile for shuffling queued nibbles for multiplication with non-zero weight nibbles |
| US11880760B2 (en) | 2019-05-01 | 2024-01-23 | Samsung Electronics Co., Ltd. | Mixed-precision NPU tile with depth-wise convolution |
| US11726950B2 (en) | 2019-09-28 | 2023-08-15 | Intel Corporation | Compute near memory convolution accelerator |
| CN110852422B (en) * | 2019-11-12 | 2022-08-16 | 吉林大学 | Convolutional neural network optimization method and device based on pulse array |
| US11714998B2 (en) * | 2020-05-05 | 2023-08-01 | Intel Corporation | Accelerating neural networks with low precision-based multiplication and exploiting sparsity in higher order bits |
| US11861327B2 (en) | 2020-11-11 | 2024-01-02 | Samsung Electronics Co., Ltd. | Processor for fine-grain sparse integer and floating-point operations |
-
2020
- 2020-12-22 US US17/131,357 patent/US11861327B2/en active Active
-
2021
- 2021-09-28 EP EP21199360.5A patent/EP4002093B1/en active Active
- 2021-11-10 CN CN202111326201.7A patent/CN114548387B/en active Active
- 2021-11-11 KR KR1020210154771A patent/KR102832860B1/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114546333A (en) * | 2020-11-11 | 2022-05-27 | 三星电子株式会社 | Processor and method of operating a processor |
Also Published As
| Publication number | Publication date |
|---|---|
| US11861327B2 (en) | 2024-01-02 |
| KR102832860B1 (en) | 2025-07-10 |
| EP4002093B1 (en) | 2026-04-29 |
| US20220147312A1 (en) | 2022-05-12 |
| EP4002093A1 (en) | 2022-05-25 |
| CN114548387A (en) | 2022-05-27 |
| KR20220064337A (en) | 2022-05-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11042360B1 (en) | Multiplier circuitry for multiplying operands of multiple data types | |
| KR102743265B1 (en) | Multiplication and accumulation circuit | |
| US11907719B2 (en) | FPGA specialist processing block for machine learning | |
| CN114548387B (en) | Methods for performing multiplication operations in neural network processors and neural network processors | |
| Jovanovic et al. | FPGA accelerator for floating-point matrix multiplication | |
| US11809798B2 (en) | Implementing large multipliers in tensor arrays | |
| US20220075598A1 (en) | Systems and Methods for Numerical Precision in Digital Multiplier Circuitry | |
| EP4459454A2 (en) | Numerical precision in digital multiplier circuitry | |
| JP2019121398A (en) | Accelerated computing method and system using lookup table | |
| US11609741B2 (en) | Apparatus and method for processing floating-point numbers | |
| KR102832857B1 (en) | Processor for fine-grain sparse integer and floating-point operations | |
| WO2023287589A1 (en) | Multiplier and adder in systolic array | |
| CN114341796B (en) | Signed multiple word multiplier | |
| US20210034327A1 (en) | Apparatus and Method for Processing Floating-Point Numbers | |
| GB2641274A (en) | Shared partial products for a dot product operation in hardware | |
| JPH05173761A (en) | Binary integer multiplier | |
| US12229659B2 (en) | Processor with outlier accommodation | |
| US20230176819A1 (en) | Pipelined processing of polynomial computation | |
| WO2024072251A1 (en) | System and method for matrix multiplication | |
| HK40072869B (en) | Signed multiword multiplier | |
| HK40072869A (en) | Signed multiword multiplier | |
| Milutinovic | FPGA accelerator for floating-point matrix multiplication | |
| Jafari et al. | Design of a Multiplier for Similar Base Numbers Without Converting Base Using a Data Oriented Memory |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |