CN114548387B - Methods for performing multiplication operations in neural network processors and neural network processors - Google Patents

Methods for performing multiplication operations in neural network processors and neural network processors

Info

Publication number
CN114548387B
CN114548387B CN202111326201.7A CN202111326201A CN114548387B CN 114548387 B CN114548387 B CN 114548387B CN 202111326201 A CN202111326201 A CN 202111326201A CN 114548387 B CN114548387 B CN 114548387B
Authority
CN
China
Prior art keywords
product
activation value
multiplier
weight
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111326201.7A
Other languages
Chinese (zh)
Other versions
CN114548387A (en
Inventor
阿里·沙菲·阿得斯塔尼
约瑟夫·哈松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Publication of CN114548387A publication Critical patent/CN114548387A/en
Application granted granted Critical
Publication of CN114548387B publication Critical patent/CN114548387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • G06F5/012Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising in floating-point computations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/4836Computations with rational numbers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • G06F7/4876Multiplying
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Nonlinear Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

A method for a neural network processor to perform a multiplication operation and a neural network processor. In some embodiments, the method includes forming a first set of products and forming a second set of products. The step of forming the first set of products may include multiplying the first activation value with the least significant sub-word and the most significant sub-word of the first weight in a first multiplier to form a first partial product and a second partial product, and adding the first partial product to the second partial product. The step of forming the second set of products may include multiplying the second activation value with the first and second subwords of mantissas in the first multiplier to form a third partial product and a fourth partial product and adding the third partial product to the fourth partial product.

Description

Method for executing multiplication operation by neural network processor and neural network processor
The present application claims the priority and benefits of U.S. provisional application No. 63/112,271 entitled "System and method for increasing area and Power efficiency by reassigning weight nibbles (SYSTEM AND METHOD FOR IMPROVING AREA AND POWER EFFICIENCY BY REDISTRIBUTING WEIGHT NIBBLES)" filed 11/2020, and U.S. application No. 17/131,357 filed 12/22/2020, the entire contents of which are incorporated herein by reference.
Technical Field
One or more aspects in accordance with embodiments of the present disclosure relate to processing circuitry, and more particularly, to systems and methods for performing multiple sets of multiplications in a manner that accommodates outliers and is capable of performing both integer and floating point operations.
Background
Processors for neural networks may perform a large number of multiplication and addition operations, some of which may be misuse on processing resources because a large portion of the numbers being processed may be relatively small and only a small portion of outliers may be relatively large. Further, some operations in such systems may be integer operations and some operations may be floating point operations, which may consume a significant amount of chip area and power if performed on separate sets of dedicated hardware.
Thus, there is a need for a system and method for performing multiple sets of multiplications in a manner that accommodates outliers and is capable of performing both integer and floating point operations.
Disclosure of Invention
According to an embodiment of the invention there is provided a method comprising forming a first set of products, each product of the first set of products being an integer product of a first activation value and a corresponding weight of a first plurality of weights, and forming a second set of products, each product of the second set of products being a floating point product of a second activation value and a corresponding weight of a second plurality of weights, the step of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in a first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in a second multiplier to form a second partial product, and adding the first partial product to the second partial product, the step of forming the second set of products comprising multiplying the second activation value with a first sub-word of the mantissa of the first weight of the second plurality of weights in the first multiplier to form a third partial product, multiplying the second activation value with a fourth partial product, and adding the second activation value to the fourth partial product in the second multiplier.
In some embodiments, the second activate value is a nibble of the mantissa of the floating point activate value.
In some embodiments, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa.
In some embodiments, the step of adding the third partial product to the fourth partial product includes performing an offset addition in a first offset adder.
In some embodiments, the offset of the offset adder is equal to the width of the first subword of the mantissa.
In some embodiments, the step of adding the first partial product to the second partial product includes performing an offset addition in a first offset adder.
In some embodiments, the step of forming the first set of products further comprises multiplying the first activation value with a least significant sub-word of a second weight of the first plurality of weights in a third multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the second weight in the third multiplier to form a second partial product, and adding the first partial product to the second partial product.
In some embodiments, the step of forming the first set of products further comprises multiplying the first activation value with a least significant sub-word of a third weight of the first plurality of weights in a fourth multiplier to form a first partial product, the third weight having a most significant nibble equal to zero, and adding the first partial product to zero.
In some embodiments, the first activation value is the most significant sub-word of the integer activation value.
In some embodiments, the method further comprises shifting to the left a sum of the first partial product and the second partial product by a number of bits equal to a magnitude of the first activation value.
According to an embodiment of the invention there is provided a system comprising a processing circuit comprising a first multiplier, a second multiplier and a third multiplier, the processing circuit being configured to form a first set of products, each product of the first set of products being an integer product of a first activation value with a respective weight of a first plurality of weights, and to form a second set of products, each product of the second set of products being a floating point product of a second activation value with a respective weight of a second plurality of weights, the process of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in the first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in the second multiplier to form a second partial product, and adding the first partial product to the second partial product, the process of forming the second set comprising multiplying the second activation value with a second number of weights in the first multiplier to form a second partial product, and adding the second activation value to the second partial product, and the third sub-word forming the second partial product.
In some embodiments, the second activate value is a nibble of the mantissa of the floating point activate value.
In some embodiments, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa.
In some embodiments, the process of adding the third partial product to the fourth partial product includes performing an offset addition in a first offset adder.
In some embodiments, the offset of the offset adder is equal to the width of the first subword of the mantissa.
In some embodiments, the process of adding the first partial product to the second partial product includes performing an offset addition in a first offset adder.
In some embodiments, the process of forming the first set of products further includes multiplying the first activation value with a least significant sub-word of a second weight of the first plurality of weights in a third multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the second weight in the third multiplier to form a second partial product, and adding the first partial product to the second partial product.
In some embodiments, the process of forming the first set of products further includes multiplying the first activation value with a least significant sub-word of a third weight of the first plurality of weights in a fourth multiplier to form a first partial product, the third weight having a most significant nibble equal to zero, and adding the first partial product to zero.
In some embodiments, the first activation value is the most significant sub-word of the integer activation value.
According to an embodiment of the invention there is provided a system comprising means for processing comprising a first multiplier, a second multiplier and a third multiplier, the means for processing being configured to form a first set of products, each product of the first set of products being an integer product of a first activation value with a respective weight of a first plurality of weights, and to form a second set of products, each product of the second set of products being a floating point product of a second activation value with a respective weight of a second plurality of weights, the process of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in the first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in the second multiplier to form a second partial product, and adding the first partial product to the second partial product, the process of forming the second set of products comprising multiplying the second activation value with a second sub-word of the second weight in the second multiplier, and adding the second activation value to the second partial product, forming a fourth sub-word.
Drawings
These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims and appended drawings, wherein:
FIG. 1 is a block diagram of a portion of a neural network processor, according to an embodiment of the present disclosure;
FIG. 2A is a block diagram of a portion of a hybrid processing circuit according to an embodiment of the present disclosure;
FIG. 2B is a data map according to an embodiment of the present disclosure;
FIG. 2C is a block diagram of a portion of a hybrid processing circuit according to an embodiment of the present disclosure, and
Fig. 3 is a data map according to an embodiment of the present disclosure.
Detailed Description
The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary embodiments of processors for fine-grained sparse (fine-GRAIN SPARSE) integer and floating-point operations provided in accordance with the present disclosure, and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. However, it is to be understood that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. Like element numbers are intended to indicate like elements or features as shown elsewhere herein.
The neural network (e.g., when performing the inference (inference)) may perform a number of computations in which an activation (or "activation value") (an element of the Input Feature Map (IFM)) is multiplied by a weight. The product of the activation and the weights may form a multi-dimensional array, which may be summed along one or more axes to form an array or "tensor (tensor)" (which may be referred to as an output signature (OFM)). Referring to fig. 1, dedicated hardware may be employed to perform such calculations. The activations may be stored in a Static Random Access Memory (SRAM) 105 and fed into a multiplier accumulator (multiplier accumulator, MAC) array, which may include (i) a plurality of blocks (which may be referred to as "bricks" 110), each of which may include a plurality of multipliers for multiplying the activations with weights, (ii) one or more adder trees for adding together the products generated by the bricks, and (iii) one or more accumulators for accumulating the sums generated by the adder trees. Each activation value may be broadcast (broadcast) to a plurality of multipliers conceptually arranged in a row in the representation of fig. 1. A plurality of adder trees 115 may be employed to form a sum.
In operation, the weights may fall within a range of values, and the distribution of the values of the weights makes relatively small weights significantly more common than relatively large weights. For example, if each weight is represented as an 8-bit number (e.g., an 8-bit integer INT8 or an 8-bit floating point number), then many weights (e.g., most weights or weights exceeding 3/4) may have a value less than 16 (i.e., the most significant nibble (most significant nibble) is zero), and then the weight with the non-zero most significant nibble may be referred to as an "outlier". In some embodiments, appropriately configured hardware may achieve increased speed and power efficiency by taking advantage of these characteristics of weights. However, the inventive concept is not so limited, and for example, each weight may also be represented as a 16-bit number (e.g., a 16-bit integer or a 16-bit floating point number FP 16).
Fig. 2A shows a portion of a hybrid processing circuit (referred to as "hybrid" because it applies to both integer and floating point operations). Referring to fig. 2A, in some embodiments, a plurality of multipliers 205 are used to multiply the weights with the activations, e.g., one nibble at a time. Each multiplier may be a 4 x 4 (i.e., 4 bits by 4 bits) multiplier, with the first input 210 configured to receive a corresponding weight nibble and the second input 215 configured to receive an activation nibble (which may be broadcast to all multipliers). An embodiment with nine multipliers is shown, in some embodiments there are more multipliers (more power generating but more expensive circuits) and in some embodiments there are fewer multipliers (less power generating but less cost circuits). The weight buffer 220 (only the output row of which is shown) may include a respective register for each multiplier 205. The output of multiplier 205 may be fed to a plurality of combining circuits 225, each combining circuit 225 may include one or more multiplexers 230, adders, and inverting-shifting circuits 235. The system may include the same number of combining circuits 225 as the multipliers 205, or the system may contain fewer combining circuits 225 than multipliers 205 (as shown), or the system may contain more combining circuits 225 than multipliers 205.
In operation, each multiplier may produce a partial product during each clock cycle. These partial products may be added together to form an integer product in an integer operation (or may be a partial product in a floating point operation), each of which may be processed by an inverse-shift circuit 235, and the result (e.g., unit 0、Unit 1、……、Unit 7) may be sent to an adder tree (e.g., adder tree 0, adder tree 1, a.the., adder tree 7) to add with other integer products. For example, in one example, as shown in fig. 2A, the first two values in the output row of the weight buffer 220 may be the least significant nibble L0 and the most significant nibble M0 of the first weight, respectively, and the active value nibble being broadcast may be the least significant nibble of the first (8 bit) active value. However, the inventive concept is not limited thereto, and for example, in another example, the first two values in the output row of the weight buffer 220 may be the most significant nibble M0 and the least significant nibble L0 of the first weight, respectively. The activation value nibble may be multiplied with the least significant nibble L0 of the first weight to form a first partial product P0 and the activation value nibble may be multiplied with the most significant nibble M0 of the first weight to form a second partial product P1. Although fig. 2A illustrates an example in which the first partial product P0 and the second partial product P1 are calculated using two multipliers (e.g., a first multiplier and a second multiplier) during one clock cycle, the inventive concept is not limited thereto, and for example, in another example, the first partial product P0 and the second partial product P1 may be calculated using the same multiplier (e.g., a third multiplier) during two clock cycles, respectively. In fig. 2A, the output rows of the weight buffer 220 may include nibbles L2, L3, M4, L7, M7, and 0, and the partial product P 2、P3、……、Pn-1 may be formed in a similar manner. These partial products may be routed to the first combining circuit 225 (leftmost one in fig. 2A) through the connection structure (connection fabric, or connection configuration) 240. The connection structure 240 may include multiplexers (e.g., UL0, UM0, UL1, UM1, um.i., UL7, and UM 7) 230, multiplexer 230 is depicted as a separate element in fig. 2A to facilitate the illustration (using arrows) of the data path (data routing) performed by multiplexer 230. In the first combining circuit 225, the product of (i) the weight (two nibbles) and (ii) the activation value nibble may be calculated as an offset sum of the first partial product and the second partial product (calculated by the corresponding offset adder 245).
As used herein, an "offset sum" of two values is the result of an "offset addition" which is an adder that forms the sum of (i) a first one of the two values and (ii) a second one of the two values that is shifted to the left by a plurality of bits (e.g., four bits), and "offset adder" is an adder that performs addition of an offset between positions of the two numbers that have significant bits of the two numbers (referred to as an "offset" of the offset adder). As used herein, the "Significance (SIGNIFICANCE)" of a nibble (or more generally, a sub-word (discussed in further detail below)) is the position that the nibble occupies in a word (word), the nibble being part of the word (e.g., whether the nibble is the most significant nibble or the least significant nibble of an 8-bit word). Thus, the most significant nibble of an 8-bit word has a significance four bits greater than the least significant nibble. For example, for integer arithmetic, the difference between the significance of the first subword of weights and the significance of the second subword of weights is equal to the width of the first subword of weights and the offset of the offset adder is equal to the width of the first subword of weights. For example, for a floating point operation, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa, and the offset of the offset adder is equal to the width of the first sub-word of the mantissa. The difference between the validity of the first sub-word and the validity of the second sub-word corresponds to an offset between the position occupied by the first sub-word in the word and the position occupied by the second sub-word in the word. A word (e.g., N bits, N being an integer greater than 1) may be divided into a plurality of sub-words. The first sub-word of the plurality of sub-words that occupies the most significant sub-word in the word is referred to as the most significant sub-word, and the second sub-word of the plurality of sub-words that occupies the most trailing sub-word is referred to as the least significant sub-word.
Each of the inverting-shifting circuits 235 may convert between (i) a sign and magnitude representation and (ii) a two's complement representation, and each of the inverting-shifting circuits 235 may shift the result as needed for proper addition to occur in the adder tree. For example, if the activate value nibble is the most significant nibble, the output of offset adder 245 may be shifted (e.g., by 4 bits to the left) so that in the adder tree the output bits will be properly aligned with the bits of other products (e.g., the product of the least significant nibble with the weight and the activate value). For example, if multiplier 205 is an unsigned integer multiplier and the adder tree is a two's complement adder tree, then conversion between the sign and size representation and the two's complement representation may be performed.
As shown in fig. 2B, the arrangement of weight nibbles in the weight buffer 220 may be the result of preprocessing. As shown, the original weight array 250 may include a first row belonging to the least significant nibble (labeled "L0" etc.) and a second row belonging to the most significant nibble (labeled "M0" etc.). Some of the nibbles may be zeros as shown in the blank cell in fig. 2B. The preprocessing may rearrange these nibbles as the weight buffer is filled (e.g., as indicated by the arrow in fig. 2B) such that the weight buffer contains a smaller proportion of zero value nibbles than the original weight array 250. In the example of fig. 2B, the eight weights (each consisting of the least significant nibble and the most significant nibble) are rearranged such that the zero value nibble is discarded and the non-zero nibble is placed into the eight positions of a row of weight buffers (the ninth position contains zero) such that when the row of weight buffers is processed by the array of nine multipliers 205 (fig. 2A), eight of the multipliers are used and only one (the ninth) is unused. Thus, as shown in fig. 2B, the most significant nibble of weight 2in the original weight array 250 is 0, so that the weight buffer includes only the least significant nibble L2 of weight 2 and does not include the most significant nibble of weight 2. In this case, the activation value nibble and the least significant nibble L2 of weight 2 may be multiplied in one multiplier 205 to form a partial product P 2, and the partial products P 2 and 0 may be fed into the combining circuit 225 to be added. Furthermore, in some cases, the sparseness of the original weight array 250 may not be sufficient to allow all the most significant nibbles to be in the same row of the weight buffer as the corresponding least significant nibbles, and some or all of the products may be formed in two clock cycles, with the activation value remaining the same for both cycles. The preprocessing may also generate an array of control signals that may be used to control the connection structure 240 (e.g., multiplexer 230) such that each partial product is sent to the appropriate input of the offset adder 245 according to the validity of the factors forming it.
As shown in fig. 2C, the hybrid processing circuit may also include a plurality of variable shifting units (or "shifting units," "variable shifting circuits") 260, the variable shifting units 260 enabling the hybrid processing circuit to perform floating point operations on floating point activations and floating point weights in a floating point mode of the hybrid processing circuit. Each such floating point number may be an FP16 floating point number (using a format such as according to the IEEE 754-2008 standard) having one sign bit, an 11-bit mantissa (or "significant digit (significand)") (represented by 10 bits and one implicit leading bit (IMPLICIT LEAD bit) or "hidden bit"), and a five-bit exponent. The 11-bit mantissa may be padded with one zero bit and divided into three nibbles, "high" (most significant) nibbles, "low" (least significant) nibbles, and (moderately significant) mid-nibbles (such that concatenating (concatenating) the high nibbles, the mid nibbles, and the low nibbles in order produces a 12-bit (padded) mantissa).
Floating point multiplication may then be performed by the hybrid processing circuit of fig. 2C by multiplying one nibble of the weights with an activated one nibble at a time in each of the multipliers 205 to form a partial product of the "high nibble, medium nibble, and low nibble of the mantissa of each weight" and the "high nibble, medium nibble, and low nibble of the activated mantissa". The (12-bit (b) wide) output of each inverting-shifting circuit 235 may be fed to a corresponding variable shifting unit 260, the variable shifting unit 260 may shift its received data between 0 and N bits to the right in floating point mode (where N may be 8 or a greater number or a smaller number depending in part on the size of the mantissa used in the adder tree (or the number of bits corresponding to the size), the size of the mantissa used in the adder tree may be selected based on the precision to be achieved), and the variable shifting unit 260 may shift its received data to the left in integer mode by 0 or M bits (where M may be the size of the most significant sub-word used in the adder tree as an activation value (or the number of bits corresponding to the size), for example, the variable shifting unit 260 may shift its received data to the left in integer mode by M bits when the activation value is the most significant sub-word of the integer activation value; the variable shifting unit 260 may shift its received data to the left in integer mode when the activation value is the least significant sub-word of the integer activation value). Additional optional shifting may be obtained by selecting one or the other input of offset adder 245 and by selecting the amount of shifting applied in inverting-shifting circuit 235. Thus, the hybrid processing circuit of fig. 2C is able to produce properly aligned outputs for each combination of the significance of the four input nibbles to the two multipliers 205, with the two multipliers 205 feeding either of the offset adders 245 during a given clock cycle (subject to the constraint that the significance of the two input values to the offset adder 245 differ by four bits, the offset adder 245 shifting one input by four bits relative to the other input).
FIG. 3 illustrates an example of preprocessing for an array of floating point weights. In a floating point representation, nibble sparsity (which may be relatively common for integer weights, e.g., the most significant nibble of the weights that have a value of zero) may be relatively rare, but the majority of the weights may be equal to zero, as shown for the original weight array 305, with all three nibbles (low (L), medium (M), and high (H)) of each of the three weights being zero. Fig. 3 shows how the three nibbles of the mantissa of each non-zero weight (weights 0,2, 4, 5, and 6) may be rearranged, first forming a first intermediate matrix 310, then forming a second intermediate matrix 315, then forming a final matrix 320, which final matrix 320 may be suitable for storage in a weight buffer. In the final matrix, all non-zero elements are in the first two rows and all products can be formed in two operations (e.g., in two clock cycles) while the original weight array 305 is loaded into the weight buffer, three operations will be used.
Although some examples are presented herein for embodiments having 8-bit weights, 8-bit activation values, four-weight wide weight buffers, and weights and activations that can process one nibble at a time, it will be understood that these parameters and other similar parameters in this disclosure are used as specific examples only, and any of these parameters may be changed, for ease of explanation. Thus, for example, the size of the weight may be a "word" and the size of a portion of the weight may be a "sub-word", with the word size being one byte and the sub-word size being one nibble in the embodiment of FIG. 2A. In other embodiments, for example, the word may be 12 bits, the sub-word may be 6 bits, or the word may be 16 bits, and the sub-word may be one byte. Further, the method of performing integer operations and floating point operations described in the present application may be a method of performing multiplication operations by a neural network processor, and components (e.g., SRAM, MAC, multiplier, combining circuit, adder tree, multiplexer, adder, inverting-shifting circuit, variable shifting unit, weight buffer, etc.) described in the present application may be included in the neural network processor.
As used herein, a "portion" of an item means "at least some" of the item, and thus may represent less than all of the item or may represent all of the item. Thus, as a special case, "a portion" of an article includes the entire article (i.e., the entire article is an example of a portion of the article). As used herein, the term "or" should be interpreted as "and/or" such that, for example, "a or B" means "a" or "B" or any of "a and B".
The terms "processing circuit" and "means for processing" as used herein each refer to any combination of hardware, firmware, and software for processing data or digital signals. The processing circuit hardware may include, for example, application Specific Integrated Circuits (ASICs), general-purpose or special-purpose Central Processing Units (CPUs), digital Signal Processors (DSPs), graphics Processors (GPUs), and programmable logic devices, such as Field Programmable Gate Arrays (FPGAs). In processing circuitry, as used herein, each function is performed by hardware configured (i.e., hardwired) to perform the function, or by more general-purpose hardware (such as a CPU) configured to execute instructions stored in a non-transitory storage medium. The processing circuitry may be fabricated on a single Printed Circuit Board (PCB) or distributed over several interconnected PCBs. The processing circuitry may include other processing circuitry, for example, the processing circuitry may include two processing circuits (FPGA and CPU) interconnected on a PCB.
As used herein, when a method (e.g., adjustment) or a first quantity (e.g., a first variable) is referred to as being "based on" a second quantity (e.g., a second variable), this means that the second quantity is an input to the method or affects the first quantity (e.g., the second quantity may be an input (e.g., a unique input or one of several inputs) that is a function of computing the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as the second quantity (e.g., stored at the same location or locations in memory)).
It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Accordingly, a first element, first component, first region, first layer or first section discussed herein could be termed a second element, second component, second region, second layer or second section without departing from the spirit and scope of the inventive concept.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concepts. As used herein, the terms "substantially," "about," and the like are used as approximate terms and not as degree terms, and are intended to take into account the inherent deviations of measured or calculated values that one of ordinary skill in the art would recognize. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. A representation such as "at least one of the elements of the list is modified after the list of elements, without modifying individual elements of the list. Furthermore, the use of "may" in describing embodiments of the inventive concepts means "one or more embodiments of the present disclosure. Furthermore, the term "exemplary" is intended to mean exemplary or illustrative. As used herein, the term "use" and variants thereof may be considered synonymous with the term "utilization" and variants thereof.
It will be understood that when an element or layer is referred to as being "on," "connected to," "coupled to" or "adjacent to" another element or layer, it can be directly on, connected to, coupled to or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," "directly connected to," "directly coupled to," or "directly adjacent to" another element or layer, there are no intervening elements or layers present.
Any numerical range recited herein is intended to include all sub-ranges subsumed with the same numerical precision within the range recited. For example, a range of "1.0 to 10.0" or "between 1.0 and 10.0" is intended to include all subranges between (and including) the minimum value of 1.0 and the maximum value of 10.0 listed (i.e., having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0 (such as, for example, 2.4 to 7.6)). Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein, and any minimum numerical limitation recited herein is intended to include all higher numerical limitations subsumed therein.
Although exemplary embodiments of processors for fine-grained sparse integer and floating point operations have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Thus, it will be appreciated that processors for fine-grained sparse integer and floating point operations constructed in accordance with the principles of the present disclosure may be implemented as different from that specifically described herein. The invention is also defined in the appended claims and equivalents thereof.

Claims (20)

1.一种神经网络处理器执行乘法运算的方法,包括:1. A method for a neural network processor to perform multiplication operations, comprising: 形成第一乘积集合,第一乘积集合中的每个乘积是第一激活值与第一多个权重中的相应的权重的整数乘积;和/或Form a first set of products, where each product in the first set of products is an integer product of a first activation value and a corresponding weight from a first plurality of weights; and/or 形成第二乘积集合,第二乘积集合中的每个乘积是第二激活值与第二多个权重中的相应的权重的浮点乘积,This forms a second set of products, where each product is a floating-point product of the second activation value and the corresponding weight from a second set of weights. 形成第一乘积集合的步骤包括:The steps to form the first product set include: 在第一乘法器中将第一激活值与所述第一多个权重中的第一权重的最低有效子字相乘,以形成第一部分乘积,In the first multiplier, the first activation value is multiplied by the least significant subword of the first weight among the first plurality of weights to form a first partial product. 在第二乘法器中将第一激活值与所述第一多个权重中的第一权重的最高有效子字相乘,以形成第二部分乘积,以及In the second multiplier, the first activation value is multiplied by the highest valid subword of the first weight among the first plurality of weights to form a second partial product, and 将第一部分乘积与第二部分乘积相加,以形成第一乘积集合中的乘积,Add the first part of the product to the second part of the product to form the product in the first product set. 形成第二乘积集合的步骤包括:The steps to form the second product set include: 在第一乘法器中将第二激活值与所述第二多个权重中的第一权重的尾数的第一子字相乘,以形成第三部分乘积,In the first multiplier, the second activation value is multiplied by the first subword of the mantissa of the first weight in the second plurality of weights to form the third part of the product. 在第二乘法器中将第二激活值与所述尾数的第二子字相乘,以形成第四部分乘积,以及In the second multiplier, the second activation value is multiplied by the second subword of the mantissa to form the fourth part of the product, and 将第三部分乘积与第四部分乘积相加,以形成第二乘积集合中的乘积。Add the third part of the product to the fourth part of the product to form the product in the second product set. 2.根据权利要求1所述的方法,其中,第二激活值是浮点激活值的尾数的半字节。2. The method according to claim 1, wherein the second activation value is half a byte of the mantissa of the floating-point activation value. 3.根据权利要求1所述的方法,其中,所述尾数的第一子字的有效度与所述尾数的第二子字的有效度之间的差等于所述尾数的第一子字的宽度。3. The method according to claim 1, wherein the difference between the validity of the first sub-word of the last digit and the validity of the second sub-word of the last digit is equal to the width of the first sub-word of the last digit. 4.根据权利要求1所述的方法,其中,将第三部分乘积与第四部分乘积相加的步骤包括在第一偏移加法器中执行偏移加法。4. The method of claim 1, wherein the step of adding the third part product to the fourth part product includes performing offset addition in the first offset adder. 5.根据权利要求4所述的方法,其中,第一偏移加法器的偏移等于所述尾数的第一子字的宽度。5. The method of claim 4, wherein the offset of the first offset adder is equal to the width of the first subword of the mantissa. 6.根据权利要求4所述的方法,其中,将第一部分乘积与第二部分乘积相加的步骤包括在第一偏移加法器中执行偏移加法。6. The method of claim 4, wherein the step of adding the first partial product to the second partial product includes performing offset addition in the first offset adder. 7.根据权利要求1至权利要求6中的任一项所述的方法,其中,形成第一乘积集合的步骤还包括:7. The method according to any one of claims 1 to 6, wherein the step of forming the first product set further comprises: 在第三乘法器中将第一激活值与所述第一多个权重中的第二权重的最低有效子字相乘,以形成第五部分乘积;In the third multiplier, the first activation value is multiplied by the least significant subword of the second weight among the first plurality of weights to form the fifth part of the product; 在第三乘法器中将第一激活值与所述第一多个权重中的第二权重的最高有效子字相乘,以形成第六部分乘积;以及In the third multiplier, the first activation value is multiplied by the highest valid subword of the second weight among the first plurality of weights to form a sixth part of the product; and 将第五部分乘积与第六部分乘积相加,以形成第一乘积集合中的乘积。Add the product of the fifth part to the product of the sixth part to form the product in the first product set. 8.根据权利要求1至权利要求6中的任一项所述的方法,其中,形成第一乘积集合的步骤还包括:8. The method according to any one of claims 1 to 6, wherein the step of forming the first product set further comprises: 在第四乘法器中将第一激活值与所述第一多个权重中的第三权重的最低有效子字相乘,以形成第七部分乘积,第三权重具有等于零的最高有效半字节;以及In the fourth multiplier, the first activation value is multiplied by the least significant subword of the third weight among the first plurality of weights to form a seventh part product, where the third weight has the most significant half-byte equal to zero; and 将第七部分乘积与零相加,以形成第一乘积集合中的乘积。Add the seventh part of the product to zero to form the product in the first product set. 9.根据权利要求1至权利要求6中的任一项所述的方法,其中,第一激活值是整数激活值的最高有效子字。9. The method according to any one of claims 1 to 6, wherein the first activation value is the most significant subword of an integer activation value. 10.根据权利要求9所述的方法,还包括:将第一部分乘积与第二部分乘积的和向左移位等于第一激活值的尺寸的位数。10. The method of claim 9, further comprising: shifting the sum of the first partial product and the second partial product to the left by a number of bits equal to the size of the first activation value. 11.一种神经网络处理器,包括:处理电路,处理电路包括第一乘法器和第二乘法器,11. A neural network processor, comprising: processing circuitry, the processing circuitry including a first multiplier and a second multiplier, 处理电路被配置为:The processing circuit is configured as follows: 形成第一乘积集合,第一乘积集合中的每个乘积是第一激活值与第一多个权重中的相应的权重的整数乘积,和/或Form a first set of products, where each product in the first set of products is an integer product of a first activation value and a corresponding weight from a first plurality of weights, and/or 形成第二乘积集合,第二乘积集合中的每个乘积是第二激活值与第二多个权重中的相应的权重的浮点乘积,This forms a second set of products, where each product is a floating-point product of the second activation value and the corresponding weight from a second set of weights. 形成第一乘积集合的处理包括:The process of forming the first product set includes: 在第一乘法器中将第一激活值与所述第一多个权重中的第一权重的最低有效子字相乘,以形成第一部分乘积,In the first multiplier, the first activation value is multiplied by the least significant subword of the first weight among the first plurality of weights to form a first partial product. 在第二乘法器中将第一激活值与所述第一多个权重中的第一权重的最高有效子字相乘,以形成第二部分乘积,以及In the second multiplier, the first activation value is multiplied by the highest valid subword of the first weight among the first plurality of weights to form a second partial product, and 将第一部分乘积与第二部分乘积相加,以形成第一乘积集合中的乘积,Add the first part of the product to the second part of the product to form the product in the first product set. 形成第二乘积集合的处理包括:The process of forming the second product set includes: 在第一乘法器中将第二激活值与所述第二多个权重中的第一权重的尾数的第一子字相乘,以形成第三部分乘积,In the first multiplier, the second activation value is multiplied by the first subword of the mantissa of the first weight in the second plurality of weights to form the third part of the product. 在第二乘法器中将第二激活值与所述尾数的第二子字相乘,以形成第四部分乘积,以及In the second multiplier, the second activation value is multiplied by the second subword of the mantissa to form the fourth part of the product, and 将第三部分乘积与第四部分乘积相加,以形成第二乘积集合中的乘积。Add the third part of the product to the fourth part of the product to form the product in the second product set. 12.根据权利要求11所述的神经网络处理器,其中,第二激活值是浮点激活值的尾数的半字节。12. The neural network processor of claim 11, wherein the second activation value is half a byte of the mantissa of the floating-point activation value. 13.根据权利要求11所述的神经网络处理器,其中,所述尾数的第一子字的有效度与所述尾数的第二子字的有效度之间的差等于所述尾数的第一子字的宽度。13. The neural network processor of claim 11, wherein the difference between the validity of the first subword of the mantissa and the validity of the second subword of the mantissa is equal to the width of the first subword of the mantissa. 14.根据权利要求11所述的神经网络处理器,其中,将第三部分乘积与第四部分乘积相加的处理包括在处理电路的第一偏移加法器中执行偏移加法。14. The neural network processor of claim 11, wherein the process of adding the third part product to the fourth part product includes performing offset addition in the first offset adder of the processing circuit. 15.根据权利要求14所述的神经网络处理器,其中,第一偏移加法器的偏移等于所述尾数的第一子字的宽度。15. The neural network processor of claim 14, wherein the offset of the first offset adder is equal to the width of the first subword of the mantissa. 16.根据权利要求14所述的神经网络处理器,其中,将第一部分乘积与第二部分乘积相加的处理包括在第一偏移加法器中执行偏移加法。16. The neural network processor of claim 14, wherein the process of adding the first partial product to the second partial product includes performing offset addition in the first offset adder. 17.根据权利要求11至权利要求16中的任一项所述的神经网络处理器,其中,处理电路还包括第三乘法器,形成第一乘积集合的处理还包括:17. The neural network processor according to any one of claims 11 to 16, wherein the processing circuitry further comprises a third multiplier, and the processing of forming the first product set further comprises: 在第三乘法器中将第一激活值与所述第一多个权重中的第二权重的最低有效子字相乘,以形成第五部分乘积;In the third multiplier, the first activation value is multiplied by the least significant subword of the second weight among the first plurality of weights to form the fifth part of the product; 在第三乘法器中将第一激活值与所述第一多个权重中的第二权重的最高有效子字相乘,以形成第六部分乘积;以及In the third multiplier, the first activation value is multiplied by the highest valid subword of the second weight among the first plurality of weights to form a sixth part of the product; and 将第五部分乘积与第六部分乘积相加,以形成第一乘积集合中的乘积。Add the product of the fifth part to the product of the sixth part to form the product in the first product set. 18.根据权利要求11至权利要求16中的任一项所述的神经网络处理器,其中,处理电路还包括第四乘法器,形成第一乘积集合的处理还包括:18. The neural network processor according to any one of claims 11 to 16, wherein the processing circuitry further comprises a fourth multiplier, and the processing of forming the first product set further comprises: 在第四乘法器中将第一激活值与所述第一多个权重中的第三权重的最低有效子字相乘,以形成第七部分乘积,第三权重具有等于零的最高有效半字节;以及In the fourth multiplier, the first activation value is multiplied by the least significant subword of the third weight among the first plurality of weights to form a seventh part product, where the third weight has the most significant half-byte equal to zero; and 将第七部分乘积与零相加,以形成第一乘积集合中的乘积。Add the seventh part of the product to zero to form the product in the first product set. 19.根据权利要求11至权利要求16中的任一项所述的神经网络处理器,其中,第一激活值是整数激活值的最高有效子字,19. The neural network processor according to any one of claims 11 to 16, wherein the first activation value is the most significant subword of an integer activation value. 处理电路还被配置为:将第一部分乘积与第二部分乘积的和向左移位等于第一激活值的大小的位数。The processing circuit is also configured to shift the sum of the first part product and the second part product to the left by a number of bits equal to the size of the first activation value. 20.一种神经网络处理器,包括:用于处理的装置,用于处理的装置包括第一乘法器、第二乘法器和第三乘法器,20. A neural network processor, comprising: means for processing, the means for processing including a first multiplier, a second multiplier, and a third multiplier. 用于处理的装置被配置为:The device for processing is configured as follows: 形成第一乘积集合,第一乘积集合中的每个乘积是第一激活值与第一多个权重中的相应的权重的整数乘积,和/或Form a first set of products, where each product in the first set of products is an integer product of a first activation value and a corresponding weight from a first plurality of weights, and/or 形成第二乘积集合,第二乘积集合中的每个乘积是第二激活值与第二多个权重中的相应的权重的浮点乘积,This forms a second set of products, where each product is a floating-point product of the second activation value and the corresponding weight from a second set of weights. 形成第一乘积集合的处理包括:The process of forming the first product set includes: 在第一乘法器中将第一激活值与所述第一多个权重中的第一权重的最低有效子字相乘,以形成第一部分乘积,In the first multiplier, the first activation value is multiplied by the least significant subword of the first weight among the first plurality of weights to form a first partial product. 在第二乘法器中将第一激活值与所述第一多个权重中的第一权重的最高有效子字相乘,以形成第二部分乘积,以及In the second multiplier, the first activation value is multiplied by the highest valid subword of the first weight among the first plurality of weights to form a second partial product, and 将第一部分乘积与第二部分乘积相加,以形成第一乘积集合中的乘积,Add the first part of the product to the second part of the product to form the product in the first product set. 形成第二乘积集合的处理包括:The process of forming the second product set includes: 在第一乘法器中将第二激活值与所述第二多个权重中的第一权重的尾数的第一子字相乘,以形成第三部分乘积,In the first multiplier, the second activation value is multiplied by the first subword of the mantissa of the first weight in the second plurality of weights to form the third part of the product. 在第二乘法器中将第二激活值与所述尾数的第二子字相乘,以形成第四部分乘积,以及In the second multiplier, the second activation value is multiplied by the second subword of the mantissa to form the fourth part of the product, and 将第三部分乘积与第四部分乘积相加,以形成第二乘积集合中的乘积。Add the third part of the product to the fourth part of the product to form the product in the second product set.
CN202111326201.7A 2020-11-11 2021-11-10 Methods for performing multiplication operations in neural network processors and neural network processors Active CN114548387B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202063112271P 2020-11-11 2020-11-11
US63/112,271 2020-11-11
US17/131,357 2020-12-22
US17/131,357 US11861327B2 (en) 2020-11-11 2020-12-22 Processor for fine-grain sparse integer and floating-point operations

Publications (2)

Publication Number Publication Date
CN114548387A CN114548387A (en) 2022-05-27
CN114548387B true CN114548387B (en) 2026-01-02

Family

ID=77998845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111326201.7A Active CN114548387B (en) 2020-11-11 2021-11-10 Methods for performing multiplication operations in neural network processors and neural network processors

Country Status (4)

Country Link
US (1) US11861327B2 (en)
EP (1) EP4002093B1 (en)
KR (1) KR102832860B1 (en)
CN (1) CN114548387B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12229659B2 (en) 2020-10-08 2025-02-18 Samsung Electronics Co., Ltd. Processor with outlier accommodation
US11861327B2 (en) 2020-11-11 2024-01-02 Samsung Electronics Co., Ltd. Processor for fine-grain sparse integer and floating-point operations
US11861328B2 (en) * 2020-11-11 2024-01-02 Samsung Electronics Co., Ltd. Processor for fine-grain sparse integer and floating-point operations

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114546333A (en) * 2020-11-11 2022-05-27 三星电子株式会社 Processor and method of operating a processor

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7428566B2 (en) 2004-11-10 2008-09-23 Nvidia Corporation Multipurpose functional unit with multiply-add and format conversion pipeline
US8042073B1 (en) 2007-11-28 2011-10-18 Marvell International Ltd. Sorted data outlier identification
US9628107B2 (en) 2014-04-07 2017-04-18 International Business Machines Corporation Compression of floating-point data by identifying a previous loss of precision
EP3230899A4 (en) * 2014-12-10 2018-08-01 Kyndi, Inc. Weighted subsymbolic data encoding
EP3274930A4 (en) 2015-03-24 2018-11-21 Hrl Laboratories, Llc Sparse inference modules for deep learning
EP3465550B1 (en) * 2016-05-26 2023-09-27 Samsung Electronics Co., Ltd. Accelerator for deep neural networks
CN109328361B (en) * 2016-06-14 2020-03-27 多伦多大学管理委员会 Accelerator for deep neural network
CN119251375A (en) 2016-08-19 2025-01-03 莫维迪厄斯有限公司 Dynamic culling of matrix operations
US10360163B2 (en) * 2016-10-27 2019-07-23 Google Llc Exploiting input data sparsity in neural network compute units
US11544545B2 (en) 2017-04-04 2023-01-03 Hailo Technologies Ltd. Structured activation based sparsity in an artificial neural network
WO2018199721A1 (en) 2017-04-28 2018-11-01 서울대학교 산학협력단 Method and apparatus for accelerating data processing in neural network
US10725740B2 (en) * 2017-08-31 2020-07-28 Qualcomm Incorporated Providing efficient multiplication of sparse matrices in matrix-processor-based devices
EP3704638A1 (en) * 2017-10-30 2020-09-09 Fraunhofer Gesellschaft zur Förderung der Angewand Neural network representation
GB2568083B (en) 2017-11-03 2021-06-02 Imagination Tech Ltd Histogram-based per-layer data format selection for hardware implementation of deep neutral network
US11144815B2 (en) * 2017-12-04 2021-10-12 Optimum Semiconductor Technologies Inc. System and architecture of neural network accelerator
US20190171927A1 (en) 2017-12-06 2019-06-06 Facebook, Inc. Layer-level quantization in neural networks
US11630666B2 (en) 2018-02-13 2023-04-18 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
US10678508B2 (en) * 2018-03-23 2020-06-09 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
US10769526B2 (en) 2018-04-24 2020-09-08 Intel Corporation Machine learning accelerator architecture
US12099912B2 (en) * 2018-06-22 2024-09-24 Samsung Electronics Co., Ltd. Neural processor
US11151769B2 (en) 2018-08-10 2021-10-19 Intel Corporation Graphics architecture including a neural network pipeline
US10871946B2 (en) 2018-09-27 2020-12-22 Intel Corporation Methods for using a multiplier to support multiple sub-multiplication operations
US11341369B2 (en) 2018-11-15 2022-05-24 Nvidia Corporation Distributed batch normalization using partial populations
KR102775183B1 (en) * 2018-11-23 2025-03-04 삼성전자주식회사 Neural network device for neural network operation, operating method of neural network device and application processor comprising neural network device
US12045724B2 (en) 2018-12-31 2024-07-23 Microsoft Technology Licensing, Llc Neural network activation compression with outlier block floating-point
US20200226444A1 (en) 2019-01-15 2020-07-16 BigStream Solutions, Inc. Systems, apparatus, methods, and architecture for precision heterogeneity in accelerating neural networks for inference and training
CN109901814A (en) * 2019-02-14 2019-06-18 上海交通大学 Custom floating point number and its calculation method and hardware structure
US12165038B2 (en) 2019-02-14 2024-12-10 Microsoft Technology Licensing, Llc Adjusting activation compression for neural network training
US12182577B2 (en) 2019-05-01 2024-12-31 Samsung Electronics Co., Ltd. Neural-processing unit tile for shuffling queued nibbles for multiplication with non-zero weight nibbles
US11880760B2 (en) 2019-05-01 2024-01-23 Samsung Electronics Co., Ltd. Mixed-precision NPU tile with depth-wise convolution
US11726950B2 (en) 2019-09-28 2023-08-15 Intel Corporation Compute near memory convolution accelerator
CN110852422B (en) * 2019-11-12 2022-08-16 吉林大学 Convolutional neural network optimization method and device based on pulse array
US11714998B2 (en) * 2020-05-05 2023-08-01 Intel Corporation Accelerating neural networks with low precision-based multiplication and exploiting sparsity in higher order bits
US11861327B2 (en) 2020-11-11 2024-01-02 Samsung Electronics Co., Ltd. Processor for fine-grain sparse integer and floating-point operations

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114546333A (en) * 2020-11-11 2022-05-27 三星电子株式会社 Processor and method of operating a processor

Also Published As

Publication number Publication date
US11861327B2 (en) 2024-01-02
KR102832860B1 (en) 2025-07-10
EP4002093B1 (en) 2026-04-29
US20220147312A1 (en) 2022-05-12
EP4002093A1 (en) 2022-05-25
CN114548387A (en) 2022-05-27
KR20220064337A (en) 2022-05-18

Similar Documents

Publication Publication Date Title
US11042360B1 (en) Multiplier circuitry for multiplying operands of multiple data types
KR102743265B1 (en) Multiplication and accumulation circuit
US11907719B2 (en) FPGA specialist processing block for machine learning
CN114548387B (en) Methods for performing multiplication operations in neural network processors and neural network processors
Jovanovic et al. FPGA accelerator for floating-point matrix multiplication
US11809798B2 (en) Implementing large multipliers in tensor arrays
US20220075598A1 (en) Systems and Methods for Numerical Precision in Digital Multiplier Circuitry
EP4459454A2 (en) Numerical precision in digital multiplier circuitry
JP2019121398A (en) Accelerated computing method and system using lookup table
US11609741B2 (en) Apparatus and method for processing floating-point numbers
KR102832857B1 (en) Processor for fine-grain sparse integer and floating-point operations
WO2023287589A1 (en) Multiplier and adder in systolic array
CN114341796B (en) Signed multiple word multiplier
US20210034327A1 (en) Apparatus and Method for Processing Floating-Point Numbers
GB2641274A (en) Shared partial products for a dot product operation in hardware
JPH05173761A (en) Binary integer multiplier
US12229659B2 (en) Processor with outlier accommodation
US20230176819A1 (en) Pipelined processing of polynomial computation
WO2024072251A1 (en) System and method for matrix multiplication
HK40072869B (en) Signed multiword multiplier
HK40072869A (en) Signed multiword multiplier
Milutinovic FPGA accelerator for floating-point matrix multiplication
Jafari et al. Design of a Multiplier for Similar Base Numbers Without Converting Base Using a Data Oriented Memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant