CN114548387B

CN114548387B - Methods for performing multiplication operations in neural network processors and neural network processors

Info

Publication number: CN114548387B
Application number: CN202111326201.7A
Authority: CN
Inventors: 阿里·沙菲·阿得斯塔尼; 约瑟夫·哈松
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2020-11-11
Filing date: 2021-11-10
Publication date: 2026-01-02
Anticipated expiration: 2041-11-10
Also published as: US11861327B2; KR102832860B1; EP4002093B1; US20220147312A1; EP4002093A1; CN114548387A; KR20220064337A

Abstract

A method for a neural network processor to perform a multiplication operation and a neural network processor. In some embodiments, the method includes forming a first set of products and forming a second set of products. The step of forming the first set of products may include multiplying the first activation value with the least significant sub-word and the most significant sub-word of the first weight in a first multiplier to form a first partial product and a second partial product, and adding the first partial product to the second partial product. The step of forming the second set of products may include multiplying the second activation value with the first and second subwords of mantissas in the first multiplier to form a third partial product and a fourth partial product and adding the third partial product to the fourth partial product.

Description

Method for executing multiplication operation by neural network processor and neural network processor

The present application claims the priority and benefits of U.S. provisional application No. 63/112,271 entitled "System and method for increasing area and Power efficiency by reassigning weight nibbles (SYSTEM AND METHOD FOR IMPROVING AREA AND POWER EFFICIENCY BY REDISTRIBUTING WEIGHT NIBBLES)" filed 11/2020, and U.S. application No. 17/131,357 filed 12/22/2020, the entire contents of which are incorporated herein by reference.

Technical Field

One or more aspects in accordance with embodiments of the present disclosure relate to processing circuitry, and more particularly, to systems and methods for performing multiple sets of multiplications in a manner that accommodates outliers and is capable of performing both integer and floating point operations.

Background

Processors for neural networks may perform a large number of multiplication and addition operations, some of which may be misuse on processing resources because a large portion of the numbers being processed may be relatively small and only a small portion of outliers may be relatively large. Further, some operations in such systems may be integer operations and some operations may be floating point operations, which may consume a significant amount of chip area and power if performed on separate sets of dedicated hardware.

Thus, there is a need for a system and method for performing multiple sets of multiplications in a manner that accommodates outliers and is capable of performing both integer and floating point operations.

Disclosure of Invention

According to an embodiment of the invention there is provided a method comprising forming a first set of products, each product of the first set of products being an integer product of a first activation value and a corresponding weight of a first plurality of weights, and forming a second set of products, each product of the second set of products being a floating point product of a second activation value and a corresponding weight of a second plurality of weights, the step of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in a first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in a second multiplier to form a second partial product, and adding the first partial product to the second partial product, the step of forming the second set of products comprising multiplying the second activation value with a first sub-word of the mantissa of the first weight of the second plurality of weights in the first multiplier to form a third partial product, multiplying the second activation value with a fourth partial product, and adding the second activation value to the fourth partial product in the second multiplier.

In some embodiments, the second activate value is a nibble of the mantissa of the floating point activate value.

In some embodiments, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa.

In some embodiments, the step of adding the third partial product to the fourth partial product includes performing an offset addition in a first offset adder.

In some embodiments, the offset of the offset adder is equal to the width of the first subword of the mantissa.

In some embodiments, the step of adding the first partial product to the second partial product includes performing an offset addition in a first offset adder.

In some embodiments, the step of forming the first set of products further comprises multiplying the first activation value with a least significant sub-word of a second weight of the first plurality of weights in a third multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the second weight in the third multiplier to form a second partial product, and adding the first partial product to the second partial product.

In some embodiments, the step of forming the first set of products further comprises multiplying the first activation value with a least significant sub-word of a third weight of the first plurality of weights in a fourth multiplier to form a first partial product, the third weight having a most significant nibble equal to zero, and adding the first partial product to zero.

In some embodiments, the first activation value is the most significant sub-word of the integer activation value.

In some embodiments, the method further comprises shifting to the left a sum of the first partial product and the second partial product by a number of bits equal to a magnitude of the first activation value.

According to an embodiment of the invention there is provided a system comprising a processing circuit comprising a first multiplier, a second multiplier and a third multiplier, the processing circuit being configured to form a first set of products, each product of the first set of products being an integer product of a first activation value with a respective weight of a first plurality of weights, and to form a second set of products, each product of the second set of products being a floating point product of a second activation value with a respective weight of a second plurality of weights, the process of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in the first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in the second multiplier to form a second partial product, and adding the first partial product to the second partial product, the process of forming the second set comprising multiplying the second activation value with a second number of weights in the first multiplier to form a second partial product, and adding the second activation value to the second partial product, and the third sub-word forming the second partial product.

In some embodiments, the process of adding the third partial product to the fourth partial product includes performing an offset addition in a first offset adder.

In some embodiments, the process of adding the first partial product to the second partial product includes performing an offset addition in a first offset adder.

In some embodiments, the process of forming the first set of products further includes multiplying the first activation value with a least significant sub-word of a second weight of the first plurality of weights in a third multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the second weight in the third multiplier to form a second partial product, and adding the first partial product to the second partial product.

In some embodiments, the process of forming the first set of products further includes multiplying the first activation value with a least significant sub-word of a third weight of the first plurality of weights in a fourth multiplier to form a first partial product, the third weight having a most significant nibble equal to zero, and adding the first partial product to zero.

According to an embodiment of the invention there is provided a system comprising means for processing comprising a first multiplier, a second multiplier and a third multiplier, the means for processing being configured to form a first set of products, each product of the first set of products being an integer product of a first activation value with a respective weight of a first plurality of weights, and to form a second set of products, each product of the second set of products being a floating point product of a second activation value with a respective weight of a second plurality of weights, the process of forming the first set of products comprising multiplying the first activation value with a least significant sub-word of the first weight of the first plurality of weights in the first multiplier to form a first partial product, multiplying the first activation value with a most significant sub-word of the first weight in the second multiplier to form a second partial product, and adding the first partial product to the second partial product, the process of forming the second set of products comprising multiplying the second activation value with a second sub-word of the second weight in the second multiplier, and adding the second activation value to the second partial product, forming a fourth sub-word.

Drawings

These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims and appended drawings, wherein:

FIG. 1 is a block diagram of a portion of a neural network processor, according to an embodiment of the present disclosure;

FIG. 2A is a block diagram of a portion of a hybrid processing circuit according to an embodiment of the present disclosure;

FIG. 2B is a data map according to an embodiment of the present disclosure;

FIG. 2C is a block diagram of a portion of a hybrid processing circuit according to an embodiment of the present disclosure, and

Fig. 3 is a data map according to an embodiment of the present disclosure.

Detailed Description

The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary embodiments of processors for fine-grained sparse (fine-GRAIN SPARSE) integer and floating-point operations provided in accordance with the present disclosure, and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. However, it is to be understood that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. Like element numbers are intended to indicate like elements or features as shown elsewhere herein.

The neural network (e.g., when performing the inference (inference)) may perform a number of computations in which an activation (or "activation value") (an element of the Input Feature Map (IFM)) is multiplied by a weight. The product of the activation and the weights may form a multi-dimensional array, which may be summed along one or more axes to form an array or "tensor (tensor)" (which may be referred to as an output signature (OFM)). Referring to fig. 1, dedicated hardware may be employed to perform such calculations. The activations may be stored in a Static Random Access Memory (SRAM) 105 and fed into a multiplier accumulator (multiplier accumulator, MAC) array, which may include (i) a plurality of blocks (which may be referred to as "bricks" 110), each of which may include a plurality of multipliers for multiplying the activations with weights, (ii) one or more adder trees for adding together the products generated by the bricks, and (iii) one or more accumulators for accumulating the sums generated by the adder trees. Each activation value may be broadcast (broadcast) to a plurality of multipliers conceptually arranged in a row in the representation of fig. 1. A plurality of adder trees 115 may be employed to form a sum.

In operation, the weights may fall within a range of values, and the distribution of the values of the weights makes relatively small weights significantly more common than relatively large weights. For example, if each weight is represented as an 8-bit number (e.g., an 8-bit integer INT8 or an 8-bit floating point number), then many weights (e.g., most weights or weights exceeding 3/4) may have a value less than 16 (i.e., the most significant nibble (most significant nibble) is zero), and then the weight with the non-zero most significant nibble may be referred to as an "outlier". In some embodiments, appropriately configured hardware may achieve increased speed and power efficiency by taking advantage of these characteristics of weights. However, the inventive concept is not so limited, and for example, each weight may also be represented as a 16-bit number (e.g., a 16-bit integer or a 16-bit floating point number FP 16).

Fig. 2A shows a portion of a hybrid processing circuit (referred to as "hybrid" because it applies to both integer and floating point operations). Referring to fig. 2A, in some embodiments, a plurality of multipliers 205 are used to multiply the weights with the activations, e.g., one nibble at a time. Each multiplier may be a 4 x 4 (i.e., 4 bits by 4 bits) multiplier, with the first input 210 configured to receive a corresponding weight nibble and the second input 215 configured to receive an activation nibble (which may be broadcast to all multipliers). An embodiment with nine multipliers is shown, in some embodiments there are more multipliers (more power generating but more expensive circuits) and in some embodiments there are fewer multipliers (less power generating but less cost circuits). The weight buffer 220 (only the output row of which is shown) may include a respective register for each multiplier 205. The output of multiplier 205 may be fed to a plurality of combining circuits 225, each combining circuit 225 may include one or more multiplexers 230, adders, and inverting-shifting circuits 235. The system may include the same number of combining circuits 225 as the multipliers 205, or the system may contain fewer combining circuits 225 than multipliers 205 (as shown), or the system may contain more combining circuits 225 than multipliers 205.

In operation, each multiplier may produce a partial product during each clock cycle. These partial products may be added together to form an integer product in an integer operation (or may be a partial product in a floating point operation), each of which may be processed by an inverse-shift circuit 235, and the result (e.g., unit ₀、Unit ₁、……、Unit ₇) may be sent to an adder tree (e.g., adder tree 0, adder tree 1, a.the., adder tree 7) to add with other integer products. For example, in one example, as shown in fig. 2A, the first two values in the output row of the weight buffer 220 may be the least significant nibble L0 and the most significant nibble M0 of the first weight, respectively, and the active value nibble being broadcast may be the least significant nibble of the first (8 bit) active value. However, the inventive concept is not limited thereto, and for example, in another example, the first two values in the output row of the weight buffer 220 may be the most significant nibble M0 and the least significant nibble L0 of the first weight, respectively. The activation value nibble may be multiplied with the least significant nibble L0 of the first weight to form a first partial product P0 and the activation value nibble may be multiplied with the most significant nibble M0 of the first weight to form a second partial product P1. Although fig. 2A illustrates an example in which the first partial product P0 and the second partial product P1 are calculated using two multipliers (e.g., a first multiplier and a second multiplier) during one clock cycle, the inventive concept is not limited thereto, and for example, in another example, the first partial product P0 and the second partial product P1 may be calculated using the same multiplier (e.g., a third multiplier) during two clock cycles, respectively. In fig. 2A, the output rows of the weight buffer 220 may include nibbles L2, L3, M4, L7, M7, and 0, and the partial product P ₂、P₃、……、P_n-1 may be formed in a similar manner. These partial products may be routed to the first combining circuit 225 (leftmost one in fig. 2A) through the connection structure (connection fabric, or connection configuration) 240. The connection structure 240 may include multiplexers (e.g., UL0, UM0, UL1, UM1, um.i., UL7, and UM 7) 230, multiplexer 230 is depicted as a separate element in fig. 2A to facilitate the illustration (using arrows) of the data path (data routing) performed by multiplexer 230. In the first combining circuit 225, the product of (i) the weight (two nibbles) and (ii) the activation value nibble may be calculated as an offset sum of the first partial product and the second partial product (calculated by the corresponding offset adder 245).

As used herein, an "offset sum" of two values is the result of an "offset addition" which is an adder that forms the sum of (i) a first one of the two values and (ii) a second one of the two values that is shifted to the left by a plurality of bits (e.g., four bits), and "offset adder" is an adder that performs addition of an offset between positions of the two numbers that have significant bits of the two numbers (referred to as an "offset" of the offset adder). As used herein, the "Significance (SIGNIFICANCE)" of a nibble (or more generally, a sub-word (discussed in further detail below)) is the position that the nibble occupies in a word (word), the nibble being part of the word (e.g., whether the nibble is the most significant nibble or the least significant nibble of an 8-bit word). Thus, the most significant nibble of an 8-bit word has a significance four bits greater than the least significant nibble. For example, for integer arithmetic, the difference between the significance of the first subword of weights and the significance of the second subword of weights is equal to the width of the first subword of weights and the offset of the offset adder is equal to the width of the first subword of weights. For example, for a floating point operation, the difference between the significance of the first sub-word of the mantissa and the significance of the second sub-word of the mantissa is equal to the width of the first sub-word of the mantissa, and the offset of the offset adder is equal to the width of the first sub-word of the mantissa. The difference between the validity of the first sub-word and the validity of the second sub-word corresponds to an offset between the position occupied by the first sub-word in the word and the position occupied by the second sub-word in the word. A word (e.g., N bits, N being an integer greater than 1) may be divided into a plurality of sub-words. The first sub-word of the plurality of sub-words that occupies the most significant sub-word in the word is referred to as the most significant sub-word, and the second sub-word of the plurality of sub-words that occupies the most trailing sub-word is referred to as the least significant sub-word.

Each of the inverting-shifting circuits 235 may convert between (i) a sign and magnitude representation and (ii) a two's complement representation, and each of the inverting-shifting circuits 235 may shift the result as needed for proper addition to occur in the adder tree. For example, if the activate value nibble is the most significant nibble, the output of offset adder 245 may be shifted (e.g., by 4 bits to the left) so that in the adder tree the output bits will be properly aligned with the bits of other products (e.g., the product of the least significant nibble with the weight and the activate value). For example, if multiplier 205 is an unsigned integer multiplier and the adder tree is a two's complement adder tree, then conversion between the sign and size representation and the two's complement representation may be performed.

As shown in fig. 2B, the arrangement of weight nibbles in the weight buffer 220 may be the result of preprocessing. As shown, the original weight array 250 may include a first row belonging to the least significant nibble (labeled "L0" etc.) and a second row belonging to the most significant nibble (labeled "M0" etc.). Some of the nibbles may be zeros as shown in the blank cell in fig. 2B. The preprocessing may rearrange these nibbles as the weight buffer is filled (e.g., as indicated by the arrow in fig. 2B) such that the weight buffer contains a smaller proportion of zero value nibbles than the original weight array 250. In the example of fig. 2B, the eight weights (each consisting of the least significant nibble and the most significant nibble) are rearranged such that the zero value nibble is discarded and the non-zero nibble is placed into the eight positions of a row of weight buffers (the ninth position contains zero) such that when the row of weight buffers is processed by the array of nine multipliers 205 (fig. 2A), eight of the multipliers are used and only one (the ninth) is unused. Thus, as shown in fig. 2B, the most significant nibble of weight 2in the original weight array 250 is 0, so that the weight buffer includes only the least significant nibble L2 of weight 2 and does not include the most significant nibble of weight 2. In this case, the activation value nibble and the least significant nibble L2 of weight 2 may be multiplied in one multiplier 205 to form a partial product P ₂, and the partial products P ₂ and 0 may be fed into the combining circuit 225 to be added. Furthermore, in some cases, the sparseness of the original weight array 250 may not be sufficient to allow all the most significant nibbles to be in the same row of the weight buffer as the corresponding least significant nibbles, and some or all of the products may be formed in two clock cycles, with the activation value remaining the same for both cycles. The preprocessing may also generate an array of control signals that may be used to control the connection structure 240 (e.g., multiplexer 230) such that each partial product is sent to the appropriate input of the offset adder 245 according to the validity of the factors forming it.

As shown in fig. 2C, the hybrid processing circuit may also include a plurality of variable shifting units (or "shifting units," "variable shifting circuits") 260, the variable shifting units 260 enabling the hybrid processing circuit to perform floating point operations on floating point activations and floating point weights in a floating point mode of the hybrid processing circuit. Each such floating point number may be an FP16 floating point number (using a format such as according to the IEEE 754-2008 standard) having one sign bit, an 11-bit mantissa (or "significant digit (significand)") (represented by 10 bits and one implicit leading bit (IMPLICIT LEAD bit) or "hidden bit"), and a five-bit exponent. The 11-bit mantissa may be padded with one zero bit and divided into three nibbles, "high" (most significant) nibbles, "low" (least significant) nibbles, and (moderately significant) mid-nibbles (such that concatenating (concatenating) the high nibbles, the mid nibbles, and the low nibbles in order produces a 12-bit (padded) mantissa).

Floating point multiplication may then be performed by the hybrid processing circuit of fig. 2C by multiplying one nibble of the weights with an activated one nibble at a time in each of the multipliers 205 to form a partial product of the "high nibble, medium nibble, and low nibble of the mantissa of each weight" and the "high nibble, medium nibble, and low nibble of the activated mantissa". The (12-bit (b) wide) output of each inverting-shifting circuit 235 may be fed to a corresponding variable shifting unit 260, the variable shifting unit 260 may shift its received data between 0 and N bits to the right in floating point mode (where N may be 8 or a greater number or a smaller number depending in part on the size of the mantissa used in the adder tree (or the number of bits corresponding to the size), the size of the mantissa used in the adder tree may be selected based on the precision to be achieved), and the variable shifting unit 260 may shift its received data to the left in integer mode by 0 or M bits (where M may be the size of the most significant sub-word used in the adder tree as an activation value (or the number of bits corresponding to the size), for example, the variable shifting unit 260 may shift its received data to the left in integer mode by M bits when the activation value is the most significant sub-word of the integer activation value; the variable shifting unit 260 may shift its received data to the left in integer mode when the activation value is the least significant sub-word of the integer activation value). Additional optional shifting may be obtained by selecting one or the other input of offset adder 245 and by selecting the amount of shifting applied in inverting-shifting circuit 235. Thus, the hybrid processing circuit of fig. 2C is able to produce properly aligned outputs for each combination of the significance of the four input nibbles to the two multipliers 205, with the two multipliers 205 feeding either of the offset adders 245 during a given clock cycle (subject to the constraint that the significance of the two input values to the offset adder 245 differ by four bits, the offset adder 245 shifting one input by four bits relative to the other input).

FIG. 3 illustrates an example of preprocessing for an array of floating point weights. In a floating point representation, nibble sparsity (which may be relatively common for integer weights, e.g., the most significant nibble of the weights that have a value of zero) may be relatively rare, but the majority of the weights may be equal to zero, as shown for the original weight array 305, with all three nibbles (low (L), medium (M), and high (H)) of each of the three weights being zero. Fig. 3 shows how the three nibbles of the mantissa of each non-zero weight (weights 0,2, 4, 5, and 6) may be rearranged, first forming a first intermediate matrix 310, then forming a second intermediate matrix 315, then forming a final matrix 320, which final matrix 320 may be suitable for storage in a weight buffer. In the final matrix, all non-zero elements are in the first two rows and all products can be formed in two operations (e.g., in two clock cycles) while the original weight array 305 is loaded into the weight buffer, three operations will be used.

Although some examples are presented herein for embodiments having 8-bit weights, 8-bit activation values, four-weight wide weight buffers, and weights and activations that can process one nibble at a time, it will be understood that these parameters and other similar parameters in this disclosure are used as specific examples only, and any of these parameters may be changed, for ease of explanation. Thus, for example, the size of the weight may be a "word" and the size of a portion of the weight may be a "sub-word", with the word size being one byte and the sub-word size being one nibble in the embodiment of FIG. 2A. In other embodiments, for example, the word may be 12 bits, the sub-word may be 6 bits, or the word may be 16 bits, and the sub-word may be one byte. Further, the method of performing integer operations and floating point operations described in the present application may be a method of performing multiplication operations by a neural network processor, and components (e.g., SRAM, MAC, multiplier, combining circuit, adder tree, multiplexer, adder, inverting-shifting circuit, variable shifting unit, weight buffer, etc.) described in the present application may be included in the neural network processor.

As used herein, a "portion" of an item means "at least some" of the item, and thus may represent less than all of the item or may represent all of the item. Thus, as a special case, "a portion" of an article includes the entire article (i.e., the entire article is an example of a portion of the article). As used herein, the term "or" should be interpreted as "and/or" such that, for example, "a or B" means "a" or "B" or any of "a and B".

The terms "processing circuit" and "means for processing" as used herein each refer to any combination of hardware, firmware, and software for processing data or digital signals. The processing circuit hardware may include, for example, application Specific Integrated Circuits (ASICs), general-purpose or special-purpose Central Processing Units (CPUs), digital Signal Processors (DSPs), graphics Processors (GPUs), and programmable logic devices, such as Field Programmable Gate Arrays (FPGAs). In processing circuitry, as used herein, each function is performed by hardware configured (i.e., hardwired) to perform the function, or by more general-purpose hardware (such as a CPU) configured to execute instructions stored in a non-transitory storage medium. The processing circuitry may be fabricated on a single Printed Circuit Board (PCB) or distributed over several interconnected PCBs. The processing circuitry may include other processing circuitry, for example, the processing circuitry may include two processing circuits (FPGA and CPU) interconnected on a PCB.

As used herein, when a method (e.g., adjustment) or a first quantity (e.g., a first variable) is referred to as being "based on" a second quantity (e.g., a second variable), this means that the second quantity is an input to the method or affects the first quantity (e.g., the second quantity may be an input (e.g., a unique input or one of several inputs) that is a function of computing the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as the second quantity (e.g., stored at the same location or locations in memory)).

It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Accordingly, a first element, first component, first region, first layer or first section discussed herein could be termed a second element, second component, second region, second layer or second section without departing from the spirit and scope of the inventive concept.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concepts. As used herein, the terms "substantially," "about," and the like are used as approximate terms and not as degree terms, and are intended to take into account the inherent deviations of measured or calculated values that one of ordinary skill in the art would recognize. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. A representation such as "at least one of the elements of the list is modified after the list of elements, without modifying individual elements of the list. Furthermore, the use of "may" in describing embodiments of the inventive concepts means "one or more embodiments of the present disclosure. Furthermore, the term "exemplary" is intended to mean exemplary or illustrative. As used herein, the term "use" and variants thereof may be considered synonymous with the term "utilization" and variants thereof.

It will be understood that when an element or layer is referred to as being "on," "connected to," "coupled to" or "adjacent to" another element or layer, it can be directly on, connected to, coupled to or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," "directly connected to," "directly coupled to," or "directly adjacent to" another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges subsumed with the same numerical precision within the range recited. For example, a range of "1.0 to 10.0" or "between 1.0 and 10.0" is intended to include all subranges between (and including) the minimum value of 1.0 and the maximum value of 10.0 listed (i.e., having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0 (such as, for example, 2.4 to 7.6)). Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein, and any minimum numerical limitation recited herein is intended to include all higher numerical limitations subsumed therein.

Although exemplary embodiments of processors for fine-grained sparse integer and floating point operations have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Thus, it will be appreciated that processors for fine-grained sparse integer and floating point operations constructed in accordance with the principles of the present disclosure may be implemented as different from that specifically described herein. The invention is also defined in the appended claims and equivalents thereof.

Claims

1. A method for a neural network processor to perform multiplication operations, comprising:

Form a first set of products, where each product in the first set of products is an integer product of a first activation value and a corresponding weight from a first plurality of weights; and/or

This forms a second set of products, where each product is a floating-point product of the second activation value and the corresponding weight from a second set of weights.

The steps to form the first product set include:

In the first multiplier, the first activation value is multiplied by the least significant subword of the first weight among the first plurality of weights to form a first partial product.

In the second multiplier, the first activation value is multiplied by the highest valid subword of the first weight among the first plurality of weights to form a second partial product, and

Add the first part of the product to the second part of the product to form the product in the first product set.

The steps to form the second product set include:

In the first multiplier, the second activation value is multiplied by the first subword of the mantissa of the first weight in the second plurality of weights to form the third part of the product.

In the second multiplier, the second activation value is multiplied by the second subword of the mantissa to form the fourth part of the product, and

Add the third part of the product to the fourth part of the product to form the product in the second product set.

2. The method according to claim 1, wherein the second activation value is half a byte of the mantissa of the floating-point activation value.

3. The method according to claim 1, wherein the difference between the validity of the first sub-word of the last digit and the validity of the second sub-word of the last digit is equal to the width of the first sub-word of the last digit.

4. The method of claim 1, wherein the step of adding the third part product to the fourth part product includes performing offset addition in the first offset adder.

5. The method of claim 4, wherein the offset of the first offset adder is equal to the width of the first subword of the mantissa.

6. The method of claim 4, wherein the step of adding the first partial product to the second partial product includes performing offset addition in the first offset adder.

7. The method according to any one of claims 1 to 6, wherein the step of forming the first product set further comprises:

In the third multiplier, the first activation value is multiplied by the least significant subword of the second weight among the first plurality of weights to form the fifth part of the product;

In the third multiplier, the first activation value is multiplied by the highest valid subword of the second weight among the first plurality of weights to form a sixth part of the product; and

Add the product of the fifth part to the product of the sixth part to form the product in the first product set.

8. The method according to any one of claims 1 to 6, wherein the step of forming the first product set further comprises:

In the fourth multiplier, the first activation value is multiplied by the least significant subword of the third weight among the first plurality of weights to form a seventh part product, where the third weight has the most significant half-byte equal to zero; and

Add the seventh part of the product to zero to form the product in the first product set.

9. The method according to any one of claims 1 to 6, wherein the first activation value is the most significant subword of an integer activation value.

10. The method of claim 9, further comprising: shifting the sum of the first partial product and the second partial product to the left by a number of bits equal to the size of the first activation value.

11. A neural network processor, comprising: processing circuitry, the processing circuitry including a first multiplier and a second multiplier,

The processing circuit is configured as follows:

Form a first set of products, where each product in the first set of products is an integer product of a first activation value and a corresponding weight from a first plurality of weights, and/or

The process of forming the first product set includes:

The process of forming the second product set includes:

12. The neural network processor of claim 11, wherein the second activation value is half a byte of the mantissa of the floating-point activation value.

13. The neural network processor of claim 11, wherein the difference between the validity of the first subword of the mantissa and the validity of the second subword of the mantissa is equal to the width of the first subword of the mantissa.

14. The neural network processor of claim 11, wherein the process of adding the third part product to the fourth part product includes performing offset addition in the first offset adder of the processing circuit.

15. The neural network processor of claim 14, wherein the offset of the first offset adder is equal to the width of the first subword of the mantissa.

16. The neural network processor of claim 14, wherein the process of adding the first partial product to the second partial product includes performing offset addition in the first offset adder.

17. The neural network processor according to any one of claims 11 to 16, wherein the processing circuitry further comprises a third multiplier, and the processing of forming the first product set further comprises:

18. The neural network processor according to any one of claims 11 to 16, wherein the processing circuitry further comprises a fourth multiplier, and the processing of forming the first product set further comprises:

19. The neural network processor according to any one of claims 11 to 16, wherein the first activation value is the most significant subword of an integer activation value.

The processing circuit is also configured to shift the sum of the first part product and the second part product to the left by a number of bits equal to the size of the first activation value.

20. A neural network processor, comprising: means for processing, the means for processing including a first multiplier, a second multiplier, and a third multiplier.

The device for processing is configured as follows:

The process of forming the first product set includes:

The process of forming the second product set includes: