LSTM accelerator based on pulse array and acceleration method
Technical Field
The invention relates to the technical field of artificial intelligence cyclic neural networks, in particular to an LSTM accelerator based on a pulse array and an acceleration method.
Background
Long Short-Term Memory (LSTM) is a classical variant of Recurrent Neural Networks (RNN) and is commonly used for processing sequence data, such as text analysis, speech recognition, and sequence-related recognition tasks such as language translation. Compared with the traditional RNN network, the LSTM network has a stronger memory function and can better process long sequences and long-term dependency. The key feature of LSTM is that it incorporates gating elements and cell states for controlling the flow of information. The introduction of the gating unit enables the LSTM to capture sequence information for a longer time interval and maintain a long-term dependency relationship, so that the problem of gradient disappearance or explosion is avoided. Meanwhile, the gating unit is introduced to cause that the calculated amount and the storage complexity in the LSTM network are larger and larger, so that the problem of low efficiency exists when the traditional general processor executes the LSTM model, and the huge storage requirement also restricts the deployment of the LSTM model in the resource-limited embedded equipment.
The Field Programmable Gate Array (FPGA) has the characteristics of programmability, high parallelism, low power consumption and the like. Compared with the traditional CPU and GPU, the FPGA has higher flexibility and shorter development period, can be used for designing models of different network structures, and can effectively improve the calculation efficiency of LSTM through reasonable parallel calculation, algorithm mapping and memory optimization.
Due to the problems of huge memory requirement and low calculation efficiency of the LSTM, various compression methods aiming at the LSTM network can effectively reduce the parameter quantity and calculation quantity of the LSTM, but at the same time, the problems of accuracy reduction and unbalanced load can be caused. Moreover, the LSTM accelerator generally adopts simple parallel and pipeline designs, cannot be suitable for the LSTM network model after compression, and has the condition that an operation unit is idle or data flow arrangement is unreasonable, so that the overall efficiency of the model is low.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing an LSTM accelerator and an acceleration method based on a pulsation array, which converts the multiplication operation of a key weight matrix and an input vector in the LSTM into the multiplication operation of the matrix and the matrix by reasonably optimizing a data structure, processes the multiplication operation by adopting the pulsation array, and flows weight data and input data in the pulsation array among basic processing units PE, thereby achieving high parallelism and high throughput and effectively improving the performance of the accelerator.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides an LSTM accelerator based on a pulsation array, which comprises a weight buffer module, a vector-matrix conversion module, K pulsation arrays and an Element-wise module, wherein the pulsation array comprises M multiplied by N processing unit PE modules, M, N are integers larger than 1, and the number of the pulsation arrays is the same as the number of LSTM gating units in an LSTM model; wherein,
The vector-matrix conversion module is used for converting the input vector into a matrix form, obtaining an input matrix and outputting the input matrix to the pulsation array;
The impulse array is used for receiving the weight matrix from the weight buffer module and the input matrix of the vector-matrix conversion module, wherein the weight matrix and the input matrix flow between the PE modules, the weight matrix is pruned to obtain a sparse weight matrix, multiplication of the sparse weight matrix and the input matrix of the LSTM gating unit is completed in the impulse array, and the vector is output to the Element-phase module;
and the Element-wise module is used for receiving the vector output from the pulse array and calculating the cell state value and the hidden value of the current time step.
As a further optimization scheme of the LSTM accelerator based on the systolic array, the PE module comprises J sparse weight multiply-accumulate SMAC modules, and the SMAC modules are used for processing multiply-accumulate operation of single hidden nodes of the LSTM units.
As a further optimization of the LSTM accelerator based on systolic arrays according to the present invention, the SMAC module comprises a selector, a shifter and an adder, and is used for performing decoding, shifting and adding operations.
As a further optimization scheme of the LSTM accelerator based on the systolic array, the parallelism J of the SMAC module depends on the block size and the sparsity of the block balanced sparse pruning.
The acceleration method of the LSTM accelerator based on the pulsation array comprises a calculation method of the pulsation array, wherein the calculation method of the pulsation array is as follows:
Step 1, K pulse arrays correspond to K LSTM gate control units, and the pulse arrays are used for carrying out corresponding multiplication operation on a sparse weight matrix and an input matrix output by a vector-matrix conversion module;
step 2, the PE module receives parameters in a sparse weight matrix and parameters in an input matrix of the weight cache module or parameters in a sparse weight matrix and parameters in an input matrix of an adjacent PE module, and decoding and multiply-accumulate operation is carried out by an SMAC module in the PE module;
step 3, after the PE module finishes the operation of the current period, reserving a part and a result of the current operation, and transmitting the parameters of the current sparse weight matrix and the parameters in the input matrix to the adjacent PE module for decoding and multiply-accumulate operation by the adjacent PE module;
Step 4, after receiving the parameters of the sparse weight matrix and the parameters in the input matrix, the adjacent PE module repeats the step 2 and the step 3 until the pulse array completes multiplication operation of the sparse weight matrix and the input matrix of the LSTM gate control unit; and transmitting part and result in the PE module to the Element-wise module to calculate the current cell state value and the hidden value, wherein the part and result in the PE module refers to the output vector of the systolic array.
As a further optimization scheme of the acceleration method of the LSTM accelerator based on the pulse array, firstly, compressing an original weight matrix in an LSTM network by adopting a block cyclic matrix compression algorithm, dividing the weight matrix into blocks, replacing the original weight matrix by adopting a cyclic matrix, compensating the precision loss by retraining, and rearranging and storing the head line vector of each block cyclic matrix by adopting head line vector representation by adopting the cyclic matrix;
And pruning the stored first row vectors by adopting a block balance pruning algorithm, reserving the same sparsity for each row vector, coding non-zero weights in a sparse weight matrix after compression pruning, combining and storing index values and parameter values, storing index values in high order, storing parameter values in low order, and taking the high order index values to decode corresponding input data values in the input vectors.
Finally, the non-zero weight in the compressed pruned sparse weight matrix is quantized into the form of power of 2 by adopting exponential quantization, and shift operation can be adopted to replace multiplication operation in the subsequent hardware acceleration process.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
The invention mainly aims at the forward reasoning process of the sparse LSTM neural network after compression pruning to accelerate hardware, in order to improve the utilization efficiency of data and the performance of an accelerator, the original matrix-vector multiplication operation in the LSTM is converted into matrix-matrix multiplication operation through a vector-matrix conversion module, the structure and parallelism of basic processing units PE in a pulse array are reasonably arranged according to the cyclic matrix block size and the block balance pruning sparsity, the pipeline design is reasonably utilized, the weight data and the input data flow among the PE of the pulse array, the weight data and the input data are multiplexed efficiently, and different PE are operated simultaneously, so that the operation state with high parallelism and high throughput is achieved, and the operation performance of the accelerator is effectively improved.
Drawings
FIG. 1 is a diagram of a LSTM accelerator architecture based on a systolic array;
FIG. 2 is a LSTM network compression flow diagram;
FIG. 3a is a diagram of a systolic array structure, and FIG. 3b is a schematic diagram of a systolic array basic building block PE;
Fig. 4 is a schematic diagram of an SMAC architecture.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 1, an LSTM accelerator based on a systolic array comprises a vector-matrix conversion module, a systolic array, an Element-wise module, a storage module and a control module;
And the LSTM network is compressed through a compression algorithm, so that the storage requirement of the LSTM network model deployed on the embedded equipment is effectively reduced, and the calculation process is accelerated. Wherein the vector-matrix conversion module is responsible for converting the input vector into a matrix form and mapping onto the input data stream of the systolic array. The pulse array consists of M multiplied by N PE modules, receives the weight and the input data from the weight buffer module and the vector-matrix conversion module, and performs operation with high parallelism among PE to complete multiplication operation of the sparse weight matrix and the input matrix of the LSTM gate control unit. The PE module consists of J SMAC modules, is responsible for multiply-accumulate operation of LSTM hidden nodes, and the parallelism of the SMAC modules depends on the block size and sparsity of block balance pruning.
The Element-wise module consists of a tanh module and a sigmoid module and is responsible for the operation of an activating function and the like of an LSTM gate output vector.
The storage module comprises an off-chip DRAM, a weight cache module, an input cache module and an output cache module. Off-chip DRAM stores off-chip data. The weight caching module stores weight parameters required by LSTM model calculation, and the weight parameters comprise weight values and weight index values. The input buffer module is used for buffering the input data read in the off-chip DRAM. The output buffer module comprises a c t buffer and an h t buffer, and is used for storing an LSTM cell state value output c t and a hidden value output h t in each time step.
The control module outputs an enabling signal, an accelerator system state conversion signal and a read-write control signal of the storage unit to control the operation of the whole accelerator.
As shown in fig. 2, the multiple hybrid compression algorithm comprises the following steps:
Step 1, firstly, an original LSTM network model is built, and a weight matrix of the original LSTM network model is obtained through training of a corresponding data set.
And step 2, replacing an original LSTM model weight matrix by adopting a block cyclic matrix, retraining the replaced LSTM network, updating weight parameters and recovering the precision loss.
And 3, carrying out block balance pruning on the weight matrix after retraining, wherein the pruning and blocking size is consistent with the first row vector size of the cyclic matrix row, each weight block keeps the same sparsity, and retraining is carried out until the recovery precision is lost.
And 4, quantifying the non-zero weight parameters after pruning, and quantifying the parameters into a form of power of 2, and continuously retraining in the process to compensate for the precision loss.
And 5, encoding the non-zero weight and simultaneously storing the index value and the weight value.
Fig. 3a is a schematic diagram of the structure of the systolic array. The systolic array is composed of M multiplied by N PE units, and adjacent PE units are separated by a register only, so that the communication distance is short. The impulse array receives the data from the weight buffer module and the vector-matrix conversion module, the input data and the weight parameters directionally flow between the adjacent PE units of the impulse array, the input data is reused by the PE units of each row, the weight parameters are reused by the PE units of each column, and different PE units operate simultaneously, so that the method has higher parallelism, accelerates the calculation process, and improves the overall throughput rate of the system.
Fig. 3b is a schematic diagram of the structure of the basic element PE of the systolic array. The PE mainly comprises an SMAC module, an adder and a register. The PE receives data from the cache or neighboring PEs for operation, in this example, the 7-bit weight parameter includes a 4-bit weight value and a 3-bit index value, the input vector includes 8 input data, as shown in fig. 4, the data is decoded in the SMAC module, the corresponding input data is decoded by the index value of the non-zero weight, a multiply-accumulate operation is performed with the weight value, the multiply-accumulate result is stored in the partial sum register. After the multiply-accumulate operation of all weights and inputs is completed, the PE unit adds partial sum results in all SMAC units to obtain an output value of the current node, and transmits the output value to an Element-phase module for operation, and the cell state value and the hidden value of the current time step are output.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.