CN117610624B - A LSTM accelerator and acceleration method based on systolic array - Google Patents

A LSTM accelerator and acceleration method based on systolic array Download PDF

Info

Publication number
CN117610624B
CN117610624B CN202311535097.1A CN202311535097A CN117610624B CN 117610624 B CN117610624 B CN 117610624B CN 202311535097 A CN202311535097 A CN 202311535097A CN 117610624 B CN117610624 B CN 117610624B
Authority
CN
China
Prior art keywords
matrix
module
lstm
input
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311535097.1A
Other languages
Chinese (zh)
Other versions
CN117610624A (en
Inventor
葛芬
王浩
周芳
叶剑涛
龚文强
张国辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202311535097.1A priority Critical patent/CN117610624B/en
Publication of CN117610624A publication Critical patent/CN117610624A/en
Application granted granted Critical
Publication of CN117610624B publication Critical patent/CN117610624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

本发明公开了一种基于脉动阵列的LSTM加速器,加速器包括向量‑矩阵转化模块、脉动阵列、Element‑wise模块、控制模块以及存储模块;向量‑矩阵转化模块负责将输入向量转化为矩阵形式,并映射至脉动阵列的输入数据流上;脉动阵列负责LSTM门控单元的稀疏权重矩阵与输入矩阵的乘法运算,由多个PE模块组成,通过输入数据流和权重数据流在PE模块间流动,高度复用数据,有效避免数据频繁读取。本发明还公开了一种基于脉动阵列的LSTM加速器的加速方法,本发明通过优化数据结构,将原有的矩阵与向量乘法运算转化为矩阵与矩阵乘法运算,并通过脉动阵列对稀疏权重矩阵与输入矩阵的乘法运算进行处理,实现了PE模块的高并行度和高吞吐率,从而提升了加速器的运算性能。

The present invention discloses an LSTM accelerator based on a systolic array, the accelerator includes a vector-matrix conversion module, a systolic array, an Element-wise module, a control module and a storage module; the vector-matrix conversion module is responsible for converting an input vector into a matrix form and mapping it to an input data stream of the systolic array; the systolic array is responsible for the multiplication operation of the sparse weight matrix of the LSTM gating unit and the input matrix, and is composed of a plurality of PE modules, and the input data stream and the weight data stream flow between the PE modules, highly multiplexing data, and effectively avoiding frequent data reading. The present invention also discloses an acceleration method for an LSTM accelerator based on a systolic array, the present invention converts the original matrix-vector multiplication operation into a matrix-matrix multiplication operation by optimizing the data structure, and processes the multiplication operation of the sparse weight matrix and the input matrix through the systolic array, realizes the high parallelism and high throughput of the PE module, thereby improving the computing performance of the accelerator.

Description

LSTM accelerator based on pulse array and acceleration method
Technical Field
The invention relates to the technical field of artificial intelligence cyclic neural networks, in particular to an LSTM accelerator based on a pulse array and an acceleration method.
Background
Long Short-Term Memory (LSTM) is a classical variant of Recurrent Neural Networks (RNN) and is commonly used for processing sequence data, such as text analysis, speech recognition, and sequence-related recognition tasks such as language translation. Compared with the traditional RNN network, the LSTM network has a stronger memory function and can better process long sequences and long-term dependency. The key feature of LSTM is that it incorporates gating elements and cell states for controlling the flow of information. The introduction of the gating unit enables the LSTM to capture sequence information for a longer time interval and maintain a long-term dependency relationship, so that the problem of gradient disappearance or explosion is avoided. Meanwhile, the gating unit is introduced to cause that the calculated amount and the storage complexity in the LSTM network are larger and larger, so that the problem of low efficiency exists when the traditional general processor executes the LSTM model, and the huge storage requirement also restricts the deployment of the LSTM model in the resource-limited embedded equipment.
The Field Programmable Gate Array (FPGA) has the characteristics of programmability, high parallelism, low power consumption and the like. Compared with the traditional CPU and GPU, the FPGA has higher flexibility and shorter development period, can be used for designing models of different network structures, and can effectively improve the calculation efficiency of LSTM through reasonable parallel calculation, algorithm mapping and memory optimization.
Due to the problems of huge memory requirement and low calculation efficiency of the LSTM, various compression methods aiming at the LSTM network can effectively reduce the parameter quantity and calculation quantity of the LSTM, but at the same time, the problems of accuracy reduction and unbalanced load can be caused. Moreover, the LSTM accelerator generally adopts simple parallel and pipeline designs, cannot be suitable for the LSTM network model after compression, and has the condition that an operation unit is idle or data flow arrangement is unreasonable, so that the overall efficiency of the model is low.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing an LSTM accelerator and an acceleration method based on a pulsation array, which converts the multiplication operation of a key weight matrix and an input vector in the LSTM into the multiplication operation of the matrix and the matrix by reasonably optimizing a data structure, processes the multiplication operation by adopting the pulsation array, and flows weight data and input data in the pulsation array among basic processing units PE, thereby achieving high parallelism and high throughput and effectively improving the performance of the accelerator.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides an LSTM accelerator based on a pulsation array, which comprises a weight buffer module, a vector-matrix conversion module, K pulsation arrays and an Element-wise module, wherein the pulsation array comprises M multiplied by N processing unit PE modules, M, N are integers larger than 1, and the number of the pulsation arrays is the same as the number of LSTM gating units in an LSTM model; wherein,
The vector-matrix conversion module is used for converting the input vector into a matrix form, obtaining an input matrix and outputting the input matrix to the pulsation array;
The impulse array is used for receiving the weight matrix from the weight buffer module and the input matrix of the vector-matrix conversion module, wherein the weight matrix and the input matrix flow between the PE modules, the weight matrix is pruned to obtain a sparse weight matrix, multiplication of the sparse weight matrix and the input matrix of the LSTM gating unit is completed in the impulse array, and the vector is output to the Element-phase module;
and the Element-wise module is used for receiving the vector output from the pulse array and calculating the cell state value and the hidden value of the current time step.
As a further optimization scheme of the LSTM accelerator based on the systolic array, the PE module comprises J sparse weight multiply-accumulate SMAC modules, and the SMAC modules are used for processing multiply-accumulate operation of single hidden nodes of the LSTM units.
As a further optimization of the LSTM accelerator based on systolic arrays according to the present invention, the SMAC module comprises a selector, a shifter and an adder, and is used for performing decoding, shifting and adding operations.
As a further optimization scheme of the LSTM accelerator based on the systolic array, the parallelism J of the SMAC module depends on the block size and the sparsity of the block balanced sparse pruning.
The acceleration method of the LSTM accelerator based on the pulsation array comprises a calculation method of the pulsation array, wherein the calculation method of the pulsation array is as follows:
Step 1, K pulse arrays correspond to K LSTM gate control units, and the pulse arrays are used for carrying out corresponding multiplication operation on a sparse weight matrix and an input matrix output by a vector-matrix conversion module;
step 2, the PE module receives parameters in a sparse weight matrix and parameters in an input matrix of the weight cache module or parameters in a sparse weight matrix and parameters in an input matrix of an adjacent PE module, and decoding and multiply-accumulate operation is carried out by an SMAC module in the PE module;
step 3, after the PE module finishes the operation of the current period, reserving a part and a result of the current operation, and transmitting the parameters of the current sparse weight matrix and the parameters in the input matrix to the adjacent PE module for decoding and multiply-accumulate operation by the adjacent PE module;
Step 4, after receiving the parameters of the sparse weight matrix and the parameters in the input matrix, the adjacent PE module repeats the step 2 and the step 3 until the pulse array completes multiplication operation of the sparse weight matrix and the input matrix of the LSTM gate control unit; and transmitting part and result in the PE module to the Element-wise module to calculate the current cell state value and the hidden value, wherein the part and result in the PE module refers to the output vector of the systolic array.
As a further optimization scheme of the acceleration method of the LSTM accelerator based on the pulse array, firstly, compressing an original weight matrix in an LSTM network by adopting a block cyclic matrix compression algorithm, dividing the weight matrix into blocks, replacing the original weight matrix by adopting a cyclic matrix, compensating the precision loss by retraining, and rearranging and storing the head line vector of each block cyclic matrix by adopting head line vector representation by adopting the cyclic matrix;
And pruning the stored first row vectors by adopting a block balance pruning algorithm, reserving the same sparsity for each row vector, coding non-zero weights in a sparse weight matrix after compression pruning, combining and storing index values and parameter values, storing index values in high order, storing parameter values in low order, and taking the high order index values to decode corresponding input data values in the input vectors.
Finally, the non-zero weight in the compressed pruned sparse weight matrix is quantized into the form of power of 2 by adopting exponential quantization, and shift operation can be adopted to replace multiplication operation in the subsequent hardware acceleration process.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
The invention mainly aims at the forward reasoning process of the sparse LSTM neural network after compression pruning to accelerate hardware, in order to improve the utilization efficiency of data and the performance of an accelerator, the original matrix-vector multiplication operation in the LSTM is converted into matrix-matrix multiplication operation through a vector-matrix conversion module, the structure and parallelism of basic processing units PE in a pulse array are reasonably arranged according to the cyclic matrix block size and the block balance pruning sparsity, the pipeline design is reasonably utilized, the weight data and the input data flow among the PE of the pulse array, the weight data and the input data are multiplexed efficiently, and different PE are operated simultaneously, so that the operation state with high parallelism and high throughput is achieved, and the operation performance of the accelerator is effectively improved.
Drawings
FIG. 1 is a diagram of a LSTM accelerator architecture based on a systolic array;
FIG. 2 is a LSTM network compression flow diagram;
FIG. 3a is a diagram of a systolic array structure, and FIG. 3b is a schematic diagram of a systolic array basic building block PE;
Fig. 4 is a schematic diagram of an SMAC architecture.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 1, an LSTM accelerator based on a systolic array comprises a vector-matrix conversion module, a systolic array, an Element-wise module, a storage module and a control module;
And the LSTM network is compressed through a compression algorithm, so that the storage requirement of the LSTM network model deployed on the embedded equipment is effectively reduced, and the calculation process is accelerated. Wherein the vector-matrix conversion module is responsible for converting the input vector into a matrix form and mapping onto the input data stream of the systolic array. The pulse array consists of M multiplied by N PE modules, receives the weight and the input data from the weight buffer module and the vector-matrix conversion module, and performs operation with high parallelism among PE to complete multiplication operation of the sparse weight matrix and the input matrix of the LSTM gate control unit. The PE module consists of J SMAC modules, is responsible for multiply-accumulate operation of LSTM hidden nodes, and the parallelism of the SMAC modules depends on the block size and sparsity of block balance pruning.
The Element-wise module consists of a tanh module and a sigmoid module and is responsible for the operation of an activating function and the like of an LSTM gate output vector.
The storage module comprises an off-chip DRAM, a weight cache module, an input cache module and an output cache module. Off-chip DRAM stores off-chip data. The weight caching module stores weight parameters required by LSTM model calculation, and the weight parameters comprise weight values and weight index values. The input buffer module is used for buffering the input data read in the off-chip DRAM. The output buffer module comprises a c t buffer and an h t buffer, and is used for storing an LSTM cell state value output c t and a hidden value output h t in each time step.
The control module outputs an enabling signal, an accelerator system state conversion signal and a read-write control signal of the storage unit to control the operation of the whole accelerator.
As shown in fig. 2, the multiple hybrid compression algorithm comprises the following steps:
Step 1, firstly, an original LSTM network model is built, and a weight matrix of the original LSTM network model is obtained through training of a corresponding data set.
And step 2, replacing an original LSTM model weight matrix by adopting a block cyclic matrix, retraining the replaced LSTM network, updating weight parameters and recovering the precision loss.
And 3, carrying out block balance pruning on the weight matrix after retraining, wherein the pruning and blocking size is consistent with the first row vector size of the cyclic matrix row, each weight block keeps the same sparsity, and retraining is carried out until the recovery precision is lost.
And 4, quantifying the non-zero weight parameters after pruning, and quantifying the parameters into a form of power of 2, and continuously retraining in the process to compensate for the precision loss.
And 5, encoding the non-zero weight and simultaneously storing the index value and the weight value.
Fig. 3a is a schematic diagram of the structure of the systolic array. The systolic array is composed of M multiplied by N PE units, and adjacent PE units are separated by a register only, so that the communication distance is short. The impulse array receives the data from the weight buffer module and the vector-matrix conversion module, the input data and the weight parameters directionally flow between the adjacent PE units of the impulse array, the input data is reused by the PE units of each row, the weight parameters are reused by the PE units of each column, and different PE units operate simultaneously, so that the method has higher parallelism, accelerates the calculation process, and improves the overall throughput rate of the system.
Fig. 3b is a schematic diagram of the structure of the basic element PE of the systolic array. The PE mainly comprises an SMAC module, an adder and a register. The PE receives data from the cache or neighboring PEs for operation, in this example, the 7-bit weight parameter includes a 4-bit weight value and a 3-bit index value, the input vector includes 8 input data, as shown in fig. 4, the data is decoded in the SMAC module, the corresponding input data is decoded by the index value of the non-zero weight, a multiply-accumulate operation is performed with the weight value, the multiply-accumulate result is stored in the partial sum register. After the multiply-accumulate operation of all weights and inputs is completed, the PE unit adds partial sum results in all SMAC units to obtain an output value of the current node, and transmits the output value to an Element-phase module for operation, and the cell state value and the hidden value of the current time step are output.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims (4)

1. The acceleration method of the LSTM accelerator based on the pulse array is characterized in that the LSTM accelerator comprises a weight buffer module, a vector-matrix conversion module, K pulse arrays and an Element-wise module, wherein the pulse arrays comprise M multiplied by N processing unit PE modules, M, N are integers which are larger than 1, and the number of the pulse arrays is the same as the number of LSTM gating units in the LSTM model; wherein,
The vector-matrix conversion module is used for converting the input vector into a matrix form, obtaining an input matrix and outputting the input matrix to the pulsation array;
The impulse array is used for receiving the weight matrix from the weight buffer module and the input matrix of the vector-matrix conversion module, wherein the weight matrix and the input matrix flow between the PE modules, the weight matrix is pruned to obtain a sparse weight matrix, multiplication of the sparse weight matrix and the input matrix of the LSTM gating unit is completed in the impulse array, and the vector is output to the Element-phase module;
the Element-wise module is used for receiving the vector output by the pulse array and calculating the cell state value and the hidden value of the current time step;
The method for calculating the systolic array is as follows:
Step 1, K pulse arrays correspond to K LSTM gate control units, and the pulse arrays are used for carrying out corresponding multiplication operation on a sparse weight matrix and an input matrix output by a vector-matrix conversion module;
step 2, the PE module receives parameters in a sparse weight matrix and parameters in an input matrix of the weight cache module or parameters in a sparse weight matrix and parameters in an input matrix of an adjacent PE module, and decoding and multiply-accumulate operation is carried out by an SMAC module in the PE module;
step 3, after the PE module finishes the operation of the current period, reserving a part and a result of the current operation, and transmitting the parameters of the current sparse weight matrix and the parameters in the input matrix to the adjacent PE module for decoding and multiply-accumulate operation by the adjacent PE module;
Step 4, after receiving the parameters of the sparse weight matrix and the parameters in the input matrix, the adjacent PE module repeats the step 2 and the step 3 until the pulse array completes multiplication operation of the sparse weight matrix and the input matrix of the LSTM gate control unit; transmitting part and result in the PE module to an Element-wise module to calculate the current cell state value and the hidden value, wherein the part and result in the PE module refer to output vectors of the pulse array;
Firstly, compressing an original weight matrix in an LSTM (least squares) network by adopting a blocking cyclic matrix compression algorithm, blocking the weight matrix, replacing the original weight matrix by adopting a cyclic matrix, compensating the precision loss by retraining, and rearranging and storing the head line vector of each blocking cyclic matrix by adopting the head line vector representation of the cyclic matrix;
Secondly, pruning the stored first row vectors by adopting a block balance pruning algorithm, reserving the same sparsity for each row vector, coding non-zero weights in a sparse weight matrix after compression pruning, combining and storing index values and parameter values, storing index values in high order, storing parameter values in low order, and decoding corresponding input data values in input vectors by taking the high order index values;
and finally, quantizing the nonzero weight in the compressed pruned sparse weight matrix into a form of power of 2 by adopting exponential quantization, and replacing multiplication operation by adopting shift operation in the subsequent hardware acceleration process.
2. The method of claim 1, wherein the PE module includes J sparse weight multiply-accumulate SMAC modules for processing multiply-accumulate operations of single hidden nodes of LSTM cells.
3. The method of claim 1, wherein the SMAC module includes a selector, a shifter, and an adder, and wherein the SMAC module is configured to perform decoding, shifting, and adding operations.
4. The acceleration method of claim 1, wherein the parallelism J of the SMAC module depends on the block size and sparsity of the block-balanced sparse pruning.
CN202311535097.1A 2023-11-16 2023-11-16 A LSTM accelerator and acceleration method based on systolic array Active CN117610624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311535097.1A CN117610624B (en) 2023-11-16 2023-11-16 A LSTM accelerator and acceleration method based on systolic array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311535097.1A CN117610624B (en) 2023-11-16 2023-11-16 A LSTM accelerator and acceleration method based on systolic array

Publications (2)

Publication Number Publication Date
CN117610624A CN117610624A (en) 2024-02-27
CN117610624B true CN117610624B (en) 2024-11-12

Family

ID=89947399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311535097.1A Active CN117610624B (en) 2023-11-16 2023-11-16 A LSTM accelerator and acceleration method based on systolic array

Country Status (1)

Country Link
CN (1) CN117610624B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118114733A (en) * 2024-03-11 2024-05-31 重庆邮电大学 A LSTM forward propagation acceleration method based on systolic array
CN119761434B (en) * 2024-12-10 2026-04-14 电子科技大学 Neural network matrix multiplication accelerator and deployment method thereof
CN121456272B (en) * 2026-01-08 2026-04-03 新紫光集团有限公司 Fine granularity quantization matrix multiplication device and method based on systolic array

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329910A (en) * 2020-10-09 2021-02-05 东南大学 A deep convolutional neural network compression method for structure pruning combined with quantization
CN113222133A (en) * 2021-05-24 2021-08-06 南京航空航天大学 FPGA-based compressed LSTM accelerator and acceleration method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070198815A1 (en) * 2005-08-11 2007-08-23 Coresonic Ab Programmable digital signal processor having a clustered SIMD microarchitecture including a complex short multiplier and an independent vector load unit
CN103700061B (en) * 2013-12-30 2016-08-17 东北大学 A kind of digital figure watermark based on compressed sensing embeds and extracting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329910A (en) * 2020-10-09 2021-02-05 东南大学 A deep convolutional neural network compression method for structure pruning combined with quantization
CN113222133A (en) * 2021-05-24 2021-08-06 南京航空航天大学 FPGA-based compressed LSTM accelerator and acceleration method

Also Published As

Publication number Publication date
CN117610624A (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN117610624B (en) A LSTM accelerator and acceleration method based on systolic array
CN113424202B (en) Tuning activation compression for neural network training
CN116663626B (en) Sparse pulse neural network accelerator based on ping-pong architecture
CN109635944B (en) A sparse convolutional neural network accelerator and implementation method
US11562247B2 (en) Neural network activation compression with non-uniform mantissas
Gupta et al. Masr: A modular accelerator for sparse rnns
CN107688849B (en) A dynamic strategy fixed-point training method and device
US10691996B2 (en) Hardware accelerator for compressed LSTM
Yang et al. Legonet: Efficient convolutional neural networks with lego filters
CN113273082A (en) Neural network activation compression with exception block floating point
CN111368988B (en) A Deep Learning Training Hardware Accelerator Exploiting Sparsity
CN106650922A (en) Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN107330514A (en) A kind of Air Quality Forecast method based on integrated extreme learning machine
CN113222133A (en) FPGA-based compressed LSTM accelerator and acceleration method
CN111985626B (en) System, method and storage medium for accelerating RNN (radio network node)
WO2018228399A1 (en) Computing device and method
CN113487012A (en) Deep convolution neural network accelerator facing FPGA and design method
CN116664705A (en) Heterogeneous calculation acceleration method for intelligent lossless compression algorithm
CN112561036B (en) HE-LSTM network structure and its corresponding FPGA hardware accelerator
CN116743182A (en) A data lossless compression method
CN118626408A (en) An in-memory computing system
Wu et al. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation
Yin et al. Minimizing area and energy of deep learning hardware design using collective low precision and structured compression
KR20230080305A (en) Optimization for ann model and npu
CN118095381B (en) Low-delay low-energy-consumption pulse neural network processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant