CN104320668B

CN104320668B - HEVC/H.265 DCT Transformation and Inverse Transformation SIMD Optimization Method

Info

Publication number: CN104320668B
Application number: CN201410608208.1A
Authority: CN
Inventors: 张小云; 黎凌宇; 高志勇; 陈立
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2014-10-31
Filing date: 2014-10-31
Publication date: 2017-08-01
Anticipated expiration: 2034-10-31
Also published as: CN104320668A

Abstract

The present invention provides a HEVC/H.265 SIMD optimization method for DCT transformation and inverse transformation. First, preprocess the input data, load the data from the memory into the register, regard it as a data vector, and interleave and re-interleave the vector data. Permutation and combination, for the DCT transformation in the vertical direction, the data is shifted to the right and rounded to adapt to the limited register bit width and improve the parallelism of calculation; then the butterfly operation is performed on the preprocessed data, and the sum and difference of the corresponding data are calculated step by step ; Then perform dot multiplication, calculate the sum of the intermediate value obtained by the butterfly operation and the product of the corresponding transformation coefficient, and obtain the product of the input data and the transformation matrix; finally, round the result of the matrix product to meet the bit width of the output data Limit and output. The invention can effectively accelerate the DCT transform and inverse transform module of the HEVC/H.265 video encoder on the Tilera platform, and obtain better acceleration and optimization effects.

Description

HEVC/H.265 DCT Transformation and Inverse Transformation SIMD Optimization Method

技术领域technical field

本发明涉及视频编码技术领域，具体地，涉及一种HEVC/H.265视频编码标准的DCT变换和反变换的SIMD加速优化方法(基于Tilera平台)，利用Tilera的SIMD指令集实现HEVC的DCT变换和反变换模块，提高运行速度。The present invention relates to the technical field of video coding, in particular, to a SIMD accelerated optimization method (based on the Tilera platform) for DCT transformation and inverse transformation of the HEVC/H.265 video coding standard, using the SIMD instruction set of Tilera to realize the DCT transformation of HEVC And the inverse transformation module to improve the running speed.

背景技术Background technique

随着视频内容的增长和视频产品的迅速发展，视频内容产业链面临更大的压力，目前AVC(Advanced Video Coding)视频压缩技术已经不能满足视频传输的要求，更高效的视频压缩技术应运而生。不仅如此，未来视频市场趋于更高水平的要求已经超出了目前AVC编码能力的范围，比如3D TV和4K TV。对于4K TV，即使使用目前H.264方式编码，也需要24-32M码率，AVC已经成为4K TV业务发展的瓶颈。在此背景下，高效视频编码(HighEfficiency Video Coding，HEVC)这种新的视频编码标准应运而生。HEVC的发展最早追溯到2004年，经过近十年的发展，HEVC于2012年2月形成完整的委员会标准草案，并于2013年1月正式成为国际标准。HEVC的目标是编码效率比AVC提高50％，复杂度比AVC复杂2到10倍。HEVC未来的业务主要面向高清、超高清、3D TV，数据量比以往视频大得多，另外HEVC要求大大提高视频压缩比，而高压缩算法是以增加算法复杂度为代价的，考虑到这两个方面的因素，HEVC编码器对系统的计算性能提出了更高的要求。With the growth of video content and the rapid development of video products, the video content industry chain is facing greater pressure. At present, AVC (Advanced Video Coding) video compression technology can no longer meet the requirements of video transmission, and more efficient video compression technology has emerged . Not only that, the future video market tends to require higher levels, such as 3D TV and 4K TV, which have exceeded the scope of current AVC encoding capabilities. For 4K TV, even if the current H.264 encoding method is used, the code rate of 24-32M is still required. AVC has become the bottleneck of 4K TV business development. In this context, a new video coding standard called High Efficiency Video Coding (HEVC) emerged as the times require. The development of HEVC can be traced back to 2004. After nearly ten years of development, HEVC formed a complete committee standard draft in February 2012, and officially became an international standard in January 2013. The goal of HEVC is to increase the coding efficiency by 50% compared with AVC, and the complexity is 2 to 10 times more complex than AVC. The future business of HEVC is mainly for HD, UHD, and 3D TV. The data volume is much larger than that of previous videos. In addition, HEVC requires a greatly improved video compression ratio, and the high compression algorithm is at the cost of increasing algorithm complexity. Considering these two Due to these factors, the HEVC encoder puts forward higher requirements on the computing performance of the system.

为降低HEVC编码复杂度，通常有算法优化、指令集优化、并行优化等方法，其中指令集优化是利用计算平台的指令集实现计算模块，SIMD(single instruction multipledata)单指令多数据技术能在一个指令周期内并行处理多个数据的计算，相比于常规的实现方案能大大减少指令周期，提高运行速度，同时能保证计算结果准确无误。在视频编码中，SIMD技术广泛应用于密集数据计算，如亚像素插值、SAD、DCT/IDCT、计算残差等模块。In order to reduce the complexity of HEVC coding, there are usually methods such as algorithm optimization, instruction set optimization, and parallel optimization. Among them, instruction set optimization is to use the instruction set of the computing platform to realize the computing module. SIMD (single instruction multiple data) technology can be used in one The parallel processing of multiple data calculations in the instruction cycle can greatly reduce the instruction cycle and improve the running speed compared with the conventional implementation scheme, and at the same time ensure that the calculation results are accurate. In video coding, SIMD technology is widely used in intensive data calculations, such as sub-pixel interpolation, SAD, DCT/IDCT, computing residuals and other modules.

在Tilera平台上实现HEVC编码器，我们移植了HEVC参考代码HM的DCT/IDCT实现方法，HEVC的DCT模块相比于H.264复杂度大大提升。H.264的变换系数为1和2，H.264的变换只需要简单的移位和加法计算。HEVC支持4x4至32x32的变换块，此外HEVC的DCT系数数值更大、更复杂，这意味着HEVC的DCT变化需要执行多次乘法，而且中间变量的数值更大，需要更大的位宽。在执行竖直方向的DCT变换时，中间变量的值超出了16bit的存储范围，为17-19bit，如果用32bit保存中间变量，则数据处理的并行水平大打折扣。在Intel和Arm上已有的一些HEVC的DCT和IDCT的SIMD实现方法，竖直方向的DCT变换的速度要低于水平方向的DCT变换。To implement the HEVC encoder on the Tilera platform, we transplanted the DCT/IDCT implementation method of the HEVC reference code HM. Compared with H.264, the DCT module of HEVC has greatly improved the complexity. The transformation coefficients of H.264 are 1 and 2, and the transformation of H.264 only needs simple shift and addition calculations. HEVC supports transform blocks from 4x4 to 32x32. In addition, the DCT coefficients of HEVC are larger and more complex, which means that the DCT change of HEVC needs to perform multiple multiplications, and the value of the intermediate variable is larger, requiring a larger bit width. When performing DCT transformation in the vertical direction, the value of the intermediate variable exceeds the storage range of 16bit, which is 17-19bit. If the intermediate variable is stored in 32bit, the parallel level of data processing will be greatly reduced. For some existing HEVC DCT and IDCT SIMD implementation methods on Intel and Arm, the speed of DCT transformation in the vertical direction is lower than that of DCT transformation in the horizontal direction.

发明内容Contents of the invention

针对现有技术中的缺陷，本发明的目的是提供一种HEVC/H.265的DCT变换和反变换的SIMD优化方法，所述方法针对Tilera平台上常规的C语言实现的HEVC的DCT和IDCT模块计算复杂度高、编码速度慢的问题，利用Tilera的SIMD指令集实现HEVC的DCT变换和反变换模块，提高运行速度。In view of the defects in the prior art, the purpose of the present invention is to provide a SIMD optimization method for DCT transformation and inverse transformation of HEVC/H.265, said method is aimed at the DCT and IDCT of HEVC realized by the conventional C language on the Tilera platform To solve the problems of high computational complexity and slow encoding speed of the module, the DCT transformation and inverse transformation modules of HEVC are implemented by using Tilera's SIMD instruction set to improve the running speed.

为实现以上目的，本发明提供一种HEVC/H.265的DCT变换和反变换的SIMD优化方法，包括如下步骤：To achieve the above object, the present invention provides a SIMD optimization method for DCT transformation and inverse transformation of HEVC/H.265, comprising the following steps:

第一步，将一维DCT输入数据从内存装载进寄存器，视为矢量数据；In the first step, the one-dimensional DCT input data is loaded into the register from the memory and regarded as vector data;

第二步，对矢量数据重新排列组合，执行蝶形运算，对输入数据逐级加减，计算出中间变量矢量；The second step is to rearrange and combine the vector data, perform butterfly operation, add and subtract the input data step by step, and calculate the intermediate variable vector;

第三步，如果该一维DCT变换是水平方向变换，直接跳到第五步；In the third step, if the one-dimensional DCT transformation is a horizontal direction transformation, skip directly to the fifth step;

第四步，对中间变量进行右移舍入运算，以限制其位宽；The fourth step is to perform right-shift rounding operation on the intermediate variable to limit its bit width;

第五步，将中间变量矢量和对应的系数矢量进行点乘运算；The fifth step is to perform dot multiplication between the intermediate variable vector and the corresponding coefficient vector;

第六步，将点乘运算的结果进行重新排列组合，执行右移舍入，输出结果保存至目标内存。The sixth step is to rearrange and combine the results of the dot multiplication operation, perform right shift rounding, and save the output result to the target memory.

优选地，所述第二步中，对输入数据的并行加减，完成蝶形运算，进行逐级多次的并行加减法，计算出用于点乘计算的中间变量矢量。Preferably, in the second step, the parallel addition and subtraction of the input data is performed, the butterfly operation is performed, and the parallel addition and subtraction is performed step by step multiple times to calculate the intermediate variable vector used for the dot product calculation.

优选地，所述第四步中，一维DCT变换对中间变量进行舍入预处理，为了保持并行性，达到较好的加速效果，对于竖直方向的一维DCT变换，将中间变量右移舍入使其位宽在16bit内；具体的右移计算如式(1)：Preferably, in the fourth step, the one-dimensional DCT transform performs rounding preprocessing on the intermediate variable. In order to maintain parallelism and achieve a better acceleration effect, for the one-dimensional DCT transform in the vertical direction, the intermediate variable is shifted to the right Rounding makes the bit width within 16 bits; the specific right shift calculation is as formula (1):

y＝(x+(1<<(MIVO-1)))>>MIVO (1)y=(x+(1<<(MIVO-1)))>>MIVO (1)

其中：x为需要进行右移舍入的数据，MIVO为中间变量最大阶数，y是执行完右移舍入后的结果；Among them: x is the data that needs to be rounded to the right, MIVO is the maximum order of the intermediate variable, and y is the result after performing the rounding to the right;

同时，输入变量shift值也要根据MIVO做相应的调整，使其能抵消中间变量的右移舍入，shift值是DCT输出时需要将数据右移舍入的位数。At the same time, the shift value of the input variable should be adjusted accordingly according to MIVO, so that it can offset the right shift and rounding of the intermediate variable. The shift value is the number of bits that need to be shifted to the right and rounded when the DCT outputs.

优选地，所述第五步中，采用一条点乘计算指令代替多次的乘法和加法运算，有效加速中间变量和相应系数的乘积求和过程。进一步的，对于计算四个中间变量和相应系数的点乘，直接采用一条指令执行四元矢量点乘便可以完成；对于四个以上中间变量和相应系数的点乘，通过多次的四元矢量点乘完成；对于计算两个中间变量和相应系数的点乘，采用并行的乘法和加法指令完成。Preferably, in the fifth step, a dot product calculation instruction is used to replace multiple multiplication and addition operations, effectively accelerating the process of product summation of intermediate variables and corresponding coefficients. Further, for calculating the dot product of four intermediate variables and corresponding coefficients, it can be completed by directly executing the quaternary vector dot product with one instruction; for the dot product of more than four intermediate variables and corresponding coefficients, multiple quaternary vector The multiplication is completed; for calculating the dot product of two intermediate variables and corresponding coefficients, it is completed by using parallel multiplication and addition instructions.

优选地，所述第六步中，采用并行的加法和右移运算以及数据重组，并行完成对点乘计算结果进行右移舍入操作；右移舍入计算如式(2)：Preferably, in the sixth step, parallel addition and right shift operations and data reorganization are used to complete the right shift and round operation of the dot product calculation results in parallel; the right shift and round calculation is as formula (2):

y＝(x+offset)>>shifty=(x+offset)>>shift

offset＝(1<<(shift-1))offset=(1<<(shift-1))

error＝2^shift-1 error=2 ^shift-1

式中：x是需要执行右移舍入的数据，y是执行完右移舍入的结果；offset是用于四舍五入的补偿值，由右移位数shift推导而来，shift值由DCT变换块的大小N(4，8，16，32)和数据位宽B(8bit，10bit)推导而来，水平方向(horizontal)的shift值和N，B有关，而竖直方向(vertical)的shift值和N有关，根据shift值的不同可以用来判断一维DCT变换的方向：水平还是竖直；error是执行右移舍入操作产生的最大误差。In the formula: x is the data that needs to be rounded to the right, y is the result of rounding to the right; offset is the compensation value used for rounding, derived from the right shift number shift, and the shift value is determined by the DCT transform block The size N (4, 8, 16, 32) and the data bit width B (8bit, 10bit) are derived. The shift value in the horizontal direction (horizontal) is related to N and B, while the shift value in the vertical direction (vertical) It is related to N. According to the shift value, it can be used to judge the direction of one-dimensional DCT transformation: horizontal or vertical; error is the maximum error caused by the right-shift rounding operation.

与现有技术相比，本发明具有如下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明提供的方法利用Tilera平台的指令集对Tilera的DCT和IDCT模块进行SIMD优化，实现在微弱的性能损失情况下有效加速Tilera平台上HEVC的DCT和IDCT模块。经验证，相比于普通的C代码实现方法，在使用该发明后，Tilera平台上HEVC的DCT和IDCT模块能平均减少40％-70％的指令周期数，而BD-PSNR(相同质量下的PSNR)仅有不到0.003的损失。The method provided by the present invention uses the instruction set of the Tilera platform to perform SIMD optimization on the DCT and IDCT modules of the Tilera, so as to effectively accelerate the DCT and IDCT modules of the HEVC on the Tilera platform under the condition of slight performance loss. It has been verified that compared with the common C code implementation method, after using this invention, the DCT and IDCT modules of HEVC on the Tilera platform can reduce the number of instruction cycles by an average of 40%-70%, while BD-PSNR (under the same quality PSNR) has a loss of less than 0.003.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1为本发明一较优实施例的一维DCT变换流程图；Fig. 1 is the flow chart of one-dimensional DCT transformation of a preferred embodiment of the present invention;

图2为本发明一较优实施例的一维IDCT变换流程图；Fig. 2 is the flow chart of one-dimensional IDCT transformation of a preferred embodiment of the present invention;

图3为本发明一较优实施例的蝶形运算计算中间变量示意图；Fig. 3 is a schematic diagram of intermediate variables calculated by butterfly operation in a preferred embodiment of the present invention;

图4为本发明一较优实施例对中间变量进行预处理的流程图；Fig. 4 is the flow chart that a preferred embodiment of the present invention carries out pretreatment to intermediate variable;

图5为本发明一较优实施例蝶形运算中执行矢量点乘运算示意图，其中：(a)为四元素矢量点乘运算示意图，(b)为二元素矢量点乘运算示意图；Fig. 5 is a schematic diagram of performing vector dot multiplication in a butterfly operation of a preferred embodiment of the present invention, wherein: (a) is a schematic diagram of four-element vector dot multiplication, and (b) is a schematic diagram of two-element vector dot multiplication;

图6为本发明一较优实施例对蝶形运算结果执行后续舍入操作示意图；6 is a schematic diagram of a subsequent rounding operation performed on the butterfly operation result in a preferred embodiment of the present invention;

图7为本发明一较优实施例一维IDCT中对数据进行转置变换操作的示意图。Fig. 7 is a schematic diagram of transposing and transforming data in one-dimensional IDCT in a preferred embodiment of the present invention.

具体实施方式detailed description

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

如图1、2所示，本实施例提供一种HEVC/H.265的DCT变换和反变换的SIMD优化方法(基于Tilera平台)，具体实施步骤如下：As shown in Figures 1 and 2, the present embodiment provides a SIMD optimization method (based on the Tilera platform) for DCT transformation and inverse transformation of HEVC/H.265, and the specific implementation steps are as follows:

步骤(1)、将一维DCT输入数据从内存装载进寄存器，视为矢量数据；具体的：Step (1), load the one-dimensional DCT input data into the register from the memory, and regard it as vector data; specifically:

从输入数据所在的内存块中批量读取数据，将相邻的多个低位宽输入数据装载进64bit寄存器，装载成一个数据矢量。Read data in batches from the memory block where the input data is located, load multiple adjacent low-bit-width input data into a 64bit register, and load it into a data vector.

步骤(2)、对矢量数据重新排列组合，执行蝶形运算，对输入数据逐级加减，计算出中间变量矢量，如图3所示；具体的：Step (2), rearrange and combine the vector data, perform butterfly operation, add and subtract the input data step by step, and calculate the intermediate variable vector, as shown in Figure 3; specifically:

HEVC的DCT变换系数有很强的奇偶对称性，利用这种对称性，先对数据进行逐级的加法和减法，然后将其和相应的系数做乘积，这样能有效减少乘法运算，这就是蝶形运算的原理。从内存装载进寄存器的数据矢量不能直接进行加减，矢量数据需要进行一定的重新排列组合，使其符合需要的运算规则，然后执行后续的计算流程。在Tilera平台上有相应的指令，支持两个64bit数据的中任意byte的排列组合。The DCT transform coefficients of HEVC have strong odd-even symmetry. Using this symmetry, the data is firstly added and subtracted step by step, and then multiplied by the corresponding coefficients, which can effectively reduce multiplication operations. This is the butterfly The principle of shape operation. The data vector loaded from the memory into the register cannot be directly added or subtracted. The vector data needs to be rearranged and combined to make it conform to the required operation rules, and then the subsequent calculation process is executed. There are corresponding instructions on the Tilera platform, which support the permutation and combination of any byte in two 64bit data.

在后续的步骤中，数据需要不断地进行排列组合，以满足相应的运算规则。In the subsequent steps, the data needs to be continuously arranged and combined to meet the corresponding operation rules.

随后利用并行加减指令，完成矢量的加减法，这样一次指令能同时完成多个数据的加法和减法，减少指令周期，实现加速。随着DCT变换块的增大，中间变量的计算级数也会增大，在逐级计算中间变量的过程中需要对计算结果进行重新排列和组合，使其满足后续运算规则。Then use the parallel addition and subtraction instructions to complete the addition and subtraction of the vector, so that one instruction can complete the addition and subtraction of multiple data at the same time, reducing the instruction cycle and realizing acceleration. With the increase of the DCT transform block, the calculation stages of the intermediate variables will also increase. In the process of calculating the intermediate variables step by step, the calculation results need to be rearranged and combined to meet the subsequent operation rules.

步骤(3)、如果该一维DCT变换时水平方向变换，直接跳到步骤(5)，如图4所示；具体的：Step (3), if the horizontal direction transformation during the one-dimensional DCT transformation, directly jump to step (5), as shown in Figure 4; specifically:

水平方向和竖直方向一维DCT变换的输入变量shift是不同的，通过判断输入变量shift值的大小可以判断其是水平方向还是竖直方向的一维DCT变换。The input variable shift of the one-dimensional DCT transformation in the horizontal direction and the vertical direction is different. By judging the value of the input variable shift, it can be judged whether it is a one-dimensional DCT transformation in the horizontal direction or in the vertical direction.

步骤(4)、对中间变量进行右移舍入运算，限制其位宽，如图4所示；具体的：Step (4), performing right-shift rounding operation on the intermediate variable, limiting its bit width, as shown in Figure 4; specifically:

二维的DCT变换由两个方向的一维DCT变换组成：水平方向DCT变换和竖直方向的DCT变换。HEVC中，先执行水平方向的一维DCT变换，其输入数据为像素值的残差，为9bit位宽的有符号数。在蝶形运算逐级计算中间变量的过程中，能保证中间变量位宽在16bit内。但是在执行竖直方向一维DCT变换时，其输入是水平方向DCT变换的输出，位宽为16bit，这样其蝶形运算的中间变量不能用16bit位宽保存，否则出现严重的溢出错误。如果用32位保存，则并行性下降，不能起到很好的加速效果。为了解决这一问题，对中间变量进行右移舍入，舍弃低位bit，使其能用16bit位宽保存，通过一定的舎位误差代价换取并行性。The two-dimensional DCT transform consists of two-dimensional one-dimensional DCT transforms: the horizontal DCT transform and the vertical DCT transform. In HEVC, the one-dimensional DCT transformation in the horizontal direction is first performed, and the input data is the residual of the pixel value, which is a signed number with a 9-bit width. During the step-by-step calculation of the intermediate variables by the butterfly operation, the bit width of the intermediate variables can be guaranteed to be within 16 bits. However, when performing the vertical one-dimensional DCT transformation, the input is the output of the horizontal DCT transformation, and the bit width is 16 bits, so the intermediate variable of the butterfly operation cannot be saved with a 16-bit bit width, otherwise a serious overflow error will occur. If it is stored in 32 bits, the parallelism will decrease and it will not be able to achieve a good acceleration effect. In order to solve this problem, the intermediate variable is right-shifted and rounded, and the low-order bits are discarded, so that it can be saved with a 16-bit bit width, and parallelism is exchanged for a certain bit error cost.

具体的右移计算如式(1)：The specific right shift calculation is as formula (1):

y＝(x+(1<<(MIVO-1)))>>MIVO (1)y=(x+(1<<(MIVO-1)))>>MIVO (1)

其中：x为需要进行右移舍入的数据，MIVO为中间变量最大阶数，y是执行完右移舍入后的结果。Among them: x is the data that needs to be rounded to the right, MIVO is the maximum order of the intermediate variable, and y is the result after the rounding to the right is performed.

此外输入变量shift值也要根据MIVO做相应的调整，使其能抵消中间变量的右移舍入。In addition, the value of the input variable shift should be adjusted accordingly according to MIVO, so that it can offset the right-shift rounding of the intermediate variable.

步骤(5)、将中间变量矢量和对应的系数矢量进行点乘运算，如图5中(a)和(b)所示；具体的：Step (5), carry out dot multiplication operation with intermediate variable vector and corresponding coefficient vector, as shown in (a) and (b) among Fig. 5; Concrete:

Tilera平台提供矢量点乘运算指令，一条指令能执行多个数据的点乘求和运算，起到很好的加速效果。The Tilera platform provides vector point multiplication instructions. One instruction can perform point multiplication and summation of multiple data, which has a good acceleration effect.

对于计算四个中间变量和相应系数的点乘，直接采用一条指令执行四元矢量点乘便可以完成；对于四个以上中间变量和相应系数的点乘，可以通过多次的四元矢量点乘完成；对于计算两个中间变量和相应系数的点乘，Tilera没有直接的指令，本发明采用并行的乘法和加法指令完成。For calculating the dot product of four intermediate variables and corresponding coefficients, it can be completed by directly executing the quaternary vector dot product with one instruction; for the dot product of more than four intermediate variables and corresponding coefficients, it can be completed by multiple quaternary vector dot multiplication ; For calculating the dot product of two intermediate variables and corresponding coefficients, Tilera has no direct instructions, and the present invention uses parallel multiplication and addition instructions to complete.

中间变量的点乘计算是DCT变换的核心计算部分，乘法复杂度高，执行点乘计算之前的操作都是为了准备数据，适应点乘指令的计算规则。The dot product calculation of intermediate variables is the core calculation part of the DCT transformation, and the multiplication complexity is high. The operations before performing the dot product calculation are all to prepare data and adapt to the calculation rules of the dot product instruction.

步骤(6)、将点乘运算的结果进行重新排列组合，执行右移舍入，输出结果保存至目标内存，如图6所示；具体的：Step (6), rearrange and combine the results of the point multiplication operation, perform right shift rounding, and save the output result to the target memory, as shown in Figure 6; specifically:

基于HEVC的规定，一维DCT变换的输出位宽为16bit，执行完点乘计算的结果为32bit，需要执行舍入运算，保存为16bit；HEVC的舍入计算如式(2)：Based on the regulations of HEVC, the output bit width of one-dimensional DCT transformation is 16 bits, and the result of the point multiplication calculation is 32 bits, which needs to be rounded and saved as 16 bits; the rounding calculation of HEVC is shown in formula (2):

y＝(x+offset)>>shifty=(x+offset)>>shift

offset＝(1<<(shift-1))offset=(1<<(shift-1))

error＝2^shift-1 error=2 ^shift-1

将步骤(5)得到的点乘结果保存为32bit数据，一个64bit寄存器保存两个点乘结果，对点乘结果先并行加上偏移值offset，然后并行右移shift，便得到16bit的输出结果。对输出结果进行重新排列和组合，最后保存至输出内存空间，完成输出。Save the dot product result obtained in step (5) as 32bit data, and store two dot product results in a 64bit register, first add the offset value offset to the dot product result in parallel, and then shift right in parallel to get the 16bit output result . The output results are rearranged and combined, and finally saved to the output memory space to complete the output.

上述为一维DCT变换的详细步骤，一维IDCT是一维DCT的反向运算，其具体流程与DCT变换类似，区别在于对输入数据需要进行转置，具体如图7所示。一维DCT中用于一次点乘计算的数据在内存分布上是连续的，因此可以直接装载进寄存器用于计算；而一维IDCT中用于一次点乘计算的输入数据在内存分布上不连续，装载进寄存器后不能直接用于计算；本发明将装载进寄存器的内存分布相邻的矢量数据逐级两两交织，最终达到转置的效果，即四个四元矢量的第n(1，2，3，4)个元素，组成第n个新的四元矢量，通过转置后的数据矢量就能直接用于后续计算。The above are the detailed steps of one-dimensional DCT transformation. One-dimensional IDCT is the reverse operation of one-dimensional DCT. The specific process is similar to that of DCT transformation. The difference is that the input data needs to be transposed, as shown in Figure 7. The data used for a point multiplication calculation in one-dimensional DCT is continuous in memory distribution, so it can be directly loaded into registers for calculation; while the input data for one point multiplication calculation in one-dimensional IDCT is discontinuous in memory distribution , can not be directly used for calculation after being loaded into the register; the present invention will be loaded into the memory distribution adjacent vector data of the register and interweave two by two, and finally achieve the effect of transposition, that is, the nth (1, 2, 3, 4) elements form the nth new quaternary vector, and the transposed data vector can be directly used for subsequent calculations.

本发明在Tilera平台上能够有效加速HEVC/H.265视频编码器的DCT变换反变换模块，获得较好的加速优化效果，降低HEVC编码复杂度。The present invention can effectively accelerate the DCT transformation and inverse transformation module of the HEVC/H.265 video encoder on the Tilera platform, obtain better acceleration and optimization effects, and reduce HEVC coding complexity.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.

Claims

1. the dct transform of HEVC/H.265 a kind of and the SIMD optimization methods of inverse transformation, it is characterised in that comprise the following steps：

The first step, enters register from memory loading by one-dimensional DCT input datas, is considered as vector data；

Second step, combination is rearranged to vector data, performs butterfly computation, input data is added and subtracted step by step, centre is calculated Variable vector；

3rd step, if the one-dimensional dct transform is horizontally oriented conversion, leaps to the 5th step；

4th step, carries out moving to right rounding-off computing, to limit its bit wide to intermediate variable；

5th step, point multiplication operation is carried out by the corresponding coefficient vector of intermediate variable vector；For calculate four intermediate variables and The dot product of corresponding coefficient, directly performing four-vector point using an instruction can complete at convenience；For anaplasia in more than four The dot product of amount and corresponding coefficient, is completed by multiple four-vector dot product；For calculating two intermediate variables and corresponding coefficient Dot product, completed using parallel multiplication and addition instruction；

6th step, the result of point multiplication operation is carried out to rearrange combination, execution moves to right rounding-off, and output result is preserved to target Deposit.

2. a kind of HEVC/H.265 according to claim 1 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the second step, to the parallel plus-minus of input data, completes butterfly computation, carry out parallel addition and subtraction multiple step by step, Calculate the intermediate variable vector calculated for dot product.

3. a kind of HEVC/H.265 according to claim 1 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 4th step, one-dimensional dct transform carries out rounding-off pretreatment to intermediate variable, becomes for the one-dimensional DCT of vertical direction Change, intermediate variable is moved to right into rounding-off makes its bit wide in 16bit.

4. a kind of HEVC/H.265 according to claim 3 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 4th step, specifically moves to right calculating such as formula (1)：

Y=(x+ (1 ＜＜ (MIVO-1))) ＞＞ MIVO (1)

Wherein：X is needs the data for carrying out moving to right rounding-off, and MIVO is intermediate variable maximum order, and y is to have performed to move to right after rounding-off Result；

Meanwhile, input variable shift values will also do corresponding adjustment according to MIVO, and can offset intermediate variable moves to right house Enter, shift values need the digit for being rounded data shift right when being DCT outputs.

5. a kind of HEVC/H.265 according to claim 1 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 6th step, using parallel addition and shift right operation and data recombination, completes parallel to dot product result of calculation Progress moves to right rounding-off operation.

6. a kind of HEVC/H.265 according to claim 5 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 6th step, moves to right rounding-off and calculate such as formula (2)：

In formula：X is to need to perform the data for moving to right rounding-off, and y is to have performed the result for moving to right rounding-off；Offset is to be used for four houses five The offset entered, by move to right digit shift derive, shift values by discrete cosine transform block size N (4,8,16,32) and data Bit wide B (8bit, 10bit) is derived, and horizontal direction horizontal shift values and N, B is relevant, and vertical direction Vertical shift values are relevant with N, and the direction of one-dimensional dct transform is judged according to the difference of shift values：Level is still erected Directly；Error is to perform to move to right the worst error that rounding-off operation is produced.