JP5928914B2

JP5928914B2 - Graphics processing apparatus and graphics processing method

Info

Publication number: JP5928914B2
Application number: JP2014054021A
Authority: JP
Inventors: 佐藤　仁; 仁佐藤; 丈博冨永; 鹿子木　朋睦; 朋睦鹿子木
Original assignee: Sony Interactive Entertainment Inc
Current assignee: Sony Interactive Entertainment Inc
Priority date: 2014-03-17
Filing date: 2014-03-17
Publication date: 2016-06-01
Anticipated expiration: 2034-03-17
Also published as: JP2015176492A

Description

この発明は、圧縮テクスチャを伸張するグラフィックス処理技術に関する。 The present invention relates to a graphics processing technique for decompressing a compressed texture.

パーソナルコンピュータやゲーム専用機において、高品質な３次元コンピュータグラフィックスを用いたゲームやシミュレーションなどのアプリケーションを実行したり、実写とコンピュータグラフィックスを融合させた映像コンテンツの再生を行うなど、高画質のグラフィックスの利用が広がっている。 High-quality images such as playing games and simulations using high-quality 3D computer graphics and playing video content that combines live-action and computer graphics on personal computers and game machines The use of graphics is spreading.

一般に、グラフィックス処理は、ＣＰＵとグラフィックスプロセッシングユニット（ＧＰＵ）が連携することで実行される。ＣＰＵが汎用的な演算を行う汎用プロセッサであるのに対して、ＧＰＵは高度なグラフィックス演算を行うための専用プロセッサである。ＣＰＵはオブジェクトの３次元モデルにもとづいて投影変換などのジオメトリ演算を行い、ＧＰＵはＣＰＵから頂点データなどを受け取ってレンダリングを実行する。ＧＰＵはラスタライザやピクセルシェーダなどの専用ハードウェアから構成され、パイプライン処理でグラフィックス処理を実行する。最近のＧＰＵには、プログラマブルシェーダと呼ばれるように、シェーダ機能がプログラム可能なものもあり、シェーダプログラミングをサポートするために、一般にグラフィックスライブラリが提供されている。 In general, graphics processing is executed by cooperation of a CPU and a graphics processing unit (GPU). While the CPU is a general-purpose processor that performs general-purpose operations, the GPU is a dedicated processor for performing advanced graphics operations. The CPU performs a geometry calculation such as projection conversion based on the three-dimensional model of the object, and the GPU receives vertex data from the CPU and executes rendering. The GPU is composed of dedicated hardware such as a rasterizer and a pixel shader, and executes graphics processing by pipeline processing. Some recent GPUs have programmable shader functions called programmable shaders, and graphics libraries are generally provided to support shader programming.

グラフィックス処理では、オブジェクトの表面の質感を表現するためにテクスチャをオブジェクトの表面に貼り付けるテクスチャマッピングが行われる。ゲームなどのアプリケーションで利用される画像の高精細化にともない、テクスチャも高解像度のデータが利用されるようになり、テクスチャデータは大容量化している。たとえば、ゲームで利用されるテクスチャはＧｉＢ（ギビバイト）のオーダーであり、必要なテクスチャデータをすべてメモリ上に格納することは困難である。 In the graphics processing, texture mapping is performed in which a texture is pasted on the surface of the object in order to express the texture of the surface of the object. As the images used in applications such as games become higher in definition, higher resolution data is used for textures, and the volume of texture data is increasing. For example, the texture used in the game is in the order of GiB (Gibibyte), and it is difficult to store all necessary texture data on the memory.

そこで非圧縮テクスチャまたはＧＰＵが直接扱える低圧縮テクスチャをハードディスクなどの記憶装置に格納しておき、必要に応じてメモリ上のテクスチャバッファにロードして描画に用いることが行われている。ハードディスクからテクスチャをロードするのに要する時間は通常数十ミリ秒から時には数秒になることもあり、安定しない。そのため、ハードディスクからのテクスチャのロードが間に合わなかった場合、本来表示したいテクスチャが利用できないという問題が生じる。 In view of this, uncompressed textures or low-compressed textures that can be directly handled by the GPU are stored in a storage device such as a hard disk, and loaded into a texture buffer on a memory and used for drawing as necessary. The time it takes to load a texture from the hard disk is usually tens of milliseconds to sometimes several seconds and is not stable. Therefore, when the texture loading from the hard disk is not in time, there arises a problem that the texture to be originally displayed cannot be used.

一方、高圧縮テクスチャであれば、メインメモリ容量を上回るテクスチャであってもメモリに保持することができ、ハードディスクからのロードなしにテクスチャを扱うことができるようになる。しかし、この場合、高圧縮テクスチャは一般にＧＰＵが直接扱えるものでないため、高圧縮テクスチャをリアルタイムで伸張するための専用ハードウェアが必要になる。専用ハードウェアが利用できなければ、ＣＰＵで圧縮テクスチャを伸張してテクスチャバッファに展開することになるが、この場合は伸張に時間がかかり、描画をリアルタイムで行うことが難しくなる。 On the other hand, in the case of a highly compressed texture, even a texture exceeding the main memory capacity can be held in the memory, and the texture can be handled without loading from the hard disk. However, in this case, since the high-compression texture is generally not directly handled by the GPU, dedicated hardware for decompressing the high-compression texture in real time is required. If dedicated hardware is not available, the CPU will decompress the compressed texture and expand it in the texture buffer. In this case, however, it takes time to decompress and it becomes difficult to perform drawing in real time.

本発明はこうした課題に鑑みてなされたものであり、その目的は、圧縮テクスチャを効率良く伸張することのできるグラフィックス処理技術を提供することにある。 The present invention has been made in view of these problems, and an object thereof is to provide a graphics processing technique capable of efficiently decompressing a compressed texture.

上記課題を解決するために、本発明のある態様のグラフィックス処理装置は、メインメモリとグラフィックスプロセッシングユニットとを含むグラフィックス処理装置であって、前記グラフィックスプロセッシングユニットは、圧縮テクスチャのランレングス伸張を実行するランレングス伸張部と、ランレングス伸張されたテクスチャを逆空間周波数変換することによりテクスチャを復元する逆空間周波数変換部とを含む。前記メインメモリは、復元されたテクスチャを部分的にキャッシュするテクスチャプールを含む。 In order to solve the above problems, a graphics processing device according to an aspect of the present invention is a graphics processing device including a main memory and a graphics processing unit, wherein the graphics processing unit has a run length of a compressed texture. A run-length decompression unit that performs decompression, and an inverse spatial frequency transform unit that restores the texture by subjecting the run-length decompressed texture to inverse spatial frequency transform. The main memory includes a texture pool that partially caches the restored texture.

本発明の別の態様は、グラフィックス処理方法である。この方法は、メインメモリとグラフィックスプロセッシングユニットとを含むグラフィックス処理装置におけるグラフィックス処理方法であって、グラフィックスプロセッシングユニットが、コンピュートシェーダによって、圧縮テクスチャのランレングス伸張を実行し、ランレングス伸張されたテクスチャを逆空間周波数変換することによりテクスチャを復元し、テクスチャを部分的にキャッシュする前記メインメモリ内のテクスチャプールに復元されたテクスチャを格納する。 Another aspect of the present invention is a graphics processing method. This method is a graphics processing method in a graphics processing apparatus including a main memory and a graphics processing unit, and the graphics processing unit executes run-length decompression of a compressed texture by a compute shader, and the run-length decompression. The texture is restored by performing inverse spatial frequency transformation on the texture, and the restored texture is stored in the texture pool in the main memory that partially caches the texture.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and the expression of the present invention converted between a method, an apparatus, a system, a computer program, a data structure, a recording medium, and the like are also effective as an aspect of the present invention.

本発明によれば、圧縮テクスチャを効率良く伸張することができる。 According to the present invention, it is possible to efficiently decompress a compressed texture.

ある実施の形態に係るグラフィックス処理装置の構成図である。1 is a configuration diagram of a graphics processing device according to an embodiment. FIG. 図２（ａ）〜図２（ｃ）は、ミップマップテクスチャを説明する図である。Fig.2 (a)-FIG.2 (c) are the figures explaining a mipmap texture. 本実施の形態のＰＲＴの仕組みを説明する図である。It is a figure explaining the mechanism of PRT of this Embodiment. 図４（ａ）〜図４（ｅ）は、ランレングス圧縮されたテクスチャのデータ量を説明する図である。FIG. 4A to FIG. 4E are diagrams for explaining the data amount of texture subjected to run-length compression. 図５（ａ）〜図５（ｃ）は、本実施の形態のランレングス圧縮および伸張を説明する図である。Fig.5 (a)-FIG.5 (c) are the figures explaining the run-length compression and expansion | extension of this Embodiment. 本実施の形態のランレングス伸張の流れを説明するフローチャートである。It is a flowchart explaining the flow of the run length expansion | extension of this Embodiment. 比較のため、分岐先に偏りがない場合のスレッドの実行過程を説明する図である。It is a figure explaining the execution process of the thread | thread when a branch destination has no bias for a comparison. 分岐先に偏りがある場合のスレッドの実行過程を説明する図である。It is a figure explaining the execution process of the thread | thread when a branch destination has a bias. 別の実施の形態に係るグラフィックス処理装置の構成図である。It is a block diagram of the graphics processing apparatus which concerns on another embodiment. 図１０（ａ）〜図１０（ｆ）は、Ｚｌｉｂ圧縮されたテクスチャのデータ量を説明する図である。FIG. 10A to FIG. 10F are diagrams for explaining the amount of texture data subjected to Zlib compression. 図１１（ａ）および図１１（ｂ）は、本実施の形態においてテクスチャをランレングス圧縮する利点を説明する図である。FIG. 11A and FIG. 11B are diagrams for explaining the advantages of run-length compression of textures in the present embodiment. 本実施の形態のグラフィックス処理装置による圧縮テクスチャの伸張処理の性能を説明する図である。It is a figure explaining the performance of the expansion process of the compression texture by the graphics processing apparatus of this Embodiment.

（第１の実施の形態）
図１は、第１の実施の形態に係るグラフィックス処理装置の構成図である。グラフィックス処理装置は、メインプロセッサ１００、グラフィックスプロセッシングユニット（ＧＰＵ）２００、およびメインメモリ３００を含む。 (First embodiment)
FIG. 1 is a configuration diagram of the graphics processing apparatus according to the first embodiment. The graphics processing device includes a main processor 100, a graphics processing unit (GPU) 200, and a main memory 300.

メインプロセッサ１００は、単一のメインプロセッサであってもよく、複数のプロセッサを含むマルチプロセッサシステムであってもよく、あるいは、複数のプロセッサコアを１個のパッケージに集積したマルチコアプロセッサであってもよい。メインプロセッサ１００はバスを介してメインメモリ３００に対してデータを読み書きすることができる。 The main processor 100 may be a single main processor, a multiprocessor system including a plurality of processors, or a multicore processor in which a plurality of processor cores are integrated in one package. Good. The main processor 100 can read / write data from / to the main memory 300 via the bus.

ＧＰＵ２００は、グラフィックプロセッサコアを搭載したグラフィックチップであり、バスを介してメインメモリ３００に対してデータを読み書きすることができる。 The GPU 200 is a graphic chip equipped with a graphic processor core, and can read / write data from / to the main memory 300 via a bus.

メインプロセッサ１００とＧＰＵ２００は、バスで接続されており、メインプロセッサ１００とＧＰＵ２００は互いにバスを介してデータをやりとりすることができる。 The main processor 100 and the GPU 200 are connected by a bus, and the main processor 100 and the GPU 200 can exchange data with each other via the bus.

同図は、グラフィックス処理の中で特にテクスチャ処理に関する構成を図示しており、それ以外の処理に関する構成は省略している。 FIG. 2 shows a configuration related to texture processing in the graphics processing, and the configuration related to other processing is omitted.

メインメモリ３００のメモリ領域はＧＰＵ２００からアクセスできるようにＧＰＵ２００が参照するアドレス空間にメモリマッピングされており、ＧＰＵ２００は、メインメモリ３００からテクスチャデータを読み取ることができる。テクスチャデータは、ＰＲＴ（Partially Resident Textures）と呼ばれる方法を用いて、部分的にメインメモリ３００にキャッシュされる。 The memory area of the main memory 300 is memory-mapped in an address space referred to by the GPU 200 so that the GPU 200 can access the GPU 200, and the GPU 200 can read texture data from the main memory 300. Texture data is partially cached in the main memory 300 using a method called PRT (Partially Resident Textures).

メインプロセッサ１００は、グラフィックス演算部２０およびＰＲＴ制御部１０を含む。グラフィックス演算部２０は、ＧＰＵ２００のグラフィックス処理部５０からテクスチャの詳細度を示すＬＯＤ（level of detail）値を受け取り、ＰＲＴ制御部１０にＬＯＤ値を渡す。ＰＲＴ制御部１０は、グラフィックス処理部５０から受け取ったＬＯＤ値にもとづいて、今後必要となるであろうミップマップテクスチャを算出し、テクスチャプールであるＰＲＴキャッシュ３２０への展開を指示したり、使わなくなったページをはがしたりすることでＰＲＴのマッピングを更新する。 The main processor 100 includes a graphics calculation unit 20 and a PRT control unit 10. The graphics calculation unit 20 receives an LOD (level of detail) value indicating the level of detail of the texture from the graphics processing unit 50 of the GPU 200 and passes the LOD value to the PRT control unit 10. Based on the LOD value received from the graphics processing unit 50, the PRT control unit 10 calculates a mipmap texture that will be required in the future, instructs the PRT cache 320, which is a texture pool, and uses it. PRT mapping is updated by removing the missing page.

図２（ａ）〜図２（ｃ）は、ミップマップテクスチャを説明する図である。ミップマップテクスチャは、詳細度（ＬＯＤ）に応じて解像度を異ならせた複数のテクスチャである。図２（ａ）のミップマップテクスチャ３４０は、高解像度のテクスチャである。図２（ｂ）のミップマップテクスチャ３４２は、図２（ａ）の高解像度のミップマップテクスチャ３４０の縦、横のサイズをそれぞれ１／２にした、中解像度のテクスチャである。図２（ｃ）のミップマップテクスチャ３４４は、図２（ｂ）の中解像度のミップマップテクスチャ３４２の縦、横のサイズをそれぞれ１／２にした、低解像度のテクスチャである。 Fig.2 (a)-FIG.2 (c) are the figures explaining a mipmap texture. The mipmap texture is a plurality of textures having different resolutions according to the level of detail (LOD). The mipmap texture 340 in FIG. 2A is a high-resolution texture. The mipmap texture 342 in FIG. 2B is a medium resolution texture in which the vertical and horizontal sizes of the high resolution mipmap texture 340 in FIG. The mipmap texture 344 in FIG. 2C is a low resolution texture in which the vertical and horizontal sizes of the medium resolution mipmap texture 342 in FIG.

図１に戻り、ＰＲＴ制御部１０は、グラフィックス演算部２０に指定された詳細度のミップマップテクスチャを読み出すようにＧＰＵ２００に指示する。より具体的には、ＰＲＴ制御部１０は、ＧＰＵ２００のランレングス伸張部３０および逆離散コサイン変換部４０を制御し、また、メインメモリ３００に格納されたＰＲＴキャッシュ３２０のスワップイン、スワップアウトを制御する。 Returning to FIG. 1, the PRT control unit 10 instructs the GPU 200 to read the mipmap texture of the level of detail specified in the graphics calculation unit 20. More specifically, the PRT control unit 10 controls the run length expansion unit 30 and the inverse discrete cosine transform unit 40 of the GPU 200, and controls swap-in and swap-out of the PRT cache 320 stored in the main memory 300. To do.

ＧＰＵ２００は、ランレングス伸張部３０、ＩＤＣＴ部４０、およびグラフィックス処理部５０を含む。 The GPU 200 includes a run-length decompression unit 30, an IDCT unit 40, and a graphics processing unit 50.

ランレングス伸張部３０は、ＰＲＴ制御部１０から指定された詳細度に対応する圧縮テクスチャ３１０をメインメモリ３００から読み出し、圧縮テクスチャ３１０をランレングス伸張し、ＤＣＴブロックリングバッファ８０に格納する。 The run-length decompression unit 30 reads the compressed texture 310 corresponding to the level of detail specified by the PRT control unit 10 from the main memory 300, decompresses the compressed texture 310, and stores it in the DCT block ring buffer 80.

ＩＤＣＴ部４０は、ＤＣＴブロックリングバッファ８０に格納されたランレングス伸張後のテクスチャのＤＣＴブロックを逆離散コサイン変換し、ＰＲＴキャッシュ３２０に格納する。 The IDCT unit 40 performs inverse discrete cosine transform on the DCT block of the texture after the run length decompression stored in the DCT block ring buffer 80 and stores the DCT block in the PRT cache 320.

グラフィックス処理部５０は、ＰＲＴキャッシュ３２０から必要なミップマップテクスチャを読み出す。ＰＲＴキャッシュ３２０は、テクスチャを部分的にキャッシュするテクスチャタイルプールであり、必要なテクスチャをスワップインし、不要なものはスワップアウトする。 The graphics processing unit 50 reads a necessary mipmap texture from the PRT cache 320. The PRT cache 320 is a texture tile pool that partially caches textures, swaps in necessary textures, and swaps out unnecessary ones.

図３は、本実施の形態のＰＲＴの仕組みを説明する図である。 FIG. 3 is a diagram for explaining the mechanism of the PRT according to the present embodiment.

仮想メモリ上にはミップマップテクスチャ３４０、３４２、３４４の領域が配置される。テクスチャの領域を一定のサイズのチャンクに分け、ページテーブル３３０を用いて、必要なテクスチャ領域だけをテクスチャタイルプール３６０に格納する。ここで、テクスチャは圧縮テクスチャ３１０としてメインメモリ３００に存在しているため、テクスチャタイルプール３６０にテクスチャ領域をキャッシュする際、圧縮テクスチャ３１０を伸張する処理が必要になる。ＰＲＴ制御部１０は、グラフィックス処理部５０からの要求に従い、ランレングス伸張部３０およびＩＤＣＴ部４０を制御して、必要に応じて圧縮テクスチャ３１０を伸張させる。 On the virtual memory, areas of mipmap textures 340, 342, and 344 are arranged. The texture area is divided into chunks of a certain size, and only the necessary texture area is stored in the texture tile pool 360 using the page table 330. Here, since the texture exists in the main memory 300 as the compressed texture 310, when the texture area is cached in the texture tile pool 360, a process for expanding the compressed texture 310 is required. The PRT control unit 10 controls the run length expansion unit 30 and the IDCT unit 40 in accordance with a request from the graphics processing unit 50, and expands the compressed texture 310 as necessary.

同図の例では、高解像度のミップマップテクスチャ３４０のチャンク３５２、中解像度のミップマップテクスチャ３４２のチャンク３５８は、それぞれページテーブル３３０のページ３３２、３３８に対応づけられており、物理メモリがテクスチャタイルプール３６０からマップされている。 In the example shown in the figure, the chunk 352 of the high resolution mipmap texture 340 and the chunk 358 of the medium resolution mipmap texture 342 are associated with the pages 332 and 338 of the page table 330, respectively, and the physical memory is a texture tile. Mapped from pool 360.

他方、高解像度のミップマップテクスチャ３４０のチャンク３５４、中解像度のミップマップテクスチャ３４２のチャンク３５６は、それぞれページテーブル３３０のページ３３４、３３６に対応づけられているが、いずれも物理メモリがまだテクスチャタイルプール３６０からマップされていない。この場合、前述のように、ＰＲＴ制御部１０は、グラフィックス処理部５０から受け取ったＬＯＤ値にもとづいて必要なテクスチャがテクスチャタイルプール３６０にあるように制御し、テクスチャタイルプール３６０の物理メモリが割り当てられ、圧縮テクスチャ３１０から必要なテクスチャデータが伸張されてテクスチャタイルプール３６０に格納される。一方、グラフィックス処理部５０は、メインプロセッサ１００を介することなく、自分自身が計算したＬＯＤ値を使ってミップマップテクスチャをテクスチャタイルプール３６０から読み出す。このとき、もし計算したＬＯＤ値に対応するミップマップテクスチャがテクスチャタイルプール３６０に存在しない場合は、グラフィックス処理部５０はフォールバックして、要求する詳細度を下げ、解像度の低いミップマップテクスチャをテクスチャタイルプール３６０から読み出し、描画する。 On the other hand, the chunk 354 of the high-resolution mipmap texture 340 and the chunk 356 of the medium-resolution mipmap texture 342 are associated with the pages 334 and 336 of the page table 330, respectively. Not mapped from pool 360. In this case, as described above, the PRT control unit 10 controls the texture tile pool 360 so that the necessary texture is in the texture tile pool 360 based on the LOD value received from the graphics processing unit 50, and the physical memory of the texture tile pool 360 is The necessary texture data is expanded from the compressed texture 310 and stored in the texture tile pool 360. On the other hand, the graphics processing unit 50 reads the mipmap texture from the texture tile pool 360 using the LOD value calculated by itself without going through the main processor 100. At this time, if the mipmap texture corresponding to the calculated LOD value does not exist in the texture tile pool 360, the graphics processing unit 50 falls back to reduce the requested detail level, Read from the texture tile pool 360 and draw.

図４（ａ）〜図４（ｅ）は、ランレングス圧縮されたテクスチャのデータ量を説明する図である。図４（ａ）に示すように、元のテクスチャデータがＲＧＢ３２ビットフォーマットで、たとえば１６Ｍｉｂ（メビバイト）あるとする。図４（ｂ）は、ＢＣ５またはＢＣ７と呼ばれるテクスチャ圧縮方式により圧縮されたテクスチャであり、元のテクスチャデータに比べておよそ１／４の圧縮率であり、品質を比較的良好に保ったまま、４ＭｉＢまでデータ量を削減できる。品質が比較的低くなってもよいのであれば、図４（ｃ）のように、ＢＣ１またはＤＸＴ１と呼ばれるテクスチャ圧縮方式により圧縮されたテクスチャを利用してもよく、この場合、元のテクスチャデータに比べておよそ１／８の圧縮率であり、２ＭｉＢまでデータ量を削減できる。図４（ａ）〜図４（ｃ）はいずれもＧＰＵ２００が直接扱うことのできるテクスチャフォーマットである。 FIG. 4A to FIG. 4E are diagrams for explaining the data amount of texture subjected to run-length compression. As shown in FIG. 4A, it is assumed that the original texture data is in RGB 32-bit format, for example, 16 Mib (Mevibyte). FIG. 4B shows a texture compressed by a texture compression method called BC5 or BC7, which is a compression ratio of about 1/4 compared to the original texture data, and the quality is kept relatively good. Data volume can be reduced to 4MiB. If the quality may be relatively low, a texture compressed by a texture compression method called BC1 or DXT1 may be used as shown in FIG. 4 (c). Compared with the compression rate of about 1/8, the data amount can be reduced to 2 MiB. 4A to 4C are texture formats that can be directly handled by the GPU 200. FIG.

他方、ＧＰＵ２００が直接扱えなくなるが、図４（ｄ）のようにＪＰＥＧにより圧縮されたテクスチャを利用すれば、元のテクスチャデータに比べておよそ１／２０の圧縮率が得られ、０．５〜１ＭｉＢまでデータ量を削減できる。この場合、ＧＰＵ２００のコンピュートシェーダではＪＰＥＧ伸張のような複雑なアルゴリズムを実行することは非効率であり、ＪＰＥＧ伸張を行うことのできる専用ハードウェアがなければ、リアルタイムで圧縮テクスチャを伸張してグラフィックス処理に利用することは難しい。 On the other hand, the GPU 200 cannot directly handle, but if a texture compressed by JPEG as shown in FIG. 4D is used, a compression ratio of about 1/20 is obtained compared to the original texture data, and 0.5 to Data volume can be reduced to 1 MiB. In this case, it is inefficient to execute a complicated algorithm such as JPEG decompression in the compute shader of the GPU 200, and if there is no dedicated hardware capable of performing JPEG decompression, the compressed texture is decompressed in real time. It is difficult to use for processing.

それに対して、図４（ｅ）に示すように、離散コサイン変換（ＤＣＴ）とランレングス（Run Length）圧縮を行えば、およそ１／１０の圧縮率が得られ、１〜２ＭｉＢまでデータ量を削減できる。ここまで高圧縮されると、圧縮テクスチャ３１０はメインメモリ３００に常駐させることが可能になる。ＧＰＵ２００は、メインメモリ３００から圧縮テクスチャ３１０を読み出し、コンピュートシェーダによって、リアルタイムでランレングス伸張および逆離散コサイン変換（ＩＤＣＴ）を実行してテクスチャを復元することが可能である。 On the other hand, as shown in FIG. 4E, if discrete cosine transform (DCT) and run length compression are performed, a compression ratio of about 1/10 is obtained, and the data amount is reduced to 1 to 2 MiB. Can be reduced. When high compression is performed up to this point, the compressed texture 310 can be made resident in the main memory 300. The GPU 200 can read the compressed texture 310 from the main memory 300 and restore the texture by executing run-length expansion and inverse discrete cosine transform (IDCT) in real time using a compute shader.

ＪＰＥＧ圧縮されたテクスチャは、ＧＰＵ２００が直接利用することができないため、ＪＰＥＧデコーダによっていったん復号する必要がある。ＪＰＥＧのコーデックが搭載されたグラフィックス装置であれば、ＪＰＥＧ圧縮されたテクスチャにも対応可能であるが、一般にはＪＰＥＧのコーデックを利用可能ではない。ＪＰＥＧ圧縮は、画像を離散コサイン変換し、量子化した後、ハフマン符号化を行うものである。ハフマン符号化は複雑な圧縮アルゴリズムであるから、仮にＧＰＵ２００のコンピュートシェーダがＪＰＥＧ圧縮されたテクスチャのハフマン復号を行ったとすると、計算量が膨大なものになってしまう。 Since textures compressed with JPEG cannot be directly used by the GPU 200, they need to be decoded once by a JPEG decoder. A graphics device equipped with a JPEG codec can handle JPEG-compressed textures, but generally cannot use a JPEG codec. JPEG compression performs Huffman coding after subjecting an image to discrete cosine transform and quantization. Since Huffman coding is a complicated compression algorithm, if the compute shader of the GPU 200 performs Huffman decoding of a texture that has been JPEG-compressed, the amount of calculation will be enormous.

それに対して、ランレングス伸張のような単純な計算はＧＰＵ２００のコンピュートシェーダによって効率的に実行することができる。図５〜図８を参照して、ＧＰＵ２００のコンピュートシェーダがランレングス伸張を効率良く実行できることを説明する。 In contrast, simple computations such as run-length expansion can be efficiently performed by the GPU 200 compute shader. With reference to FIGS. 5 to 8, it will be described that the compute shader of the GPU 200 can efficiently execute run-length expansion.

図５（ａ）〜図５（ｃ）は、本実施の形態のランレングス圧縮および伸張を説明する図である。図５（ａ）はオリジナルデータ列、図５（ｂ）はランレングス圧縮されたデータ列、図５（ｃ）はランレングス伸張されたデータ列を示す。 Fig.5 (a)-FIG.5 (c) are the figures explaining the run-length compression and expansion | extension of this Embodiment. 5A shows the original data string, FIG. 5B shows the run-length compressed data string, and FIG. 5C shows the run-length decompressed data string.

本実施の形態のランレングス圧縮では、バイト単位で圧縮を行い、１６進数で「００」および「ｆｆ」以外の入力値をそのまま出力する。図５（ａ）の符号４１０で示す、最初の６バイトの入力値「３ｆ」、「４ｄ」、「ｅ８」、「０２」、「ａ５」、「０１」は、「００」でも「ｆｆ」でもないため、図５（ｂ）のように、そのまま６バイトの出力値「３ｆ」、「４ｄ」、「ｅ８」、「０２」、「ａ５」、「０１」として符号化される。 In the run-length compression of this embodiment, compression is performed in units of bytes, and input values other than “00” and “ff” are output as they are in hexadecimal. The input values “3f”, “4d”, “e8”, “02”, “a5”, “01” of the first 6 bytes indicated by reference numeral 410 in FIG. 5A can be “00” or “ff”. However, as shown in FIG. 5B, the 6-byte output values “3f”, “4d”, “e8”, “02”, “a5”, “01” are encoded as they are.

本実施の形態のランレングス圧縮では、入力値「００」がｎ個連続して並ぶ場合、２バイトの出力値「ｆｆ」、「ｎ−１」として符号化する。たとえば、図５（ａ）の符号４２０で示すように「００」が７個連続して並ぶ場合、図５（ｂ）の符号４２２で示すように２バイトの出力値「ｆｆ０６」として符号化する。 In the run-length compression according to the present embodiment, when n input values “00” are arranged in succession, they are encoded as 2-byte output values “ff” and “n−1”. For example, when seven “00” s are consecutively arranged as indicated by reference numeral 420 in FIG. 5A, encoding is performed as a 2-byte output value “ff06” as indicated by reference numeral 422 in FIG. 5B. .

本実施の形態のランレングス圧縮は、実値の「ｆｆ」が入力された場合、実値の「ｆｆ」であることを識別するために２バイトの「ｆｆ００」に変換する。図５（ａ）の符号４３０で示す入力値「ｆｆ」は、図５（ｂ）の符号４３２で示すように２バイトの出力値「ｆｆ００」として符号化される。 In the run-length compression of the present embodiment, when an actual value “ff” is input, it is converted into “ff00” of 2 bytes to identify the actual value “ff”. The input value “ff” indicated by reference numeral 430 in FIG. 5A is encoded as a 2-byte output value “ff00” as indicated by reference numeral 432 in FIG.

本実施の形態のランレングス伸張は、ランレングス圧縮の逆の変換を行えばよい。図５（ｂ）の最初の６バイトの入力値「３ｆ」、「４ｄ」、「ｅ８」、「０２」、「ａ５」、「０１」は、図５（ｃ）に示すようにそのまま出力される。図５（ｂ）の符号４２２で示す「ｆｆ０６」に対しては、図５（ｃ）の符号４２４で示すように、最初の「ｆｆ」を「００」に変換した後、６個の「００」を出力する。図５（ｃ）の符号４３２で示す「ｆｆ００」に対しては、これは実値の「ｆｆ」であることを示しているから、図５（ｃ）の符号４３４で示すように、１バイトの「ｆｆ」を出力する。 The run-length expansion of the present embodiment may be performed by performing the inverse conversion of the run-length compression. The input values “3f”, “4d”, “e8”, “02”, “a5”, “01” of the first 6 bytes in FIG. 5B are output as they are as shown in FIG. The For “ff06” indicated by reference numeral 422 in FIG. 5B, the first “ff” is converted into “00” as shown by reference numeral 424 in FIG. Is output. For “ff00” indicated by reference numeral 432 in FIG. 5C, this indicates that it is an actual value “ff”. Therefore, as indicated by reference numeral 434 in FIG. "Ff" is output.

図６は、本実施の形態のランレングス伸張の流れを説明するフローチャートである。 FIG. 6 is a flowchart for explaining the flow of run-length expansion according to this embodiment.

変数ＲＬは「００」の出力を繰り返す回数（ｎ−１）を示すものであり、初期値としてＲＬ＝０であるから（Ｓ１０のＮｏ）、入力データ列から１バイトの読み出しが行われる（Ｓ２０）。ステップＳ２０で読み出したデータが「ｆｆ」でない場合（Ｓ２２のＹｅｓ）、読み出したデータをそのまま出力し（Ｓ２４）、ステップＳ１０に戻る。ステップＳ２０で読み出したデータが「ｆｆ」である場合（Ｓ２２のＮｏ）、さらに次の１バイトを読み出す（Ｓ３０）。 The variable RL indicates the number of times (n-1) to repeat the output of “00”. Since RL = 0 as an initial value (No in S10), 1 byte is read from the input data string (S20). ). If the data read in step S20 is not “ff” (Yes in S22), the read data is output as it is (S24), and the process returns to step S10. When the data read in step S20 is “ff” (No in S22), the next 1 byte is further read (S30).

ステップＳ３０で読み出されたデータが「００」である場合（Ｓ３２のＹｅｓ）、その一つ前に読み出された「ｆｆ」が実値であることを意味するから、「ｆｆ」を出力し（Ｓ３４）、ステップＳ１０に戻る。 If the data read in step S30 is “00” (Yes in S32), it means that “ff” read immediately before is an actual value, so “ff” is output. (S34), it returns to step S10.

ステップＳ３０で読み出されたデータが「００」でない場合（Ｓ３２のＮｏ）、変数ＲＬに読み出されたデータを代入する（Ｓ４０）。これによりＲＬには、「００」の出力を繰り返す回数（ｎ−１）が代入される。その後、最初の「００」を出力し（Ｓ４２）、ステップＳ１０に戻る。 If the data read in step S30 is not “00” (No in S32), the read data is substituted into the variable RL (S40). As a result, the number (n−1) of repeating the output of “00” is substituted into RL. Thereafter, the first “00” is output (S42), and the process returns to step S10.

ステップＳ２４およびステップＳ３４からステップＳ１０に戻った場合、変数ＲＬ＝０であるから（ステップＳ１０のＮｏ）、ステップＳ２０に進み、それ以降のステップを繰り返す。 When returning from step S24 and step S34 to step S10, since variable RL = 0 (No in step S10), the process proceeds to step S20, and the subsequent steps are repeated.

ステップＳ４２からステップＳ１０に戻った場合、変数ＲＬ＝ｎ−１であるから（ステップＳ１０のＹｅｓ）、変数ＲＬから１を引き（Ｓ１２）、「００」を出力し（Ｓ１４）、ステップＳ１０に戻る。変数ＲＬが０になるまで、ステップＳ１２およびステップＳ１４が繰り返され、「００」が（ｎ−１）回出力される。 When returning from step S42 to step S10, since variable RL = n−1 (Yes in step S10), 1 is subtracted from variable RL (S12), “00” is output (S14), and the process returns to step S10. . Steps S12 and S14 are repeated until the variable RL becomes 0, and "00" is output (n-1) times.

本実施の形態のテクスチャ圧縮では、画像のブロックに対して離散コサイン変換（ＤＣＴ）がなされた後、量子化され、ランレングス圧縮される。自然画を離散コサイン変換すると、周波数成分のほとんどが低周波領域に集中し、高周波成分は無視できるほど小さくなる。特に量子化により、高周波成分のＤＣＴ係数はほとんどゼロになる。このことから、ランレングス圧縮の入力データはゼロが多数連続することが多くなる。 In texture compression according to the present embodiment, discrete cosine transform (DCT) is performed on a block of an image, and then quantized and run-length compressed. When a natural image is subjected to discrete cosine transform, most of the frequency components are concentrated in the low frequency region, and the high frequency components become so small that they can be ignored. In particular, due to quantization, the DCT coefficient of the high frequency component becomes almost zero. For this reason, the run length compression input data often has many zeros.

図６のステップＳ１０、Ｓ１２、Ｓ１４を分岐Ａ、ステップＳ２０、Ｓ２２、Ｓ２４を分岐Ｂ、ステップＳ３０、Ｓ３２、Ｓ３４を分岐Ｃ、ステップＳ４０、Ｓ４２を分岐Ｄとすると、離散コサイン変換後のテクスチャデータはゼロが多数連続することが多いため、ランレングス伸張を行うと、分岐Ａを通ることがきわめて多くなる。一般的な自然画のテクスチャの場合、およそ８割以上が分岐Ａを通ることが実験的に確かめられている。このランレングス伸張の性質によって、ＧＰＵ２００のコンピュートシェーダが効率良くランレングス伸張を行うことができる。なぜなら、ＧＰＵ２００は、ＳＩＭＤ（Single Instruction Multiple Data）アーキテクチャであり、複数のスレッドが異なるデータに対して同じインストラクションを同時に実行するため、分岐条件に偏りがあれば、並列度が高まり、実行効率が上がる。 Steps S10, S12, and S14 in FIG. 6 are branch A, steps S20, S22, and S24 are branch B, steps S30, S32, and S34 are branch C, and steps S40 and S42 are branch D. Texture data after discrete cosine transform Since many zeros continue in many cases, when run-length extension is performed, the number of passes through branch A is extremely high. In the case of a general natural image texture, it has been experimentally confirmed that about 80% or more passes through the branch A. Due to the nature of the run-length expansion, the compute shader of the GPU 200 can efficiently perform the run-length expansion. This is because the GPU 200 has a SIMD (Single Instruction Multiple Data) architecture, and a plurality of threads execute the same instruction simultaneously on different data, so if the branch condition is biased, the degree of parallelism increases and the execution efficiency increases. .

ＧＰＵ２００は、一つのプログラムカウンタ（ＰＣ）がインストラクションキャッシュに格納されたインストラクションを参照し、たとえば１６個のＡＬＵ（Arithmetic Logic Unit）が同時にＰＣが参照するインストラクションを実行する。ｉｆ−ｔｈｅｎ−ｅｌｓｅループの分岐毎に異なる命令を１６個のスレッドにセットして同時に実行することになる。１６個のスレッドに対して、ｉｆ分岐では、ｉｆ条件が成立する場合（Ｔｒｕｅ）のピクセルを担当するスレッドを有効にして並列に実行し、ｅｌｓｅ分岐では、ｅｌｓｅ条件が成立する場合（Ｆａｌｓｅ）のピクセルを担当するスレッドを有効にして並列に実行する。ｉｆ条件が成立する場合とｅｌｓｅ条件が成立する場合がほぼ同数である場合、Ｔｒｕｅの場合とＦａｌｓｅの場合で有効化するスレッドの入れ替えを頻繁に行うことになるが、ｉｆ条件成立が８割、ｅｌｓｅ条件成立が２割のように偏っていれば、Ｔｒｕｅの場合に有効化するスレッドの集合を繰り返し使えるため、実行効率が高まる。図７および図８を参照してこの点をより詳しく説明する。 In the GPU 200, one program counter (PC) refers to an instruction stored in the instruction cache, and, for example, 16 ALUs (Arithmetic Logic Units) execute instructions referred to by the PC at the same time. Different instructions are set in 16 threads for each branch of the if-then-else loop and executed simultaneously. For 16 threads, if the if condition is satisfied in the if branch (true), the thread responsible for the pixel is enabled and executed in parallel, and if the else condition is satisfied in the else branch (false) Enable the thread responsible for the pixel to run in parallel. When the if condition is satisfied and the else condition is satisfied, the number of threads to be activated is frequently changed in the case of True and False, but if condition is 80%, If the else condition is biased such as 20%, the set of threads to be enabled in the case of True can be used repeatedly, and execution efficiency increases. This point will be described in more detail with reference to FIGS.

図７は、比較のため、分岐先に偏りがない場合のスレッドの実行過程を説明する図である。 FIG. 7 is a diagram for explaining the thread execution process when there is no bias in the branch destination for comparison.

ＧＰＵ２００は複数の計算ユニット（Computing Unit）を含む。ＧＰＵ２００の１つの計算ユニットで同時に実行されるスレッドの数は計算ユニット内の演算器の数によって決まるが、ここではこれを１６個とする。１つの計算ユニットに同時に投入可能な最大１６スレッドの集まりを「スレッドセット」と呼ぶ。スレッドセットに含まれる各スレッドは、同じシェーダプログラムを実行するが、処理するデータはそれぞれ異なり、プログラム内に分岐がある場合は、それぞれ別の分岐先をもつことがある。１つの計算ユニットはあるサイクルでは、１つのスレッドセット（ここでは最大１６スレッド）を並列に実行する。 The GPU 200 includes a plurality of computing units. The number of threads simultaneously executed by one calculation unit of the GPU 200 is determined by the number of arithmetic units in the calculation unit. A group of up to 16 threads that can be simultaneously input to one computing unit is called a “thread set”. Each thread included in the thread set executes the same shader program, but the data to be processed is different. If there is a branch in the program, each thread may have a different branch destination. One calculation unit executes one thread set (here, a maximum of 16 threads) in parallel in a certain cycle.

たとえば、各分岐先での必要な命令数が数個であっても、プラグラムカウンタが１個であり、計算ユニット内のすべての演算器は同一の命令を実行するＳＩＭＤ構造であるため、スレッドマスクによって実行するスレッドを変えながら各分岐の一つ一つの命令を実行することになる。 For example, even if the number of instructions required at each branch destination is only a few, the program counter is one, and all the arithmetic units in the calculation unit have a SIMD structure that executes the same instruction. Each instruction of each branch is executed while changing the thread to be executed.

一例として、図６のフローチャートの分岐Ａは３命令、分岐Ｂは４命令、分岐Ｃは２命令、分岐Ｄは５命令で実行されるとする。図７の例では、スレッドセット４５０内の１６個のスレッドの分岐先が順にＡ、Ａ、Ｃ、Ａ、Ａ、Ａ、Ｃ、Ｂ、Ｃ、Ａ、Ｃ、Ａ、Ｃ、Ａ、Ｃ、Ｄである場合を説明している。 As an example, assume that branch A in the flowchart of FIG. 6 is executed with 3 instructions, branch B with 4 instructions, branch C with 2 instructions, and branch D with 5 instructions. In the example of FIG. 7, branch destinations of 16 threads in the thread set 450 are A, A, C, A, A, A, C, B, C, A, C, A, C, A, C, The case of D is described.

サイクル１において、分岐Ａを実行するスレッドのみ（この場合、８個のスレッド）を有効にし、プログラムカウンタを１つずつ進めながら、分岐Ａの３命令Ａ−１、Ａ−２、Ａ−３を実行する。 In cycle 1, only the thread that executes branch A (in this case, eight threads) is enabled, and the three instructions A-1, A-2, and A-3 of branch A are executed while the program counter is advanced one by one. Run.

サイクル３において、分岐Ｂを実行するスレッドのみ（この場合、１個のスレッド）を有効にし、プログラムカウンタを１つずつ進めながら、分岐Ｂの４命令Ｂ−１、Ｂ−２、Ｂ−３、Ｂ−４を実行する。 In cycle 3, only the thread that executes branch B (in this case, one thread) is enabled, and the four instructions B-1, B-2, B-3, B-4 is executed.

サイクル８において、分岐Ｃを実行するスレッドのみ（この場合、６個のスレッド）を有効にし、プログラムカウンタを１つずつ進めながら、分岐Ｃの２命令Ｃ−１、Ｃ−２を実行する。 In cycle 8, only the thread that executes branch C (in this case, six threads) is enabled, and the two instructions C-1 and C-2 of branch C are executed while the program counter is incremented by one.

サイクル１０において、分岐Ｄを実行するスレッドのみ（この場合、１個のスレッド）を有効にし、プログラムカウンタを１つずつ進めながら、分岐Ｄの５命令Ｄ−１、Ｄ−２、Ｄ−３、Ｄ−４、Ｄ−５を実行する。 In cycle 10, only the thread that executes branch D (in this case, one thread) is enabled, and while the program counter is incremented one by one, five instructions D-1, D-2, D-3, D-4 and D-5 are executed.

このように、図７の例では、スレッドセットに含まれる１６個のスレッドが４つの分岐Ａ〜Ｄのすべての命令を実行するために、１４サイクルが必要となる。 Thus, in the example of FIG. 7, 14 cycles are required for the 16 threads included in the thread set to execute all the instructions of the four branches A to D.

図８は、分岐先に偏りがある場合のスレッドの実行過程を説明する図である。図８の例では、スレッドセット４５２内の１６個のスレッドの分岐先が順にＡ、Ａ、Ｃ、Ａ、Ａ、Ａ、Ｃ、Ｃ、Ｃ、Ａ、Ｃ、Ａ、Ｃ、Ａ、Ｃ、Ａである場合を説明している。この例では、シェーダプログラム上は分岐先が４種類あるが、分岐条件が成立するピクセルが偏っており、分岐先が分岐Ａと分岐Ｃの２種類しかない。スレッドセットに含まれる１６個のスレッドはこの２種類の分岐だけを実行すればよい。 FIG. 8 is a diagram illustrating a thread execution process when there is a bias in the branch destination. In the example of FIG. 8, the branch destinations of 16 threads in the thread set 452 are A, A, C, A, A, A, C, C, C, A, C, A, C, A, C, The case of A is described. In this example, there are four types of branch destinations in the shader program, but the pixels satisfying the branch condition are biased, and there are only two types of branch destinations, branch A and branch C. The 16 threads included in the thread set need only execute these two types of branches.

サイクル１において、分岐Ａを実行するスレッドのみ（この場合、９個のスレッド）を有効にし、プログラムカウンタを１つずつ進めながら、分岐Ａの３命令Ａ−１、Ａ−２、Ａ−３を実行する。 In cycle 1, only the thread that executes branch A (in this case, nine threads) is enabled, and the three instructions A-1, A-2, and A-3 of branch A are executed while the program counter is advanced one by one. Run.

サイクル４において、分岐Ｃを実行するスレッドのみ（この場合、７個のスレッド）を有効にし、プログラムカウンタを１つずつ進めながら、分岐Ｃの２命令Ｃ−１、Ｃ−２を実行する。 In cycle 4, only the thread that executes branch C (in this case, seven threads) is enabled, and the two instructions C-1 and C-2 of branch C are executed while the program counter is advanced one by one.

このように、図８の例では、スレッドセットに含まれる１６個のスレッドが２つの分岐Ａ、Ｃのすべての命令を実行すればよく、必要サイクル数は５サイクルに減る。 In this way, in the example of FIG. 8, the 16 threads included in the thread set only need to execute all the instructions of the two branches A and C, and the necessary number of cycles is reduced to 5.

このように入力されるデータの性質によってプログラムの分岐先に偏りが生じる場合は、同じスレッドマスクをそのまま使って繰り返し命令を実行することができ、実行効率が向上する。分岐先にばらつきがあると、分岐毎にスレッドマスクを切り替えることになり、実行効率が低下する。 If there is a bias in the branch destination of the program due to the nature of the input data, the same thread mask can be used as it is to execute the instruction repeatedly, improving the execution efficiency. If there are variations in the branch destinations, the thread mask is switched for each branch, and the execution efficiency decreases.

テクスチャを離散コサイン変換した後、ランレングス圧縮することの利点はここになる。自然画由来のＤＣＴ係数の特性から、ＤＣＴ係数行列の左上の低周波成分に０以外の値が集中し、ＤＣＴ係数行列の右下の高周波成分に０が連続するようになる。したがって、離散コサイン変換後の画像ブロックをジグザグパターンにより１次元配列にすると、どのブロックのＤＣＴ係数も最初は非ゼロの値が続き、後半に０が連続するデータ列となる傾向がある。 Here is the advantage of run length compression after discrete cosine transform of the texture. Due to the nature of DCT coefficients derived from natural images, values other than 0 are concentrated on the upper left low frequency component of the DCT coefficient matrix, and 0 continues to the lower right high frequency component of the DCT coefficient matrix. Therefore, when an image block after the discrete cosine transform is made into a one-dimensional array by a zigzag pattern, the DCT coefficient of any block tends to be a data string in which a non-zero value continues first and 0 continues in the latter half.

このＤＣＴ係数の傾向を踏まえて、スレッドセットの各スレッドには、異なるＤＣＴブロックのＤＣＴ係数を処理するようにランレングス圧縮データを割り当て、各スレッドがＤＣＴブロック内で相対的に同じ位置のＤＣＴ係数のランレングス伸張を行うようにスレッドセットを構成する。ＤＣＴ係数の値が「００」か、「ｆｆ」か、それ以外かによって、分岐先が分岐Ａ〜Ｄのいずれかになる。スレッドセットの構成から、ＤＣＴブロック内の相対的に同じ位置ではＤＣＴ係数の傾向が似るため、スレッドセット内の各スレッドの分岐先は同じものに偏るようになる。これにより、図７のように分岐先がばらつくのではなく、図８のように分岐先が偏るようになり、スレッドセットの効率的な実行状態を長く続けることができる。その結果、スレッドセットによってランレングス伸張は効率良く実行される。 Based on the tendency of the DCT coefficient, run-length compressed data is assigned to each thread of the thread set so as to process the DCT coefficient of a different DCT block, and each thread has a DCT coefficient at a relatively same position in the DCT block. Configure the thread set to perform run-length expansion. Depending on whether the value of the DCT coefficient is “00”, “ff”, or any other value, the branch destination is one of the branches A to D. From the configuration of the thread set, the tendency of the DCT coefficient is similar at the relatively same position in the DCT block, so that the branch destination of each thread in the thread set is biased to the same. Accordingly, the branch destinations do not vary as shown in FIG. 7, but the branch destinations become biased as shown in FIG. 8, and the efficient execution state of the thread set can be continued for a long time. As a result, run-length expansion is efficiently performed by the thread set.

本実施の形態のグラフィックス処理装置によれば、離散コサイン変換後にランレングス圧縮されたテクスチャを用いるため、テクスチャ容量を大きく削減することができる。ＧＰＵ２００のコンピュートシェーダが圧縮テクスチャをランレングス伸張し、逆離散コサイン変換するため、高速に圧縮テクスチャを伸張してグラフィックス処理に投入することができる。高圧縮されたテクスチャはメモリに常駐させることができるため、大容量のテクスチャをハードディスクなどの記憶装置から読み出す必要がなく、オンメモリでＰＲＴを実行することが可能である。圧縮テクスチャがオンメモリ化されているため、必要に応じて圧縮テクスチャを読み出し、伸張してＰＲＴキャッシュにスワップインする構成にしても、レイテンシは短く、リアルタイムでテクスチャ処理を実行することができる。 According to the graphics processing apparatus of the present embodiment, the texture capacity that has been run-length compressed after discrete cosine transform is used, so that the texture capacity can be greatly reduced. Since the compute shader of the GPU 200 performs run-length expansion of the compressed texture and inverse discrete cosine transform, the compressed texture can be expanded at high speed and input to the graphics processing. Since the highly compressed texture can be resident in the memory, it is not necessary to read out a large-capacity texture from a storage device such as a hard disk, and the PRT can be executed on-memory. Since the compressed texture is on-memory, the latency is short and texture processing can be executed in real time even if the compressed texture is read out as needed, decompressed, and swapped into the PRT cache.

（第２の実施の形態）
図９は、第２の実施の形態に係るグラフィックス処理装置の構成図である。第２の実施の形態に係るグラフィックス処理装置は、Ｚｌｉｂエンジン６０を備える点が第１の実施の形態とは異なる。第１の実施の形態と共通する構成については適宜説明を省略し、主に第１の実施の形態と相違する構成について詳しく説明する。 (Second Embodiment)
FIG. 9 is a configuration diagram of the graphics processing apparatus according to the second embodiment. The graphics processing apparatus according to the second embodiment is different from the first embodiment in that it includes a Zlib engine 60. The description of the configuration common to the first embodiment will be omitted as appropriate, and the configuration different from the first embodiment will be mainly described in detail.

Ｚｌｉｂエンジン６０は、Ｚｌｉｂ伸張を実行する専用回路である。Ｚｌｉｂとは、Ｄｅｆｌａｔｅと呼ばれる可逆圧縮アルゴリズムを実装した、データの圧縮・伸張を行うライブラリである。 The Zlib engine 60 is a dedicated circuit that executes Zlib decompression. Zlib is a data compression / decompression library that implements a reversible compression algorithm called Deflate.

本実施の形態では、圧縮テクスチャ３１０として、離散コサイン変換され、ランレングス圧縮された後、Ｚｌｉｂにより可逆圧縮されたテクスチャを利用する。圧縮テクスチャ３１０はメインメモリ３００に格納される。 In the present embodiment, as the compressed texture 310, a texture that has been subjected to discrete cosine transform, run-length compressed, and then reversibly compressed by Zlib is used. The compressed texture 310 is stored in the main memory 300.

Ｚｌｉｂエンジン６０は、メインメモリ３００に格納された圧縮テクスチャ３１０をＺｌｉｂ伸張し、ランレングスブロックリングバッファ７０に格納する。 The Zlib engine 60 performs Zlib decompression on the compressed texture 310 stored in the main memory 300 and stores it in the run-length blocking ring buffer 70.

ランレングス伸張部３０は、ランレングスブロックリングバッファ７０に格納されたＺｌｉｂ伸張後の圧縮テクスチャをランレングス伸張してＤＣＴブロックリングバッファ８０に格納する。それ以降の処理は第１の実施の形態と同じである。 The run-length decompression unit 30 performs run-length decompression on the compressed texture after Zlib decompression stored in the run-length block ring buffer 70 and stores it in the DCT block ring buffer 80. The subsequent processing is the same as in the first embodiment.

図１０（ａ）〜図１０（ｆ）は、Ｚｌｉｂ圧縮されたテクスチャのデータ量を説明する図である。図１０（ａ）〜図１０（ｃ）のＧＰＵ２００により扱うことのできる圧縮テクスチャは、図４（ａ）〜図４（ｃ）と同じであるから説明を省略する。 FIG. 10A to FIG. 10F are diagrams for explaining the amount of texture data subjected to Zlib compression. Since the compressed texture that can be handled by the GPU 200 in FIGS. 10A to 10C is the same as that in FIGS. 4A to 4C, the description thereof is omitted.

図１０（ｄ）〜図１０（ｆ）は、ＧＰＵ２００が直接扱えない圧縮テクスチャである。図１０（ｅ）の離散コサイン変換およびＺｌｉｂ圧縮されたテクスチャは、図１０（ｄ）のＪＰＥＧ圧縮されたテクスチャと同様におよそ１／２０の圧縮率が得られる。しかしながら、後述するようにＤＣＴ係数をそのままＺｌｉｂ圧縮すると、伸張時にＺｌｉｂエンジン６０に通常のハードウェア性能を超える負荷を課すことになり、効率が悪い。そこで、本実施の形態では、図１０（ｆ）に示すように、離散コサイン変換およびランレングス圧縮後にＺｌｉｂ圧縮されたテクスチャを用いる。 FIG. 10D to FIG. 10F are compressed textures that the GPU 200 cannot directly handle. The discrete cosine transform and Zlib compressed texture of FIG. 10 (e) can obtain a compression ratio of approximately 1/20, similar to the JPEG compressed texture of FIG. 10 (d). However, if the DCT coefficient is Zlib-compressed as it is, as will be described later, a load exceeding the normal hardware performance is imposed on the Zlib engine 60 at the time of expansion, resulting in poor efficiency. Therefore, in the present embodiment, as shown in FIG. 10F, a texture that is Zlib compressed after discrete cosine transform and run-length compression is used.

図１１（ａ）および図１１（ｂ）は、本実施の形態においてテクスチャをランレングス圧縮する利点を説明する図である。 FIG. 11A and FIG. 11B are diagrams for explaining the advantages of run-length compression of textures in the present embodiment.

図１１（ａ）は、比較のため、ランレングス圧縮していないテクスチャをＺｌｉｂエンジン６０でＺｌｉｂ伸張する場合を示す。圧縮テクスチャの圧縮率が１／２０である場合、Ｚｌｉｂエンジン６０が５０ＭＢ／ｓ（メガバイト／秒）の転送速度で圧縮テクスチャの入力を受けた場合、１３３３ＭＢ／ｓの転送速度でＺｌｉｂ伸張されたテクスチャを出力する必要がある。Ｚｌｉｂ伸張されたテクスチャはＩＤＣＴ部４０により逆離散コサイン変換され、１０００ＭＢ／ｓの転送速度で復元されたテクスチャが出力される。 For comparison, FIG. 11A shows a case where a texture that has not been run-length compressed is Zlib decompressed by the Zlib engine 60. When the compression ratio of the compressed texture is 1/20, when the Zlib engine 60 receives a compressed texture input at a transfer rate of 50 MB / s (megabytes / second), the Zlib decompressed texture at a transfer rate of 1333 MB / s Must be output. The Zlib decompressed texture is subjected to inverse discrete cosine transform by the IDCT unit 40, and a texture restored at a transfer rate of 1000 MB / s is output.

Ｚｌｉｂエンジン６０の通常の入出力比は２〜４倍である。それに対して、ランレングス圧縮していないテクスチャの場合は、約２０倍の出力性能を要求されることになるが、Ｚｌｉｂエンジン６０の通常のハードウェア制限を超えるため、実装するのは現実的ではない。通常の出力性能のＺｌｉｂエンジン６０を用いると、要求される出力性能が出せないために、Ｚｌｉｂエンジン６０の出力がボトルネックとなり、テクスチャの復元にかかる時間が極端に長くなってしまう。 The normal input / output ratio of the Zlib engine 60 is 2 to 4 times. On the other hand, in the case of a texture that is not run-length compressed, an output performance of about 20 times is required. However, since it exceeds the normal hardware limit of the Zlib engine 60, it is not practical to implement it. Absent. If the Zlib engine 60 having the normal output performance is used, the required output performance cannot be obtained, so that the output of the Zlib engine 60 becomes a bottleneck, and the time required for restoring the texture becomes extremely long.

図１１（ｂ）は、ランレングス圧縮されたテクスチャをＺｌｉｂエンジン６０でＺｌｉｂ伸張する場合を示す。この場合、Ｚｌｉｂエンジン６０が５０ＭＢ／ｓ（メガバイト／秒）の転送速度で圧縮テクスチャの入力を受けた場合、１２５ＭＢ／ｓの転送速度でＺｌｉｂ伸張されたテクスチャを出力すればよい。なぜなら、その後、ランレングス伸張部３０がＺｌｉｂ伸張されたテクスチャをランレングス伸張し、１３３３Ｍｂ／ｓの転送速度で出力することができるからである。ランレングス伸張されたテクスチャはＩＤＣＴ部４０により逆離散コサイン変換され、１０００ＭＢ／ｓの転送速度で復元されたテクスチャが出力される。 FIG. 11B shows a case where a Zlib engine 60 decompresses a run-length compressed texture. In this case, when the Zlib engine 60 receives a compressed texture input at a transfer rate of 50 MB / s (megabytes / second), the Zlib decompressed texture may be output at a transfer rate of 125 MB / s. This is because, after that, the run-length expansion unit 30 can perform the run-length expansion on the Zlib-extended texture and output it at a transfer rate of 1333 Mb / s. The texture subjected to the run-length expansion is subjected to inverse discrete cosine transform by the IDCT unit 40, and a texture restored at a transfer rate of 1000 MB / s is output.

ランレングス伸張部３０とＩＤＣＴ部４０は、ともにＧＰＵ２００のコンピュートシェーダにより実行されるから、データ転送の帯域幅は十分に大きく、ランレングス伸張部３０からＩＤＣＴ部４０のデータの受け渡しは、１３３３Ｍｂ／ｓの転送速度を実現可能である。この場合、Ｚｌｉｂエンジン６０の出力性能は２倍程度で済むから、通常のハードウェア制限の範囲で実装することができる。 Since both the run length decompression unit 30 and the IDCT unit 40 are executed by the compute shader of the GPU 200, the data transfer bandwidth is sufficiently large, and the data transfer from the run length decompression unit 30 to the IDCT unit 40 is 1333 Mb / s. The transfer speed can be realized. In this case, since the output performance of the Zlib engine 60 is about twice, it can be implemented within the range of normal hardware restrictions.

図１２は、本実施の形態のグラフィックス処理装置による圧縮テクスチャの伸張処理の性能を説明する図である。一例として、縦６４０画素、横６４０画素の圧縮テクスチャを伸張する場合を説明する。ここではＧＰＵ２００は一例として１８個の計算ユニット（ＣＵ）をもつ。Ｚｌｉｂエンジン６０は、テクスチャ伸張以外の用途にも用いられるため、ここでは、出力性能が２００Ｍｉｂ／ｓのＺｌｉｂエンジン６０のリソースの一部を用いて、２６Ｍｉｂ／ｓで圧縮テクスチャをＺｌｉｂ伸張する。これには６．２ｍｓ（ミリ秒）かかる。その後、１つのＣＵを用いてランレングス伸張を行うが、これは１．３ｍｓかかる。その後、逆離散コサイン変換は１８個のＣＵを用いて行うが、これは０．３ｍｓかかる。合計でメインメモリ３００上の圧縮テクスチャを伸張するのに８ｍｓのレイテンシとなり、リアルタイムで圧縮テクスチャを伸張してグラフィックス処理に投入することができる。 FIG. 12 is a diagram for explaining the performance of the compressed texture expansion processing by the graphics processing apparatus according to the present embodiment. As an example, a case will be described in which a compressed texture having a length of 640 pixels and a width of 640 pixels is expanded. Here, the GPU 200 has 18 calculation units (CUs) as an example. Since the Zlib engine 60 is also used for applications other than texture expansion, the compressed texture is Zlib expanded at 26 Mib / s using a part of the resources of the Zlib engine 60 whose output performance is 200 Mib / s. This takes 6.2 ms (milliseconds). After that, run length expansion is performed using one CU, which takes 1.3 ms. Thereafter, the inverse discrete cosine transform is performed using 18 CUs, which takes 0.3 ms. In total, the compressed texture on the main memory 300 is expanded to a latency of 8 ms, and the compressed texture can be expanded and input to the graphics processing in real time.

仮にランレングス圧縮しないテクスチャを用いると、通常の出力性能のＺｌｉｂエンジン６０では、圧縮テクスチャをＺｌｉｂ伸張するのに約１０倍の６２ｍｓがかかることになり、実用に耐えなくなる。ランレングス圧縮されたテクスチャを利用することで、Ｚｌｉｂエンジン６０に与える負荷を軽くし、ランレングス伸張をコンピュートシェーダで高速に行うことで、圧縮テクスチャの伸張処理によるレイテンシを短くすることができる。 If a texture that is not run-length compressed is used, the Zlib engine 60 with normal output performance will take about 10 times 62 ms to Zlib decompress the compressed texture, making it unusable for practical use. By using the run-length compressed texture, the load applied to the Zlib engine 60 can be lightened, and the run-length expansion can be performed at high speed by the compute shader, so that the latency due to the compression processing of the compressed texture can be shortened.

第２の実施の形態のグラフィックス処理装置によれば、離散コサイン変換およびランレングス圧縮後にＺｌｉｂ圧縮されたテクスチャを用いるため、ＪＰＥＧ圧縮と同様にテクスチャ容量を大きく削減することができる。このように高圧縮されたテクスチャはメモリに常駐させることができ、オンメモリでＰＲＴを実行することが可能である。 According to the graphics processing apparatus of the second embodiment, since the texture subjected to the Zlib compression after the discrete cosine transform and the run length compression is used, the texture capacity can be greatly reduced similarly to the JPEG compression. The highly compressed texture can be made resident in the memory, and the PRT can be executed on-memory.

Ｚｌｉｂデコーダを備えるグラフィックス処理装置において、Ｚｌｉｂ圧縮の前にランレングス圧縮されたテクスチャを利用することで、圧縮テクスチャの伸張時にＺｌｉｂデコーダにかかる負荷を抑えることができる。 In a graphics processing apparatus including a Zlib decoder, by using a texture that has been run-length compressed before Zlib compression, it is possible to reduce the load on the Zlib decoder when decompressing the compressed texture.

また、第１の実施の形態と同様、ＧＰＵ２００のコンピュートシェーダが圧縮テクスチャをランレングス伸張し、逆離散コサイン変換するため、高圧縮されたテクスチャをリアルタイムで伸張してグラフィックス処理に投入することができる。 Similarly to the first embodiment, the compute shader of the GPU 200 performs run-length expansion of the compressed texture and inverse discrete cosine transform, so that the highly compressed texture is expanded in real time and input to the graphics processing. it can.

以上、本発明を実施の形態をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. The embodiments are exemplifications, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are within the scope of the present invention. .

上記の実施の形態では、圧縮テクスチャをメモリに格納したが、圧縮テクスチャをハードディスクや光ディスクなどの記録媒体に格納してもよい。テクスチャは高圧縮されているため、記憶容量を抑えることができ、また、オンメモリの場合のレイテンシにはかなわないが、記録媒体からの読み出しのレイテンシをある程度抑えることもできる。 In the above embodiment, the compressed texture is stored in the memory. However, the compressed texture may be stored in a recording medium such as a hard disk or an optical disk. Since the texture is highly compressed, the storage capacity can be suppressed, and the latency in reading from the recording medium can be suppressed to some extent although it does not meet the latency in the case of on-memory.

上記の実施の形態では、画像の空間領域を空間周波数領域に変換する空間周波数変換の一例として、離散コサイン変換を用いたが、これ以外の空間周波数変換、たとえば離散フーリエ変換を用いてもよい。 In the above embodiment, the discrete cosine transform is used as an example of the spatial frequency transform for transforming the spatial region of the image into the spatial frequency region. However, other spatial frequency transforms such as a discrete Fourier transform may be used.

上記の実施の形態では、ランレングス圧縮の一例として、「００」が連続した場合に、特定の符号「ｆｆ」と「００」が連続した長さの組み合わせで符号化したが、ランレングス圧縮はこれ以外の方法を用いてもよい。たとえば、「００」以外の値が連続した場合に特定の符号と連続した長さで符号化してもよい。 In the above embodiment, as an example of run-length compression, when “00” is continuous, encoding is performed with a combination of specific lengths of “ff” and “00”, but run-length compression is Other methods may be used. For example, when values other than “00” are consecutive, encoding may be performed with a length continuous with a specific code.

上記の実施の形態では、Ｚｌｉｂデコーダがハードウェアとして利用できる場合を説明したが、Ｚｌｉｂ以外の圧縮アルゴリズムで圧縮されたデータを伸張するデコーダがハードウェアとして実装されている場合にも、本発明の実施の形態を適用することができる。 In the above embodiment, the case where the Zlib decoder can be used as hardware has been described. However, the present invention can also be applied to a case where a decoder that decompresses data compressed by a compression algorithm other than Zlib is implemented as hardware. Embodiments can be applied.

１０ＰＲＴ制御部、２０グラフィックス演算部、３０ランレングス伸張部、４０逆離散コサイン変換部、５０グラフィックス処理部、６０Ｚｌｉｂエンジン、７０ランレングスブロックリングバッファ、８０ＤＣＴブロックリングバッファ、１００メインプロセッサ、２００ＧＰＵ、３００メインメモリ、３１０圧縮テクスチャ、３２０ＰＲＴキャッシュ、３３０ページテーブル、３４０ミップマップテクスチャ、３６０テクスチャタイルプール。 10 PRT control unit, 20 graphics operation unit, 30 run length decompression unit, 40 inverse discrete cosine transform unit, 50 graphics processing unit, 60 Zlib engine, 70 run length block ring buffer, 80 DCT block ring buffer, 100 main processor , 200 GPU, 300 main memory, 310 compressed texture, 320 PRT cache, 330 page table, 340 mipmap texture, 360 texture tile pool.

Claims

A graphics processing device including a main memory and a graphics processing unit,
The graphics processing unit includes a run-length decompression unit that performs run-length decompression of a compressed texture, and an inverse spatial frequency transform unit that restores a texture by performing inverse spatial frequency transform on the run-length decompressed texture,
The main memory is seen containing a texture pool partially cache the reconstructed texture,
The run-length decompression unit is executed by a plurality of threads of a compute shader, and each thread performs run-length decompression of spatial frequency transform coefficients at the same position in the spatial frequency transform block of the compressed texture. Graphics processing device.

The graphics processing apparatus according to claim 1 , wherein the compressed texture is stored in the main memory, and the run-length expansion unit reads the compressed texture from the main memory.

A decompression circuit for decompressing the compressed texture before the run length decompression by the graphics processing unit;
The run-length decompression unit, a graphics processing device according to claim 1 or 2, characterized in that to perform the run length expansion of the texture that is decompressed by the decompression circuit.

A graphics processing method in a graphics processing apparatus including a main memory and a graphics processing unit,
In the main memory, the graphics processing unit performs run-length decompression of the compressed texture by a compute shader, restores the texture by performing inverse spatial frequency conversion on the run-length decompressed texture, and partially caches the texture. storing the restored texture in the texture pool,
The run-length expansion is performed by a plurality of threads of a compute shader, and each thread performs the run-length expansion of spatial frequency transform coefficients at the same position in the spatial frequency transform block of the compressed texture. Graphics processing method.

Performing run-length decompression of the compressed texture; restoring the texture by inverse spatial frequency transforming the run-length decompressed texture; and storing the restored texture in a texture pool that partially caches the texture; Is executed by the compute shader of the graphics processing unit ,
The run-length expansion is performed by a plurality of threads of a compute shader, and each thread performs the run-length expansion of spatial frequency transform coefficients at the same position in the spatial frequency transform block of the compressed texture. program.