Rice decoding on the GPU with Metal

Dec 8, 2018

A previous post demonstrated Huffman decoding implemented on the GPU. This post addresses the same problem space, but focuses on an implementation of adaptive Rice decoding for the GPU. This example of GPGPU programming is designed to implement decompression of a full screen grayscale image at 30 FPS. Traditional decoding would make use the CPU, this approach offloads the processor intensive Rice decoding to the GPU.

This implementation makes use of Rice codes that can be quickly decoded using only bit shifting and count leading zero bit operations. A fast GPU implementation is only possible when code lengths are limited to a maximum bit length. This specific implementation requires a maximum prefix unary code length of 16 bits, so that the 16 bit ushort data type could be used in GPU code (more on that later). Complete source code and Xcode project is available on github.

Grayscale Delta Compression

This implementation of grayscale coding is similar to the Huffman version in that grayscale deltas are encoded instead of grayscale values. The following example shows coding of delta values and how the most common deltas are nearest to zero.

  grayscale = 0  9  7  8  9  7  8
  deltas    = 0 +9 -2 +1 +1 -2 +1

When delta values from a real image are displayed as a graph, the shape is known as TSGD (Two-Sided Geometric Distribution).

    TSGD


The graph above indicates that deltas -1, 0, +1 are more common than -2 and +2 and significantly more common than -10 and +10. Rice coding is able to encode small delta values like -1 and +1 with short bit codes. This encoding process generates a smaller compressed output size, as long as the delta distribution matches the TSGD.

Rice Coding Optimization

Rice coding is adaptive based on a single parameter k. In the case of grayscale delta encoding, data is in the byte range (0, 255) so that the k value (in terms of POT) is in the int range (0, 7). This Rice encoding process selects a block of input and then calculates an optimal parameter k to minimize compressed data size for that specific block.

The block size used for Rice k parameter selection is 8x8. Block size has been extensively explored in image compression research and 8x8 blocks are the best default choice. The optimal k value for each block is calculated at encode time and stored as side channel information for use at decode time.

Use of an optimal k table created at compression time has an extremely important decompression implication. An optimal k table means the decoder does not need to spend any time adapting parameters based on values already seen at decode time. Many decoding approaches spend a significant amount of execution time doing this type of context modeling. This implementation avoids any context modeling, so that the decoder can execute in parallel and as quickly as possible.

Delta Block Optimization

Rice parameter optimization is most effective with 8x8 blocks, but delta compression is actually improved with larger blocks. This implementation makes use of 32x32 sized blocks when decompressing deltas. Each "Big Block" contains 16 of the smaller 8x8 blocks.

Big Block
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Compressed data is grouped in terms of these "Big Block" chunks of 32x32 values for optimal decoding performance. Rice decoding is implemented as a compute shader that executes on 32 SIMD threads inside the GPU (known as warps in GPU terminology). The implementation maps pairs of warp threads to individual blocks inside each "Big Block" such that threads (0,1) process small block 0, threads (2,3) process small block 1, and so on. This efficient mapping of compute shader resources and highly localized data layout results in impressive execution time while still providing excellent data compression.

Results

The example image provided with the MetalRice github project is a 2048 x 1536 grayscale image of the Golden Gate Bridge. When compressed with the Rice approach the file size in bytes is reduced significantly.

  inBytes    3145728  (3.14 Megs)
  riceBytes  1657832  (1.65 Megs)

The example image is the exact pixel size of an iPad Retina screen and provides a good test input at a large but reasonable texture size. The GPU performance (on an A9 processor) as seen in Xcode is:


    GPU Performance Image

Optimizations

The master branch contains an optimized Rice decoder that executes the compute shader stage in 4.4 ms. The previous best execution time for the compute shader (before it was optimized) was 8.9 ms. An interested developer can examine the previous unoptimized version in the github repo on a branch named unopt.

The interesting part about optimizing this shader is that a 2x performance improvement was mostly due to changing the logic so that fewer conditional code blocks would be executed. Some changes were counterintuitive, for example changing the 16 bit input register to a 32 bit register resulted in faster execution time since twice the number of bits could be buffered. Another change to the relative ordering of the bits in the encoded stream meant that an optimized function could read four k sized bits fields more quickly. After optimizing this shader, it was clear that the main issue effecting execution time was thread/warp divergence.

Advantages

This Rice decoder executes almost entirely on the GPU, the process requires only a tiny amount of CPU to send the encoded bit buffer to the GPU. The Rice coding approach produces surprisingly effective results for photographic images and good results for synthetic images. Compute shader performance for the Rice implementation is more than twice as fast as the Huffman based GPU decoder. The Rice decoder is fast enough to support even a 60 FPS frame rate. Relative timings in ms (1/1000 of a second) are indicated below:


    Rice Compare Times

Disadvantages

This Rice compute shader implementation is able to run effectively on A8 and newer processors, but the A7 processor (iPhone 5S/iPad Retina) executes compute shader logic so slowly that the results are effectively unusable. There appears to be no solution for this compute shader execution problem. Either A7 devices would need to be excluded or a software only substitute would need to be used for A7 devices.

The Rice parallel decompression approach requires additional side channel information to store block offsets into the bitstream and to store the k table. The k table is small, but the block offsets can be as large as 200 kB uncompressed. Additional work to reduce the block offset side channel size would be beneficial.

Future Directions

The Rice based GPU decoding approach is highly effective both in terms of compressed size and execution time. Additional work to reduce side channel information would reduce the combined compressed size even further. One method to reduce block bit offset table size would be to convert block offsets into deltas and then subtract the known number of bits needed to store N k bit elements for each block in the combined stream. Another approach to reduce stored block offset table size would be to store 1/2 or 1/4 of the stream offsets and then use multiple render passes that update parsed bit offsets in the stream. These approaches must balance side channel size against impact on the decoder execution time.

This Rice decoder is built specifically for image processing, but there is no reason the approach could not be easily adapted to implement real time audio playback or even multiple audio channel mixing on the GPU. In fact, any process that produces TSGD data can easily be plugged into this Rice compression/decompression approach.