Thuc notes - Deflate algorithm

Deflate

Deflate is a lossless streaming compression/decompression algorithm. It’s used widely in GZIP and PNG.

Deflate is a combination of LZ77 and Huffman coding algorithm to compress data.

LZ77 provides a technique for searching duplicated series of bytes in original data and replacing them with back-reference to the first duplicated part position.

Huffman coding provides a mapping solution for bit reduction by replacing original data with smaller size code.

Deflate stream format

Deflate is designed as a series of blocks that you can compress/decompress as a stream sequence.

┌─────────────┐
│             │
│    block    │
│             │
├─────────────┤
│             │
│    block    │
│             │
├─────────────┤
│             │
│    block    │
│             │
├─────────────┤
│             │
│    block    │
│             │
├─────────────┤
│             │
│    block    │
│             │
├─────────────┤
│             │
│    block    │
│             │
└─────────────┘

Each block has 2 main parts: header and data.

Deflate’s block uses 3 bits header as a starting of the block.

┌─────────────┐                                         
│     ...     │                                         
│    block    │                                         
│             │                                         
├─────────────┼────────┬───────────────────────────────┐
│             │        │Bit (0) - Last-block-in-stream │
│             │        ├───────────────────────────────┤
│             │ Header │                               │
│             │        │  Bit (1) & (2) - Block type   │
│             │        │                               │
│             ├────────┴───────────────────────────────┤
│             │                                        │
│             │                                        │
│             │                                        │
│    block    │                                        │
│             │                                        │
│             │                                        │
│             │                  Data                  │
│             │                                        │
│             │                                        │
│             │                                        │
│             │                                        │
│             │                                        │
│             │                                        │
├─────────────┼────────────────────────────────────────┘
│     ...     │                                         
│    block    │                                         
│             │                                         
└─────────────┘                                         

The first bit (0): define the current block

Value of bit (0)Description
0This is the last block of the stream.
1There are more blocks after this block.

The second bit and the third bit (1) (2): define block type

Value of bit (0)Description
00No compression. Raw data is stored with lengths from 0 to 65535 bytes. It’s used for incompressible data.
01Compressed with fixed Huffman codes. It used a pre-defined/pre-agreed Huffman tree defined in the Deflate RFC. This type is used for short messages, where a pre-defined Huffman tree saved more space instead a custom Huffman tree.
10Compressed with dynamic Huffman codes. Huffman tree will be defined included.
11This file is reserved, and don’t use.

Data

There are two types of data strucutres depending on the header definition, no compression data (00) and compression data (01 and 10).

No compression data block

No compression with the block type.

It consists of three parts: length, one’s complement of length and literal data.

┌─────────────┐                                         
│     ...     │                                         
│    block    │                                         
│             │                                         
├─────────────┼────────┬───────────────────────────────┐
│             │        │Bit (0) - Last-block-in-stream │
│             │        ├───────────────────────────────┤
│             │ Header │                               │
│             │        │      Bit (1) & (2) = 00       │
│             │        │                               │
│             ├────────┼───────────────────────────────┤
│             │        │        2 bits - Length        │
│             │        │                               │
│Non-compresse│        ├───────────────────────────────┤
│  d blocks   │        │ 2 bits - one's complement of  │
│             │        │            length             │
│             │        ├───────────────────────────────┤
│             │  Data  │                               │
│             │        │                               │
│             │        │                               │
│             │        │         literal data          │
│             │        │                               │
│             │        │                               │
│             │        │                               │
├─────────────┼────────┴───────────────────────────────┘
│     ...     │                                         
│    block    │                                         
│             │                                         
└─────────────┘                                         

Compression data block

Compression data block requires 2 things: Huffman code trees and encoded data.

  • Huffman code trees: use to map original data to reduced size codes. With header definition. Deflate support 2 cases to define the Huffman code tree:
    • Fixed Huffman codes: pre-agreed Huffman code trees from spec.
    • Dynamic Huffman codes: dynamic Huffman code trees that build from raw data.
  • Compressed data: They are a series of elements of 2 types: literal bytes and pointer to back duplicate literal bytes. The compressed data are encoded to a reduced bit of Huffman code trees.
┌─────────────┐                                         
│     ...     │                                         
│    block    │                                         
│             │                                         
├─────────────┼────────┬───────────────────────────────┐
│             │        │Bit (0) - Last-block-in-stream │
│             │        ├───────────────────────────────┤
│             │ Header │                               │
│             │        │      Bit (1) & (2) = 01       │
│             │        │                               │
│             ├────────┼───────────────────────────────┤
│             │        │        2 bits - Length        │
│ compressed  │        │                               │
│  blocks -   │        ├───────────────────────────────┤
│Fixed Huffman│        │ 2 bits - one's complement of  │
│    codes    │        │            length             │
│             │        ├───────────────────────────────┤
│             │  Data  │                               │
│             │        │                               │
│             │        │                               │
│             │        │         Encoded data          │
│             │        │                               │
│             │        │                               │
│             │        │                               │
├─────────────┼────────┼───────────────────────────────┤
│             │        │Bit (0) - Last-block-in-stream │
│             │        ├───────────────────────────────┤
│             │ Header │                               │
│             │        │      Bit (1) & (2) = 01       │
│             │        │                               │
│             ├────────┼───────────────────────────────┤
│             │        │        2 bits - Length        │
│             │        │                               │
│             │        ├───────────────────────────────┤
│ compressed  │        │ 2 bits - one's complement of  │
│  blocks -   │        │            length             │
│   Dynamic   │        ├───────────────────────────────┤
│Huffman codes│        │                               │
│             │        │      Huffman code trees       │
│             │  Data  │                               │
│             │        ├───────────────────────────────┤
│             │        │                               │
│             │        │                               │
│             │        │                               │
│             │        │         Encoded data          │
│             │        │                               │
│             │        │                               │
│             │        │                               │
├─────────────┼────────┴───────────────────────────────┘
│     ...     │                                         
│    block    │                                         
│             │                                         
└─────────────┘                                         

Compression stages:

Deflate compress the data in two stages: duplicate elimination and bit reduction.

Duplicate elimination

In the first stage, deflate uses the LZ77 algorithm. It searches duplicated uncompressed bytes and replaces them with reference pointer. The reference pointer is defined by 2 elements:

  • Distance (or offset): the relative back to the first existing bytes occurs in original uncompressed data.
  • Length: the duplicated bytes length.

The reference pointers of LZ77 in Deflate defines by 8-bit for duplicated uncompressed length and a 15-bit distance of duplicated uncompressed position.

LZ77 can search and create reference pointers for duplicated bytes across blocks, as long as the distance appears within the last 32 KB of uncompressed data decoded.

Let’s try an example with original data:

This is


           Length: 2 
           ───       
┌─┬─┬─┬─┬─┬─┬─┐      
│t│h│i│s│ │i│s│      
└─┴─┴─┴─┴─┴─┴─┘      
     ▲     │         
     │     │         
     └─────┘         
     Distance: back 3

We will have a reference pointer with length = 2 and distance = 3

This (3,2)

or

This (back 3, length 2)

Bit reduction

Deflate uses the Huffman coding algorithm to reduce the size of the original data to smaller bits. If you want to know how Huffman coding works, check https://thuc.space/posts/huffman_coding_algorithm/.

For each block, Deflate define 2 Huffman code trees for mapping code to literal data:

  • The first tree for mapping alphabet and duplicated length of pointer LZ77.
    • Code from 0 to 255: the alphabet.
    • Code 256: indicates end-of-block.
    • Code from 257 to 285: represent the length of pointer reference from 3 to 258 (with extra bits support). It’s also the maximum back reference of pointer 258 bytes.
  • Second tree for the back distance of pointer LZ77.
    • Code from 0 to 29: represent back distance of pointer reference from 1 to 32768 (with extra bits support).

Extra bits support bits that follow encoded code, which allows code can contain more possible values from one code.

  • Example length of pointer reference extra bits:
Code (decimal)Code without extra bitsCode with extra bitsDistance
265100001001100001001011
265100001001100001001112
27410001001010001001000043
27410001001010001001000144
  • Example distance of pointer reference extra bits:
Code (decimal)Code without extra bitsCode with extra bitsdistance
6110110009
61101100110
61101101011
61101101112

Finally, data will be encoded to reduced-size code, including the huffman coding tree if it’s explicited in dynamic huffman codes compression data block.

References: