How Data Compression Works — gzip Brotli Zstd Explained

Every time you load a web page, your browser is receiving data that's been compressed on the server and decompressed on your end — usually in milliseconds. The JavaScript, HTML, and CSS you download is typically 60–80% smaller than the raw source. Understanding how that works helps you use it deliberately rather than just hoping it happens.

Lossless vs Lossy

There's a fundamental split worth understanding before diving into specific algorithms.

Lossless compression means you can reconstruct the original data exactly. Compress a text file, decompress it, and you have bit-for-bit identical bytes. That's what gzip, Brotli, and Zstd do. Required for anything where accuracy matters: code, documents, databases, executables.

Lossy compression discards some data permanently to achieve smaller files. JPEG images, MP3 audio, and H.264 video all work this way — you trade perfect accuracy for much higher compression ratios. The discarded data is chosen to be imperceptible (or at least tolerable) to human senses. You can't recover the original from a lossy-compressed file.

For web transfer of text-based content (HTML, CSS, JSON, JavaScript), you always want lossless. For images, you might use lossy (JPEG, WebP lossy mode) or lossless depending on the content.

LZ77: The Algorithm Behind Almost Everything

Most modern lossless compressors trace their lineage to two algorithms published by Abraham Lempel and Jacob Ziv in 1977 and 1978: LZ77 and LZ78. The core idea of LZ77 is elegant.

As the compressor reads through data, it maintains a sliding window — a buffer of recently seen bytes. When it encounters a sequence it has seen before, instead of outputting those bytes again, it outputs a reference: "go back X bytes and copy Y bytes from there."

Take the string the cat sat on the mat. When the compressor reaches "the mat" at the end, it's already seen "the" at the start. It outputs a back-reference ("go back 15 bytes, copy 3 bytes") instead of the literal characters t, h, e. Then " mat" literally.

Repetitive data compresses extremely well this way. Source code, HTML with repeated tag patterns, JSON with repeated field names — all compress to a fraction of their original size. Random data compresses almost not at all, because there are no patterns to reference.

Huffman Coding: Encoding Frequencies

LZ77 reduces redundancy by replacing repeated sequences with references. Huffman coding goes further — it encodes frequent symbols with fewer bits.

In ASCII text, every character takes 8 bits. But if the letter e appears 13% of the time and z appears 0.07% of the time, why give them the same number of bits? Huffman coding builds a frequency table, then assigns shorter bit patterns to common symbols and longer ones to rare ones.

In typical English text, common letters might be encoded in 3–4 bits while rare ones get 8–12. On average, Huffman coding alone saves roughly 40%.

Most practical compressors combine LZ77 (or a variant) with Huffman coding. Find repetitions, represent them as references, then encode the references and literals efficiently. This is the core of DEFLATE — the algorithm inside gzip.

gzip: DEFLATE with Headers

gzip is DEFLATE plus a standardized container format. The .gz file wraps the compressed data with a header containing the original filename, modification time, operating system, and a CRC32 checksum to verify integrity on decompression.

On the web, the Content-Encoding: gzip header tells browsers the response body is gzip-compressed. The browser decompresses transparently — you never see the compressed bytes.

# Check if a server is sending gzip
curl -I -H "Accept-Encoding: gzip" https://example.com

# Compress a file manually
gzip -k myfile.txt        # keeps original, creates myfile.txt.gz
gzip -d myfile.txt.gz     # decompress

# Check compression ratio
gzip -l myfile.txt.gz

gzip has been the default web compression format since the mid-1990s. Support is universal. Its weakness is that DEFLATE is showing its age — there are better algorithms now.

Brotli: Built for the Web

Google released Brotli in 2015, designed specifically for web content. It uses a combination of LZ77, Huffman coding, and a second-order context modelling technique. More importantly, it ships with a built-in dictionary of common web strings — HTML tag names, CSS properties, common JavaScript tokens, HTTP header values.

Because the compressor and decompressor share this dictionary, Brotli doesn't need to transmit it as part of the compressed data. It just references items by ID. This gives Brotli a significant advantage on typical web payloads.

The practical numbers: Brotli compresses web text roughly 15–25% better than gzip at comparable speeds. For JavaScript files specifically, savings of 20%+ are common. Google's original benchmark showed 20–26% improvement over gzip for web content.

One catch: Brotli should only be used over HTTPS. Older HTTP/1.1 middleboxes — corporate proxies, some CDN edge nodes — don't recognize the br content-encoding and may corrupt the response before it reaches the browser. HTTPS prevents these intermediaries from tampering with the payload. All major browsers only advertise Accept-Encoding: br on HTTPS connections, and modern CDNs negotiate Brotli automatically when the client signals support.

# Check Accept-Encoding negotiation
curl -v -H "Accept-Encoding: br, gzip" https://example.com 2>&1 | grep -i "content-encoding"

Zstd: Facebook's Speed-First Algorithm

Facebook open-sourced Zstandard (Zstd) in 2016. It uses a different approach from gzip and Brotli, optimizing hard for both compression and decompression speed rather than just ratio.

Zstd uses a technique called Asymmetric Numeral Systems (ANS) instead of traditional Huffman coding for the entropy coding stage. ANS is more accurate than Huffman at representing probabilities, which means better compression with less waste.

The headline numbers are striking. At its default compression level, Zstd compresses faster than gzip at level 1 while achieving better ratios than gzip at level 6. Decompression is consistently faster than all the alternatives.

Zstd supports custom training dictionaries — you can train a dictionary on a sample of your actual data, giving it the same advantage Brotli gets from its built-in web dictionary. This is particularly powerful for compressing large numbers of similar small files (database records, API responses, log lines).

For web transfer, Zstd is the newest addition. The Content-Encoding: zstd header is supported in modern Chrome, Firefox, and Safari. Caddy and nginx support serving zstd-encoded content. The wire savings over gzip are real, typically in the 10–20% range for text content.

Setting Content-Encoding Headers

If you're configuring a web server to serve compressed content, the basics look like this.

nginx:

gzip on;
gzip_types text/plain text/css application/json application/javascript text/xml;
gzip_min_length 1000;

# Brotli requires the ngx_brotli module
brotli on;
brotli_types text/plain text/css application/json application/javascript;

Express (Node.js):

import compression from 'compression';

app.use(compression({
  level: 6,       // gzip level 1-9
  threshold: 1024 // don't compress responses smaller than 1KB
}));

Caddy compresses automatically with gzip and Zstd enabled by default — no configuration needed.

The browser signals what it can accept via the Accept-Encoding request header. The server picks the best option it supports and responds with the matching Content-Encoding header.

Dictionary Compression in Practice

Both Brotli and Zstd can use dictionaries to dramatically improve compression of small, similar payloads. This matters for APIs returning many small JSON responses — at small sizes, compression headers can cost more than the compression saves without a dictionary.

# Train a Zstd dictionary from sample data
zstd --train samples/*.json -o api-responses.dict

# Use the dictionary
zstd -D api-responses.dict response.json -o response.json.zst

Dictionary-based compression can compress 100-byte API responses to 20 bytes — a ratio that's impossible without shared context. It's overkill for most projects but worth knowing exists.

Verifying Compression Is Working

# Check what encoding a server sends
curl -s -o /dev/null -w "%{size_download}" -H "Accept-Encoding: gzip" https://example.com
curl -s -o /dev/null -w "%{size_download}" https://example.com

# View headers explicitly
curl -I -H "Accept-Encoding: gzip, br" https://example.com

# Measure transferred vs actual size
curl -w "Size: %{size_download} bytes\n" -H "Accept-Encoding: gzip, br" -o /dev/null -s https://example.com/app.js

Wrapping Up

gzip, Brotli, and Zstd all find patterns and encode them efficiently, but they differ in age, optimization target, and ecosystem support. gzip works everywhere. Brotli wins on web text over HTTPS. Zstd wins on speed and is increasingly available for static asset serving.

For file-level compression, the PDF Compress and Image Compress tools apply these principles to binary formats — worth using when you need to reduce file size before attaching or sharing.