A PDF you export from Word or InDesign is rarely the smallest it can be. The file often contains uncompressed image data, fonts with thousands of unused glyphs, and redundant internal objects. PDF compression strips or recompresses that content — sometimes achieving a 10x reduction, sometimes only 5%, depending on what's inside. Here's what actually happens under the hood.
Why PDFs Get Large
What makes files big in the first place:
Uncompressed or over-quality images are the biggest culprit by far. A single high-resolution photo embedded in a PDF can be several megabytes on its own. Word processors and design tools often embed images at the full original resolution, even if the document is displayed at a fraction of that size. A 24MP camera photo printed at 2 inches on a page doesn't need 24 megapixels of data.
Unsubsetted fonts add significant overhead. When you embed a complete font file, you're including the outlines for every glyph the font contains — potentially thousands of characters for a Unicode font, even if your document only uses 40 of them. A single unsubsetted OpenType font can be 200–600KB.
Redundant objects accumulate when PDFs are incrementally updated. The PDF format allows appending new content without rewriting the whole file. After multiple save operations, the file can contain multiple versions of the same object — earlier revisions that are no longer referenced but still take up space.
Uncompressed content streams are page description data (the PostScript-like commands that describe where to draw text and graphics). These are usually compressed but sometimes aren't, particularly in PDFs produced by older or minimal-compliance tools.
Image Compression Inside PDFs
Images in PDFs use a filter chain — one or more compression algorithms applied to the raw pixel data. The choice of filter depends on the image content.
DCT (JPEG) compression is used for photographs and continuous-tone images. It's a lossy algorithm — it discards high-frequency detail that human vision is less sensitive to. Quality can be tuned from nearly invisible loss to heavy artifacting. Most PDF optimizers target 72–150 DPI for screen-resolution images and 150–300 DPI for print, then apply JPEG at quality 60–80. A raw 5MB photo can become 200KB this way.
FLATE (ZIP/Deflate) compression is the lossless option, used for graphics, diagrams, screenshots, and images with text. It's the same algorithm as ZIP files — it finds repeating patterns in the data and encodes them more efficiently. FLATE is the right choice when you can't afford any quality loss, like a logo or a chart with precise colors.
JBIG2 is specialized for black-and-white (1-bit) images — scanned text pages, fax output. It achieves dramatically better compression than FLATE for bitonal content by recognizing repeating symbol patterns (the same letter e appears many times on a page) and storing each symbol once. PDF/A-2 and later standards allow JBIG2; PDF/A-1 does not.
CCITT Group 4 is an older standard for bilevel images, used in many scanned document workflows. It's lossless and efficient for text scans, but JBIG2 achieves significantly better ratios on most real-world documents.
Font Subsetting
Font subsetting is one of the highest-impact optimizations for text-heavy PDFs. Instead of embedding the full font file, subsetting embeds only the glyphs actually used in the document.
If your document uses a font to display the word "Hello World", a subsetted version of that font contains outlines for H, e, l, o, W, r, d — just those seven glyphs. A typical English business document uses maybe 80–100 distinct characters out of a font that might contain 1,200.
PDF generators like Acrobat, InDesign, and LaTeX subset fonts by default. The problem arises with older tools, some Word-to-PDF converters, and PDFs that have been combined from multiple sources. A merged PDF can end up with several copies of the same font family, each slightly different, none subsetted.
Subsetting is always lossless. The visual output is identical — the glyph shapes are the same, just fewer of them. The only downside is that if you need to edit the PDF later, the font may not have the glyphs you're trying to add.
Object Streams and Cross-Reference Streams
PDF 1.5 introduced two structural improvements that reduce file size without touching content.
Object streams pack multiple PDF objects into a single compressed stream. A PDF contains many small objects — page dictionaries, resource lists, metadata records. Individually they're tiny, but the overhead of storing each as a separate top-level object adds up. Grouping them into object streams and compressing the whole group achieves better compression than compressing each object alone.
Cross-reference streams replace the traditional xref table (a plain text index of byte offsets for every object in the file) with a compressed binary stream. On a document with thousands of objects, the xref table can be substantial. The compressed stream format is both smaller and more efficient to parse.
Tools that produce PDF 1.4 and earlier output miss both of these. Optimizers that target PDF 1.5+ rewrite the file structure to take advantage of them.
Lossless vs Lossy: The Real Trade-off
PDF optimization tools typically offer two modes, and the difference matters depending on your use case.
Lossless optimization — Remove duplicate objects, recompress content streams with better settings, subset fonts, apply FLATE to uncompressed graphics, upgrade the file structure to use object/xref streams. Nothing visual changes. This usually achieves 10–30% size reduction on typical documents.
Lossy optimization — Downsample images to a lower resolution or recompress them at lower quality. This is where the big reductions happen. Downsampling a 300 DPI image to 150 DPI throws away 75% of the pixels. Recompressing JPEG images that were stored losslessly (as FLATE in the PDF) to JPEG at quality 75 can reduce image size by 80%. But the changes are irreversible — if you later need the high-resolution version, you have to go back to the source.
The practical rule: use lossless for archival documents, contracts, and anything that might be reprinted. Use lossy for email attachments, web PDFs, and documents that will only ever be viewed on screen at normal zoom.
What Ghostscript Actually Does
Ghostscript, the most common open-source PDF processing engine, uses /screen, /ebook, /printer, and /prepress settings as compression presets. The /screen preset downsamples color images to 72 DPI and applies aggressive JPEG compression. /ebook targets 150 DPI. /printer keeps 300 DPI but still recompresses images and subsets fonts.
Running Ghostscript's /ebook preset on a 10MB Word-to-PDF export is a common quick win:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=compressed.pdf input.pdf
For multi-document workflows, merging first and then compressing the combined file is more efficient — the compressor can deduplicate fonts and resources across all the input files at once. See PDF Merge for combining files before running the optimization step.
Realistic Size Reduction Expectations
PDF compression isn't magic. Results depend entirely on what's inside the file.
Image-heavy PDFs — a brochure full of photographs — can shrink by 60–85% with lossy optimization. Lossless-only gets you maybe 5–15%.
Text-only PDFs — contracts, reports, forms — are already mostly compressed. Lossless optimization can get you 10–20%. Lossy adds little beyond font subsetting.
Scanned documents — the big variable. An uncompressed scan at 300 DPI is huge. Recompressing to JBIG2 (for black-and-white text) or JPEG can reduce a 15MB scan to under 1MB with acceptable visual quality.
Already-optimized PDFs — if someone already ran the file through Acrobat's optimizer or Ghostscript, you might get 2–5% additional savings at best.
The PDF Compress tool handles the most impactful operations — image resampling, FLATE compression for graphics, and object stream optimization — without requiring you to install Ghostscript locally.
For a deeper understanding of the PDF format itself, including how content streams and the cross-reference table are structured, see How PDF Works. If you're curious about the compression algorithms PDF borrows from general data compression, How Data Compression Works covers the underlying theory.