How PDF Works: Inside the World's Most Portable Document Format

How PDF Works: Inside the World's Most Portable Document Format

What PDF Was Designed To Be

PDF stands for Portable Document Format. Adobe created it in 1993 with one goal: a document that looks identical on every device, regardless of operating system, fonts, or screen size. Unlike HTML, which reflows, or a Word document, which renders differently based on installed fonts, a PDF is a precise description of a page. Place a word at a specific point and that's exactly where it appears on every viewer.

Technically, PDF is a page-description language — a direct descendant of PostScript, Adobe's earlier printer language. Where PostScript is a full programming language that a printer executes to draw a page, PDF is a static subset: a fixed description of what's on each page, encoded into a structured binary file. No execution, no loops — just content and positioning.

The File Structure

A PDF file has four major sections that appear in order: the header, the body, the cross-reference table, and the trailer.

The header is just one line. It identifies the file as a PDF and declares the version:

%PDF-1.7

The second line conventionally includes four bytes with values above 127. This hints to transfer tools that the file is binary, not plain text, so they don't try to do line-ending conversion on it.

The body contains all the objects that make up the document — pages, images, fonts, annotations, metadata. Each object has a unique object number and generation number:

12 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] >>
endobj

Object 12, generation 0, is a Page object. It references its parent (object 2) and declares its dimensions (612×792 points, which is US Letter at 72 points per inch).

The cross-reference table (xref) is an index. It maps each object number to its byte offset in the file, enabling random access — you can jump directly to any object without scanning the entire file:

xref
0 13
0000000000 65535 f
0000000009 00000 n
...

The trailer points to the xref table and to the document's root object (the catalog), which is the entry point for navigating the document tree:

trailer
<< /Size 13 /Root 1 0 R /Info 11 0 R >>
startxref
8492
%%EOF

Content Streams and Operators

The actual visible content on a page — text, lines, images — lives in content streams. A content stream is a sequence of PDF operators, which are short keywords, with their operands. Think of it as a very low-level drawing instruction set:

BT                       % Begin text object
/F1 12 Tf               % Use font F1 at 12pt
100 700 Td              % Move text position: 100 right, 700 up from line start
(Hello, PDF World) Tj  % Show text string
ET                       % End text object

The coordinate system has its origin at the bottom-left corner of the page, with y increasing upward. This is the PostScript/mathematical convention, opposite to what most screen rendering systems use. When you're working with PDF coordinates programmatically, this trips people up — y=700 on a Letter page (792pt tall) puts you 700 points from the bottom, or about 92 points from the top.

Distances are measured in points: 1 point = 1/72 of an inch. A US Letter page is 612×792pt. An A4 page is 595×842pt.

How Fonts Are Embedded

Font embedding is what makes PDFs portable. When a font is embedded, the font data travels with the file — the viewer doesn't need that font installed. The tradeoff is file size, but it's the right tradeoff for document exchange.

PDF supports several font formats:

  • Type 1 — Adobe's original outline font format, still common in older PDFs
  • TrueType — the format most Windows and older Mac fonts use
  • OpenType — the modern standard, which can contain either TrueType or CFF (Compact Font Format) outlines
  • CIDFont — for large character sets (Asian scripts), where character codes map into a glyph dictionary rather than a simple encoding vector

Full embedding stores the entire font. Subsetting stores only the glyphs actually used in the document, with a randomized tag prepended to the font name (e.g. ABCDEF+Inter) to indicate it's a subset. A report that only uses ASCII can have a subsetted font under 20KB even for a complex typeface.

Fonts that are not embedded rely on the viewer having that font installed. When it doesn't, the viewer substitutes — and the layout shifts, often badly. This is why "embed all fonts" is standard practice before sharing PDFs.

Image Streams

Images in PDF are also stored as streams, typically compressed. PDF supports several image compression formats natively: JPEG (DCTDecode), ZIP/deflate (FlateDecode), CCITT Group 4 (for monochrome scans), and JBIG2. When you insert a JPEG into a PDF, it's usually stored as-is with DCTDecode — no re-compression, no quality loss.

Images are referenced from the page's content stream as XObjects:

/Im1 Do  % Draw image XObject named Im1

Their position and scale come from the current transformation matrix, set before the Do operator.

PDF Versions and Linearization

PDF has gone through versions 1.0 (1993) to 1.7 (2008), then the ISO standard took over with PDF 2.0 (2017). Each version added features: transparency (1.4), forms (1.2), digital signatures (1.3), optional content layers (1.5). Most tools target 1.4 or 1.7 for broad compatibility.

Linearized PDFs (also called "fast web view") rearrange the file structure so a viewer can begin displaying the first page before the entire file is downloaded. The first page's objects appear at the beginning of the file, and a special linearization hint table helps the viewer request the right byte ranges. It's still the same four-section structure — just with objects reordered for streaming access.

You can tell if a PDF is linearized by the presence of /Linearized 1 near the start of the file. Many PDF generators have a "Optimize for web" option that enables this.

For merging or compressing existing PDFs without needing to understand the internals, PDF Merge and PDF Compress handle the heavy lifting. To go deeper on what compression does to PDF file structure, the follow-up post How PDF Compression Works covers stream-level optimization and re-encoding strategies. And for a practical comparison of when to use PDF at all, PDF vs DOCX covers the workflow decision.