How PDF Redaction Actually Works

Q: What metadata do I need to strip beyond the visible content?

At minimum: document properties (Title, Author, Subject, Keywords, Producer), XMP metadata (the embedded XML packet often contains internal paths and revision history), embedded comments and annotations, bookmarks and outlines, embedded files, JavaScript actions, and optional content groups (hidden layers). Tools like Acrobat's "Sanitize Document" walk all of these. exiftool and pdfinfo are the command-line equivalents for verification.

Q: How do I verify redaction worked before publishing?

Run three checks in order: (1) pdftotext redacted.pdf - | grep -i "sensitiveterm" — should return empty; (2) pdfimages -all redacted.pdf out and inspect every extracted image for hidden content; (3) pdfinfo redacted.pdf and exiftool redacted.pdf to verify metadata is clean. If any of these surface what you tried to hide, your redaction is broken. Always test on a copy, never on the file you're about to publish.

In 2014, the Transportation Security Administration released a screening manual to the public with sensitive sections "redacted" — black rectangles drawn over names of officials, security procedures, and explosive thresholds. Within hours of the PDF hitting the web, researchers were copying the supposedly hidden text straight out of it. The black boxes were just images on top. The original characters were still in the file, intact, copy-pasteable, indexable by search engines.

It happens constantly. The Department of Justice has done it. Manafort's lawyers did it in 2019, leaking sealed court information about Konstantin Kilimnik. Banks, law firms, and Fortune 500 companies do it every quarter. The pattern is always the same: someone draws a black rectangle in a tool that doesn't understand what redaction actually means, ships the document, and the underlying text travels with it.

This post is about why that keeps happening, what real redaction looks like under the hood, and how to verify your own documents before they leak.

What a PDF Actually Stores

A PDF is not a picture of a page. It's a structured document made of objects — fonts, content streams, images, metadata, annotations, form fields — assembled by a viewer at render time. When Acrobat shows you "Hello, world" on a page, what's stored on disk is roughly:

BT
  /F1 12 Tf
  72 720 Td
  (Hello, world) Tj
ET

BT opens a text block, the font is selected, a position is set, and Tj paints a string. The literal characters live in the content stream, separately from anything visible on screen. A viewer can show or hide things using clipping paths, layers, opacity, or stacking order — but none of that erases the underlying objects.

This is the trap. When you draw a filled black rectangle over (Hello, world), the rectangle gets appended to the content stream as another drawing operation. The string is still right there in the file, still selectable by anything that walks the object tree.

The Black Box Lie

The most common failure mode is what I'll call "black box redaction": opening a PDF in a generic editor, drawing solid rectangles over sensitive text, and saving. Visually it looks redacted. Functionally, the file still contains every character you tried to hide.

Three trivial ways the data leaks back out:

Copy-paste. Open the PDF in any viewer, drag-select the redacted region, paste into a text editor. The hidden text appears.
Text extraction tools. Run pdftotext from poppler, or any PDF text extractor, and the underlying content stream is dumped to stdout — rectangles ignored.
Search indexing. Google, Bing, and internal enterprise search will index the original text. Your "hidden" content shows up in snippets.

You can confirm this in seconds with our PDF Text Extractor. Drop a black-boxed PDF in, and any text supposedly hidden underneath will print right out.

There's a related failure with images: if a sensitive image was placed on the page and then covered, the image object itself is still embedded. Tools that walk PDF resources will pull it out. Our PDF Image Extractor does exactly that, and is a fast sanity check before publishing anything sensitive.

What Real Redaction Does

Proper redaction has two phases. Most failures skip the second one entirely.

Phase one: marking. The reviewer identifies regions to redact. This produces redaction annotations — metadata that says "the area at coordinates X,Y to X',Y' on page 3 will be redacted." At this stage, nothing has actually been removed. The annotation is a TODO note for the document.

Phase two: applying. This is where real redaction happens, and it has to do all of the following:

Parse the content stream and identify every text-showing operator (Tj, TJ, ', ") whose glyphs intersect the redaction rectangle
Remove or replace those operators in the content stream itself, not just paint over them
Identify any image XObjects that overlap the region and either crop, replace, or remove them
Walk vector path operators (m, l, c, re, etc.) and clip them at the redaction boundary
Repaint a solid fill (typically black) on top so the visual result matches expectation
Regenerate the page's text-extraction layer so copy-paste returns nothing for the redacted span

If a tool only does the last step — paint a black rectangle — it has not redacted anything. It has decorated the page.

Adobe Acrobat Pro's redaction workflow makes this explicit: "Mark for Redaction" is one button, "Apply Redactions" is a separate, irreversible action. The two-step UI is not friction for its own sake. It exists because Phase 2 destroys data, and the tool wants you to confirm before doing it. See Adobe's official redaction documentation for the full procedure.

flowchart TB
  START([Original PDF]) --> P1
  subgraph P1[Phase 1: Mark]
    M1[Identify regions]
    M2[Add redaction annotations]
    M3[Reviewer confirms<br/>nothing destroyed yet]
    M1 --> M2 --> M3
  end
  P1 --> P2
  subgraph P2[Phase 2: Apply - irreversible]
    A1[Edit content stream<br/>remove Tj operators]
    A2[Crop or remove<br/>image XObjects]
    A3[Clip vector paths]
    A4[Repaint solid fill]
    A5[Strip metadata + XMP]
    A1 --> A2 --> A3 --> A4 --> A5
  end
  P2 --> V[Verify: pdftotext + pdfimages + pdfinfo]
  V --> END([Safe to publish])

Metadata Is a Second Leak Channel

Even if you successfully remove text from the visible content stream, PDFs carry metadata that often duplicates it. Common offenders:

Document properties: Title, Author, Subject, Keywords, Producer, CreationDate, ModDate
XMP metadata: an embedded XML packet that may contain commit history, internal file paths, original author identity, and tool fingerprints
Form fields: filled-in values stored separately from rendered glyphs
Embedded comments and annotations: review notes, sticky notes, highlights with attached text
Bookmarks and outlines: chapter titles that name what you're trying to hide
Embedded files: original Word docs or spreadsheets attached to the PDF
JavaScript actions: button handlers that contain data references
Optional content groups (layers): content marked invisible but still present

A properly applied redaction tool will run a sanitization pass that walks all of these and either strips them, replaces them with empty values, or asks the reviewer what to do. Acrobat calls this "Sanitize Document." The NSA's redaction guide covers the methodology in depth and is the authoritative reference for high-stakes work.

If you do not sanitize, the visible page can be perfectly clean and the document properties can still announce "Author: Jane Doe, Title: Internal Memo re: Acquisition of FooCorp."

Cases That Actually Made the News

The Wikipedia article on sanitization of classified information maintains a depressing running list. A few highlights:

2005, US Army report on Nicola Calipari's death. Black boxes covered names of officers and rules of engagement. Recipients hit Ctrl+A, Ctrl+C, paste, and got the lot. The report had to be reissued.
2010, TSA screening manual. Detailed thresholds, exemption lists, and identifying details revealed by basic copy-paste from rectangles.
2011, AT&T iPad email leak case. A "redacted" court filing exposed witness names because the redaction was image overlay, not text removal.
2019, Manafort filing. Sealed mentions of Russian intelligence contacts visible on copy-paste. Made global headlines within hours of filing.
2023, multiple law firm leaks. Discovery documents released to opposing counsel containing fully recoverable "redacted" sections — sometimes settling cases on the spot.

These were not amateurs. They were lawyers, intelligence agencies, and Fortune 500 companies using mainstream commercial software. The common factor was a tool or workflow that drew rectangles without applying them. The Electronic Frontier Foundation's transparency work catalogs more cases and pushes agencies to redact correctly when responding to FOIA requests.

Verifying Redaction Before You Ship

Three checks, in order, before any document leaves your control.

Check 1: extract text and grep. Use a real text extractor and search for anything that should be hidden. If a redacted name appears in the output, the redaction is broken.

pdftotext redacted.pdf - | grep -i "smith"
# Empty output = good. Match = your "redaction" is fake.

If you don't have a CLI handy, our PDF Text Extractor does the same thing in the browser — extracted text is searchable directly on the page.

Check 2: extract images. Sensitive logos, signatures, or screenshots may be embedded as image objects under the redaction. Run an extractor:

pdfimages -all redacted.pdf out
ls out/
# Inspect every output. If you see something you tried to hide, you didn't hide it.

Or use PDF Image Extractor for a quick visual audit.

Check 3: dump metadata. Author, title, and producer fields routinely carry sensitive info — internal codenames, original filenames, draft revision history.

pdfinfo redacted.pdf
exiftool redacted.pdf

If your goal is to publish anonymously, the metadata pass is non-negotiable.

Practical Workflow

A reliable redaction workflow looks roughly like this:

1. Work on a *copy* — never the original
2. Mark all sensitive regions (Phase 1)
3. Apply redactions (Phase 2) — this is irreversible
4. Run a sanitize/clean-metadata pass
5. Save as a new file (do NOT overwrite the marked-up version)
6. Verify with the three checks above
7. Open the result in a fresh viewer and try to copy-paste
8. Only then publish

flowchart LR
  ORIG[Original PDF] --> COPY[Work on a copy]
  COPY --> MARK[Mark regions]
  MARK --> APPLY[Apply redactions<br/>destroy data]
  APPLY --> SAN[Sanitize metadata,<br/>XMP, comments,<br/>embedded files]
  SAN --> SAVE[Save as new file]
  SAVE --> V1[pdftotext grep]
  SAVE --> V2[pdfimages inspect]
  SAVE --> V3[pdfinfo / exiftool]
  V1 & V2 & V3 --> CHECK{All clean?}
  CHECK -- "yes" --> PUB[Publish]
  CHECK -- "no" --> COPY

Two structural rules worth following regardless of tool choice.

Flatten before redacting when possible. Converting text-bearing pages to flat raster images and then redacting on the image guarantees the underlying text is gone — there's no longer a content stream to leak from. The cost is searchability and accessibility. For high-stakes redaction (legal filings, intelligence material) flattening is often the right tradeoff. Our PDF compression tool and other rasterizing pipelines can produce flat PDFs.

Treat watermarks differently. Watermarks are visible by design and are not redaction. If you need a "draft" or "confidential" overlay, use a real watermark tool like our PDF Watermark — but never confuse it with redaction. A watermark adds visible content; redaction removes hidden content.

Unlock encrypted PDFs before you redact. Redaction tools cannot rewrite content streams they can't decrypt. Use a tool like PDF Unlock with the correct password (not for breaking encryption you don't own), redact and sanitize the cleaned output, then re-encrypt if needed. Encryption is not redaction — a password-protected PDF that still contains the original sensitive text is one cracked password away from full disclosure. Redaction removes the data; encryption merely guards it.

The Takeaway

PDF redaction has a UX problem masquerading as a security problem. Every commercial PDF tool offers a "draw rectangle" function. Far fewer offer a real "apply redaction and sanitize" pipeline, and the ones that do bury it behind multi-step workflows that are easy to skip. Most leaks happen because a smart, careful person used the wrong button.

Three rules to internalize:

Drawing a black rectangle is not redaction. It's drawing a black rectangle.
Redaction must modify the content stream and strip metadata. If your tool doesn't claim to do both, it doesn't redact.
Always verify with text + image extraction + metadata dump before shipping. The verification step takes ninety seconds and would have stopped every famous leak in the last twenty years.

Hidden data in documents is a solved problem. The solutions just have to be used correctly, and the cost of getting it wrong is permanent. If you're publishing anything you can't take back, treat redaction the way you'd treat a database migration: review, apply, verify, then commit.

FAQ

Why isn't drawing a black rectangle in Preview or Photoshop real redaction?

Because both tools just add a drawing operator on top of the existing content stream — the original text and images stay in the file untouched. A reader can copy-paste, run pdftotext, or extract images and recover the data. Real redaction must remove or replace the underlying objects in the content stream, which Preview and most general image editors don't do. Use Adobe Acrobat Pro's "Apply Redactions" feature or a dedicated redaction tool.

Can OCR scanned PDFs leak redacted information?

Yes, in two ways. If the redaction was applied to the visible image but the OCR'd text layer was left untouched, search engines and screen readers still see the original text. If the redaction added a new image overlay without removing the underlying scanned image, image extraction tools recover the unredacted page. Always run a sanitize pass that strips both layers and verify with pdftotext after redacting.

Is rasterizing a PDF a safe redaction strategy?

It's the most reliable strategy for high-stakes work. Converting text-bearing pages to flat raster images destroys the content stream entirely — there's no underlying text to leak. The cost is searchability and accessibility (screen readers can't read the page, search engines can't index it). For legal filings or intelligence material, the tradeoff is worth it; for general business documents, proper redaction with text removal is preferred.

What metadata do I need to strip beyond the visible content?

At minimum: document properties (Title, Author, Subject, Keywords, Producer), XMP metadata (the embedded XML packet often contains internal paths and revision history), embedded comments and annotations, bookmarks and outlines, embedded files, JavaScript actions, and optional content groups (hidden layers). Tools like Acrobat's "Sanitize Document" walk all of these. exiftool and pdfinfo are the command-line equivalents for verification.

How do I verify redaction worked before publishing?

Run three checks in order: (1) pdftotext redacted.pdf - | grep -i "sensitive_term" — should return empty; (2) pdfimages -all redacted.pdf out and inspect every extracted image for hidden content; (3) pdfinfo redacted.pdf and exiftool redacted.pdf to verify metadata is clean. If any of these surface what you tried to hide, your redaction is broken. Always test on a copy, never on the file you're about to publish.

Are there free, browser-based redaction tools I can trust?

Most "free PDF redaction" tools online only do the rectangle-overlay version of redaction — not real content stream removal. Adobe Acrobat Pro is the established gold standard but costs money. Open-source options like qpdf for content stream manipulation, combined with mutool clean for sanitization, can do real redaction with command-line discipline. For high-stakes work, never trust a tool that claims redaction without explicit "apply" and "sanitize" steps.

What's the difference between encryption and redaction?

Encryption hides content behind a password but leaves it intact in the file. Redaction removes content entirely. A password-protected PDF that still contains the original sensitive text is one cracked password (or one shared password) away from full disclosure. Treat encryption as access control, not data minimization. If something needs to never be recoverable, redact it before encrypting.

Why do major law firms keep getting redaction wrong?

Mostly UX and process issues, not technical incompetence. Most PDF tools' redaction workflow has too many steps, and "Mark for Redaction" looks done before "Apply Redactions" runs. Lawyers under deadline pressure ship marked-but-not-applied files. The fix is process discipline: a checklist that requires text extraction verification before any document leaves the firm, plus institutional defaults that flatten high-sensitivity exhibits to raster.