PDF Redaction Mistakes — How Black Boxes Leak Hidden Info

Q: Should journalists trust a "redacted" leak document?

Treat it as untrusted until verified. Run pdftotext, pdfimages -all, and exiftool. If hidden text surfaces, the source's redaction failed and there may be more content they didn't intend to share. Conservative practice: rasterize before quoting, both to confirm what's visible and to prevent your publication from leaking the underlying data downstream.

In March 2019, lawyers for Paul Manafort filed a court document with sensitive sections "redacted" using neat black rectangles — references to a Russian intelligence official, polling data sharing, meetings in Madrid. Within minutes, journalists were copying the redacted text directly out of the PDF. The story made every front page that afternoon. The lawyers used a perfectly mainstream tool. They used the wrong button.

Five years earlier, the TSA released a 93-page screening manual under FOIA with black bars over officials' names and internal thresholds. The text underneath was selectable and indexed by Google within a week.

These are not edge cases. They are the dominant failure mode of PDF redaction. This post is about the specific mistakes — what each looks like in the file, why mainstream tools allow them, and the verification steps that would have caught every single case.

Why "Drawing a Black Box" Doesn't Redact

A PDF is not a picture of a page. It's a tree of objects: fonts, content streams, images, annotations, metadata. When a viewer renders "John Smith" on the page, the underlying file holds something like this:

BT
  /F1 12 Tf
  72 720 Td
  (John Smith) Tj
ET

The string (John Smith) lives in a content stream as literal text. The viewer paints it. If you draw a filled rectangle over it in a generic editor, the rectangle is appended to the same stream as another drawing operation:

BT
  /F1 12 Tf
  72 720 Td
  (John Smith) Tj
ET
0 0 0 rg
72 715 60 12 re f

Visually, the rectangle covers the text. But (John Smith) is still in the stream, intact. Any tool that walks the object tree — including standard text extractors and search-engine crawlers — finds it instantly. This is the entire mechanism behind every redaction failure of the last two decades.

You can verify this in seconds with our PDF Text Extractor. Drop a black-boxed PDF in, and any text supposedly hidden underneath prints right out. If your "redacted" document leaks text into that tool, anyone on the internet who downloads it can do the same.

What's Actually in Your PDF After a Black-Box "Redaction"

When you draw a rectangle in a non-redaction-aware tool and save, the file gains the drawing operation but loses nothing. Specifically, you can still recover:

Every character of the supposedly hidden text, via copy-paste or pdftotext
Embedded images under the rectangle, via pdfimages -all or our PDF Image Extractor
Form field values stored separately from rendered text
Document and XMP metadata — author, title, internal paths, software fingerprints
Bookmarks that often summarize redacted sections
Embedded files — original Word docs the PDF was generated from
Optional content layers marked invisible but still in the file

Adobe's official redaction documentation splits redaction into "Mark for Redaction" and "Apply Redactions" deliberately. The second step destroys data; the first one doesn't. Generic PDF editors that conflate the two let you mark all day without applying anything.

The Manafort 2019 Leak via Copy-Paste

The Manafort filing addressed the special counsel's claim that Manafort had violated his cooperation agreement, with black rectangles drawn over names, dates, and references. Within minutes of the docket entry going live, reporters opened the PDF in Adobe Reader, drag-selected the redacted regions, and pasted into text editors. The selection grabbed the underlying text because the rectangles were image overlays, not redactions. The sealed information appeared verbatim — including the reference to Konstantin Kilimnik, identified as a Russian intelligence operative, and details of polling data shared during the 2016 campaign.

The Wikipedia article on sanitization of classified information catalogs this case alongside dozens of others. The pattern is always the same: a generic editor, a rectangle tool, no apply-redactions phase.

The TSA 2014 Manual Leak

Five years earlier, the TSA released a Standard Operating Procedures manual under FOIA. Black bars covered senior officials' names, weight thresholds for explosive detection, exemption rules, and internal abbreviation glossaries. Researchers opened the 93-page PDF in Acrobat Reader and copied selectable text from under the bars. Every redacted section was recovered within hours.

The TSA case stacked a second mistake on top: even after the agency reissued the document with proper content-stream redaction, the metadata still contained internal author names and the original software fingerprint. The Electronic Frontier Foundation's transparency reports catalog similar mistakes across federal agencies.

Real Redaction at the Content-Stream Level

Proper redaction has two phases, and both have to happen. Most failures skip phase two.

Phase 1: marking. The reviewer identifies regions as Redact annotations — metadata that flags coordinates as "to be redacted." At this point nothing is removed.

Phase 2: applying. The redaction engine has to:

Parse every content stream on each affected page
Find every text-showing operator (Tj, TJ, ', ") whose glyphs intersect the region
Remove or replace those operators in the stream, not paint over them
Crop or remove image XObjects that overlap the region
Clip vector path operators (m, l, c, re) at the redaction boundary
Repaint a solid fill so the visual result still shows a black box
Regenerate the page's text-extraction layer so copy-paste returns nothing
Strip linked metadata (annotations, bookmarks, form values) that name the redacted content

The NSA's published redaction guidance walks through the procedure in detail. It is the canonical reference for high-stakes work and explicitly calls out the difference between marking and applying.

Our PDF Redact Text tool follows this content-stream-level approach. The output PDF has the redacted text physically removed, not painted over. Verification with pdftotext returns nothing for the redacted spans — because nothing is there anymore.

Metadata as a Second Leak Channel

Even successful content-stream redaction can leak through metadata. The fields that routinely betray documents:

Title       : Internal Memo re: Acquisition of FooCorp
Author      : Jane Doe (Legal)
Subject     : DRAFT v3 — confidential
Keywords    : project-orion, board-2024, layoffs
Producer    : Acrobat Pro DC 24.001.20629
CreationDate: 2024-11-12T14:23:08Z

XMP metadata is worse — an embedded XML packet that may contain commit history, internal paths, and tool fingerprints. ExifTool dumps it in seconds: exiftool redacted.pdf.

Form fields are a third channel. If a redacted page contains an AcroForm field whose value is (John Smith), that value lives in the form-field dictionary, not the rendered content stream. Any field-aware tool reads it directly. Our PDF Form Filler is useful for inspecting which fields exist before publication.

A complete redaction workflow runs a sanitization pass over all of these. Acrobat calls it "Sanitize Document." If your tool doesn't have a sanitize step, your redaction is incomplete by definition.

Practical Workflow That Doesn't Leak

A reliable redaction workflow:

1. Work on a copy — never the original
2. Flatten any layered, encrypted, or form-bearing pages first
3. Mark all sensitive regions (Phase 1)
4. Apply redactions (Phase 2) — irreversible
5. Run sanitize/clean-metadata pass
6. Save as a new file
7. Verify: text + image extraction + metadata dump
8. Try copy-paste manually in a fresh viewer
9. Only then publish

Verification is non-negotiable. Three commands:

# 1. Text extraction — should return nothing for redacted spans
pdftotext redacted.pdf - | grep -i "smith"

# 2. Image extraction — inspect every output for sensitive content
pdfimages -all redacted.pdf out
ls out/

# 3. Metadata dump
pdfinfo redacted.pdf
exiftool redacted.pdf

In the browser, PDF Text Extractor replaces step 1, PDF Image Extractor replaces step 2, and any viewer's document-properties pane covers step 3.

If you're publishing anonymously, verification is the difference between a clean publication and a career-ending leak. Most journalist-protection guides recommend rasterizing the document (every page to flat image) before redacting; tools like PDF Compress and rasterizing pipelines can flatten pages this way. The cost is searchability — for high-stakes work, the right trade.

If your source PDF is encrypted, decrypt it first with PDF Unlock (assuming you own the password) so the redaction engine can rewrite the content stream. Re-encrypt the cleaned output if needed — but never confuse encryption with redaction. Likewise, a "draft" overlay from PDF Watermark is not redaction; a watermark adds visible content, redaction removes hidden content.

For deeper mechanics, see How PDF Redaction Actually Works. For broader PDF internals, How PDF Works covers the object model.

FAQ

Is "drawing a black rectangle in Word" the same as redaction?

No, and worse — Word's "Save as PDF" preserves selectable text under any drawing layer you add. The rectangle is rendered as a graphic on top of the text run, not a replacement for it. A PDF made this way leaks every redacted character via copy-paste and pdftotext. Use a real redaction tool (Acrobat Pro's apply-redactions, our PDF Redact Text, or a flattening pipeline) instead.

What about printing a PDF and re-scanning it?

This works because scanning produces a flat raster image — there's no content stream to leak from. The trade-off is that the resulting PDF is unsearchable, inaccessible to screen readers, and has no selectable text. For most modern publishing workflows, that's a problem. The flatten-to-raster approach is the same idea executed digitally and is preferable for high-stakes work where searchability matters less than absolute redaction.

Can OCR recover text from rasterized redactions?

OCR can read any text it can visually see. Rasterizing a page with sensitive text not under a redaction means OCR will recover that text — but if the redaction itself is a solid black box on the rasterized image, OCR will return nothing for that region because there are no glyphs to read. Rasterized redaction is robust to OCR. Vector "redaction" with selectable text underneath is not.

Are encrypted PDFs automatically safe?

No. Password-protected PDFs are readable to anyone with the password. Encryption is access control, not data removal — the original sensitive text is still in the file. Redact and sanitize first, then optionally encrypt the cleaned output.

Does Adobe Acrobat Pro guarantee proper redaction?

Acrobat Pro's "Apply Redactions" function does perform content-stream-level redaction correctly when you use it. The leaks happen when users mark redactions but never apply them, or when they only paint with the rectangle drawing tool instead of using the redaction workflow. The tool is correct; the failures are workflow mistakes. Always click "Apply Redactions" and run "Sanitize Document" before saving.

What about PDF/A archival format?

PDF/A is an ISO standard for long-term archival. It does not redact anything. A PDF/A document with hidden text under black rectangles is just as leaky as any other PDF. Treat format and redaction as orthogonal questions.

How do I redact a PDF that contains AcroForm or XFA forms?

Forms store values separately from rendered glyphs, so a content-stream-redacted page can still leak via the form-field dictionary. Either flatten the form first (converting field values into static rendered text), or use a tool that walks form fields and sanitizes them. Our PDF Form Filler inspects which fields exist before publishing.

Should journalists trust a "redacted" leak document?

Treat it as untrusted until verified. Run pdftotext, pdfimages -all, and exiftool. If hidden text surfaces, the source's redaction failed and there may be more content they didn't intend to share. Conservative practice: rasterize before quoting, both to confirm what's visible and to prevent your publication from leaking the underlying data downstream.