UtilityKit

500+ fast, free tools. Most run in your browser only; Image & PDF tools upload files to the backend when you run them.

PDF Text Extractor

Extract all text from any PDF in your browser. Preserve line breaks, copy, or download as .txt file.

About PDF Text Extractor

The PDF Text Extractor uses Mozilla's PDF.js library to extract the text content from every page of a PDF document directly in your browser. Pages are processed sequentially with a live progress counter. A toggle lets you preserve paragraph line breaks (using EOL markers from PDF.js) or collapse the text into a flowing paragraph for easier copy-paste. The extracted text includes a page header separator (--- Page N ---) before each page's content. You can copy the full output to your clipboard or download it as a .txt file.

Why use PDF Text Extractor

  • Extracts text from all pages in sequence with a live progress indicator
  • Preserve or collapse line breaks depending on your downstream use
  • Page separators make it easy to locate content from specific pages
  • Runs in-browser via PDF.js — no server, no privacy concerns
  • Preserve or collapse line breaks depending on your downstream use (script vs paragraph)
  • Page separators (--- Page N ---) make it easy to locate content from specific pages

How to use PDF Text Extractor

  1. Drag and drop a PDF file onto the drop zone, or click 'browse' to select one
  2. Watch the progress indicator as each page is extracted
  3. Toggle 'Preserve line breaks' on or off based on your needs
  4. Click 'Copy' to copy all extracted text to your clipboard
  5. Click '↓ .txt' to download the text as a plain text file
  6. Drag and drop a PDF file onto the drop zone, or click browse to select one
  7. Watch the progress indicator as each page is extracted (shows N of M pages)

When to use PDF Text Extractor

  • When you need to copy text from a PDF that doesn't allow selection
  • When processing PDF content in a text editor or script
  • When indexing PDF content for search or database import
  • When extracting text from scanned-but-OCR'd PDFs
  • When you need to copy text from a PDF that doesn't allow selection in the viewer
  • When processing PDF content in a text editor, IDE, script, or LLM prompt

Examples

Single-column report

Input: report.pdf — 5 pages of single-column text

Output: --- Page 1 --- ... --- Page 2 --- ... — clean paragraphs preserved with line breaks

Multi-column research paper

Input: paper.pdf — two-column academic layout

Output: Text may interleave columns; copy to a wider editor and reflow column-by-column

Scanned-and-OCR'd doc

Input: contract.pdf — scanned then run through Adobe OCR

Output: Text layer present; extraction succeeds with possible minor OCR errors (broken words, mismatched chars)

Image-only scan

Input: old-fax.pdf — pure raster scan, no OCR

Output: Empty output — no text layer to extract; run OCR first using a tool like Tesseract or Adobe

Tips

  • If the output is empty, the PDF likely has no text layer — it's an image-only scan that needs OCR first
  • Multi-column layouts may produce out-of-order text — paste into a wider editor and reflow manually
  • Turn off Preserve line breaks if you need a single-line summary or want to feed text into an LLM
  • Use the page separators (--- Page N ---) to grep or search-jump to specific pages in long extracts
  • Encrypted PDFs need to be unlocked first — use the PDF Unlock tool then return here
  • For very large PDFs, watch the progress counter — extraction can take 5-10 seconds for 200+ pages

Frequently Asked Questions

Can it extract text from scanned PDFs (image-based)?
Only if the PDF contains an embedded OCR text layer. Pure image-only PDFs (no text layer) will produce no output — you would need OCR software first.
Why is some text out of order?
PDF text items are stored in drawing order, which may not match reading order for multi-column layouts or complex designs. PDF.js returns items in the order the PDF renderer encounters them.
What does the 'Preserve line breaks' toggle do?
When enabled, PDF.js EOL (end-of-line) markers are converted to newlines in the output, preserving the visual paragraph structure of the PDF.
Is the PDF uploaded to a server?
No. PDF.js runs in your browser. The file is read locally via FileReader and processed page by page.
Are headers, footers, and page numbers included?
Yes. All text in the PDF including headers, footers, captions, and page numbers is extracted. Page separation markers (--- Page N ---) are added by this tool.
What happens with password-protected PDFs?
PDF.js will fail to load an encrypted PDF without the password. Use the PDF Unlock tool first, then extract text from the unlocked file.
What does the Preserve line breaks toggle do?
When enabled, PDF.js EOL (end-of-line) markers are converted to newlines in the output, preserving the visual paragraph structure of the PDF.
Does it preserve tables and columns?
No — output is a flat stream of text. Tables collapse into rows of space-separated values, and multi-column pages may interleave. Use a PDF table extraction tool for tabular data.

Explore the category

Glossary

Text Layer
The machine-readable text data inside a PDF that powers selection, copy-paste, search, and accessibility. Required for text extraction; absent in pure-image scans.
OCR
Optical Character Recognition — software that converts images of text into selectable text by analyzing pixel patterns. Required to add a text layer to scanned PDFs.
PDF.js
Mozilla's open-source JavaScript PDF library used in Firefox and on the web. Renders and parses PDFs entirely in the browser without plugins.
EOL Marker
End-of-line indicator that PDF.js emits between text items when the layout suggests a paragraph break. Used by this tool's preserve-line-breaks toggle.
Embedded Font
A font included inside the PDF file so it renders identically on any device. Embedded fonts let PDF.js extract correct Unicode characters for non-ASCII text.
ToUnicode Map
A PDF table that maps glyph codes back to Unicode characters. Without it, PDF.js may extract garbage symbols for stylized fonts.
Page Tree
The hierarchical PDF structure listing each page; PDF.js iterates this tree to extract text from every page in document order.
Drawing Order
The order in which a PDF's content stream emits text fragments — not necessarily reading order. Multi-column PDFs may appear out of sequence.