UtilityKit

500+ fast, free tools. Most run in your browser only; Image & PDF tools upload files to the backend when you run them.

PDF Text Extractor

Extract all text from any PDF in your browser. Preserve line breaks, copy, or download as .txt file.

About PDF Text Extractor

The PDF Text Extractor uses Mozilla's PDF.js library to extract the text content from every page of a PDF document directly in your browser. Pages are processed sequentially with a live progress counter. A toggle lets you preserve paragraph line breaks (using EOL markers from PDF.js) or collapse the text into a flowing paragraph for easier copy-paste. The extracted text includes a page header separator (--- Page N ---) before each page's content. You can copy the full output to your clipboard or download it as a .txt file.

Why use PDF Text Extractor

Extracts text from all

Extracts text from all pages in sequence with a live progress indicator

Preserve or collapse line

Preserve or collapse line breaks depending on your downstream use

Page separators make it

Page separators make it easy to locate content from specific pages

Runs in-browser via PDF

Runs in-browser via PDF.js — no server, no privacy concerns

Preserve or collapse line

Preserve or collapse line breaks depending on your downstream use (script vs paragraph)

Page separators (--- Page

Page separators (--- Page N ---) make it easy to locate content from specific pages

How to use PDF Text Extractor

Drag and drop a PDF file onto the drop zone, or click 'browse' to select one
Watch the progress indicator as each page is extracted
Toggle 'Preserve line breaks' on or off based on your needs
Click 'Copy' to copy all extracted text to your clipboard
Click '↓ .txt' to download the text as a plain text file
Drag and drop a PDF file onto the drop zone, or click browse to select one
Watch the progress indicator as each page is extracted (shows N of M pages)

When to use PDF Text Extractor

When you need to copy text from a PDF that doesn't allow selection
When processing PDF content in a text editor or script
When indexing PDF content for search or database import
When extracting text from scanned-but-OCR'd PDFs
When you need to copy text from a PDF that doesn't allow selection in the viewer
When processing PDF content in a text editor, IDE, script, or LLM prompt

Examples

Single-column report

Input: report.pdf — 5 pages of single-column text

Output: --- Page 1 --- ... --- Page 2 --- ... — clean paragraphs preserved with line breaks

Multi-column research paper

Input: paper.pdf — two-column academic layout

Output: Text may interleave columns; copy to a wider editor and reflow column-by-column

Scanned-and-OCR'd doc

Input: contract.pdf — scanned then run through Adobe OCR

Output: Text layer present; extraction succeeds with possible minor OCR errors (broken words, mismatched chars)

Image-only scan

Input: old-fax.pdf — pure raster scan, no OCR

Output: Empty output — no text layer to extract; run OCR first using a tool like Tesseract or Adobe

Tips

If the output is empty, the PDF likely has no text layer — it's an image-only scan that needs OCR first
Multi-column layouts may produce out-of-order text — paste into a wider editor and reflow manually
Turn off Preserve line breaks if you need a single-line summary or want to feed text into an LLM
Use the page separators (--- Page N ---) to grep or search-jump to specific pages in long extracts
Encrypted PDFs need to be unlocked first — use the PDF Unlock tool then return here
For very large PDFs, watch the progress counter — extraction can take 5-10 seconds for 200+ pages

Frequently Asked Questions

Can it extract text from scanned PDFs (image-based)?▾

Only if the PDF contains an embedded OCR text layer. Pure image-only PDFs (no text layer) will produce no output — you would need OCR software first.

Why is some text out of order?▾

PDF text items are stored in drawing order, which may not match reading order for multi-column layouts or complex designs. PDF.js returns items in the order the PDF renderer encounters them.

What does the 'Preserve line breaks' toggle do?▾

When enabled, PDF.js EOL (end-of-line) markers are converted to newlines in the output, preserving the visual paragraph structure of the PDF.

Is the PDF uploaded to a server?▾

No. PDF.js runs in your browser. The file is read locally via FileReader and processed page by page.

Are headers, footers, and page numbers included?▾

Yes. All text in the PDF including headers, footers, captions, and page numbers is extracted. Page separation markers (--- Page N ---) are added by this tool.

What happens with password-protected PDFs?▾

PDF.js will fail to load an encrypted PDF without the password. Use the PDF Unlock tool first, then extract text from the unlocked file.

What does the Preserve line breaks toggle do?▾

When enabled, PDF.js EOL (end-of-line) markers are converted to newlines in the output, preserving the visual paragraph structure of the PDF.

Does it preserve tables and columns?▾

No — output is a flat stream of text. Tables collapse into rows of space-separated values, and multi-column pages may interleave. Use a PDF table extraction tool for tabular data.

Explore the category

Glossary

Text Layer: The machine-readable text data inside a PDF that powers selection, copy-paste, search, and accessibility. Required for text extraction; absent in pure-image scans.
OCR: Optical Character Recognition — software that converts images of text into selectable text by analyzing pixel patterns. Required to add a text layer to scanned PDFs.
PDF.js: Mozilla's open-source JavaScript PDF library used in Firefox and on the web. Renders and parses PDFs entirely in the browser without plugins.
EOL Marker: End-of-line indicator that PDF.js emits between text items when the layout suggests a paragraph break. Used by this tool's preserve-line-breaks toggle.
Embedded Font: A font included inside the PDF file so it renders identically on any device. Embedded fonts let PDF.js extract correct Unicode characters for non-ASCII text.
ToUnicode Map: A PDF table that maps glyph codes back to Unicode characters. Without it, PDF.js may extract garbage symbols for stylized fonts.
Page Tree: The hierarchical PDF structure listing each page; PDF.js iterates this tree to extract text from every page in document order.
Drawing Order: The order in which a PDF's content stream emits text fragments — not necessarily reading order. Multi-column PDFs may appear out of sequence.

PDF Text Extractor

About PDF Text Extractor

Why use PDF Text Extractor

Extracts text from all

Preserve or collapse line

Page separators make it

Runs in-browser via PDF

Preserve or collapse line

Page separators (--- Page

How to use PDF Text Extractor

When to use PDF Text Extractor

Examples

Single-column report

Multi-column research paper

Scanned-and-OCR'd doc

Image-only scan

Tips

Frequently Asked Questions

Explore the category

Related Tools

Related reading

Glossary