Text Cleaner
Remove extra spaces, symbols and more
About Text Cleaner
Text copied from PDFs, web pages, Word documents, and email clients almost always arrives with unwanted baggage: double spaces, inconsistent line breaks, smart quotes that break code, invisible Unicode characters, and stray HTML entities. Manually fixing these one by one is tedious and easy to miss. Text Cleaner applies a configurable set of cleaning operations in a single pass — normalize whitespace, strip extra blank lines, replace smart quotes with straight quotes, remove HTML entities, fix broken line breaks from PDF copy-paste, and more. Each operation is a toggle so you control exactly what gets cleaned. The result is predictable, clean plain text that pastes correctly into your editor, database, spreadsheet, or any downstream tool without surprises.
Why use Text Cleaner
Configurable Per-Operation Toggles
Turn each cleaning step on or off individually so you never clean more than you intend.
Smart Quote Normalization
Replaces curly/typographic quotes with straight ASCII quotes — essential before pasting into code or JSON.
PDF Copy-Paste Repair
Fixes broken line breaks inserted by PDF-to-text extraction so paragraphs flow correctly again.
HTML Entity Decoding
Converts &, , <, and other HTML entities to their plain-text equivalents.
Excess Whitespace Removal
Collapses multiple consecutive spaces and blank lines into clean, normalized spacing.
Unicode Control Character Strip
Removes invisible characters like zero-width spaces and soft hyphens that cause subtle rendering bugs.
How to use Text Cleaner
- Paste your messy or copied text into the input area.
- Enable the cleaning operations you need using the toggle switches.
- The cleaned output appears instantly in the output area below.
- Disable any toggle if a particular cleaning step is changing text you want to keep.
- Click Copy to copy the cleaned text to your clipboard.
- Clear the input and paste new text to clean the next batch.
When to use Text Cleaner
- When copying text from a PDF and the output has line breaks in the middle of every sentence.
- When pasting content from Microsoft Word into a plain-text field that shows smart quote characters.
- When preparing text for import into a database or spreadsheet that chokes on HTML entities.
- When cleaning up scraped web content that contains excess whitespace and invisible characters.
- When normalizing text before running it through another tool like Word Counter or Find & Replace.
- When extracting email body text that has inconsistent line endings and double spacing.
Examples
PDF paragraph restoration
Input: This is a paragraph that was
copied from a PDF file and has
broken line breaks everywhere.
Output: This is a paragraph that was copied from a PDF file and has broken line breaks everywhere.
Smart quote normalization
Input: “Hello,” she said. ‘It’s fine.’
Output: "Hello," she said. 'It's fine.'
HTML entity decoding
Input: Tom & Jerry <cartoon> is © 1940
Output: Tom & Jerry <cartoon> is © 1940
Tips
- When cleaning PDF-copied text, enable the PDF line break fix first — it often resolves 80% of the formatting issues in one step.
- Disable smart quote normalization if your text contains foreign language dialogue where curly quotes are intentional.
- Run Text Cleaner before using Word Counter or Word Frequency Counter to ensure spacing artifacts do not inflate your word count.
- Use the HTML entity decoder when cleaning text exported from a CMS that stores HTML-encoded content in plain-text fields.
- After cleaning, paste into the Text Diff tool to verify what exactly changed versus your original paste.
Frequently Asked Questions
What are smart quotes and why do they cause problems?▾
Smart quotes are typographic quotation marks (“” ‘’) used in word processors for visual appeal. They are Unicode characters that differ from the ASCII straight quotes (" ') expected by JSON, code editors, and most programming languages — causing parse errors when pasted into those contexts.
What does 'fix PDF line breaks' actually do?▾
PDFs often insert a hard line break at every line of visual text when copied. The PDF fix operation joins lines that do not end with sentence-terminating punctuation, restoring the original paragraph flow.
Will it remove intentional line breaks in my text?▾
The PDF line break fix only removes line breaks within what appears to be a paragraph — not blank lines between paragraphs. You can also disable that specific toggle if you need to preserve all line breaks exactly.
What HTML entities does it decode?▾
Common named entities like &, <, >, , ", and numeric entities like “ are all decoded to their plain-text equivalents.
Does it remove all whitespace or just excess whitespace?▾
The excess whitespace toggle collapses runs of two or more spaces into a single space and removes trailing spaces from line ends. It does not remove single spaces between words.
Can it strip HTML tags as well?▾
For full HTML tag removal, use the Strip HTML Tags tool which is purpose-built for that operation. Text Cleaner focuses on whitespace, encoding, and formatting normalization rather than tag stripping.
Does it handle Windows-style CRLF line endings?▾
Yes. CRLF (\r\n) line endings are normalized to LF (\n) as part of the whitespace normalization step.
Is there any text that the cleaner might accidentally modify?▾
Legitimate em-dashes and en-dashes may be normalized to hyphens if punctuation normalization is enabled. If your text uses these intentionally, disable that specific toggle before cleaning.
Glossary
- Smart Quotes
- Typographic quotation marks (“” ‘’) that curve inward. They are Unicode characters distinct from straight ASCII quote characters used in code.
- HTML Entity
- A code sequence starting with & and ending with ; that represents a special character in HTML, such as & for & or for a non-breaking space.
- CRLF
- Carriage Return + Line Feed (\r\n), the Windows line ending convention. Unix systems use LF only (\n). Mixed line endings can cause display and parsing issues.
- Zero-Width Space
- A Unicode character (U+200B) that takes up no visible space but can cause unexpected behavior in text processing, search, and rendering.
- Non-Breaking Space
- A space character (U+00A0, HTML ) that prevents a line break at that position. Often copied from web pages and can cause word count inflation.
- Soft Hyphen
- A Unicode character (U+00AD) that suggests a hyphenation point but is invisible unless the word breaks across a line. Commonly left in text copied from typeset documents.