What is PDF OCR?
PDF OCR (optical character recognition) converts scanned image pages inside a PDF into selectable, searchable text. The OCR engine analyzes each pixel region, identifies characters, and writes a machine-readable text layer behind the image so you can copy, search, and index the document.
OCR PDF - Extract Text from Scanned PDFs
Convert scanned PDFs and images to searchable, editable text with AI OCR. 100+ languages including Chinese, Arabic, Hindi. Free, 95%+ accuracy.
About OCR PDF
OCR PDF converts scanned PDFs and images into searchable, editable text using Tesseract.js, the WebAssembly port of Google's Tesseract OCR engine. We run it with a worker pool (4 parallel workers by default) so multi-page documents process quickly without freezing the tab. The pipeline renders each page via pdfjs-dist at a configurable DPI (150-600), passes the pixel data to Tesseract, and stitches the recognized text back into a searchable PDF that looks identical to the original but lets you Ctrl+F search, copy-paste, and screen-read the content.
The OCR supports 46+ languages (English, Spanish, French, German, CJK, Arabic, Devanagari, and more) and 7 page segmentation modes — Auto (default), Single Column, Text Block, Single Line, Single Word, Sparse Text, and Sparse Text with OSD (orientation detection). For specialized documents you can whitelist character sets: digits-only (invoice total scanning), alphanumeric, hex, or letters-plus-punctuation — this dramatically improves accuracy when you know the document only contains numbers or specific characters. Six output formats are supported: searchable PDF, plain text, hOCR (HTML with word coordinates), JSON (structured positions plus confidence scores), DOCX-style paragraphs, and Markdown with per-page confidence annotations.
Language-specific OCR pages: Hindi (हिंदी) · Arabic (العربية) · Chinese (中文) · Japanese (日本語) · Korean (한국어) · Thai (ไทย) · Russian (Русский) · Urdu (اردو) · Bengali (বাংলা) · Tamil (தமிழ்) · Hebrew (עברית)
How to Use OCR PDF - Extract Text from Scanned PDFs
- Step 1: Drop your scanned PDF or image into the drop zone. JPG, PNG, TIFF, and PDF all work.
- Step 2: Pick the document language(s). For multi-language documents you can select up to 3 languages simultaneously.
- Step 3: Choose the page segmentation mode — Auto works for most documents; use Single Column for academic papers, Sparse Text for forms and receipts
- Step 4: Optionally pick a character whitelist (digits-only for invoice totals, alphanumeric for serial numbers)
- Step 5: Pick an output format and click Run OCR. 4 worker threads process pages in parallel — typical 20-page scan finishes in 15-30 seconds.
Key Features
- 46+ languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, and Devanagari
- 7 page segmentation modes — Auto, Single Column, Text Block, Single Line, Single Word, Sparse Text, Sparse+OSD
- Character whitelist presets — all chars, digits only, letters only, alphanumeric, hexadecimal, letters+punctuation
- DPI range 150-600 — default 300 is the sweet spot for speed/quality; raise to 600 for tiny text
- 4-worker pool for parallel page processing — 20-page documents OCR in seconds, not minutes
- 6 output formats — searchable PDF, plain text, hOCR, JSON, DOCX-style, Markdown
- Per-page confidence scores in structured outputs — identify pages that need manual review
- Handles rotated text via OSD (Orientation and Script Detection) — sideways scans recognized correctly
- 200 MB file size limit — large scanned archives supported
How We Compare
Compared to desktop alternatives like Adobe Acrobat Pro (starting at $19.99/month), Smallpdf ($12/month for unlimited), or iLovePDF ($9/month Premium), PDF AI Tools delivers comparable quality at $0 for the core feature set. We skip the subscription friction by processing most operations directly in your browser with WebAssembly — no server infrastructure costs to pass on to users. Our AI features (summarization, chat, OCR) use a pay-as-you-go backend that keeps your total cost well under $5/month even for power users.
Frequently Asked Questions
What languages does the OCR support?
46+, including English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese (Simplified and Traditional), Japanese, Korean, Arabic, Hindi, Russian, and many European and Asian languages. You can select up to 3 languages at once for multi-language documents.
Which page segmentation mode should I pick?
"Auto" is a safe default for most documents. "Single Column" improves accuracy on academic papers and books. "Sparse Text" works best for forms, receipts, and documents with isolated text blocks on a mostly-empty page. "Sparse+OSD" adds orientation detection for sideways or upside-down scans.
What's the character whitelist for?
If your document only contains specific characters, telling the OCR to ignore everything else dramatically improves accuracy. Invoice line-item amounts → digits-only whitelist. Serial numbers → alphanumeric. Credit card statements → digits and punctuation. It cuts recognition errors on ambiguous characters (O vs 0, 1 vs l).
What DPI should I run OCR at?
300 DPI is the sweet spot for speed and accuracy on standard typed documents. 150 DPI works for clean high-contrast scans and is faster. 600 DPI is for small text (footnotes, legal fine print) or degraded scans where you need every character captured.
What's the difference between searchable PDF and hOCR output?
Searchable PDF is a copy of your original scan with an invisible text layer added — looks identical, but Ctrl+F search and screen readers now work. hOCR is HTML output with each recognized word's bounding box coordinates — useful for building custom search interfaces, text analytics, or web archives. JSON output is similar but easier to parse programmatically.
Can it handle rotated text or pages scanned sideways?
Yes. Pick the "Sparse Text with OSD" segmentation mode (OSD = Orientation and Script Detection). Tesseract first detects the page orientation and rotates the input accordingly before running recognition. Works for 90°, 180°, and 270° rotated scans.
Who Uses This Tool
- Archivists digitizing scanned historical documents into searchable collections
- Law firms making discovery document scans full-text searchable
- Accountants extracting line-item amounts from scanned invoices
- Researchers OCRing scanned book chapters for quotation and citation
- Medical offices digitizing patient record scans for EHR import
- Government offices converting legacy paper archives to searchable PDF