What is PDF OCR?

PDF OCR (optical character recognition) converts scanned image pages inside a PDF into selectable, searchable text. The OCR engine analyzes each pixel region, identifies characters, and writes a machine-readable text layer behind the image so you can copy, search, and index the document.

OCR PDF - Extract Text from Scanned PDFs

Convert scanned PDFs and images to searchable, editable text with AI OCR. 100+ languages including Chinese, Arabic, Hindi. Free, 95%+ accuracy.

About OCR PDF

OCR PDF converts scanned PDFs and images into searchable, editable text using Tesseract.js, the WebAssembly port of Google's Tesseract OCR engine. We run it with a worker pool (4 parallel workers by default) so multi-page documents process quickly without freezing the tab. The pipeline renders each page via pdfjs-dist at a configurable DPI (150-600), passes the pixel data to Tesseract, and stitches the recognized text back into a searchable PDF that looks identical to the original but lets you Ctrl+F search, copy-paste, and screen-read the content.

The OCR supports 46+ languages (English, Spanish, French, German, CJK, Arabic, Devanagari, and more) and 7 page segmentation modes — Auto (default), Single Column, Text Block, Single Line, Single Word, Sparse Text, and Sparse Text with OSD (orientation detection). For specialized documents you can whitelist character sets: digits-only (invoice total scanning), alphanumeric, hex, or letters-plus-punctuation — this dramatically improves accuracy when you know the document only contains numbers or specific characters. Six output formats are supported: searchable PDF, plain text, hOCR (HTML with word coordinates), JSON (structured positions plus confidence scores), DOCX-style paragraphs, and Markdown with per-page confidence annotations.

Language-specific OCR pages: Hindi (हिंदी) · Arabic (العربية) · Chinese (中文) · Japanese (日本語) · Korean (한국어) · Thai (ไทย) · Russian (Русский) · Urdu (اردو) · Bengali (বাংলা) · Tamil (தமிழ்) · Hebrew (עברית)

How to Use OCR PDF - Extract Text from Scanned PDFs

  1. Step 1: Drop your scanned PDF or image into the drop zone. JPG, PNG, TIFF, and PDF all work.
  2. Step 2: Pick the document language(s). For multi-language documents you can select up to 3 languages simultaneously.
  3. Step 3: Choose the page segmentation mode — Auto works for most documents; use Single Column for academic papers, Sparse Text for forms and receipts
  4. Step 4: Optionally pick a character whitelist (digits-only for invoice totals, alphanumeric for serial numbers)
  5. Step 5: Pick an output format and click Run OCR. 4 worker threads process pages in parallel — typical 20-page scan finishes in 15-30 seconds.

Key Features

How We Compare

Compared to desktop alternatives like Adobe Acrobat Pro (starting at $19.99/month), Smallpdf ($12/month for unlimited), or iLovePDF ($9/month Premium), PDF AI Tools delivers comparable quality at $0 for the core feature set. We skip the subscription friction by processing most operations directly in your browser with WebAssembly — no server infrastructure costs to pass on to users. Our AI features (summarization, chat, OCR) use a pay-as-you-go backend that keeps your total cost well under $5/month even for power users.

Frequently Asked Questions

What languages does the OCR support?

46+, including English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese (Simplified and Traditional), Japanese, Korean, Arabic, Hindi, Russian, and many European and Asian languages. You can select up to 3 languages at once for multi-language documents.

Which page segmentation mode should I pick?

"Auto" is a safe default for most documents. "Single Column" improves accuracy on academic papers and books. "Sparse Text" works best for forms, receipts, and documents with isolated text blocks on a mostly-empty page. "Sparse+OSD" adds orientation detection for sideways or upside-down scans.

What's the character whitelist for?

If your document only contains specific characters, telling the OCR to ignore everything else dramatically improves accuracy. Invoice line-item amounts → digits-only whitelist. Serial numbers → alphanumeric. Credit card statements → digits and punctuation. It cuts recognition errors on ambiguous characters (O vs 0, 1 vs l).

What DPI should I run OCR at?

300 DPI is the sweet spot for speed and accuracy on standard typed documents. 150 DPI works for clean high-contrast scans and is faster. 600 DPI is for small text (footnotes, legal fine print) or degraded scans where you need every character captured.

What's the difference between searchable PDF and hOCR output?

Searchable PDF is a copy of your original scan with an invisible text layer added — looks identical, but Ctrl+F search and screen readers now work. hOCR is HTML output with each recognized word's bounding box coordinates — useful for building custom search interfaces, text analytics, or web archives. JSON output is similar but easier to parse programmatically.

Can it handle rotated text or pages scanned sideways?

Yes. Pick the "Sparse Text with OSD" segmentation mode (OSD = Orientation and Script Detection). Tesseract first detects the page orientation and rotates the input accordingly before running recognition. Works for 90°, 180°, and 270° rotated scans.

Who Uses This Tool