How to OCR a Scanned PDF — Make It Searchable Free

5 min read

A scanned PDF is just a photo of a document — it looks like text but contains no actual characters. You cannot search it, copy from it, or convert it to Word. OCR (Optical Character Recognition) fixes this by reading the image and producing a real text layer.

Text-Based vs Scanned PDFs

Open your PDF and try to highlight a word. If you can select text, it already has a text layer — you do not need OCR. If nothing highlights, or if you select a large block when you click a single word, the PDF is image-based and needs OCR.

Common sources of scanned PDFs: physical documents run through a flatbed scanner, photos of pages taken with a phone, PDFs exported from old scanning software that did not embed text.

How OCR Works in the Browser

PDF Family uses Tesseract.js — a JavaScript port of Google's Tesseract OCR engine compiled to WebAssembly. It runs entirely in your browser with no upload required.

When you submit a PDF, the tool renders the first page to a canvas at 2× scale (for higher detail), then passes the canvas image to Tesseract. The recognised text is laid out into a new searchable PDF using pdf-lib. For multi-page PDFs, each page is processed in sequence.

Tips for Better OCR Accuracy

Scan at 300 DPI minimum

Low-resolution scans (below 150 DPI) produce blurry characters that OCR misreads. Most modern scanners default to 300 DPI — check your scanner settings before scanning.

Select the correct language

Tesseract uses language-specific character models. OCRing a French document with the English model will produce errors on accented characters. PDF Family supports 11 languages including English, Spanish, French, German, Chinese, Japanese, and Arabic.

Straighten skewed documents

Text tilted more than 5–10 degrees confuses OCR significantly. Many scanner apps include automatic deskew — use it. For phone photos, ensure the page fills the frame and is flat.

High contrast improves accuracy

Black text on white background gives the best results. Avoid scanning under uneven lighting or with shadows crossing the text. Use the scanner's "document" or "text" mode rather than "photo" mode.

What Happens After OCR

The output is a new PDF containing the recognised text. You can search it with Ctrl+F, copy text from it, and use it as input for PDF-to-Word conversion. The visual appearance matches the original scan.

If you want an editable Word document from the scanned PDF: run OCR first, then use the PDF-to-Word converter on the resulting searchable PDF.

Make your scanned PDF searchable now

Tesseract OCR runs in your browser — files never leave your device. 11 languages supported.

OCR PDF Free →

Frequently Asked Questions

What is OCR and how does it work?

OCR (Optical Character Recognition) analyses an image pixel by pixel to identify letter shapes and reconstruct the original text. Modern OCR engines like Tesseract use machine learning models trained on millions of document images to achieve high accuracy on clean scans.

Does OCR work on handwritten text?

Standard OCR is designed for printed text. Handwriting recognition (HTR) is a separate, harder problem. Tesseract has limited handwriting support — results on handwritten documents are unreliable.

My OCR result has garbled characters — why?

This usually means the scan quality is too low (under 150 DPI), the document is skewed or has shadows, or the selected language does not match the document. Try rescanning at 300 DPI and selecting the correct language.