In the last lesson, we classified the dataset: 44% digital text, 40% scanned with OCR, and 16% image-only. PyMuPDF handles the first two categories, but the image-only files return nothing.
In this lesson, we’ll run fresh OCR on page images to recover text from those files, and compare what happens when we re-OCR files that already have a text layer.
Open 1.2_extracting_with_ocr.ipynb in your notebook environment to follow along.
What you’ll learn
By the end of this lesson, you’ll be able to:
Run OCR on a PDF using PyMuPDF’s built-in method and the Tesseract CLI
Explain the difference between the two approaches and when to use each
Understand what re-OCR can and can’t fix
Build a three-tier extraction strategy for production pipelines
Tesseract
Tesseract is one of the most widely used open-source OCR engines. Originally developed by HP in the 1980s, it’s now maintained by Google and remains the standard tool for extracting text from page images.
It works at the character level — recognising individual letter shapes and assembling them into words. Both of the OCR approaches we’ll use rely on Tesseract under the hood. You’ll need it installed on your machine:
macOS:brew install tesseract
Linux:apt install tesseract-ocr
Codespace: Pre-installed in the workshop environment
In your notebook, run the first two cells to install PyMuPDF and configure the PDF paths.
Recap
In your notebook, run the next two cells to confirm what we saw in the previous lesson — one file with a noisy OCR layer, and one with no text at all.
300 DPI — the industry standard minimum for OCR on scanned documents
full=True forces fresh OCR, ignoring any existing text layer
The output fragments many words onto separate lines. This is because get_textpage_ocr() uses Tesseract’s default page segmentation mode (PSM 3), which analyzes each text region independently.
Option 2: Render + Tesseract CLI
For more control, render the page to an image and call Tesseract directly. In your notebook, run the next cell.
python
Tesseract CLI
page = doc[0]
pix = page.get_pixmap(dpi=300)
with tempfile.NamedTemporaryFile(
suffix=".png") as f:
pix.save(f.name)
result = subprocess.run(
["tesseract", f.name,
"stdout",
"--psm", "6", # (1)
"-l", "eng"],
capture_output=True,
text=True,
)
cli_text = result.stdout
--psm 6 assumes a single uniform block of text — keeps lines intact instead of fragmenting them
Both methods use Tesseract on the same image. The CLI gives you explicit control over page segmentation, which is useful when experimenting with difficult scans. For bulk extraction on clean images, the built-in method is simpler and faster — no subprocess overhead, no temp files.
Side-by-side comparison
In your notebook, run the next cell to see all three methods side by side on the same file.
The existing text layer (get_text()) returns garbled text — Bate:, ENRON CORE., EROTECTIVE. But re-OCR on the same page produces dramatically better results: Date:, ENRON CORP., PROTECTIVE.
The underlying page image is clean. The garbled text layer came from an older or lower-quality OCR pass. Modern Tesseract, running on the clean image, corrects the errors.
This is a common situation in real document dumps: the embedded text layer is noisy, but the page image is fine. Re-OCR recovers what the original OCR missed.
For image-only files (the 795 PDFs with no text layer at all), OCR is the only way to get any text.
What about a really bad scan?
In your notebook, run the next cell to see what happens with a severely degraded file.
ENRON CORP. - PRODUCED PURSUANT → almost unrecognisable
Email address garbled beyond use
The text layer is nearly unreadable. But re-OCR on the clean underlying image recovers the content: CONFIDENTIAL, Enron Corp., Ward, Kim <kim.ward@enron.com>.
Even severely garbled text layers can be fixed by re-OCR — as long as the page image is clean. The quality ceiling is the image, not the existing text layer.
OCR on image-only PDFs
In your notebook, run the next cell to OCR the image-only file that returned nothing in the previous lesson.
python
OCR an image-only PDF
page = doc_empty[0]
pix = page.get_pixmap(dpi=300)
with tempfile.NamedTemporaryFile(
suffix=".png") as f:
pix.save(f.name)
result = subprocess.run(
["tesseract", f.name,
"stdout",
"--psm", "6",
"-l", "eng"],
capture_output=True,
text=True,
)
print(result.stdout[:1000])
This is where OCR adds the most value. The 795 image-only files had no text at all — now they have usable content.
The quality depends on the scan, but for clean B&W images at 300 DPI, Tesseract produces good results.
A tiered strategy
In your notebook, run the next two cells to define and test a three-tier extraction function.
python
Three-tier extraction
def extract_text(pdf_path):
doc = pymupdf.open(pdf_path)
page_count = len(doc)
pages = []
ocr_pages = 0
for page in doc:
text = page.get_text().strip()
if text:
pages.append(text) # (1)
else:
tp = page.get_textpage_ocr(
dpi=300,
language="eng") # (2)
text = page.get_text(
textpage=tp).strip()
pages.append(text)
ocr_pages += 1
doc.close()
return "\n\n".join(pages),
page_count, ocr_pages
Page has a text layer — use it directly (near-instant)
No text layer — fall back to built-in OCR at 300 DPI
No file-size heuristic needed. Each page is tested individually: if get_text() returns content, use it. If not, OCR it. This handles mixed PDFs where some pages are digital and others are image-only.
84% of pages have text layers and extract instantly. Only the 16% image-only pages trigger OCR.
Speed comparison
In your notebook, run the speed comparison cell to benchmark all three methods.
Method
Speed
get_text() (existing layer)
~200-300 PDFs/sec
PyMuPDF built-in OCR
~1-3 PDFs/sec
Tesseract CLI
~0.3-1 PDFs/sec
Plain text extraction is roughly 250-300x faster than OCR.
The tiered strategy keeps things fast:
Tier
Files
Estimated time
Text layer (84%)
4,116
~15 seconds
OCR (16% image-only)
795
~15-30 minutes
For larger corpora, you might parallelise the OCR step or consider a vision LLM for the hardest cases.
Alternative: EasyOCR
Test OCR on your hardest files first
In our dataset, re-OCR works well because the underlying page images are clean — the errors came from an older OCR pass. Your data may be different. If your scans are physically degraded (faded ink, creased pages, low resolution), re-OCR won’t help much — the quality ceiling is always the image. Test OCR on a small sample of your hardest files before committing to a full-corpus run. Also consider the language parameter (-l eng) — if your documents contain non-English text, you’ll need the appropriate Tesseract language pack.
EasyOCR is a powerful alternative — pure pip install, no system dependencies. In your notebook, run the next two cells to install it and try it on the same file.
Deep learning-based (CRAFT + CRNN) — significantly faster with a GPU
paragraph=True groups detected text into natural paragraphs
EasyOCR reads body text more naturally, but can struggle with structured header fields and is ~5x slower than Tesseract on CPU.
If your data is corrupted to just the cusp of what Tesseract can handle, a deep-learning package like EasyOCR is always worth a shot.
Other OCR options
Tesseract and EasyOCR are far from the only choices. Depending on your use case, you might also consider:
docTR — deep learning OCR from Mindee, good accuracy, supports PyTorch and TensorFlow
PaddleOCR — multilingual OCR from Baidu, strong on non-Latin scripts
Surya — newer transformer-based OCR with layout detection
Cloud APIs — Google Cloud Vision, AWS Textract, Azure AI Document Intelligence offer high accuracy at per-page cost
Vision LLMs — GPT-4o, Claude, etc. can read scanned documents directly, with the best context understanding but the highest cost and lowest throughput
Check your understanding
When re-OCR helps
An email PDF has a garbled text layer (ENRON CORE. instead of ENRON CORP.) but the underlying page image is clean. What happens when you re-OCR it?
❏ The same errors appear because the image is degraded
❏ Re-OCR makes things worse by adding new errors on top of old ones
✓ Modern Tesseract reads the clean image and produces correct text — the old errors came from an older OCR pass, not from the image
❏ Re-OCR only works on image-only PDFs
Hint
The quality ceiling is the image, not the existing text layer. If the image is clean, what determines OCR quality?
Solution
The garbled text layer came from an earlier, lower-quality OCR pass. The page image itself is clean. Modern Tesseract running on the clean image produces dramatically better results — ENRON CORP. instead of ENRON CORE., Date: instead of Bate:. Re-OCR recovers what the original OCR missed.
Summary
OCR recovers text from image-only PDFs that have no text layer — this is where it adds the most value
Re-OCR on clean images produces dramatically better results when the existing text layer is garbled — the quality ceiling is the image, not the old text layer
Tesseract CLI with --psm 6 gives you control over page segmentation — useful for experimentation on difficult scans
PyMuPDF built-in OCR is simpler and faster for bulk extraction — no subprocess overhead, no temp files
300 DPI is the industry standard for OCR on scanned documents
OCR is ~250-300x slower than plain text extraction — use it only where needed
A three-tier strategy (digital text → existing OCR → fresh OCR on image-only) balances speed and coverage
EasyOCR is a Python-only alternative — no system install, better with a GPU, but slower on CPU
Next: Before we run the full pipeline, we’ll look at combined extraction packages that handle text reading, OCR, and layout analysis in a single tool.