So far we’ve assembled our own extraction pipeline from separate tools — PyMuPDF for text layers, Tesseract for OCR, and custom logic to decide which to use.
An alternative approach is to use a combined extraction package that handles all of this in one tool: text extraction, OCR, and layout analysis in a single pipeline.
Open 1.3_extracting_with_docling.ipynb in your notebook environment to follow along.
By default, Docling trusts the existing text layer. Since E61D04918.pdf has a noisy OCR layer (Bate: for Date:, ENRON CORE. for ENRON CORP.), the output carries those same errors.
Docling’s export_to_text() internally uses its markdown exporter, which HTML-encodes angle brackets (< becomes <). Since our email headers contain Name <email> format, html.unescape() restores the original characters.
This is the same behavior as PyMuPDF’s get_text() — the existing text layer is used as-is. To get better text, we need to tell Docling to ignore the text layer and re-OCR the page image.
Forcing full-page OCR
In your notebook, run the next two cells to configure a second converter with forced OCR and extract the same file.
python
pipeline_options_ocr = PdfPipelineOptions()
pipeline_options_ocr.do_ocr = True
pipeline_options_ocr.ocr_options = OcrMacOptions(
force_full_page_ocr=True # (1)
)
# The notebook detects your platform
# and uses EasyOcrOptions on Linux
converter_ocr = DocumentConverter(...)
python
result = converter_ocr.convert(str(SAMPLE_PDF))
ocr_text = html.unescape(
result.document.export_to_text()
)
print(ocr_text[:1000])
Ignores the existing text layer and re-OCRs the page image using macOS Vision (or EasyOCR on Linux)
Compare the output to the default extraction. In our dataset, the underlying page images are clean — the garbled text layer came from an earlier, lower-quality OCR pass. Forced OCR re-reads the clean image and produces dramatically better results.
OCR on image-only PDFs
In your notebook, run the next cell to extract an image-only file.
python
result = converter.convert(str(EMPTY_PDF))
empty_text = html.unescape(
result.document.export_to_text()
)
print(empty_text[:1000])
Docling detects there’s no text layer and OCRs automatically — no special configuration needed.
This is where combined packages shine. With our modular approach, we needed a tiered strategy to route image-only files to Tesseract. Docling handles it automatically in a single code path.
One tool for any PDF
In your notebook, run the next cell to extract a clean digital PDF with the same converter.
The default converter reads the text layer on clean files and OCRs image-only files — no file-size heuristic, no switching tools. One call for any PDF:
python
result = converter.convert("any_pdf.pdf")
text = html.unescape(result.document.export_to_text())
The limitation: it trusts existing text layers, so files with OCR errors carry those errors through. For those, you need force_full_page_ocr.
Quality comparison
In your notebook, run the next cell to compare PyMuPDF, Tesseract CLI, and Docling’s forced OCR on the same file.
Each OCR engine has its own error profile. The existing text layer, a fresh Tesseract re-OCR, and Docling’s Vision OCR will each produce slightly different results on the same image.
Compare the header fields in the output — Case No, Doc No, Date, From, Subject — to see where the engines agree and disagree.
For our dataset, re-OCR from any engine produces much better results than the existing text layer, because the underlying images are clean. The garbled text came from an earlier OCR pass, not from a degraded image.
In other datasets, the image itself may be the limiting factor. The quality ceiling is always the source image.
Speed comparison
In your notebook, run the speed comparison cell to benchmark all three approaches on a mixed sample.
The modular strategy (PyMuPDF + Tesseract) is fastest because get_text() is nearly instant on files with a text layer. Docling runs layout analysis on every file, even when simple text extraction would suffice.
In your notebook, run the next cell to test Docling with OCR disabled on clean files. Even in text-only mode, the layout analysis overhead means it can’t match PyMuPDF’s raw speed.
Modular vs combined: when to use each
Choose a combined package (Docling, Unstructured, etc.) when:
You want a single tool for the entire extraction process — no tiered strategy
Your documents have complex layouts (multi-column, tables, mixed content)
Your dataset is small enough that speed doesn’t matter
You want layout-aware extraction and are willing to pay the speed cost
Choose modular tools (PyMuPDF + Tesseract) when:
You’re processing hundreds of thousands or millions of documents and speed matters
Your documents are simple (single-column emails like ours)
You need fine-grained control over each step
You need cross-platform consistency
For this workshop, we’ll use PyMuPDF + Tesseract for the full corpus extraction — speed matters at 5,000 files. But if you’re working with your own PDFs and starting fresh, a combined package like Docling could be the better first choice: one tool, one code path, and layout awareness without needing to build a tiered strategy yourself.
Consider layout-aware tools for complex documents
Our email PDFs are simple single-column text, so modular tools win on speed. If your documents have tables, multi-column layouts, forms, or mixed content (e.g. invoices, contracts, medical records), a combined package like Docling may be worth the speed cost — layout analysis is precisely what those documents need. Try both on a sample and compare the output quality before deciding.
Check your understanding
Combined package tradeoff
Docling is ~100x slower than PyMuPDF for text extraction, even with OCR disabled. Why?
❏ Docling uses Python instead of C for text reading
❏ Docling downloads text from a cloud API
✓ Docling runs layout analysis (layout detection, reading order, structure) on every page, even when it doesn’t need to OCR
❏ Docling re-OCRs every page by default
Hint
What does Docling do that PyMuPDF doesn’t, even when both are just reading text?
Solution
Docling’s document understanding pipeline analyzes the visual layout of every page — detecting regions, determining reading order, classifying content types. This is what gives it layout awareness, but it runs even on pages where simple text extraction would suffice. GPU acceleration closes this gap significantly.
Summary
Combined extraction packages handle text extraction, OCR, and layout analysis in a single pipeline — no tiered strategy needed
Docling, Unstructured, Marker, and cloud APIs are popular options
In default mode, Docling reads text layers or OCRs as needed — one tool, one code path
Use force_full_page_ocr=True when you don’t trust existing text layers
Combined packages add layout analysis on top of extraction, but are slower than modular tools — quality depends on the document and OCR engine
The speed gap comes from layout analysis — these packages interpret every page, even when simple text extraction would suffice
Best suited for smaller datasets, complex layouts, or when you want one simple code path
Next: We’ll look at vision models — the most capable extraction method, and the most expensive — to understand when they’re worth the cost.