You’ve now seen four approaches to extracting text from PDFs:
PyMuPDF — fast, reliable on files with text layers
Tesseract — OCRs image-only pages
Combined packages (Docling, Unstructured, etc.) — one tool for the whole pipeline
Vision models — highest quality, highest cost
Time to put it into practice. Open 1.5_full_extraction.ipynb in your notebook environment.
The strategy
The notebook processes all 4,911 PDFs using a per-page tiered strategy.
For each PDF:
Open it with PyMuPDF
Try get_text() on every page
If a page returns text, use it (fast)
If a page returns nothing, fall back to built-in OCR at 300 DPI (slower)
Write one .txt file per PDF to data/extracted_text/
As we saw in the OCR lesson, both Tesseract CLI and PyMuPDF’s built-in OCR produce comparable results on clean B&W scans. For bulk extraction, the built-in method avoids subprocess overhead across 795 files and keeps everything in a single dependency.
The extraction function
In your notebook, run the first four cells to install dependencies, configure paths, and define the extraction function.
python
Per-page tiered extraction
def extract_text(pdf_path):
doc = pymupdf.open(pdf_path)
page_count = len(doc)
pages = []
ocr_pages = 0
for page in doc:
text = page.get_text().strip()
if text:
pages.append(text) # (1)
else:
tp = page.get_textpage_ocr(
dpi=300,
language="eng") # (2)
text = page.get_text(
textpage=tp).strip()
pages.append(text)
ocr_pages += 1
doc.close()
return "\n\n".join(pages),
page_count, ocr_pages
Page has a text layer — use it directly (near-instant)
Image-only page — fall back to built-in OCR at 300 DPI
The function returns the text, page count, and how many pages needed OCR — useful for understanding the dataset split.
Run the extraction
Pre-extracted text files are included in the repo. Run the next cell and it will confirm that all 4,911 files are present.
If you’d like to re-extract from scratch (~20 minutes), uncomment the two clear lines at the top of the cell. This deletes the existing files and re-runs the full extraction.
Digital files are processed first (instant), then scanned files (may trigger OCR on image-only pages).
Understand the speed split
84% of files have text layers and extract instantly via get_text(). Only the 16% image-only files trigger OCR, which runs at ~1-3 pages/sec.
The results summary in the next cell shows the method split.
Results summary
In your notebook, run the next cell to see the extraction statistics — how many files used each method, how many succeeded, and total characters extracted.
Then run the verify cell to spot-check a few extracted files and confirm they look reasonable.
Try your own data
The extraction function works on any PDF dataset, not just Enron emails. If you have your own PDFs:
Drop them into a folder
Point PDF_DIR at that folder
Run the notebook
The per-page strategy adapts automatically — pages with text layers get fast extraction, image-only pages get OCR. No thresholds to tune.
A note on real-world data
Our dataset has a 44/40/16 split between digital, scanned, and image-only files. This is specific to our synthetic dataset. In a real document dump — like the Clinton FOIA releases — every file is typically a scanned image with an OCR text layer. There are no "digital" PDFs because the originals were physically printed and scanned.
If the OCR text layers in your documents are good enough for your purposes, you can skip re-OCR entirely and extract with get_text() alone — instant, no dependencies, no configuration. Run a few samples through your parsing pipeline first to check whether the existing OCR quality is sufficient.
Going forward
From this point on, we’ll work with the pre-extracted plain text files rather than re-running extraction each time. The extraction step is a one-time cost — once you have your .txt files, the rest of the pipeline operates on text.
In the next module, we’ll parse these text files into structured email records: From, To, Subject, date, and body.
Check your understanding
Per-page extraction strategy
The extraction function checks each page individually: if get_text() returns content, use it; if not, OCR it. Why per-page rather than per-file?
❏ Per-page is faster because it processes less data
❏ PyMuPDF can only open one page at a time
✓ A single PDF can have some pages with text layers and others without — per-page handling extracts text where it exists and OCRs only the pages that need it
❏ Per-file processing would require a different library
Hint
Think about a multi-page scanned document where the first page was OCR’d but the second wasn’t.
Solution
Mixed PDFs are common — some pages digital, others image-only. Per-page checking ensures each page gets the right treatment. A per-file heuristic (like file size) would either OCR pages that don’t need it (slow) or skip pages that do (missing content).
Summary
Run 1.5_full_extraction.ipynb to extract the full Enron dataset
The per-page tiered strategy tries get_text() first, falling back to built-in OCR at 300 DPI only on image-only pages
Built-in OCR for bulk extraction — comparable quality to Tesseract CLI on clean scans, simpler and faster at scale
84% of files extract instantly via text layers; only 16% need OCR
The same approach works on your own PDF datasets — no configuration needed
From here on, we work with the extracted plain text files — extraction is a one-time step