Parsing Approaches

From Text to Structure

You’ve decided to build a metadata graph — your task now is to extract sender, recipient, date, and subject from 5,000 text files.

Raw text flowing through different parsing approaches to produce structured records

What You’ll Learn

By the end of this lesson, you’ll be able to:

  • Describe five approaches to email parsing — including layout-aware extraction

  • Consider when each approach might suit your dataset

  • Assess the tradeoffs between speed, cost, training data, and maintenance

Approach 1: Parsing Libraries

Standard email formats are well-specified. Python’s email.parser handles RFC 2822 and MIME email completely — headers, body, attachments, encoding — with no pattern writing required.

python
Standard library email parsing
import email
from email import policy

msg = email.message_from_file(
    f, policy=policy.default
)
sender = msg["From"]    # (1)
subject = msg["Subject"]
body = msg.get_body().get_content()
  1. Direct field access — no parsing, no patterns, no models. The structure is already there.

This works well when the format is standard. It faces limitations the moment the structure is gone — which happens as soon as you convert emails to PDFs.

Approach 2: Layout-Aware Extraction

When emails live in PDFs, the original structure (columns, tables, font sizes, bounding boxes) is encoded in the file but usually discarded during plain text extraction.

Layout-aware tools read both the text and the visual structure simultaneously — producing structured fields in a single pass rather than requiring a separate parsing step.

  • Docling → to extract sections, tables, and headers from PDFs using layout analysis, with no training required

  • LayoutLM / LayoutLMv3 → to classify each text token as a specific field using both text and bounding box position — requires fine-tuning on annotated examples

  • Cloud Document AI → to combine OCR and layout understanding in a managed API (Google, Azure, AWS)

For email PDFs with consistent table layouts, this approach can eliminate most of the parsing challenge before it begins.

You’d do it at the same time as the extraction from PDF.

Why we didn’t show you this earlier

The most important consideration in any graph project is the graph itself — specifically, its data model.

A common first approach with many of these processes is to:

  • Find a cool package that promises to extract data

  • Naively extract that data to a graph

  • Wonder why the graph isn’t working

  • Start again

It is important to understand that Docling and other, similar packages are combining multiple jobs into a single process, and limiting some of the control you could have over them. They are easier to manage when you already understand every job they are doing.

Approach 3: Rule-Based Parsing

When the structure isn’t standard — PDF-extracted text, exported archives, proprietary formats — you encode what you know as rules. This covers a wide spectrum of sophistication:

  • Simple patterns → to match a label and capture the value

  • Structural templates → to define the expected sequence of labels and values across a whole document

  • AI-assisted development → to use an LLM to help write and test patterns for edge cases

The more structure you encode, the more your parser handles. The cost is maintenance — every new layout pattern means more rules.

A spectrum from simple regex to sophisticated structural templates

This approach will only really work when your data contains predictable, repeating patterns. In such cases, you could run parsing and extraction on millions of files within a couple of hours for free.

Approach 4: ML-Based Extraction

Trained models learn to extract fields from labelled examples, rather than matching patterns you write by hand. Two approaches are relevant here:

  • spaCy NER and spancat — train a model to classify which spans of text belong to which fields. Generalises well across layout variations once trained, but requires annotated examples.

  • GLiNER — zero-shot entity extraction using a pre-trained transformer encoder. Provide labels at inference time and it finds matching spans — no annotation required.

These sit between rule-based parsing and LLMs: free at inference time, deterministic, and capable of generalising to unseen layouts.

Approach 5: LLM-Based Parsing

LLM-based parsing sends the extracted text to a language model and asks it to return structured fields. The model reads the text and infers the structure from meaning rather than patterns.

  • Flexible → to handle layouts and noise that no template covers

  • No training data → to require only a well-written prompt

  • Expensive → at ~$1-5 per 1,000 emails compared to free alternatives

  • Non-deterministic → the same email may parse differently on different runs

LLMs add genuine value on the hard cases — badly garbled text, unusual layouts, novel formats.

On clean, well-structured text they add cost and time but little to no improvement over cheaper alternatives.

Comparing the Approaches

Libraries Layout-aware Rule-based ML models LLM LLM (Batch API)

Speed

~10K/sec

~10-100/sec

~1K/sec

~100-500/sec

~1-5/sec

~10-50/sec (async)

Cost

Free

Free-$$

Free

Free (inference)

~$.50-5/1K

~$0.25-2.50/1K

Training needed

No

Sometimes

No

Yes (examples)

No

No

Determinism

Yes

Yes

Yes

Yes

No

No

Generalisation

RFC only

Layout-dependent

Pattern-dependent

Learns from data

High

High

Hybrid pipeline

Generally, you might opt for a hybrid pipeline that categorizes your data and reroutes each subset to the most suitable approach.

  • Libraries for well-formed RFC email

  • Layout-aware tools for complex documents (tables, charts, etc.)

  • Rules for predictable text layouts

  • ML models for less predictable patterns but predictable spans

  • LLMs for edge cases or near-recoverable corruption

What’s Next

Over the next lessons, you’ll work through each approach in turn:

  1. Layout-aware extraction — use Docling and understand LayoutLM for structured PDF extraction

  2. Parsing libraries — use email.parser on the Enron plaintext corpus and see where it breaks

  3. Rule-based parsing — build patterns and structural templates

  4. ML-based parsing — explore spaCy and GLiNER for email extraction

  5. LLM parsing — direct API calls for the cases that need them

Then you’ll combine them into a hybrid pipeline, prepare the records for import, and explore the graph.

Check your understanding

Parsing tradeoffs

You have a corpus of 500,000 scanned invoices with varying layouts. Which parsing approach is the best starting point?

  • ❏ Regex templates — they’re the fastest and cheapest

  • ❏ An LLM — it can handle any layout

  • ✓ It depends on how many distinct layouts there are — if most invoices share a few layouts, templates first; if every invoice is different, an LLM or NER approach

  • ❏ OCR is the only option for scanned documents

Hint

The right tool depends on the data. What do you need to know about the invoices before choosing?

Solution

There’s no universal best approach. Templates are fast and cheap but need a known layout for each variant. LLMs handle unknown layouts but are slow and expensive at 500K files. NER models generalize across layouts after training. The first step is always to investigate the data — how many layouts are there, how consistent are they?

Summary

  • Parsing libraries — well-suited to standard RFC email. Free, instant, complete. Faces limitations on PDF-extracted text.

  • Layout-aware extraction — reads visual structure alongside text. Eliminates the parsing step for consistent PDF layouts. Requires model or service setup.

  • Rule-based parsing — encodes what you know as patterns. Free, deterministic, maintainable. Performance depends on effort invested.

  • ML-based extraction — trained or zero-shot models that learn to find fields. Free at inference time, generalizes beyond your patterns.

  • LLM-based parsing — comprehension-based extraction. Handles anything, but costs money and introduces non-determinism.

  • A hybrid pipeline uses the right tool for each subset of your data

Next: We’ll explore layout-aware extraction with Docling and LayoutLM.

Chatbot

How can I help you today?

Data Model

Your data model will appear here.