You’ve built four parsing tools: regex templates, a finetuned NER model, an LLM prompt, and a zero-shot extractor. This notebook combines three of them into a pipeline that produces CSV files ready for cleaning and import.
Open 2.7_hybrid_pipeline.ipynb in your Codespace to follow along.
What you’ll learn
By the end of this lesson, you’ll be able to:
Run the template parser on the full corpus and collect failures
Use NER to identify individual entities within template-extracted fields
Pair NER-detected names with their email addresses using document order
Merge LLM results for template failures into the same record format
Write node and relationship CSV files for graph import
The pipeline
Templates — identify which fields exist and where the header boundaries are. 98.5% of emails match.
NER — run on every template-matched header to identify individual people and email addresses within the raw field strings.
LLM — for the ~70 emails templates couldn’t handle (from notebook 2.6), where entities are already split into arrays.
Build records — assemble standardised records from NER entities and LLM results.
CSV output — node and relationship files ready for cleaning and import.
Step 1: template pass
Run every email through the template parser. Templates identify which fields exist and where the header ends — but the field values are raw strings that may contain multiple people, email addresses, and noise.
In your notebook, run the template pass cell. ~98.5% match, ~70 fail.
Step 2: NER pass
The NER model runs on the header text of every template-matched email. It identifies individual entity spans — each person’s name, each email address, the date, and the subject — tagged with their role (SENDER, RECIPIENT, CC_RECIPIENT, etc.).
This is where multi-recipient strings like "Buy, Rick, mark.e.haedicke@enron.com, Rieker, Paula" get split into individual entities. The NER model understands that Buy, Rick is one person, not two.
Pairing names with emails
NER entities come out in document order: a RECIPIENT name is immediately followed by its RECIPIENT_EMAIL.
A bare email with no preceding name — susan.j.mara had no display name in the original header
The record builder walks this sequence and pairs each name with the email that follows it. Bare emails without a name become email-only entries.
Step 3: LLM results
The ~70 template failures were sent to the LLM in notebook 2.6. The LLM already returns individuals in arrays (RECIPIENT_NAMES: ["Alice", "Bob"]), so no further entity detection is needed. Load the results and build records from the same flat format.
Building records
One function builds a record for template+NER emails, another for LLM emails. Both produce the same output format:
doc_id, sender_name, sender_email, sender_domain
recipients — a list of {name, email, type} entries
date_raw, date_parsed, subject, method
The template’s sent field provides the raw date string. NER captures it too, but the template value is more reliable for date parsing.
Validation
Before writing CSV, check for systematic issues. In your notebook, run the validation cell to see null rates and the top sender domains.
Missing sender names come from automated system emails. Missing sender emails come from OCR-garbled addresses. Missing dates come from severely corrupted scans. These are real gaps in the data, not parser bugs.
CSV output
Each node type and relationship type gets its own file:
emails.csv — one row per email (doc_id, subject, date, method)
users.csv — one row per unique name
mailboxes.csv — one row per unique email address
domains.csv — one row per unique domain
SENT.csv, RECEIVED.csv, CC_ON.csv — who sent/received each email
USED.csv — which users use which mailboxes
HAS_MAILBOX.csv — which domains own which mailboxes
Open these in a spreadsheet to inspect the data before the next step.
Check your understanding
NER in the pipeline
Why does the pipeline run NER on template-matched emails instead of using the template’s raw field strings directly?
❏ NER is faster than reading template output
❏ Templates don’t extract field values
✓ Template values are raw strings like "Buy, Rick, mark.e.haedicke@enron.com" — NER identifies individual people and emails within them
❏ NER corrects OCR errors that templates miss
Hint
Think about what a template’s To field looks like when there are multiple recipients with Last, First names mixed with bare email addresses.
Solution
Templates identify field boundaries but return raw strings. A To field with multiple recipients is one long string with commas that could be inside names (Buy, Rick) or between names. NER identifies individual entity spans with their roles — it knows Buy, Rick is one RECIPIENT, not two.
Entity pairing
The pair_entities function walks NER entities in document order. Why does this allow pairing names with emails?
❏ NER always returns names and emails in alphabetical order
❏ The function uses regex to match names to email addresses
✓ In the header text, each person’s name is immediately followed by their email address — NER preserves this document order
❏ The function queries a lookup table of known name-email pairs
Hint
Look at the header: Kenneth Lay <kenneth.lay@enron.com>. In what order does NER tag these spans?
Solution
In the email header, Name <email> pairs appear in sequence. NER tags them in document order: RECIPIENT followed by RECIPIENT_EMAIL. The pairing function tracks a "pending name" and attaches the next email to it. A bare email with no preceding name becomes an email-only entry.
Summary
Templates identify field boundaries for 98.5% of emails
NER identifies individual entities within those fields — each person’s name and email as a separate span
Entity pairing walks NER output in document order, matching each name with the email that follows it
The LLM handles the ~70 emails with OCR-garbled headers, returning individuals in arrays
One record format serves both sources — template+NER and LLM produce identical output
CSV output provides a checkpoint before cleaning and import