The hybrid pipeline

Combining templates, NER, and LLM

You’ve built four parsing tools: regex templates, a finetuned NER model, an LLM prompt, and a zero-shot extractor. This notebook combines three of them into a pipeline that produces CSV files ready for cleaning and import.

Open 2.7_hybrid_pipeline.ipynb in your Codespace to follow along.

What you’ll learn

By the end of this lesson, you’ll be able to:

  • Run the template parser on the full corpus and collect failures

  • Use NER to identify individual entities within template-extracted fields

  • Pair NER-detected names with their email addresses using document order

  • Merge LLM results for template failures into the same record format

  • Write node and relationship CSV files for graph import

The pipeline

  1. Templates — identify which fields exist and where the header boundaries are. 98.5% of emails match.

  2. NER — run on every template-matched header to identify individual people and email addresses within the raw field strings.

  3. LLM — for the ~70 emails templates couldn’t handle (from notebook 2.6), where entities are already split into arrays.

  4. Build records — assemble standardised records from NER entities and LLM results.

  5. CSV output — node and relationship files ready for cleaning and import.

Step 1: template pass

Run every email through the template parser. Templates identify which fields exist and where the header ends — but the field values are raw strings that may contain multiple people, email addresses, and noise.

In your notebook, run the template pass cell. ~98.5% match, ~70 fail.

Step 2: NER pass

The NER model runs on the header text of every template-matched email. It identifies individual entity spans — each person’s name, each email address, the date, and the subject — tagged with their role (SENDER, RECIPIENT, CC_RECIPIENT, etc.).

This is where multi-recipient strings like "Buy, Rick, mark.e.haedicke@enron.com, Rieker, Paula" get split into individual entities. The NER model understands that Buy, Rick is one person, not two.

Pairing names with emails

NER entities come out in document order: a RECIPIENT name is immediately followed by its RECIPIENT_EMAIL.

text
NER output (document order)
RECIPIENT        'Steven J Kean'
RECIPIENT_EMAIL  'steven.kean@enron.com'
RECIPIENT        'Richard Shapiro'
RECIPIENT_EMAIL  'richard.shapiro@enron.com'
RECIPIENT_EMAIL  'susan.j.mara@enron.com'  // (1)
RECIPIENT        'Jeff Dasovich'
RECIPIENT_EMAIL  'jeff.dasovich@enron.com'
  1. A bare email with no preceding name — susan.j.mara had no display name in the original header

The record builder walks this sequence and pairs each name with the email that follows it. Bare emails without a name become email-only entries.

Step 3: LLM results

The ~70 template failures were sent to the LLM in notebook 2.6. The LLM already returns individuals in arrays (RECIPIENT_NAMES: ["Alice", "Bob"]), so no further entity detection is needed. Load the results and build records from the same flat format.

Building records

One function builds a record for template+NER emails, another for LLM emails. Both produce the same output format:

  • doc_id, sender_name, sender_email, sender_domain

  • recipients — a list of {name, email, type} entries

  • date_raw, date_parsed, subject, method

The template’s sent field provides the raw date string. NER captures it too, but the template value is more reliable for date parsing.

Validation

Before writing CSV, check for systematic issues. In your notebook, run the validation cell to see null rates and the top sender domains.

Missing sender names come from automated system emails. Missing sender emails come from OCR-garbled addresses. Missing dates come from severely corrupted scans. These are real gaps in the data, not parser bugs.

CSV output

Each node type and relationship type gets its own file:

  • emails.csv — one row per email (doc_id, subject, date, method)

  • users.csv — one row per unique name

  • mailboxes.csv — one row per unique email address

  • domains.csv — one row per unique domain

  • SENT.csv, RECEIVED.csv, CC_ON.csv — who sent/received each email

  • USED.csv — which users use which mailboxes

  • HAS_MAILBOX.csv — which domains own which mailboxes

Open these in a spreadsheet to inspect the data before the next step.

Check your understanding

NER in the pipeline

Why does the pipeline run NER on template-matched emails instead of using the template’s raw field strings directly?

  • ❏ NER is faster than reading template output

  • ❏ Templates don’t extract field values

  • ✓ Template values are raw strings like "Buy, Rick, mark.e.haedicke@enron.com" — NER identifies individual people and emails within them

  • ❏ NER corrects OCR errors that templates miss

Hint

Think about what a template’s To field looks like when there are multiple recipients with Last, First names mixed with bare email addresses.

Solution

Templates identify field boundaries but return raw strings. A To field with multiple recipients is one long string with commas that could be inside names (Buy, Rick) or between names. NER identifies individual entity spans with their roles — it knows Buy, Rick is one RECIPIENT, not two.

Entity pairing

The pair_entities function walks NER entities in document order. Why does this allow pairing names with emails?

  • ❏ NER always returns names and emails in alphabetical order

  • ❏ The function uses regex to match names to email addresses

  • ✓ In the header text, each person’s name is immediately followed by their email address — NER preserves this document order

  • ❏ The function queries a lookup table of known name-email pairs

Hint

Look at the header: Kenneth Lay <kenneth.lay@enron.com>. In what order does NER tag these spans?

Solution

In the email header, Name <email> pairs appear in sequence. NER tags them in document order: RECIPIENT followed by RECIPIENT_EMAIL. The pairing function tracks a "pending name" and attaches the next email to it. A bare email with no preceding name becomes an email-only entry.

Summary

  • Templates identify field boundaries for 98.5% of emails

  • NER identifies individual entities within those fields — each person’s name and email as a separate span

  • Entity pairing walks NER output in document order, matching each name with the email that follows it

  • The LLM handles the ~70 emails with OCR-garbled headers, returning individuals in arrays

  • One record format serves both sources — template+NER and LLM produce identical output

  • CSV output provides a checkpoint before cleaning and import

Next: Cleaning and normalization.

Companion notebook: 2.7_hybrid_pipeline.ipynb

Chatbot

How can I help you today?

Data Model

Your data model will appear here.