email.parser returned empty records the moment it touched PDF-extracted text — because the RFC structure the library expects is gone.
Open 2.4_rule_based_parsing.ipynb in your Codespace to follow along.
What you’ll learn
By the end of this lesson, you’ll be able to:
Use regex anchors and quantifiers to match email header patterns
Strip boilerplate from PDF-extracted text using regex
Build a greedy extractor and see where it silently fails
Encode role sequences as templates that match exactly or reject cleanly
Use an AI assistant to generate templates for edge cases
Run the template system on the full corpus and measure coverage
Regex in 60 seconds
A regular expression (regex) is a pattern string that matches text. You don’t need to memorize regex syntax — LLMs generate patterns on the fly, and you’ll use that approach in the next lesson. For now, five patterns are enough.
import re
re.match(r'^From', 'From: Matthew') # match # (1)
re.match(r'^From', ' From: ...') # None
re.match(r'^From:\s*$', 'From:') # match # (2)
re.match(r'^From:\s*$', 'From: Matt') # None
m = re.match(r'^From:\s+(.+)', 'From: Matthew Lenhart') # (3)
m.group(1) # 'Matthew Lenhart'
m = re.search(r'\S+@\S+', 'vapte@yahoo.com @ ENRON') # (4)
m.group(0) # 'vapte@yahoo.com'
re.match(r'^from:', 'FROM:', re.IGNORECASE) # match # (5)
^ anchors the pattern to the start of the line. 'From: Matthew' starts with From so it matches, but ' From: …' starts with spaces so it does not.
\s*$ means "zero or more whitespace characters followed by the end of the line." If the line is just From: with nothing after it, the pattern matches. If there is a value after the colon, it does not. This is how you detect a bare label line.
(.+) is a capture group. The parentheses tell the engine to extract whatever the .+ matched, and you retrieve it with m.group(1). Here it captures 'Matthew Lenhart'.
\S+ matches one or more non-whitespace characters. The pattern \S+@\S+ grabs everything on both sides of the @ until it hits a space, which is useful for pulling email addresses out of noisy text.
By default, regex is case-sensitive — ^from: would not match 'FROM:'. Passing re.IGNORECASE makes the match case-insensitive, so from:, From:, and FROM: all match the same pattern.
The Enron corpus has boilerplate
Every PDF-extracted text file in our corpus starts with government processing stamps that must be removed before field extraction begins.
Real-world boilerplate may be worth keeping
We added those stamps for the purposes of this course — however, many document dumps you might deal with contain similar boilerplate which you may want to keep.
Stripping boilerplate with regex
In your notebook, run the next cell to define the boilerplate patterns and strip them from a sample file.
python
Boilerplate removal
BOILERPLATE = [ # (1)
r"^'?CONFIDENTIAL.*$", # (2)
r"^Enron\s*Corp.*$",
r"^Case\s*No[,\.].*$", # (3)
r"^Doc\s*No[,\.].*$",
r"^Date:?\s*\d", # (4)
r"^ENRON\s*CORP.*$",
r"^SUBJECT TO PROTECTIVE.*$",
r"^RELEASE IN.*$",
r"^ENRON[-=]\d+.*$", # (5)
]
def strip_boilerplate(text): # (6)
lines = text.split("\n")
cleaned = [
line for line in lines
if not any(
re.match(p, line.strip(),
re.IGNORECASE)
for p in BOILERPLATE
)
]
return "\n".join(cleaned).strip()
BOILERPLATE is a list of regex patterns, one per boilerplate line type. Each pattern describes the shape of a line that should be removed.
^'?CONFIDENTIAL.$ matches lines starting with CONFIDENTIAL, with an optional leading apostrophe ('?) that OCR sometimes introduces. .$ matches the rest of the line, whatever it contains.
[,\.] matches either a comma or a period. OCR sometimes reads Case No. as Case No, — this character class handles both.
Date:?\s*\d matches the boilerplate date stamp (e.g. Date: 01/15/2003). The :? makes the colon optional, and \d requires that a digit follows — this prevents matching the email’s Date: header, which would have a day name after the colon, not a digit.
[-=] matches either a hyphen or an equals sign. OCR sometimes reads the document separator ENRON-819… as ENRON=819….
strip_boilerplate splits the text into lines, tests each line against every pattern in the list, and keeps only the lines that match none of them. re.IGNORECASE makes all matches case-insensitive.
Before and after
text
Before stripping
CONFIDENTIAL
Enron Corp.
Case No. EC-2002-01038
Doc No. E0048ADF3
Date: 01/15/2003
ENRON CORP. - PRODUCED ...
RELEASE IN PART
From:
EDIS Email Service <edismail@incident.com>
Sent:
Fri, 5 Apr 2002 00:07:00 -0800 (PST)
To:
Motley, Matt <matt.motley@enron.com>
Subject:
[EDIS] EQ 4 2 SAN BERNARDINO COUNTY
Please check the following...
text
After stripping
From:
EDIS Email Service <edismail@incident.com>
Sent:
Fri, 5 Apr 2002 00:07:00 -0800 (PST)
To:
Motley, Matt <matt.motley@enron.com>
Subject:
[EDIS] EQ 4 2 SAN BERNARDINO COUNTY
Please check the following...
The email header is now at a predictable position — no scanning for "where does the real content start?"
Retain boilerplate for real document dumps
If you were running this on real government documents, you would likely retain at least some of this boilerplate.
Header structure and field variations
In your notebook, run the next cell to inspect the header structure and count field combinations across the corpus.
After stripping boilerplate, the email headers follow a consistent alternating pattern: each label sits on its own line (From:, Sent:, To:, Subject:), with the value on the next line.
The variation is in which fields are present — most emails have From/Sent/To/Subject, some also include Cc:, and some have empty fields or missing Subject:.
Field combinations
In your notebook, the output shows the top field combinations across the corpus.
About 9% of emails include Cc recipients before the Subject line
Another 6% have Cc appearing after Subject — a variation in the PDF layout
No Subject: line — the email body starts immediately after To:
The pattern is always alternating; the variation is in which fields appear and in what order.
Greedy extraction
The alternating pattern is predictable enough to write a simple extractor: walk the lines top to bottom, and whenever you see a label you recognize (From:, Sent:, etc.), record the next non-empty line as that field’s value. Everything between one label and the next gets absorbed into the current field.
This approach is greedy — it always produces a result. It never says "I don’t know" or "this doesn’t look right." It just keeps absorbing lines until the next recognized label appears.
The greedy extractor in code
In your notebook, run the next cell to define the extractor and test it on three files.
python
Simple alternating extractor
def parse_alternating(text):
fields = {}
lines = text.split("\n")
current_label = None
in_body = False
for line in lines:
stripped = line.strip()
if in_body:
break
m = re.match(
r'^(From|Sent|To|Cc|Subject)'
r':\s*$',
stripped, re.IGNORECASE
)
if m:
current_label = ( # (1)
m.group(1).lower()
)
fields[current_label] = ""
elif current_label and stripped:
fields[current_label] += ( # (2)
(" " if fields[current_label]
else "") + stripped
)
if current_label == "subject":
in_body = True # (3)
return fields
A label-only line (e.g. From:) sets current_label
The next non-empty line is the value — it accumulates until the next label
After capturing the subject value, everything that follows is body text
On 50 digital emails, the extractor finds from, sent, to, and subject on nearly all of them.
Where it produces wrong data
In your notebook, run the next cell to see what happens when OCR corrupts a label. The greedy extractor silently absorbs the corrupted text into the wrong field — and there is no indication that anything went wrong.
text
OCR-corrupted header (Ce: for Cc:)
From:
Hitchcock, Dorie <dorie.hitchcock@enron.com>
Sent:
Fri, 20 Jul 2001 11:13:22 -0700 (PDT)
To:
Kitchen, Louise <louise.kitchen@enron.com>,
Lavorato, John <john.lavorato@enron.com>
Ce: // (1)
Schoppe, Tammie <tammie.schoppe@enron.com>
Subject:
EA EVENT MASTER SCHEDULE
Ce: is OCR’s reading of Cc:. The greedy extractor doesn’t recognize it as a label, so it appends Ce: and the Cc recipients to the To field. The Cc field is missing entirely.
The problem isn’t that the extractor crashes — it’s that it returns a result that looks plausible but is wrong.
So what can we rely on?
The greedy extractor fails because it’s permissive — it always produces a result, even when that result is wrong. To build something stricter, we first need to understand what’s actually consistent in the data.
In your notebook, run the next cell to see role sequences side by side. Every email with the same field combination produces the identical role sequence — not approximately the same, but identical.
python
Finding consistent patterns
HEADER_LABELS = {"from", "sent", "to", "cc", "subject", "attachments"}
def classify_line(line):
"""Assign a role to a single (non-empty) line."""
stripped = line.strip()
# Label-only line
if re.match(
r'^(From|Sent|To|Cc|Subject|Attachments?):\s*$', stripped, re.IGNORECASE
):
label = re.match(r'^(\w+):', stripped).group(1).lower()
return ("LABEL", label)
# Same-line (label + value)
m = re.match(
r'^(From|Sent|To|Cc|Subject|Attachments?):\s+(\S.+)', stripped, re.IGNORECASE
)
if m:
label = m.group(1).lower()
return ("LABEL_VALUE", label)
return ("VALUE", None)
def role_sequence(text, max_lines=12):
"""Return the sequence of roles for the first few non-empty header lines."""
seq = []
for line in text.split("\n"):
if not line.strip():
continue
role, label = classify_line(line)
tag = f"{role}({label})" if label else role
seq.append(tag)
if len(seq) >= max_lines:
break
if label == "subject" and role in ("LABEL_VALUE",):
break
return seq
A different field combination, a different sequence
An email with Cc: produces a different role sequence — LABEL(cc) → VALUE is inserted between To and Subject. A template built for no-Cc will correctly reject this, and vice versa.
You can use a template to encode the exact role sequence of one field combination — refusing to extract from anything that deviates even slightly.
Rejection in action
In your notebook, run the next cell to see rejection in action. A no-Cc email is tried against the Cc template (rejected), then the correct no-Cc template (succeeds). A Cc email gets the opposite treatment.
text
Template rejection
# Email 1 — no Cc field
alt_FStCX (Cc template) → None
alt_FStX (no-Cc template) → {from, sent, to, subject}
# Email 2 — with Cc field
alt_FStX (no-Cc template) → None
alt_FStCX (Cc template) → {from, sent, to, cc, subject}
Templates encode repeating sequences
In your notebook, run the next two cells to define templates and the generic matcher.
python
Template definitions
from enum import Enum, auto
class Role(Enum):
LABEL = auto() # bare 'From:'
VALUE = auto() # data line
LABEL_VALUE = auto() # 'From: Matthew'
BODY = auto() # body starts here
L, V, LV, B = (Role.LABEL, Role.VALUE,
Role.LABEL_VALUE, Role.BODY)
TEMPLATE_ALT_FSTX = { # (1)
"name": "alt_FStX",
"structure": [
(L, "from"), (V, "from"),
(L, "sent"), (V, "sent"),
(L, "to"), (V, "to"),
(L, "subject"), (V, "subject"),
(B, None),
],
}
TEMPLATE_ALT_FSTCX = { # (2)
"name": "alt_FStCX",
"structure": [
(L, "from"), (V, "from"),
(L, "sent"), (V, "sent"),
(L, "to"), (V, "to"),
(L, "cc"), (V, "cc"),
(L, "subject"), (V, "subject"),
(B, None),
],
}
alt_FStX: alternating From/Sent/To/Subject — the most common field combination (~72% of parsed emails)
alt_FStCX: alternating From/Sent/To/Cc/Subject — the second most common (~25%)
A template is just data — the matching engine is a single generic function that walks any template against any text.
The generic matcher
python
Template matcher (core logic)
def match_template(template, text):
lines = text.split("\n")
start = _find_from_line(lines)
if start is None:
return None # (1)
extracted = {}
i = start
structure = template["structure"]
for step_idx, (role, field) in (
enumerate(structure)
):
while (i < len(lines) and
not lines[i].strip()):
i += 1
if i >= len(lines):
return None
if role == Role.LABEL:
m = _LABEL_RE.match(
lines[i].strip()
)
if not m or (
m.group(1).lower() != field
):
return None # (2)
i += 1
elif role == Role.VALUE:
if _ANY_LABEL_RE.match(
lines[i].strip()
):
return None # (3)
extracted[field] = (
lines[i].strip()
)
i += 1
elif role == Role.BODY:
extracted["_body_start_idx"] = i
break
extracted["_template"] = (
template["name"]
)
return extracted
No From: line found in the first 30 lines — return None immediately, don’t guess
Wrong label or label has an inline value where this step expected a bare label — fail the whole template
A label line appeared where a value was expected — the structure doesn’t match, fail immediately
Every failure is explicit and immediate — the matcher never produces a partial result.
Three starter templates
helpers/enron_templates.py ships with three templates — the most common field combinations. They were built by the process you just saw: classify lines, encode the sequence, test it.
The file also contains additional template definitions (commented examples for reference), but only the three most common are active. In the next lesson, you’ll build your own templates to cover the ~70 edge cases these three miss.
Running the system
In your notebook, run the next cells to import and run the system on the full corpus.
python
Importing and running
from helpers.enron_templates import (
extract_enron_headers
)
parsed = []
failures = []
for f in txt_files:
text = f.read_text(encoding="utf-8")
result = extract_enron_headers(text) # (1)
if result is not None:
parsed.append({
"doc_id": f.stem,
"from": result.get("from",""),
"sent": result.get("sent",""),
"to": result.get("to",""),
"subject": result.get("subject",""),
"_template": result["_template"],
})
else:
failures.append(f.stem)
extract_enron_headers strips boilerplate internally, then tries each template in order — the caller sees a single function call
At ~2,500 emails/second, the full 5,000-file corpus runs in under two seconds.
Coverage: ~98.6%
In your notebook, run the next cells to see coverage on the full corpus.
Three templates, 98.5% coverage. The ~70 failures are emails with layouts these templates reject — missing Subject lines, empty Cc fields, unusual field orderings, or severely garbled headers.
Next steps
70 edge cases from a dataset of thousands. In the next lesson, you’ll work with an LLM to build additional templates that cover these failures — then rerun the same test to measure the improvement.
Build templates from your own data
The three templates here are built for the Enron corpus’s specific field combinations. For your own data, the methodology is the same but the templates will be different. Start by classifying lines in 20-30 sample documents to identify your most common role sequences, then encode each as a template. You’ll likely find that 2-4 templates cover the vast majority of your corpus too — the long tail of edge cases is where the next lesson’s LLM-assisted approach helps most.
Check your understanding
Greedy extraction vs templates
The greedy extractor silently absorbed Ce: (OCR for Cc:) into the To field. How does the template approach handle this differently?
❏ Templates also absorb Ce: into To, but flag it as low confidence
❏ Templates correct Ce: to Cc: automatically using spell checking
✓ The template for emails without Cc rejects the file because Ce: appears where Subject: is expected; the Cc template matches because it recognizes Ce: as an OCR variant of Cc:
❏ Templates skip any line they don’t recognize
Hint
Templates encode exact role sequences. What happens when the sequence doesn’t match — does the template guess, or does it refuse?
Solution
Templates are rejection tools. The no-Cc template (alt_FStX) expects Subject: after To: but finds Ce: — it returns None. The Cc template (alt_FStCX) expects a Cc label after To: and recognizes Ce: as an OCR variant — it matches correctly. The greedy extractor had no mechanism to reject, so it silently produced wrong data.
Template coverage
The three starter templates cover ~98.6% of the corpus. What happens to the remaining ~70 emails?
❏ They are discarded from the dataset
❏ The template parser guesses the closest match
✓ They are collected as failures for the LLM to handle in a later notebook
❏ The parser retries with different boilerplate stripping
Hint
Templates return None when no template matches. What does the pipeline do with None results?
Solution
When no template matches, the email is added to the failures list. These failures are sent to an LLM in notebook 2.6 via the Batch API. The hybrid pipeline in notebook 2.7 merges both template and LLM results into a single dataset.
Summary
Regex anchors (^, $) and quantifiers (\s*, \S+) are the building blocks for header matching in PDF-extracted text
Boilerplate is stripped with a list of regex patterns before field extraction begins
The corpus follows a consistent alternating layout (label on one line, value on the next) with variation in which fields are present
A greedy extractor works well on clean files but silently produces wrong data when OCR corrupts labels (Ce: for Cc:, Toa: for To:)
Emails with the same field combination share an exact role sequence — that precision is the foundation for templates
A template encodes the exact role sequence of one field combination and refuses to extract from anything that deviates — no partial matches, no guessing
Building a template: read the stripped text, assign a role to each line, write the sequence as (Role, field) tuples
Three starter templates cover ~98.6% of the corpus — the ~70 failures are edge cases for the next lessons
extract_enron_headers on ~5,000 emails runs at ~2,500 emails/second
Next: You’ll build your own templates by working with an LLM to classify email lines and generate template definitions.