Module 1 gave you 4,911 plain text files — one per email PDF. Before writing a single line of parsing code, it helps to understand what’s inside them. Every document corpus has its own structure, noise, and quirks.
Open 2.1_whats_in_your_documents.ipynb and follow along — you’ll investigate the data hands-on as we go.
What you’ll learn
By the end of this lesson, you’ll be able to:
Identify the structural layers in an extracted email document
Distinguish between metadata, content, and noise
Recognise patterns that a parser can exploit
Assess the quality of your extracted text before committing to a parsing strategy
Four layers
Every extracted email contains up to four layers. Learning to see them is the first step to parsing. In your notebook, run the first few cells to load the text files and inspect a few.
text
Extracted text from E0058489E.pdf
CONFIDENTIAL // (1)
Enron Corp.
Case No. EC-2002-01038
Doc No. E0058489E
ENRON CORP. - PRODUCED PURSUANT TO FERC SUBPOENA.
RELEASE IN PART
From: // (2)
Kay Mann <kay.mann@enron.com>
Sent:
Thu, 31 May 2001 02:37:00 -0700 (PDT)
To:
Kathleen Carnahan <kathleen.carnahan@enron.com>
Subject:
RE: important stuff
Thanks, but I'm throwing away // (3)
junk all weekend.
-----Original Message----- // (4)
From: Mann, Kay
Subject: important stuff
So where are you kayaking
this weekend?
Boilerplate — classification markings, case numbers, document IDs. Repeats on every page.
Email headers — structured fields: who sent it, when, to whom, about what.
Body text — the actual message content. Unstructured, variable length.
Embedded chain — a forwarded or replied-to message with its own headers and body.
Layer 1: boilerplate
Boilerplate repeats across documents but carries no per-email meaning. In your notebook, run section 3 to see how consistently these patterns appear across the corpus.
text
Boilerplate from E0048ADF3.pdf
CONFIDENTIAL // (1)
Enron Corp. // (2)
Case No. EC-2002-01038 // (3)
Doc No. E0048ADF3 // (4)
Date: 01/15/2003
ENRON CORP. - PRODUCED PURSUANT TO FERC SUBPOENA. // (5)
SUBJECT TO PROTECTIVE ORDER.
CONFIDENTIAL TREATMENT REQUESTED.
RELEASE IN FULL // (6)
Classification marking
Producing entity
Case identifier
Document number
Production stamp
Release status (FULL, PART, or withheld)
Consider whether your boilerplate is valuable
In our demo dataset, we have added those boilerplate elements. However, in your dataset, they might constitute valuable information.
Layer 2: email headers
The structured metadata — who, when, and what. These are your primary parsing targets for a metadata graph.
text
Headers from E0048ADF3.pdf
From: // (1)
Nemec, Gerald <gerald.nemec@enron.com>
Sent: // (2)
Wed, 17 Oct 2001 13:11:22 -0700 (PDT)
To: // (3)
Lucci, Paul T. <t..lucci@enron.com>
Cc: // (4)
Tycholiz, Barry <barry.tycholiz@enron.com>,
Williams, Jason R <.williams@enron.com>
Subject: // (5)
Kennedy Amendment
From — the sender
Sent — date and time
To — primary recipients
Cc — copied recipients
Subject — the topic
Every field you extract becomes a node or property in your graph.
Finding metadata elsewhere
Not all metadata lives in the document text. PDF files have their own metadata fields — title, author, creation date — that may contain useful information.
In real document dumps, PDF metadata often includes the OCR software used, the scan date, and sometimes the document ID. Our synthetic PDFs have minimal metadata, but it’s always worth checking. In your notebook, run section 3b to inspect the PDF metadata.
Header patterns
In your notebook, run section 4 to extract the field combinations from every file. Each email has a From: label on its own line, followed by the sender on the next line, then Sent:, To:, optionally Cc:, and Subject:. Long recipient lists wrap across multiple lines.
The output shows the field combinations found in the corpus and how many recipients each To and Cc field contains. About 72% have From/Sent/To/Subject. About 28% also include Cc. A handful are missing Subject or To — edge cases the parser will need to handle.
The parsing complexity lives in the To and Cc fields: splitting multi-recipient lists that wrap across lines into individual names and email addresses.
Field frequency
Not every email has all header fields. In your notebook, run section 5 to check which fields appear most often across the corpus — this tells you what you can reliably extract and what might be missing.
Layer 3: body text
The unstructured content — where entities, topics, and intent are buried. In your notebook, run section 6 to extract and examine body text across the corpus.
text
Body from E0000813E.pdf
Subject:
Remarks to EES Employees--December 1
[REDACTED] B6 // (1)
EES employee meeting next Friday, and
I put together the attached as a
potential framework. It provides an
Enron problem list of 1985/86 versus
today and a brief look at our
different visions. // (2)
If I can do something else, I'll be
back Monday morning and will be happy
to do so.
- Rob // (3)
Redaction marker — withheld content
The body text itself — unstructured, variable length
Sign-off
Bear in mind, many of these emails consist of nested threads. Here, we are naively defining the 'body' as 'anything below the subject line'.
In course 2, you’ll work on decomposing these threads and resolving duplicate thread components across the graph.
Layer 4: embedded chains
Many email PDFs contain entire conversation threads — replies and forwards nested within the body. In your notebook, run section 7 to count how many emails contain chains.
text
Chain from E0058489E.pdf
From: Kay Mann // (1)
Subject: RE: important stuff
Thanks, but I'm throwing away
junk all weekend. // (2)
-----Original Message----- // (3)
From: Mann, Kay // (4)
Sent: Thursday, May 31, 2001
Subject: important stuff
So where are you kayaking
this weekend? // (5)
Outer email’s headers
Outer email’s body
Chain marker
Embedded email’s headers
Embedded email’s body
Redacted content
Some emails have content that was redacted before release — names, amounts, or other details replaced with markers. In your notebook, run section 8b to check how many files contain redaction markers.
text
Redactions from E00CF8AE9.pdf
From:
Amr Ibrahim <amr.ibrahim@enron.com>
Sent:
Thu, 5 Jul 2001 01:02:00 -0700 (PDT)
To:
Richard Shapiro <richard.shapiro@enron.com>,
Lisa Yoho <lisa.yoho@enron.com>,
[REDACTED] Bé6 // (1)
Subject:
Fireball Coal Project - RAC Meeting June 29
While working in her group to
[REDACTED] B5 // (2)
measure to reflect the risk...
Redacted recipient — name withheld before release
Redaction in body text — details removed
Redactions are meaningful signals. A [REDACTED] marker tells you that a relationship existed even though the details were withheld.
OCR noise in headers
About 3% of files have noticeable OCR errors from lower-quality scans. In your notebook, run section 9 to find these files and see the patterns.
Most of the corpus (96%) has clean text. The parser needs to handle these variants, but they follow predictable patterns.
What you’ve found
You’ve now investigated the raw data and confirmed the challenges ahead: different field combinations, multi-line recipient lists, embedded chains, and OCR noise on a small subset. Every parsing decision in the next lessons flows from what you’ve just seen.
Check your understanding
Document layers
You open one of the extracted text files and see government stamps at the top, email headers in the middle, body text below, and a forwarded email chain at the bottom. Which of these layers contains information that varies per email?
❏ The government stamps (CONFIDENTIAL, Case No, Doc No)
✓ The email headers (From, Sent, To, Subject)
❏ The boilerplate stamps repeat the same content on every file
✓ The body text
Hint
Think about which layers carry information specific to one email vs information that is the same across every document in the dataset.
Solution
The government stamps (CONFIDENTIAL, Case No, Doc No, ENRON CORP) are boilerplate — the same on every document. The email headers and body text are unique to each email. Both carry per-email information, but the headers are structured (parseable fields) while the body is unstructured.
Field combinations
About 72% of emails in the dataset have the same header field combination. What is it?
❏ From / To / Subject
❏ From / Sent / To / Cc / Subject
✓ From / Sent / To / Subject
❏ From / Sent / To / Cc / Bcc / Subject
Hint
Most emails don’t have Cc recipients. What’s the simplest complete set of header fields?
Solution
From/Sent/To/Subject accounts for about 72% of the corpus. About 28% also include Cc. Only a handful are missing fields like Subject or To.
Summary
Investigate your documents before writing parsing code — what you find shapes every decision that follows
Email PDFs contain four layers: boilerplate, headers, body, and embedded chains
About 72% have From/Sent/To/Subject; about 28% also have Cc; a few are missing fields
To and Cc fields can contain many recipients across multiple lines — this is where parsing complexity lives
Redacted content ([REDACTED] markers) should be tracked as entities — they signal that information was withheld
About 3% of files have OCR noise in headers (Bate: for Date:, ENRON CORE. for ENRON CORP.) — predictable patterns the parser can handle
Understanding the structure drives your parsing strategy and modeling choices
Next: We’ll look at the different types of graph you can build from these documents — and how each answers different questions.