The CSV files from the pipeline contain raw extracted values. Kay Mann and kay mann would become two separate User nodes. kay.mann@enron.com and Kay.Mann@Enron.com would become two Mailbox nodes. OCR-garbled domains like enror.corm would never match enron.com.
Before importing, we need normalized values to merge on. The raw values stay — normalized values sit alongside them in _norm columns.
Open 2.8_cleaning_and_normalization.ipynb in your Codespace to follow along.
What you’ll learn
By the end of this lesson, you’ll be able to:
Inspect raw extracted data for systematic quality issues
Apply a standard NLP normalization pipeline: unicode decomposition, artifact stripping, lowercasing
Build a corpus-specific OCR domain correction table
Relocate misplaced email addresses rather than discarding them
Write normalized CSV files ready for Neo4j import
The approach
Create _norm columns alongside the raw values. Import merges on the normalized form; the raw value is preserved as a property for traceability.
This approach works for this dataset — few people share identical names among the few thousand who worked at Enron. For a larger or more ambiguous dataset (census data, for example), you would merge on the most unique identifier or constellation of identifiers.
If you accidentally merge two different people with the same name (you won’t), you can detach them later using the raw CSV as ground truth.
Inspecting the raw data
In your notebook, run the first section to load the CSV files and inspect the noise. The common issues:
Email addresses in name fields — the NER model occasionally captured an email as a name span
Semicolons — artifacts from the template’s multi-line value join
Angle brackets and quotes — fragments from OCR-garbled Name <email> pairs
Inconsistent casing — ALAN YUDKOWSKY, Kay Mann, kay mann
Built by listing all domains with three or fewer mailboxes, filtering for patterns that are obviously corrupted, and verifying each against the mailbox local parts
For your own data, inspect the rare domains the same way — sort by frequency, and the OCR variants will cluster at the bottom.
Relocating misplaced emails
Build your own correction table
The OCR correction table and artifact patterns in this lesson are built from the Enron corpus. Your data will have its own systematic errors. To build your own correction table, sort extracted domains (or whatever your key identifiers are) by frequency — OCR variants and typos cluster at the bottom with very low counts. Inspect those manually, map each to its correct form, and add them to your lookup table. The same principle applies to any normalisation: inspect your data’s actual noise patterns rather than guessing.
When the name field contains an email address and the address field is empty, the normalization step moves the email to the address field rather than discarding it.
This preserves the entity reference — without relocation, those mailboxes would be lost entirely from the graph.
In your notebook, run the apply, verify, and write sections to produce the final CSV files.
Check your understanding
Why normalize?
What happens if you import Kay Mann and kay mann without normalization?
❏ Neo4j automatically merges them into one node
✓ They become two separate User nodes — queries that should match one person return two
❏ The import fails with a constraint violation
❏ The second one overwrites the first
Hint
Think about what MERGE (u:User {name_norm: …}) does when two different strings are passed.
Solution
Without normalization, MERGE sees Kay Mann and kay mann as different values and creates two nodes. The _norm column lowercases both to kay mann, so MERGE on name_norm creates one node. The raw value is preserved as a property for traceability.
Leading dot preservation
The normalization pipeline preserves leading dots in email addresses like .williams@enron.com rather than stripping them. Why?
❏ Leading dots are valid in email addresses per RFC 5321
❏ Stripping the dot would create an invalid address
✓ .williams@enron.com is a truncated firstname.williams@enron.com — stripping the dot could falsely merge it with a different person’s williams@enron.com
❏ The OCR correction table doesn’t handle leading dots
Hint
Think about what happens if two different people have addresses t..williams@enron.com and williams@enron.com. What would stripping the dot do?
Solution
The leading dot indicates OCR truncation of the local part — t..williams became .williams. Stripping it to williams@enron.com could merge this mailbox with a completely different person named Williams. Keeping the dot preserves the distinction. Entity resolution in a later course can figure out whether they’re the same person.
Summary
Raw extracted values aren’t safe to merge on — casing, OCR noise, and artifacts create false duplicates
The normalization pipeline applies unicode decomposition, artifact stripping, lowercasing, and whitespace collapsing to names
Email normalization rejects garbage values and applies an OCR domain correction table built from corpus inspection
Leading dots in emails are preserved — they indicate truncation, not noise
Misplaced emails are relocated from name fields to address fields rather than discarded
Raw values are preserved in the CSV alongside _norm columns — import merges on normalized, stores both