In the course overview, you saw the graph you’re building — Domain, Mailbox, User, Email. In the previous lesson, you found layers of content inside your text files. The question now is: why is the graph shaped the way it is, and how does that shape determine what you need to parse?
What You’ll Learn
By the end of this lesson, you’ll understand:
Three types of graph you can build from email data
What questions each type can answer — and what it can’t
How your target graph type determines what you need to parse
Why the model you choose shapes every parsing decision ahead
The Metadata Graph
A metadata graph might capture the structured fields found in email headers: who sent what to whom, and when.
An entity graph captures people, organizations, locations, and topics mentioned in the email body — not just the header fields.
Questions it answers:
What topics connect different people?
Which organizations are mentioned most frequently?
What entities appear together across multiple emails?
The Hybrid Graph
If you continue with this course, bear in mind that we will be constructing a hybrid of both graphs. This will allow us to:
Resolve mentioned entities to users and mailboxes
Use successful resolutions in the metadata layer to resolve downstream entities
Cluster topics by email networks, and vice versa
Cluster redactions in the entity and metadata layers by the email networks they come from
You have options
Think of this course as a jumping off point. You can choose to continue with the model we demonstrate here — but you can also follow your own model.
Sometimes, the data you’re working with, and the questions you want to answer, suggest a different kind of model. That is perfectly fine.
For example, most use-cases will not require you to decompose email threads. We’re showing you some strategies to achieve that, just in case you need it.
Now let’s take a look at how you can develop the most appropriate model for your data.
Start With the Questions
Your target graph type determines what you need to extract — which determines how you parse.
Model 1: Just Mailboxes
The simplest model: one node per mailbox, everything stored as properties.
Even a flat mailbox model is a graph. Linking mailboxes with a simple set of :EMAILED relationships gives you a monopartite network of communication patterns. You could use:
PageRank → to find the most referenced or replied-to mailboxes in the corpus
Betweenness centrality → to identify bridging mailboxes that connect otherwise separate components
Community detection → To cluster communication groups
Model 1: Where It Stops
With everything flattened into single :Mailbox nodes, we lose the semantics contained within each email. We also lose the individual threads within emails.
"Which groups talk about the same kinds of things?" → have to look up elsewhere
"Who is mentioned most across the corpus?" → lost to us
"When did X become aware of Y?" → not answerable without the body content
Model 2: Mailbox + Email
Separating the mailbox into its own node turns email addresses into first-class entities.
With mailboxes as nodes, the graph becomes a communication network. Project mailbox-to-mailbox and you have a social graph. You could use:
Community detection → to discover teams and working groups from communication patterns
PageRank on mailboxes → to find the most central communicators
Shortest path → to trace how information flows between any two addresses
Domain-level analysis → to reveal which organizations communicate most, and through whom
Model 2: Where It Stops
A mailbox is not a person. Matthew Lenhart might use matthew.lenhart@enron.com at work and matt.lenhart@gmail.com at home. In this model, those are two unconnected nodes.
"How many emails did Matthew send in total?" → you’d need to know all his addresses and sum them manually
"Who are the most prolific senders?" → you get mailboxes, not people
"Who bridges two departments?" → you might count the same person twice
If your questions are about people rather than addresses, you need another layer.
Similarly, enron.comms@enron.com may be used by multiple individuals from the same marketing team. That team might change over many years. Separating Mailbox from :User allows you to track how many distinct people use the same ':Mailbox'.
Model 3: User + Mailbox + Email
Adding a User node links multiple mailboxes to the same individual.
With users as first-class nodes, every metric collapses across all of a person’s addresses. Centrality, community membership, and communication volume are now about individuals, not inboxes. You could perform:
Entity resolution → to match multiple addresses to the same person — or vice versa
Cross-domain analysis → to see a person’s communication across enron.com, gmail.com, and yahoo.com in one view
Cross-layer entity resolution → to link User nodes to entities extracted from body text
Choosing Your Depth
Each model adds a layer of separation — and a new class of analysis it unlocks.
Mailbox only → community detection, anomaly detection, centrality, node embeddings
User + Mailbox + Email → true social network, entity resolution, cross-domain analysis
The deeper your model, the more you need to parse, resolve, and maintain. If you only want to find the most 'influential' mailboxes in the network, you could happily skip the more detailed extraction layers.
This Module’s Target
Now you know what you’re parsing toward: a metadata graph with User, Mailbox, Domain, and Email as separate nodes. Every parsing decision in the lessons ahead — which fields to extract, how to handle OCR noise, when to use an LLM — serves this model. By the end of this module, you’ll import the parsed records and see the graph come to life.
Check your understanding
Why separate User and Mailbox?
The graph model uses separate User and Mailbox nodes rather than storing the email address directly on the User. Why?
❏ Email addresses are too long to store as node properties
❏ Neo4j can’t index string properties on User nodes
✓ A person may use multiple email addresses, and the same address may appear with different display names
❏ Mailbox nodes are required by the LOAD CSV import process
Hint
Think about what happens when the same person sends from two different email addresses, or when two different display names appear with the same address.
Solution
Separating User and Mailbox allows the graph to represent one person using multiple email addresses (connected by USED relationships). It also handles cases where the same address appears with different display names. This separation is what enables the "all emails from one person across mailboxes" query in the final lesson.
Relationship direction
In the metadata graph model, which direction does the SENT relationship go?
✓ From User/Mailbox to Email — the sender points to the email they sent
❏ From Email to User/Mailbox — the email points to who sent it
❏ Both directions — SENT is bidirectional
❏ Direction doesn’t matter in Neo4j
Hint
Think about how you’d read the pattern aloud: "Kay Mann SENT this email" or "this email was SENT BY Kay Mann." Which reads more naturally as a directed relationship?
Solution
SENT goes from the sender (User or Mailbox) to the Email node. This reads naturally: (u:User)-[:SENT]→(e:Email). The same convention applies to RECEIVED and CC_ON — they point from the recipient toward the email.
Summary
A metadata graph captures who, when, and what from header fields
An entity graph captures people, places, and topics from body text
A hybrid graph combines both — built across all three courses
Your questions determine your model: if you’ll query on a value, make it a node
The metadata graph is this module’s target — every parsing decision ahead serves it
Next: We’ll look at the different approaches you can use to parse email headers.