In this course, you’ll extract structured communication metadata from documents and build an entity network in Neo4j.
You’ll extract text, parse it into structured records, and import the results — producing a graph of people, mailboxes, domains, and emails ready for entity extraction in the next course.
The Three-Course Series
This course is the first of three:
Entity Communication Networks (this course): Extract and structure communication metadata into a graph
Entity Extraction: Thread decomposition, chunking, and entity extraction
The Codespace takes approximately 10 minutes to configure. While it sets up, continue through the next slides.
You will also need an AuraDB Free instance. Create one at Neo4j Aura if you don’t have one already.
The End-state graph
By the end of all three courses, your Neo4j instance will contain a metadata graph that captures who sent what to whom, and through which domain. Every extraction and parsing decision you make in this course serves this model.
Metadata graph
By the end of this course, you will have created the precursor to the previous graph, composed of email metadata and coarse thread relationships.
Graph Model
The metadata graph separates people from their email addresses, and email addresses from their domains — each separation unlocks a different class of query.
You may wonder why we are using both (:User)and(:Mailbox) nodes.
In a company, multiple users may use the same mailbox. Multiple mailboxes may be used by the same user. Despite their 'sameness' in general, they are fundamentally different entities — and so, we should model them as such.
The impact of this will become clearer as you continue the course.
Check your understanding
Course goal
What is the end result of this course?
❏ A trained LLM that can read any PDF
❏ A collection of extracted text files
✓ A metadata graph in Neo4j with Email, User, Mailbox, and Domain nodes
❏ A spreadsheet of parsed email records
Hint
Think about where the data ends up, not just the intermediate steps.
Solution
The course builds a complete pipeline from raw PDFs to a Neo4j graph. The graph contains Email nodes (with subject, date, body), User nodes (senders and recipients), Mailbox nodes (email addresses), and Domain nodes — connected by SENT, RECEIVED, CC_ON, USED, and HAS_MAILBOX relationships.
Summary
You’re building a communication network from raw documents — Domain, Mailbox, User, and Email nodes
This is part one of a three-course series
Every extraction and parsing decision serves the graph model
The dataset has real-world OCR and formatting challenges
Your Codespace should be setting up in the background