Course Overview

What We’re Building

In this course, you’ll extract structured communication metadata from documents and build an entity network in Neo4j.

You’ll extract text, parse it into structured records, and import the results — producing a graph of people, mailboxes, domains, and emails ready for entity extraction in the next course.

Diagram

The Three-Course Series

This course is the first of three:

  1. Entity Communication Networks (this course): Extract and structure communication metadata into a graph

  2. Entity Extraction: Thread decomposition, chunking, and entity extraction

  3. Entity Resolution: Deduplicate and resolve entities

Each course can be taken independently. Courses 2 and 3 provide starter graphs if you haven’t completed the prior course.

The Dataset

We’ll work with a collection of email PDFs — real-world documents with the typical challenges of scanned and digitized correspondence:

  • OCR artifacts — misread characters, merged words, corrupted email addresses

  • Inconsistent formatting — different email clients produce different layouts

  • Embedded chains — forwarded messages and reply threads nested within a single document

  • Boilerplate — headers, footers, and classification markings mixed with content

an example email from the corpus.

Bring your own dataset

If you would prefer to use your own dataset, feel free to do so.

Bring your own data

Throughout the course, you will receive extra standout tips and guidance on how to work with your own dataset.

Setting Up Your Environment

Click the button below to open the workshop repository in a GitHub Codespace.

Open in GitHub Codespace

The Codespace takes approximately 10 minutes to configure. While it sets up, continue through the next slides.

You will also need an AuraDB Free instance. Create one at Neo4j Aura if you don’t have one already.

The End-state graph

By the end of all three courses, your Neo4j instance will contain a metadata graph that captures who sent what to whom, and through which domain. Every extraction and parsing decision you make in this course serves this model.

A simplified version of the final data model

Metadata graph

By the end of this course, you will have created the precursor to the previous graph, composed of email metadata and coarse thread relationships.

a graph of Users

Graph Model

The metadata graph separates people from their email addresses, and email addresses from their domains — each separation unlocks a different class of query.

cypher
Graph data model
(:User)-[:SENT]->(:Email)
(:Mailbox)-[:SENT]->(:Email)
(:User)-[:USED]->(:Mailbox)
(:Mailbox)-[:RECEIVED]->(:Email)

You may wonder why we are using both (:User) and (:Mailbox) nodes.

In a company, multiple users may use the same mailbox. Multiple mailboxes may be used by the same user. Despite their 'sameness' in general, they are fundamentally different entities — and so, we should model them as such.

The impact of this will become clearer as you continue the course.

Check your understanding

Course goal

What is the end result of this course?

  • ❏ A trained LLM that can read any PDF

  • ❏ A collection of extracted text files

  • ✓ A metadata graph in Neo4j with Email, User, Mailbox, and Domain nodes

  • ❏ A spreadsheet of parsed email records

Hint

Think about where the data ends up, not just the intermediate steps.

Solution

The course builds a complete pipeline from raw PDFs to a Neo4j graph. The graph contains Email nodes (with subject, date, body), User nodes (senders and recipients), Mailbox nodes (email addresses), and Domain nodes — connected by SENT, RECEIVED, CC_ON, USED, and HAS_MAILBOX relationships.

Summary

  • You’re building a communication network from raw documents — Domain, Mailbox, User, and Email nodes

  • This is part one of a three-course series

  • Every extraction and parsing decision serves the graph model

  • The dataset has real-world OCR and formatting challenges

  • Your Codespace should be setting up in the background

Chatbot

How can I help you today?

Data Model

Your data model will appear here.