Build a Graph

In the previous task, you used the Neo4jVector class to create Chunk nodes in the graph. Using Neo4jVector is an efficient and easy way to get started.

To create a graph where you can also understand the relationships within the data, you must incorporate the metadata into the data model.

In this lesson, you will create a graph of the course content using the neo4j Python driver and OpenAI API.

Data Model

The data model you will create is a simplified version of the course content model you saw earlier.

Data model showing Course

The graph will contain the following nodes, properties, and relationships:

  • Course, Module, and Lesson nodes with a name property

  • A url property on Lesson nodes will hold the GraphAcademy URL for the lesson

  • Paragraph nodes will have text and embedding property

  • The HAS_MODULE, HAS_LESSON, and CONTAINS relationships will connect the nodes

You can extract the name properties and url metadata from the directory structure of the lesson files. For example, the first lesson of the Neo4j & LLM Fundamentals course has the following path:

courses\llm-fundamentals\modules\1-introduction\lessons\1-neo4j-and-genai\lesson.adoc

You can extract the following metadata from the path:

  • Course.name - llm-fundamentals

  • Module.name - 1-introduction

  • Lesson.name - 1-neo4j-and-genai

  • Lesson.url - graphacademy.neo4j.com/courses/{Course.name}/{{Module.name}}/{Lesson.name}

Extracting the data

Open the 1-knowledge-graphs-vectors\build_graph.py file in your code editor.

This starter code loads and chunks the course content.

python
Load and chunk the content
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

# Create a function to get the embedding

# Create a function to get the course data

# Create OpenAI object

# Connect to Neo4j

# Create a function to run the Cypher query

# Iterate through the chunks and create the graph

# Close the neo4j driver

For each chunk, you have to create an embedding of the text and extract the metadata.

Create a function to create and return an embedding using the OpenAI API:

python
Create embeddings
def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=chunk.page_content,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

Create a 2nd function, which will extract the data from the chunk:

python
Get course data
def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])

    return data

The get_course_data function:

  1. Splits the document source path to extract the course, module, and lesson names

  2. Constructs the url using the extracted names

  3. Extracts the text from the chunk

  4. Creates an embedding using the get_embedding function

  5. Returns a dictionary containing the extracted data

Create the graph

To create the graph, you will need to:

  1. Create an OpenAI object to generate the embeddings

  2. Connect to the Neo4j database

  3. Iterate through the chunks

  4. Extract the course data from each chunk

  5. Create the nodes and relationships in the graph

Create the OpenAI object:

python
llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

Connect to the Neo4j sandbox:

python
driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()

Test the connection

You could run your code now to check that you can connect to the OpenAI API and Neo4j sandbox.

To create the data in the graph, you will need a function that incorporates the course data into a Cypher statement and runs it in a transaction.

python
Create chunk function
def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
        )

The create_chunk function will accept the data dictionary created by the get_course_data function.

You should be able to identify the $course, $module, $lesson, $url, $text, and $embedding parameters in the Cypher statement.

Iterate through the chunks and execute the create_chunk function:

python
for chunk in chunks:
    with driver.session(database="neo4j") as session:
        
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )

A new session is created for each chunk. The execute_write method calls the create_chunk function, passing the data dictionary created by the get_course_data function.

Finally, close the driver.

python
driver.close()
Click to view the complete code
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=chunk.page_content,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])

    return data

def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
        )

llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()

for chunk in chunks:
    with driver.session(database="neo4j") as session:
        
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )

driver.close()

Run the code to create the graph. It will take a minute or two to complete as it creates the embeddings for each paragraph.

Explore the graph

View the graph by running the following Cypher:

cypher
MATCH (c:Course)-[:HAS_MODULE]->(m:Module)-[:HAS_LESSON]->(l:Lesson)-[:CONTAINS]->(p:Paragraph)
RETURN *
Result from the Cypher

You will need to create a vector index to query the paragraph embeddings.

cypher
Create Vector Index
CREATE VECTOR INDEX paragraphs IF NOT EXISTS
FOR (p:Paragraph)
ON p.embedding
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}}

You can use the vector index and the graph to find a lesson to help with specific questions:

cypher
Find a lesson
WITH genai.vector.encode(
    "How does RAG help ground an LLM?",
    "OpenAI",
    { token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('paragraphs', 6, userEmbedding)
YIELD node, score
MATCH (l:Lesson)-[:CONTAINS]->(node)
RETURN l.name, l.url, score

Explore the graph and see how the relationships between the nodes can bring additional meaning to the unstructured data.

Continue

When you are ready, you can move on to the next task.

Summary

You created a graph of the course content using the neo4j Python driver and OpenAI API.