Create a graph

In the two previous lessons, you used the LangChain Neo4jVector and Neo4jGraph classes to create nodes in the graph. Using Neo4jVector and Neo4Graph is an efficient and easy way to get started.

To create a graph where you can also understand the relationships within the data, you must incorporate the metadata into the data model.

In this lesson, you will create a graph of the course content using the neo4j Python driver and OpenAI API.

Data Model

The data model you will create is a simplified version of the course content model you saw earlier in this module.

The graph will contain the following nodes, properties, and relationships:

Course, Module, and Lesson nodes with a name property
A url property on Lesson nodes will hold the GraphAcademy URL for the lesson
Paragraph nodes will have text and embedding property
The HAS_MODULE, HAS_LESSON, and CONTAINS relationships will connect the nodes

You can extract the name properties and url metadata from the directory structure of the lesson files. For example, the first lesson of the Neo4j & LLM Fundamentals course has the following path:

courses\llm-fundamentals\modules\1-introduction\lessons\1-neo4j-and-genai\lesson.adoc

You can extract the following metadata from the path:

Course.name - llm-fundamentals
Module.name - 1-introduction
Lesson.name - 1-neo4j-and-genai
Lesson.url - graphacademy.neo4j.com/courses/{Course.name}/{Module.name}/{Lesson.name}

Extracting the data

Open the llm-vectors-unstructured\build_graph.py file in your code editor.

This starter code loads and chunks the course content.

python

Load and chunk the content

import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase

COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

# Create a function to get the embedding

# Create a function to get the course data

# Create OpenAI object

# Connect to Neo4j

# Create a function to run the Cypher query

# Iterate through the chunks and create the graph

# Close the neo4j driver

For each chunk, you have to create an embedding of the text and extract the metadata.

Create a function to create and return an embedding using the OpenAI API:

python

Create embeddings

def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=text,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

Create a 2nd function, which will extract the data from the chunk:

python

Get course data

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])

    return data

The get_course_data function:

Splits the document source path to extract the course, module, and lesson names
Constructs the url using the extracted names
Extracts the text from the chunk
Creates an embedding using the get_embedding function
Returns a dictionary containing the extracted data

Create the graph

To create the graph, you will need to:

Create an OpenAI object to generate the embeddings
Connect to the Neo4j database
Iterate through the chunks
Extract the course data from each chunk
Create the nodes and relationships in the graph

Create the OpenAI object:

python

llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

Connect to the Neo4j sandbox:

python

driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()

Test the connection

You could run your code now to check that you can connect to the OpenAI API and Neo4j sandbox.

To create the data in the graph, you will need a function that incorporates the course data into a Cypher statement and runs it in a transaction.

python

Create chunk function

def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """,
        data
        )

The create_chunk function will accept the data dictionary created by the get_course_data function.

You should be able to identify the $course, $module, $lesson, $url, $text, and $embedding parameters in the Cypher statement.

Iterate through the chunks and execute the create_chunk function:

python

for chunk in chunks:
    with driver.session(database=os.getenv('NEO4J_DATABASE', 'neo4j')) as session:
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )

A new session is created for each chunk. The execute_write method calls the create_chunk function, passing the data dictionary created by the get_course_data function.

Finally, close the driver.

python

driver.close()

Click to view the complete code

import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase

COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=text,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])

    return data

def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """,
        data
        )

llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()

for chunk in chunks:
    with driver.session(database=os.getenv('NEO4J_DATABASE', 'neo4j')) as session:
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )

driver.close()

Explore the graph

Run the code to create the graph. It will take a minute or two to complete as it creates the embeddings for each paragraph.

View the graph by running the following Cypher:

cypher

MATCH (c:Course)-[:HAS_MODULE]->(m:Module)-[:HAS_LESSON]->(l:Lesson)-[:CONTAINS]->(p:Paragraph)
RETURN *

You will need to create a vector index to query the paragraph embeddings.

cypher

Create Vector Index

CREATE VECTOR INDEX paragraphs IF NOT EXISTS
FOR (p:Paragraph)
ON p.embedding
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}}

You can use the vector index and the graph to find a lesson to help with specific questions:

cypher

Find a lesson

WITH genai.vector.encode(
    "How does RAG help ground an LLM?",
    "OpenAI",
    { token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('paragraphs', 6, userEmbedding)
YIELD node, score
MATCH (l:Lesson)-[:CONTAINS]->(node)
RETURN l.name, l.url, score

Explore the graph and see how the relationships between the nodes can bring additional meaning to the unstructured data.

When you are ready to move on, click Continue.

Lesson Summary

In this lesson, you created a graph of course content.

In the next lesson, you will learn how to add topics to the graph.

Introduction to Vector Indexes and Unstructured Data

Introduction

Vector indexes

Importing unstructured data

Create a graph

Data Model

Extracting the data

Create the graph

Explore the graph

Lesson Summary

Chatbot