In the previous task, you used the Neo4jVector
class to create Chunk
nodes in the graph.
Using Neo4jVector
is an efficient and easy way to get started.
To create a graph where you can also understand the relationships within the data, you must incorporate the metadata into the data model.
In this lesson, you will create a graph of the course content using the neo4j
Python driver and OpenAI API.
Data Model
The data model you will create is a simplified version of the course content model you saw earlier.
The graph will contain the following nodes, properties, and relationships:
-
Course
,Module
, andLesson
nodes with aname
property -
A
url
property onLesson
nodes will hold the GraphAcademy URL for the lesson -
Paragraph
nodes will havetext
andembedding
property -
The
HAS_MODULE
,HAS_LESSON
, andCONTAINS
relationships will connect the nodes
You can extract the name
properties and url
metadata from the directory structure of the lesson files.
For example, the first lesson of the Neo4j & LLM Fundamentals course has the following path:
courses\llm-fundamentals\modules\1-introduction\lessons\1-neo4j-and-genai\lesson.adoc
You can extract the following metadata from the path:
-
Course.name
-llm-fundamentals
-
Module.name
-1-introduction
-
Lesson.name
-1-neo4j-and-genai
-
Lesson.url
-graphacademy.neo4j.com/courses/{Course.name}/{{Module.name}}/{Lesson.name}
Extracting the data
Open the 1-knowledge-graphs-vectors\build_graph.py
file in your code editor.
This starter code loads and chunks the course content.
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(docs)
# Create a function to get the embedding
# Create a function to get the course data
# Create OpenAI object
# Connect to Neo4j
# Create a function to run the Cypher query
# Iterate through the chunks and create the graph
# Close the neo4j driver
For each chunk, you have to create an embedding of the text and extract the metadata.
Create a function to create and return an embedding using the OpenAI API:
def get_embedding(llm, text):
response = llm.embeddings.create(
input=chunk.page_content,
model="text-embedding-ada-002"
)
return response.data[0].embedding
Create a 2nd function, which will extract the data from the chunk:
def get_course_data(llm, chunk):
data = {}
path = chunk.metadata['source'].split(os.path.sep)
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['text'] = chunk.page_content
data['embedding'] = get_embedding(llm, data['text'])
return data
The get_course_data
function:
-
Splits the document source path to extract the
course
,module
, andlesson
names -
Constructs the
url
using the extracted names -
Extracts the
text
from the chunk -
Creates an
embedding
using theget_embedding
function -
Returns a dictionary containing the extracted data
Create the graph
To create the graph, you will need to:
-
Create an OpenAI object to generate the embeddings
-
Connect to the Neo4j database
-
Iterate through the chunks
-
Extract the course data from each chunk
-
Create the nodes and relationships in the graph
Create the OpenAI object:
llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
Connect to the Neo4j sandbox:
driver = GraphDatabase.driver(
os.getenv('NEO4J_URI'),
auth=(
os.getenv('NEO4J_USERNAME'),
os.getenv('NEO4J_PASSWORD')
)
)
driver.verify_connectivity()
Test the connection
You could run your code now to check that you can connect to the OpenAI API and Neo4j sandbox.
To create the data in the graph, you will need a function that incorporates the course data into a Cypher statement and runs it in a transaction.
def create_chunk(tx, data):
tx.run("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
""",
data
)
The create_chunk
function will accept the data
dictionary created by the get_course_data
function.
You should be able to identify the $course
, $module
, $lesson
, $url
, $text
, and $embedding
parameters in the Cypher statement.
Iterate through the chunks and execute the create_chunk
function:
for chunk in chunks:
with driver.session(database="neo4j") as session:
session.execute_write(
create_chunk,
get_course_data(llm, chunk)
)
A new session is created for each chunk. The execute_write
method calls the create_chunk
function, passing the data
dictionary created by the get_course_data
function.
Finally, close the driver.
driver.close()
Click to view the complete code
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(docs)
def get_embedding(llm, text):
response = llm.embeddings.create(
input=chunk.page_content,
model="text-embedding-ada-002"
)
return response.data[0].embedding
def get_course_data(llm, chunk):
data = {}
path = chunk.metadata['source'].split(os.path.sep)
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['text'] = chunk.page_content
data['embedding'] = get_embedding(llm, data['text'])
return data
def create_chunk(tx, data):
tx.run("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
""",
data
)
llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
driver = GraphDatabase.driver(
os.getenv('NEO4J_URI'),
auth=(
os.getenv('NEO4J_USERNAME'),
os.getenv('NEO4J_PASSWORD')
)
)
driver.verify_connectivity()
for chunk in chunks:
with driver.session(database="neo4j") as session:
session.execute_write(
create_chunk,
get_course_data(llm, chunk)
)
driver.close()
Run the code to create the graph. It will take a minute or two to complete as it creates the embeddings for each paragraph.
Explore the graph
View the graph by running the following Cypher:
MATCH (c:Course)-[:HAS_MODULE]->(m:Module)-[:HAS_LESSON]->(l:Lesson)-[:CONTAINS]->(p:Paragraph)
RETURN *
You will need to create a vector index to query the paragraph embeddings.
CREATE VECTOR INDEX paragraphs IF NOT EXISTS
FOR (p:Paragraph)
ON p.embedding
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
You can use the vector index and the graph to find a lesson to help with specific questions:
WITH genai.vector.encode(
"How does RAG help ground an LLM?",
"OpenAI",
{ token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('paragraphs', 6, userEmbedding)
YIELD node, score
MATCH (l:Lesson)-[:CONTAINS]->(node)
RETURN l.name, l.url, score
Explore the graph and see how the relationships between the nodes can bring additional meaning to the unstructured data.
Continue
When you are ready, you can move on to the next task.
Summary
You created a graph of the course content using the neo4j
Python driver and OpenAI API.