Creating a graph
In the previous task, you used the Neo4jVector
class to create Chunk
nodes in the graph.
Using Neo4jVector
is an efficient and easy way to get started.
To create a graph where you can also understand the relationships within the data, you must incorporate the metadata into the data model.
In this lesson, you will create a graph of the course content.
Data Model
You will create a graph of the course content containing the following nodes, properties, and relationships:
-
Course
,Module
, andLesson
nodes with aname
property -
A
url
property onLesson
nodes will hold the GraphAcademy URL for the lesson -
Paragraph
nodes will haveid
,text
, andembedding
properties -
The
HAS_MODULE
,HAS_LESSON
, andCONTAINS
relationships will connect the nodes
Data Model
You can extract the name
properties and url
metadata from the directory structure of the lesson files.
For example, the first lesson of the Neo4j & LLM Fundamentals course has the following path:
courses\llm-fundamentals\modules\1-introduction\lessons\1-neo4j-and-genai\lesson.adoc
The following metadata is in the path:
-
Course.name
-llm-fundamentals
-
Module.name
-1-introduction
-
Lesson.name
-1-neo4j-and-genai
-
Lesson.url
-graphacademy.neo4j.com/courses/{Course.name}/{{Module.name}}/{Lesson.name}
Building the graph
Open the 1-knowledge-graphs-vectors\build_graph.py
starter code in your code editor.
The starter code loads and chunks the course content.
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
add_start_index=True
)
chunks = text_splitter.split_documents(docs)
# Create an OpenAI embedding provider
# Create a function to get the course data
# Connect to Neo4j
# Create a function to run the Cypher query
# Iterate through the chunks and create the graph
For each chunk, you will have to:
-
Create an embedding of the text.
-
Extract the metadata.
Extracting the data
Create an OpenAI embedding provider instance to generate the embeddings:
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
Extracting the data
Create a function to extract the metadata from the chunk:
def get_course_data(embedding_provider, chunk):
filename = chunk.metadata["source"]
path = filename.split(os.path.sep)
data = {}
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
data['text'] = chunk.page_content
data['embedding'] = embedding_provider.embed_query(chunk.page_content)
return data
The get_course_data
function:
-
Splits the document source path to extract the
course
,module
, andlesson
names -
Constructs the
url
using the extracted names -
Creates a unique
id
for the paragraph from the file name and the chunk position -
Extracts the
text
from the chunk -
Creates an
embedding
using theembedding_provider
instance -
Returns a dictionary containing the extracted data
Creating the graph
To create the graph, you will need to:
-
Connect to the Neo4j database
-
Iterate through the chunks
-
Extract the course data from each chunk
-
Create the nodes and relationships in the graph
Connect
Connect to the Neo4j sandbox:
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD')
)
Test the connection
You could run your code now to check that you can connect to the OpenAI API and Neo4j sandbox.
Create data
To create the data in the graph, you will need a function that incorporates the course data into a Cypher statement and runs it:
def create_chunk(graph, data):
graph.query("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
""",
data
)
The create_chunk
function accepts the data
dictionary created by the get_course_data
function.
You should be able to identify the following parameters in the Cypher statement:
-
$course
-
$module
-
$lesson
-
$url
-
$id
-
$text
-
$embedding
Create chunk
Iterate through the chunks and execute the create_chunk
function:
for chunk in chunks:
data = get_course_data(embedding_provider, chunk)
create_chunk(graph, data)
print("Processed chunk", data['id'])
The metadata is found for each chunk and used to create a new chunk in the graph.
Click to view the complete code
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
add_start_index=True
)
chunks = text_splitter.split_documents(docs)
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
def get_course_data(embedding_provider, chunk):
filename = chunk.metadata["source"]
path = filename.split(os.path.sep)
data = {}
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
data['text'] = chunk.page_content
data['embedding'] = embedding_provider.embed_query(chunk.page_content)
return data
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD')
)
def create_chunk(graph, data):
graph.query("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
""",
data
)
for chunk in chunks:
data = get_course_data(embedding_provider, chunk)
create_chunk(graph, data)
print("Processed chunk", data['id'])
Run the code to create the graph.
Explore the graph
View the graph by running the following Cypher:
MATCH (c:Course)-[:HAS_MODULE]->(m:Module)-[:HAS_LESSON]->(l:Lesson)-[:CONTAINS]->(p:Paragraph)
RETURN *
Create vector index
You will need to create a vector index to query the paragraph embeddings.
CREATE VECTOR INDEX paragraphs IF NOT EXISTS
FOR (p:Paragraph)
ON p.embedding
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
Query the vector index
You can use the vector index and the graph to find a lesson to help with specific questions:
WITH genai.vector.encode(
"How does RAG help ground an LLM?",
"OpenAI",
{ token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('paragraphs', 6, userEmbedding)
YIELD node, score
MATCH (l:Lesson)-[:CONTAINS]->(node)
RETURN l.name, l.url, score
Continue
Explore the graph and see how the relationships between the nodes can bring additional meaning to the unstructured data.
When you are ready, you can move on to the next task.
Summary
You created a graph of the course content using the Neo4j and LangChain.