Creating a graph

In the previous task, you used the Neo4jVector class to create Chunk nodes in the graph. Using Neo4jVector is an efficient and easy way to get started.

To create a graph where you can also understand the relationships within the data, you must incorporate the metadata into the data model.

In this lesson, you will create a graph of the course content.

Data Model

You will create a graph of the course content containing the following nodes, properties, and relationships:

Course, Module, and Lesson nodes with a name property
A url property on Lesson nodes will hold the GraphAcademy URL for the lesson
Paragraph nodes will have id, text, and embedding properties
The HAS_MODULE, HAS_LESSON, and CONTAINS relationships will connect the nodes

Data Model

You can extract the name properties and url metadata from the directory structure of the lesson files.

For example, the first lesson of the Neo4j & LLM Fundamentals course has the following path:

courses\llm-fundamentals\modules\1-introduction\lessons\1-neo4j-and-genai\lesson.adoc

The following metadata is in the path:

Course.name - llm-fundamentals
Module.name - 1-introduction
Lesson.name - 1-neo4j-and-genai
Lesson.url - graphacademy.neo4j.com/courses/{Course.name}/{{Module.name}}/{Lesson.name}

Building the graph

Open the 1-knowledge-graphs-vectors\build_graph.py starter code in your code editor.

The starter code loads and chunks the course content.

python

Load and chunk the content

import os
from dotenv import load_dotenv
load_dotenv()

from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
    add_start_index=True
)

chunks = text_splitter.split_documents(docs)

# Create an OpenAI embedding provider

# Create a function to get the course data

# Connect to Neo4j

# Create a function to run the Cypher query

# Iterate through the chunks and create the graph

For each chunk, you will have to:

Create an embedding of the text.
Extract the metadata.

Extracting the data

Create an OpenAI embedding provider instance to generate the embeddings:

python

Create embedding_provider

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

Extracting the data

Create a function to extract the metadata from the chunk:

python

Get course data

def get_course_data(embedding_provider, chunk):
    filename = chunk.metadata["source"]
    path = filename.split(os.path.sep)

    data = {}
    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
    data['text'] = chunk.page_content
    data['embedding'] = embedding_provider.embed_query(chunk.page_content)
    return data

The get_course_data function:

Splits the document source path to extract the course, module, and lesson names
Constructs the url using the extracted names
Creates a unique id for the paragraph from the file name and the chunk position
Extracts the text from the chunk
Creates an embedding using the embedding_provider instance
Returns a dictionary containing the extracted data

Creating the graph

To create the graph, you will need to:

Connect to the Neo4j database
Iterate through the chunks
Extract the course data from each chunk
Create the nodes and relationships in the graph

Connect

Connect to the Neo4j sandbox:

python

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

Test the connection

You could run your code now to check that you can connect to the OpenAI API and Neo4j sandbox.

Create data

To create the data in the graph, you will need a function that incorporates the course data into a Cypher statement and runs it:

python

Create chunk function

def create_chunk(graph, data):
    graph.query("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
    )

The create_chunk function accepts the data dictionary created by the get_course_data function.

You should be able to identify the following parameters in the Cypher statement:

$course
$module
$lesson
$url
$id
$text
$embedding

Create chunk

Iterate through the chunks and execute the create_chunk function:

python

for chunk in chunks:
    data = get_course_data(embedding_provider, chunk)
    create_chunk(graph, data)
    print("Processed chunk", data['id'])

The metadata is found for each chunk and used to create a new chunk in the graph.

Click to view the complete code

import os
from dotenv import load_dotenv
load_dotenv()

from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
    add_start_index=True
)

chunks = text_splitter.split_documents(docs)

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

def get_course_data(embedding_provider, chunk):
    filename = chunk.metadata["source"]
    path = filename.split(os.path.sep)

    data = {}
    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
    data['text'] = chunk.page_content
    data['embedding'] = embedding_provider.embed_query(chunk.page_content)
    return data

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

def create_chunk(graph, data):
    graph.query("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
    )

for chunk in chunks:
    data = get_course_data(embedding_provider, chunk)
    create_chunk(graph, data)
    print("Processed chunk", data['id'])

Run the code to create the graph.

The program will take a minute or two to complete as it creates the embeddings for each paragraph.

Explore the graph

View the graph by running the following Cypher:

cypher

MATCH (c:Course)-[:HAS_MODULE]->(m:Module)-[:HAS_LESSON]->(l:Lesson)-[:CONTAINS]->(p:Paragraph)
RETURN *

Create vector index

You will need to create a vector index to query the paragraph embeddings.

cypher

Create Vector Index

CREATE VECTOR INDEX paragraphs IF NOT EXISTS
FOR (p:Paragraph)
ON p.embedding
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}}

Query the vector index

You can use the vector index and the graph to find a lesson to help with specific questions:

cypher

Find a lesson

WITH genai.vector.encode(
    "How does RAG help ground an LLM?",
    "OpenAI",
    { token: $openAiApiKey }) AS userEmbedding
CALL db.index.vector.queryNodes('paragraphs', 6, userEmbedding)
YIELD node, score
MATCH (l:Lesson)-[:CONTAINS]->(node)
RETURN l.name, l.url, score

Continue

Explore the graph and see how the relationships between the nodes can bring additional meaning to the unstructured data.

When you are ready, you can move on to the next task.

Summary

You created a graph of the course content using the Neo4j and LangChain.

Gen-AI - Hands-on Workshop

Knowledge Graphs, Unstructured Data, and Vectors

LLMs, RAG, Python, and LangChain

Creating a graph

Creating a graph

Data Model

Data Model

Building the graph

Extracting the data

Extracting the data

Creating the graph

Connect

Create data

Create chunk

Explore the graph

Create vector index

Query the vector index

Continue

Summary

Chatbot