Import Unstructured Data

You will use Python and LangChain to chunk up course content and create embeddings for each chunk. You will then load the chunks into a Neo4j graph database.

The course content

You will load the content from the course Neo4j & LLM Fundamentals.

The workshop repository you cloned contains the course data.

Open the 1-knowledge-graphs-vectors\data directory in your code editor.

You should note the following structure:

asciidoc - contains the course content in ascidoc format
- courses - the course content
  - llm-fundamentals - the course name
    
    modules - contains numbered directories for each module
    
    01-name - the module name
    
    lessons - contains numbered directories for each lesson
    
    01-name - the lesson name
    
    lesson.adoc - the lesson content

Load the content and chunk it

You will load the content and chunk it using Python and LangChain.

Your code will split the lesson content into chunks of text, around 1500 characters long, each containing one or more paragraphs. You can determine the paragraph in the content with two newline characters (\n\n).

Open the 1-knowledge-graphs-vectors/create_vector.py file and review the program:

python

import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

# Load lesson documents
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

# Create a text splitter
# text_splitter =

# Split documents into chunks
# chunks =

# Create a Neo4j vector store
# neo4j_db =

The program uses the DirectoryLoader class to load the content from the data/asciidoc directory.

Your task is to add the code to:

Create a CharacterTextSplitter object to split the content into chunks of text.
Use the split_documents method to split the documents into chunks of text based on the existence of \n\n and a chunk size of 1500 characters.

Create the CharacterTextSplitter object to split the content into paragraphs (\n\n).

python

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

Split the documents into chunks of text.

python

chunks = text_splitter.split_documents(docs)

print(chunks)

You can run your code now to see the chunks of text.

Create vector index

Once you have chunked the content, you can use the LangChain Neo4jVector class to create embeddings, a vector index, and store the chunks in a Neo4j graph database.

You will need to modify your Python program:

Connect to the Neo4j database.

python

from langchain_neo4j import Neo4jGraph

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD'),
)

Create the nodes and vector index.

python

from langchain_neo4j import Neo4jVector
from langchain_openai import OpenAIEmbeddings

neo4j_vector = Neo4jVector.from_documents(
    chunks,
    OpenAIEmbeddings(openai_api_key=os.getenv('OPENAI_API_KEY')),
    graph=graph,
    index_name="chunkVector",
    node_label="Chunk", 
    text_node_property="text", 
    embedding_node_property="embedding",  
)

The code will create 'Chunk' nodes with text and embedding properties and a vector index called chunkVector. You should be able to identify where you pass the Chunk, text, embedding, and chunkVector parameters.

View the complete code

import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_neo4j import Neo4jGraph
from langchain_neo4j import Neo4jVector
from langchain_openai import OpenAIEmbeddings

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

print(chunks)

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD'),
)

neo4j_vector = Neo4jVector.from_documents(
    chunks,
    OpenAIEmbeddings(openai_api_key=os.getenv('OPENAI_API_KEY')),
    graph=graph,
    index_name="chunkVector",
    node_label="Chunk", 
    text_node_property="text", 
    embedding_node_property="embedding",  
)

Run the program to create the chunk nodes and vector index. It may take a minute or two to complete.

View chunks in the sandbox

You can now view the chunks in the Neo4j sandbox.

cypher

MATCH (c:Chunk) RETURN c LIMIT 25

You can also query the vector index to find similar chunks. For example, you can find lesson chunks relating to a specific question, "What does Hallucination mean?":

cypher

WITH genai.vector.encode(
    "What does Hallucination mean?",
    "OpenAI",
    { token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('chunkVector', 6, userEmbedding)
YIELD node, score
RETURN node.text, score

Remember to replace sk-… with your OpenAI API key.

Experiment with different questions and see how the vector index can find similar chunks.

Continue

When you are ready, you can move on to the next task.

Summary

You learned to use Python and LangChain to load, chunk, and vectorize unstructured data into a Neo4j graph database.

Gen-AI - Hands-on Workshop

Knowledge Graphs, Unstructured Data, and Vectors

LLMs, RAG, Python, and LangChain

Import Unstructured Data

The course content

Load the content and chunk it

Create vector index

View chunks in the sandbox

Continue

Summary