The LLM Graph Builder follows the process you learned earlier in the course:
-
Gather the data
-
Chunk the data
-
Vectorize the data
-
Pass the data to an LLM to extract nodes and relationships
-
Use the output to generate the graph
Python and LangChain are predominantly used to build the knowledge graph.
In this lesson, you will review Python code snippets that complete the above steps.
Gather the data
The application uses a document loader to load the PDFs from a directory.
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
DOCS_PATH = "llm-knowledge-graph/data/course/pdfs"
loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
The glob
specifies the search path to find the PDFs.
Chunk the data
The application splits the documents using a text splitter.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(docs)
This code splits the text into paragraphs (every time there is a double newline \n\n
).
Vectorize the data
The application creates an embedding for each chunk of text and adds them to the graph.
from langchain_openai import OpenAIEmbeddings
from langchain_community.graphs import Neo4jGraph
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD')
)
for chunk in chunks:
# Extract the filename
filename = os.path.basename(chunk.metadata["source"])
# Create a unique identifier for the chunk
chunk_id = f"{filename}.{chunk.metadata["page"]}"
# Embed the chunk
chunk_embedding = embedding_provider.embed_query(chunk.page_content)
# Add the Document and Chunk nodes to the graph
properties = {
"filename": filename,
"chunk_id": chunk_id,
"text": chunk.page_content,
"textEmbedding": chunk_embedding
}
graph.query("""
MERGE (d:Document {id: $filename})
MERGE (c:Chunk {id: $chunk_id})
SET c.text = $text
MERGE (d)<-[:PART_OF]-(c)
WITH c
CALL db.create.setNodeVectorProperty(c, 'textEmbedding', $embedding)
""",
properties
)
# Create the vector index
graph.query("""
CREATE VECTOR INDEX `vector`
FOR (c: Chunk) ON (c.embedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}};""")
OpenAI creates the embeddings in this example, but you could use any embedding model.
The code uses the Neo4jGraph.query
method to create Document
and Chunk
nodes and store the text and embedding data.
Extract nodes and relationships
The application uses the LangChain LLMGraphTransformer
, contributed by Neo4j, to extract the nodes and relationships.
The LLMGraphTransformer
requires an llm
, in this example, it is using OpenAI’s gpt-3.5-turbo
, but you could use any LLM.
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
llm = ChatOpenAI(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model_name="gpt-3.5-turbo"
)
doc_transformer = LLMGraphTransformer(
llm=llm,
)
for chunk in chunks:
# Generate the entities and relationships from the chunk
graph_docs = doc_transformer.convert_to_graph_documents([chunk])
The LLMGraphTransformer.convert_to_graph_documents
uses the llm
to generate a set of graph docs of nodes and relationships.
Use the output to generate the graph
Finally, the application uses the generated graph documents to create the graph.
The graph documents consist of a set of entity nodes.
A Node
representing the Chunk
and a HAS_ENTITY
relation is added to each graph document to create a link between the generated entities and the source document.
from langchain_community.graphs.graph_document import Node, Relationship
for chunk in chunks:
filename = os.path.basename(chunk.metadata["source"])
chunk_id = f"{filename}.{chunk.metadata["page"]}"
graph_docs = doc_transformer.convert_to_graph_documents([chunk])
# Map the entities in the graph documents to the chunk node
for graph_doc in graph_docs:
chunk_node = Node(
id=chunk_id,
type="Chunk"
)
for node in graph_doc.nodes:
graph_doc.relationships.append(
Relationship(
source=chunk_node,
target=node,
type="HAS_ENTITY"
)
)
# add the graph documents to the graph
graph.add_graph_documents(graph_docs)
The last step passes the graph documents to the Neo4Graph.add_graph_documents
method to create the nodes and relationships in Neo4j.
This process creates a simplified version of the data model you saw in the previous lesson:
When you are ready, move on to the next lesson.
Summary
In this lesson, you explored the code to build a knowledge graph using an LLM.
In the next lesson, you will run and adapt a Python program to create a knowledge graph.