In the previous lesson, you reviewed code snippets required to implement the knowledge graph build process.
In this lesson, you will explore and modify the complete Python code to build a knowledge graph using LangChain.
Open the llm-knowledge-graph/create_kg.py
file.
View create_kg.py
import os
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.graphs.graph_document import Node, Relationship
from dotenv import load_dotenv
load_dotenv()
DOCS_PATH = "llm-knowledge-graph/data/course/pdfs"
llm = ChatOpenAI(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model_name="gpt-3.5-turbo"
)
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD')
)
doc_transformer = LLMGraphTransformer(
llm=llm,
)
# Load and split the documents
loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
docs = loader.load()
chunks = text_splitter.split_documents(docs)
for chunk in chunks:
filename = os.path.basename(chunk.metadata["source"])
chunk_id = f"{filename}.{chunk.metadata["page"]}"
print("Processing -", chunk_id)
# Embed the chunk
chunk_embedding = embedding_provider.embed_query(chunk.page_content)
# Add the Document and Chunk nodes to the graph
properties = {
"filename": filename,
"chunk_id": chunk_id,
"text": chunk.page_content,
"embedding": chunk_embedding
}
graph.query("""
MERGE (d:Document {id: $filename})
MERGE (c:Chunk {id: $chunk_id})
SET c.text = $text
MERGE (d)<-[:PART_OF]-(c)
WITH c
CALL db.create.setNodeVectorProperty(c, 'textEmbedding', $embedding)
""",
properties
)
# Generate the entities and relationships from the chunk
graph_docs = doc_transformer.convert_to_graph_documents([chunk])
# Map the entities in the graph documents to the chunk node
for graph_doc in graph_docs:
chunk_node = Node(
id=chunk_id,
type="Chunk"
)
for node in graph_doc.nodes:
graph_doc.relationships.append(
Relationship(
source=chunk_node,
target=node,
type="HAS_ENTITY"
)
)
# add the graph documents to the graph
graph.add_graph_documents(graph_docs)
# Create the vector index
graph.query("""
CREATE VECTOR INDEX `chunkVector`
IF NOT EXISTS
FOR (c: Chunk) ON (c.textEmbedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}};""")
Review the code, you should be able to identify the sections of the code that:
-
Gather the data
-
Chunk the data
-
Vectorize the data
-
Pass the data to an LLM to extract nodes and relationships
-
Use the output to generate the graph
This is a standard process to build a knowledge graph and can be adapted to suit your use case.
Documents
The code loads a set of PDF documents in a directory.
loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)
Depending on how your documents are stored, you may need to modify the loader
to load the documents.
LangChain includes integration for different file types and storage.
For example, you can load data from a CSV file using the CSVLoader.
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path="path/to/csv_file.csv")
You can find more information in the LangChain Document loaders how-to guide.
Allowed nodes and relationships
You can modify the code to define a set schema for the knowledge graph by specifying the allowed nodes and relationships.
When using the LLM Graph Builder you modified the schema to only include the following node labels:
-
Technology
-
Concept
-
Skill
-
Event
-
Person
-
Object
To achieve the same thing you need to include the list of labels as allowed_nodes
when creating the LLMGraphTransformer
instance.
doc_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
)
You can also restrict the relationships by specifying the allowed_relationships
parameter.
doc_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
allowed_relationships=["USES", "HAS", "IS", "AT", "KNOWS"],
)
Properties
Currently, the LLM will only extract the nodes and relationships from the text.
You can also instruct it to include properties for the nodes and relationships by specifying the properties
parameter.
Specifying properties will result in nodes and relationships with additional meta data. The properties will only be present if the LLM can generate them from the text provided.
In this example, a name
and description
property will be added if the values can be determined from the text.
doc_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
node_properties=["name", "description"],
)
Defining properties allows you to increase the granularity of the knowledge graph at the cost of the build process taking longer.
Structured data
When generating the knowledge graph, you can also include structured data about the documents.
In this example, the documents are part of a GraphAcademy course and you could extend the graph to include Course
, Module
, and Lesson
nodes.
Generate the graph
When you are ready, run the create_kg.py
script to generate the knowledge graph.
This query will match the documents and return the first 50 nodes and corresponding relationships:
MATCH (d:Document)-[*]-(n)
RETURN d,n
LIMIT 50
In the next module, you will explore methods of querying the knowledge graph.
Experiment with the allowed_nodes
, allowed_relationships
, and properties
parameters to see how they affect the knowledge graph.
If you want to reset the sandbox and start again - you can delete all the nodes and relationships in the graph by running the following Cypher:
MATCH (n) DETACH DELETE n
When you are ready, move on to the next lesson.
Check Your Understanding
1. Allowed Nodes
What are the implications of specifying allowed nodes and relationships in the LLM Graph Transformer?
Select all that apply.
doc_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Customer", "Product", "Price", "Sale"],
)
-
✓ The graph may contain less information.
-
❏ The graph may contain other nodes.
-
✓ The graph will only contain the nodes
Customer
,Product
,Price
,Sale
. -
❏ The graph will contain no relationships as none are not specified.
Hint
Specifying allowed_nodes
will result in a more concise knowledge graph.
The allowed_relationships
parameter can be used to restrict the relationships.
Solution
The correct answers are:
-
The graph may contain less information.
-
The graph will only contain the nodes
Customer
,Product
,Price
,Sale
.
Relationships of any type will be included unless you specify the allowed_relationships
parameter.
Lesson Summary
In this lesson, you learned how to build a knowledge graph using Python and LangChain.
In the next optional challenge, you can upload your own documents and build a knowledge graph from them.