Build a KG with Python

In the previous lesson, you reviewed code snippets required to implement the knowledge graph build process.

In this lesson, you will explore and modify the complete Python code to build a knowledge graph using LangChain.

Open the llm-knowledge-graph/create_kg.py file.

View create_kg.py

python

import os

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.graphs.graph_document import Node, Relationship

from dotenv import load_dotenv
load_dotenv()

DOCS_PATH = "llm-knowledge-graph/data/course/pdfs"

llm = ChatOpenAI(
    openai_api_key=os.getenv('OPENAI_API_KEY'), 
    model_name="gpt-3.5-turbo"
)

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

doc_transformer = LLMGraphTransformer(
    llm=llm,
    )

# Load and split the documents
loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

docs = loader.load()
chunks = text_splitter.split_documents(docs)

for chunk in chunks:

    filename = os.path.basename(chunk.metadata["source"])
    chunk_id = f"{filename}.{chunk.metadata["page"]}"
    print("Processing -", chunk_id)

    # Embed the chunk
    chunk_embedding = embedding_provider.embed_query(chunk.page_content)

    # Add the Document and Chunk nodes to the graph
    properties = {
        "filename": filename,
        "chunk_id": chunk_id,
        "text": chunk.page_content,
        "embedding": chunk_embedding
    }
    
    graph.query("""
        MERGE (d:Document {id: $filename})
        MERGE (c:Chunk {id: $chunk_id})
        SET c.text = $text
        MERGE (d)<-[:PART_OF]-(c)
        WITH c
        CALL db.create.setNodeVectorProperty(c, 'textEmbedding', $embedding)
        """, 
        properties
    )

    # Generate the entities and relationships from the chunk
    graph_docs = doc_transformer.convert_to_graph_documents([chunk])

    # Map the entities in the graph documents to the chunk node
    for graph_doc in graph_docs:
        chunk_node = Node(
            id=chunk_id,
            type="Chunk"
        )

        for node in graph_doc.nodes:

            graph_doc.relationships.append(
                Relationship(
                    source=chunk_node,
                    target=node, 
                    type="HAS_ENTITY"
                    )
                )

    # add the graph documents to the graph
    graph.add_graph_documents(graph_docs)

# Create the vector index
graph.query("""
    CREATE VECTOR INDEX `chunkVector`
    IF NOT EXISTS
    FOR (c: Chunk) ON (c.textEmbedding)
    OPTIONS {indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
    }};""")

Review the code, you should be able to identify the sections of the code that:

Gather the data
Chunk the data
Vectorize the data
Pass the data to an LLM to extract nodes and relationships
Use the output to generate the graph

This is a standard process to build a knowledge graph and can be adapted to suit your use case.

Documents

The code loads a set of PDF documents in a directory.

python

loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)

Depending on how your documents are stored, you may need to modify the loader to load the documents.

LangChain includes integration for different file types and storage.

For example, you can load data from a CSV file using the CSVLoader.

python

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="path/to/csv_file.csv")

You can find more information in the LangChain Document loaders how-to guide.

Allowed nodes and relationships

You can modify the code to define a set schema for the knowledge graph by specifying the allowed nodes and relationships.

When using the LLM Graph Builder you modified the schema to only include the following node labels:

Technology
Concept
Skill
Event
Person
Object

To achieve the same thing you need to include the list of labels as allowed_nodes when creating the LLMGraphTransformer instance.

python

doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    )

You can also restrict the relationships by specifying the allowed_relationships parameter.

python

doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    allowed_relationships=["USES", "HAS", "IS", "AT", "KNOWS"],
)

Restricting the nodes and relationship will result in a more concise knowledge graph. A more concise graph may support you in answering specific questions but it could also be missing information. Information could be missing because the model will only generate nodes and relationships that are allowed.

Properties

Currently, the LLM will only extract the nodes and relationships from the text. You can also instruct it to include properties for the nodes and relationships by specifying the properties parameter.

Specifying properties will result in nodes and relationships with additional meta data. The properties will only be present if the LLM can generate them from the text provided.

In this example, a name and description property will be added if the values can be determined from the text.

python

doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    node_properties=["name", "description"],
)

Defining properties allows you to increase the granularity of the knowledge graph at the cost of the build process taking longer.

Structured data

When generating the knowledge graph, you can also include structured data about the documents.

In this example, the documents are part of a GraphAcademy course and you could extend the graph to include Course, Module, and Lesson nodes.

You can learn more about importing data from unstructured data sources in the GraphAcademy course Introduction to Vector Indexes and Unstructured Data.

Generate the graph

When you are ready, run the create_kg.py script to generate the knowledge graph.

This query will match the documents and return the first 50 nodes and corresponding relationships:

cypher

MATCH (d:Document)-[*]-(n)
RETURN d,n
LIMIT 50

In the next module, you will explore methods of querying the knowledge graph.

Experiment with the allowed_nodes, allowed_relationships, and properties parameters to see how they affect the knowledge graph.

If you want to reset the sandbox and start again - you can delete all the nodes and relationships in the graph by running the following Cypher:

cypher

Delete all nodes and relationships

MATCH (n) DETACH DELETE n

When you are ready, move on to the next lesson.

Check Your Understanding

1. Allowed Nodes

What are the implications of specifying allowed nodes and relationships in the LLM Graph Transformer?

Select all that apply.

python

doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Customer", "Product", "Price", "Sale"],
)

✓ The graph may contain less information.
❏ The graph may contain other nodes.
✓ The graph will only contain the nodes Customer, Product, Price, Sale.
❏ The graph will contain no relationships as none are not specified.

Hint

Specifying allowed_nodes will result in a more concise knowledge graph. The allowed_relationships parameter can be used to restrict the relationships.

Solution

The correct answers are:

The graph may contain less information.
The graph will only contain the nodes Customer, Product, Price, Sale.

Relationships of any type will be included unless you specify the allowed_relationships parameter.

Lesson Summary

In this lesson, you learned how to build a knowledge graph using Python and LangChain.

In the next optional challenge, you can upload your own documents and build a knowledge graph from them.

Building Knowledge Graphs with LLMs

Knowledge graphs

LLM Graph Builder

Build a Graph with Python

Querying Knowledge Graphs

Build a KG with Python

Documents

Allowed nodes and relationships

Properties

Structured data

Generate the graph

Check Your Understanding

1. Allowed Nodes

Lesson Summary