Build a KG with Python

In the previous lesson, you reviewed code snippets required to implement the knowledge graph build process.

In this lesson, you will explore and modify the complete Python code to build a knowledge graph using LangChain.

Open the llm-knowledge-graph/create_kg.py file.

View create_kg.py
python
import os

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.graphs.graph_document import Node, Relationship

from dotenv import load_dotenv
load_dotenv()

DOCS_PATH = "llm-knowledge-graph/data/course/pdfs"

llm = ChatOpenAI(
    openai_api_key=os.getenv('OPENAI_API_KEY'), 
    model_name="gpt-3.5-turbo"
)

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

doc_transformer = LLMGraphTransformer(
    llm=llm,
    )

# Load and split the documents
loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

docs = loader.load()
chunks = text_splitter.split_documents(docs)

for chunk in chunks:

    filename = os.path.basename(chunk.metadata["source"])
    chunk_id = f"{filename}.{chunk.metadata["page"]}"
    print("Processing -", chunk_id)

    # Embed the chunk
    chunk_embedding = embedding_provider.embed_query(chunk.page_content)

    # Add the Document and Chunk nodes to the graph
    properties = {
        "filename": filename,
        "chunk_id": chunk_id,
        "text": chunk.page_content,
        "embedding": chunk_embedding
    }
    
    graph.query("""
        MERGE (d:Document {id: $filename})
        MERGE (c:Chunk {id: $chunk_id})
        SET c.text = $text
        MERGE (d)<-[:PART_OF]-(c)
        WITH c
        CALL db.create.setNodeVectorProperty(c, 'textEmbedding', $embedding)
        """, 
        properties
    )

    # Generate the entities and relationships from the chunk
    graph_docs = doc_transformer.convert_to_graph_documents([chunk])

    # Map the entities in the graph documents to the chunk node
    for graph_doc in graph_docs:
        chunk_node = Node(
            id=chunk_id,
            type="Chunk"
        )

        for node in graph_doc.nodes:

            graph_doc.relationships.append(
                Relationship(
                    source=chunk_node,
                    target=node, 
                    type="HAS_ENTITY"
                    )
                )

    # add the graph documents to the graph
    graph.add_graph_documents(graph_docs)

# Create the vector index
graph.query("""
    CREATE VECTOR INDEX `chunkVector`
    IF NOT EXISTS
    FOR (c: Chunk) ON (c.textEmbedding)
    OPTIONS {indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
    }};""")

Review the code, you should be able to identify the sections of the code that:

  1. Gather the data

  2. Chunk the data

  3. Vectorize the data

  4. Pass the data to an LLM to extract nodes and relationships

  5. Use the output to generate the graph

This is a standard process to build a knowledge graph and can be adapted to suit your use case.

Documents

The code loads a set of PDF documents in a directory.

python
loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)

Depending on how your documents are stored, you may need to modify the loader to load the documents.

LangChain includes integration for different file types and storage.

For example, you can load data from a CSV file using the CSVLoader.

python
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="path/to/csv_file.csv")

You can find more information in the LangChain Document loaders how-to guide.

Allowed nodes and relationships

You can modify the code to define a set schema for the knowledge graph by specifying the allowed nodes and relationships.

When using the LLM Graph Builder you modified the schema to only include the following node labels:

  • Technology

  • Concept

  • Skill

  • Event

  • Person

  • Object

To achieve the same thing you need to include the list of labels as allowed_nodes when creating the LLMGraphTransformer instance.

python
doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    )

You can also restrict the relationships by specifying the allowed_relationships parameter.

python
doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    allowed_relationships=["USES", "HAS", "IS", "AT", "KNOWS"],
)

Restricting the nodes and relationship will result in a more concise knowledge graph. A more concise may support you in answering specific questions but it could you be missing some information . Information could be missing because the model will only generate nodes and relationships that are allowed.

Properties

Currently, the LLM will only extract the nodes and relationships from the text. You can also instruct it to include properties for the nodes and relationships by specifying the properties parameter.

Specifying properties will result in nodes and relationships with additional meta data. The properties will only be present if the LLM can generate them from the text provided.

In this example, a name and description property will be added if the values can be determined from the text.

python
doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    node_properties=["name", "description"],
)

Defining properties allows you to increase the granularity of the knowledge graph at the cost of the build process taking longer.

Structured data

When generating the knowledge graph, you can also include structured data about the documents.

In this example, the documents are part of a GraphAcademy course and you could extend the graph to include Course, Module, and Lesson nodes.

Graph data model showing Course

You can learn more about importing data from unstructured data sources in the GraphAcademy course Introduction to Vector Indexes and Unstructured Data.

Generate the graph

When you are ready, run the create_kg.py script to generate the knowledge graph.

This query will match the documents and return the first 50 nodes and corresponding relationships:

cypher
MATCH (d:Document)-[*]-(n)
RETURN d,n
LIMIT 50

In the next module, you will explore methods of querying the knowledge graph.

Experiment with the allowed_nodes, allowed_relationships, and properties parameters to see how they affect the knowledge graph.

If you want to reset the sandbox and start again - you can delete all the nodes and relationships in the graph by running the following Cypher:

cypher
Delete all nodes and relationships
MATCH (n) DETACH DELETE n

When you are ready, move on to the next lesson.

Check Your Understanding

1. Allowed Nodes

What are the implications of specifying allowed nodes and relationships in the LLM Graph Transformer?

Select all that apply.

python
doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Customer", "Product", "Price", "Sale"],
)
  • ✓ The graph may contain less information.

  • ❏ The graph may contain other nodes.

  • ✓ The graph will only contain the nodes Customer, Product, Price, Sale.

  • ❏ The graph will contain no relationships as none are not specified.

Hint

Specifying allowed_nodes will result in a more concise knowledge graph. The allowed_relationships parameter can be used to restrict the relationships.

Solution

The correct answers are:

  • The graph may contain less information.

  • The graph will only contain the nodes Customer, Product, Price, Sale.

Relationships of any type will be included unless you specify the allowed_relationships parameter.

Lesson Summary

In this lesson, you learned how to build a knowledge graph using Python and LangChain.

In the next optional challenge, you can upload your own documents and build a knowledge graph from them.