Define a schema

The knowledge graph you created is unconstrained, meaning that any entity or relationship can be created based on the data extracted from the text. This can lead to graphs that are non-specific and may be difficult to analyze and query.

In this lesson, you will modify the SimpleKGPipeline to use a custom schema for the knowledge graph.

Schema

When you provide a schema to the SimpleKGPipeline, it will pass this information to the LLM instructing it to only identify those nodes and relationships. This allows you to create a more structured and meaningful knowledge graph.

You define a schema by expressing the desired nodes, relationships, or patterns you want to extract from the text.

For example, you might want to extract the following information:

nodes - Person, Organization, Location
relationships - WORKS_AT, LOCATED_IN
patterns - (Person)-[WORKS_AT]→(Organization), (Organization)-[LOCATED_IN]→(Location)

Iterate your schema

You don’t have to define nodes, relationships, and patterns all at once. You can start with just nodes or just relationships and expand your schema as needed.

For example, if you only define nodes, the LLM will find any relationships between those nodes based on the text.

This approach can help you iteratively build and refine your knowledge graph schema.

Nodes

Open genai-graphrag-python/kg_builder_schema.py and review the code:

python

kg_builder_schema.py

import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

You define the NODES as a list of node labels and pass the list to the SimpleKGPipeline when creating the pipeline instance.

python

NODES

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

Define relevant nodes

You should define the node labels that are relevant to your domain and the information you want to extract from the text.

You can also provide a description for each node label and associated properties to help guide the LLM when extracting entities.

python

Node descriptions and properties

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

Run the program to create the knowledge graph with the defined nodes.

Remember to delete the existing graph before re-running the pipeline

cypher

Delete the existing graph

MATCH (n) DETACH DELETE n

The graph created will be constrained to only include the defined node labels.

cypher

View the entities extracted from each chunk

MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

Relationships

You express required relationship types by providing a list of relationship types to the SimpleKGPipeline.

python

RELATIONSHIP_TYPES

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

You can also provide patterns that define how nodes types are connected by relationships.

python

PATTERNS

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

Nodes, relationships and patterns are all passed to the SimpleKGPipeline as the schema when creating the pipeline:

python

schema

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

Review the data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf PDF document and experiment by creating a set of nodes, relationships and patterns relevant to the data.

Process all the documents

When you are happy with the schema, you can modify the program to process all the PDF documents from the Neo4j and Generative AI Fundamentals course:

python

All PDFs

data_path = "./genai-graphrag-python/data/"
pdf_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith('.pdf')]

for pdf_file in pdf_files:

    print(f"Processing {pdf_file}")
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

You can run the program to create a knowledge graph based on all the documents using the defined schema.

Reveal the complete code

python

import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

data_path = "./genai-graphrag-python/data/"
pdf_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith('.pdf')]

for pdf_file in pdf_files:

    print(f"Processing {pdf_file}")
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

OpenAI Rate Limiting?

When using a free OpenAI API key, you may encounter rate limiting issues when processing multiple documents. You can add a sleep between document processing to mitigate this.

Review the knowledge graph and observe how the defined schema has influenced the structure of the graph.

cypher

Documents, Chunks, and Entity counts

RETURN
  count{ (:Document) } as documents,
  count{ (:Chunk) } as chunks,
  count{ (:__Entity__) } as entities

Check your understanding

Why would you define a schema when using the SimpleKGPipeline?

❏ To improve the performance and speed of data processing
❏ To reduce the computational resources required by the LLM
✓ To create a more structured and meaningful knowledge graph by constraining entities and relationships
❏ To ensure that all possible entities and relationships are extracted from the text

Hint

Think about what happens when you don’t provide a schema - the knowledge graph becomes unconstrained. What problems might this cause?

Solution

Defining a schema allows you to create a more structured and meaningful knowledge graph by constraining entities and relationships that are extracted. Without a schema, the knowledge graph is unconstrained, meaning any entity or relationship can be created, which can lead to graphs that are non-specific and difficult to analyze and query.

Lesson Summary

In this lesson, you learned how to define a custom schema for the knowledge graph.

In the next lesson, you will learn how to add structured data to the knowledge graph.

Constructing Knowledge Graphs with Neo4j GraphRAG for Python

Introduction

Knowledge Graph Pipeline

Retrieval

Customization

Define a schema

Schema

Nodes

Relationships

Process all the documents

Check your understanding

Why would you define a schema when using the SimpleKGPipeline?

Lesson Summary

Chatbot