Define a schema

The knowledge graph you created is unconstrained, meaning that any entity or relationship can be created based on the data extracted from the text. This can lead to graph which are non-specific and maybe difficult to analyze and query.

In this lesson, you will modify the SimpleKGPipeline to use a custom schema for the knowledge graph.

Schema

When you provide a schema to the SimpleKGPipeline, it will pass this information to the LLM instructing it to only identify those nodes and relationships. This allows you to create a more structured and meaningful knowledge graph.

You define a schema by expressing the desired nodes, relationships, or patterns you want to extract from the text.

For example, you might want to extract the following information:

  • nodes - Person, Organization, Location

  • relationships - WORKS_AT, LOCATED_IN

  • patterns - (Person)-[WORKS_AT]→(Organization), (Organization)-[LOCATED_IN]→(Location)

Iterate your schema

You don’t have to define nodes, relationships, and patterns all at once. You can start with just nodes or just relationships and expand your schema as needed.

For example, if you only define nodes, the LLM will find any relationships between those nodes based on the text.

This approach can help you iteratively build and refine your knowledge graph schema.

Nodes

Open genai-graphrag-python/kg_builder_schema.py and review the code:

python
kg_builder_schema.py
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

You define the NODES as a list of node labels and pass the list to the SimpleKGPipeline when creating the pipeline instance.

python
NODES
NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

Define relevant nodes

You should define the node labels that are relevant to your domain and the information you want to extract from the text.

You can also provide a description for each node label and associated properties to help guide the LLM when extracting entities.

python
Node descriptions and properties
NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

Run the program to create the knowledge graph with the defined nodes.

Remember to delete the existing graph before re-running the pipeline

cypher
Delete the existing graph
MATCH (n) DETACH DELETE n

The graph created will be constrained to only include the defined node labels.

cypher
View the entities extracted from each chunk
MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

Relationships

You express required relationship types by providing a list of relationship types to the SimpleKGPipeline.

python
RELATIONSHIP_TYPES
RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

You can also provide patterns that define how nodes types are connected by relationships.

python
PATTERNS
PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

Nodes, relationships and patterns are all passed to the SimpleKGPipeline as the schema when creating the pipeline:

python
schema
kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

Review the data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf PDF document and experiment by creating a set of nodes, relationships and patterns relevant to the data.

Process all the documents

When you are happy with the schema, you can modify the program to process all the PDF documents from the GraphAcademy Neo4j and Generative AI Fundamentals course:

python
All PDFs
data_path = "./genai-graphrag-python/data/"
pdf_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith('.pdf')]

for pdf_file in pdf_files:

    print(f"Processing {pdf_file}")
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

You can run the program to create a knowledge graph based on all the documents using the defined schema.

Reveal the complete code
python
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

data_path = "./genai-graphrag-python/data/"
pdf_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith('.pdf')]

for pdf_file in pdf_files:

    print(f"Processing {pdf_file}")
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

Review the knowledge graph and observe how the defined schema has influenced the structure of the graph.

cypher
Documents, Chunks, and Entity counts
RETURN
  count{ (:Document) } as documents,
  count{ (:Chunk) } as chunks,
  count{ (:__Entity__) } as entities

Check your understanding

Why would you define a schema when using the SimpleKGPipeline?

  • ❏ To improve the performance and speed of data processing

  • ❏ To reduce the computational resources required by the LLM

  • ✓ To create a more structured and meaningful knowledge graph by constraining entities and relationships

  • ❏ To ensure that all possible entities and relationships are extracted from the text

Hint

Think about what happens when you don’t provide a schema - the knowledge graph becomes unconstrained. What problems might this cause?

Solution

Defining a schema allows you to create a more structured and meaningful knowledge graph by constraining the entities and relationships that are extracted. Without a schema, the knowledge graph is unconstrained, meaning any entity or relationship can be created, which can lead to graphs that are non-specific and difficult to analyze and query.

Lesson Summary

In this lesson, you learned how to define a custom schema for the knowledge graph.

In the next lesson, you will learn how to add structured data to the knowledge graph.

Chatbot

How can I help you today?