Define a schema

Overview

The knowledge graph you created is unconstrained, meaning that any entity or relationship can be created based on the data extracted from the text.

This can lead to graphs that are non-specific and may be difficult to analyze and query.

In this lesson, you will modify the SimpleKGPipeline to use a custom schema for the knowledge graph.

Schema

When you provide a schema to the SimpleKGPipeline, it will pass this information to the LLM instructing it to only identify those nodes and relationships.

This allows you to create a more structured and meaningful knowledge graph.

You define a schema by expressing the desired nodes, relationships, or patterns you want to extract from the text.

For example, you might want to extract the following information:

  • nodes - Person, Organization, Location

  • relationships - WORKS_AT, LOCATED_IN

  • patterns - (Person)-[WORKS_AT]→(Organization), (Organization)-[LOCATED_IN]→(Location)

Iterate your schema

You don’t have to define nodes, relationships, and patterns all at once. You can start with just nodes or just relationships and expand your schema as needed.

For example, if you only define nodes, the LLM will find any relationships between those nodes based on the text.

This approach can help you iteratively build and refine your knowledge graph schema.

Continue with the lesson to define the schema.

Nodes

Open workshop-genai/kg_builder_schema.py and review the code:

python
kg_builder_schema.py
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

pdf_file = "./workshop-genai/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

You define the NODES as a list of node labels and pass the list to the SimpleKGPipeline when creating the pipeline instance.

python
NODES
NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

Define relevant nodes

You should define the node labels that are relevant to your domain and the information you want to extract from the text.

You can also provide a description for each node label and associated properties to help guide the LLM when extracting entities.

python
Node descriptions and properties
NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

Recreate the knowledge graph with the defined nodes:

  1. Delete any existing nodes and relationships.

    cypher
    Delete the existing graph
    MATCH (n) DETACH DELETE n
  2. Run the program

    The graph will be constrained to only include the defined node labels.

View the entities and chunks in the graph using the following Cypher query:

cypher
Entities and Chunks
MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

Relationships

You can define required relationship types by providing a list to the SimpleKGPipeline.

python
RELATIONSHIP_TYPES
RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

You can also describe patterns that define how nodes are connected by relationships.

python
PATTERNS
PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

Nodes, relationships and patterns are all passed to the SimpleKGPipeline as the schema when creating the pipeline:

python
schema
kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)
Reveal the complete code
python
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)


NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

pdf_file = "./workshop-genai/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

Review the data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf PDF document and experiment by creating a set of NODES, RELATIONSHIPS and PATTERNS relevant to the data.

Recreate the knowledge graph:

  1. Delete any existing nodes and relationships.

  2. Run the program.

Process all the documents?

In the next lesson, you will add structured data to the knowledge graph, and process all of the documents.

Optionally, you could modify the program now to process the documents from the data directory without the structured data:

python
All PDFs
data_path = "./workshop-genai/data/"
pdf_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith('.pdf')]

for pdf_file in pdf_files:

    print(f"Processing {pdf_file}")
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

Explore

Review the knowledge graph and observe how the defined schema has influenced the structure of the graph:

cypher
Entities and Chunks
MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

View the counts of documents, chunks and entities in the graph:

cypher
Documents, Chunks, and Entity counts
RETURN
  count{ (:Document) } as documents,
  count{ (:Chunk) } as chunks,
  count{ (:__Entity__) } as entities

Lesson Summary

In this lesson, you learned how to define a custom schema for the knowledge graph.

In the next lesson, you will learn how to add structured data to the knowledge graph.

Chatbot

How can I help you today?