Define a schema

Overview

The knowledge graph you created is unconstrained, meaning that any entity or relationship can be created based on the data extracted from the text.

This can lead to graphs that are non-specific and may be difficult to analyze and query.

In this lesson, you will modify the SimpleKGPipeline to use a custom schema for the knowledge graph.

Schema

Providing a schema allows you to create a more structured and meaningful knowledge graph. The pipeline instructs the LLM to only identify the specified nodes and relationships

You define a schema by expressing the desired nodes, relationships, or patterns you want to extract from the text, for example:

nodes - Person, Organization, Location
relationships - WORKS_AT, LOCATED_IN
patterns - (Person)-[WORKS_AT]→(Organization), (Organization)-[LOCATED_IN]→(Location)

Iterate your schema

You don’t have to define nodes, relationships, and patterns all at once. You can start with just nodes or just relationships and expand your schema as needed.

For example, if you only define nodes, the LLM will find any relationships between those nodes based on the text.

This approach can help you iteratively build and refine your knowledge graph schema.

Add the schema to the pipeline

Your task is to:

Open workshop-genai/kg_builder_schema.py and review the code
Define a set of nodes, relationships, and patterns relevant to the data
Modify the SimpleKGPipeline to use the defined schema
Run the pipeline, iterate, and refine the schema
Review the graph and observe how the schema has influenced the structure of the graph

Continue with the lesson to add the schema.

Nodes

Open workshop-genai/kg_builder_schema.py and review the code:

python

kg_builder_schema.py

import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

pdf_file = "./workshop-genai/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

You define the NODES as a list of node labels and pass the list to the SimpleKGPipeline when creating the pipeline instance.

python

NODES

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
    },
)

Define relevant nodes

You should define the node labels that are relevant to your domain and the information you want to extract from the text.

You can also provide a description for each node label and associated properties to help guide the LLM when extracting entities.

python

Node descriptions and properties

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

Recreate the knowledge graph with the defined nodes:

Delete any existing nodes and relationships.
cypher
Delete the existing graph
```
MATCH (n) DETACH DELETE n
```
Run the program

The graph will be constrained to only include the defined node labels.

View the entities and chunks in the graph using the following Cypher query:

cypher

Entities and Chunks

MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

Relationships

You can define required relationship types by providing a list to the SimpleKGPipeline.

python

RELATIONSHIP_TYPES

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

You can also describe patterns that define how nodes are connected by relationships.

python

PATTERNS

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

Nodes, relationships and patterns are all passed to the SimpleKGPipeline as the schema when creating the pipeline:

python

schema

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

Reveal the complete code

python

import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)


NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

pdf_file = "./workshop-genai/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

Review the data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf PDF document and experiment by creating a set of NODES, RELATIONSHIPS and PATTERNS relevant to the data.

To recreate the knowledge graph:

Delete any existing nodes and relationships.
Run the program.

Process all the documents?

In the next lesson, you will add structured data to the knowledge graph, and process all of the documents.

Optionally, you could modify the program now to process the documents from the data directory without the structured data:

python

All PDFs

data_path = "./workshop-genai/data/"
pdf_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith('.pdf')]

for pdf_file in pdf_files:

    print(f"Processing {pdf_file}")
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

Explore

Review the knowledge graph and observe how the defined schema has influenced the structure of the graph:

cypher

Entities and Chunks

MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

View the counts of documents, chunks and entities in the graph:

cypher

Documents, Chunks, and Entity counts

RETURN
  count{ (:Document) } as documents,
  count{ (:Chunk) } as chunks,
  count{ (:__Entity__) } as entities

Lesson Summary

In this lesson, you learned how to define a custom schema for the knowledge graph.

In the next lesson, you will learn how to add structured data to the knowledge graph.

Neo4j and Generative AI Workshop

Generative AI

Knowledge Graph Construction

Retrieval

Agents

Define a schema

Overview

Schema

Add the schema to the pipeline

Nodes

Relationships

Explore

Lesson Summary

Chatbot