Add structured data to the knowledge graph

Overview

The knowledge graph you created is solely based on unstructured data extracted from documents.

You may have access to structured data sources such as databases, CSV files, or APIs that contain valuable information relevant to your domain.

Combining the structured and unstructured data can enhance the knowledge graph’s richness and usefulness.

Lexical and Domain Graphs

The unstructured part of your graph is known as the Lexical Graph, while the structured part is known as the Domain Graph.

Structured data source

The repository contains a sample CSV file workshop-genai/data/docs.csv which contains metadata about the lessons the documents were created from.

csv

Sample docs.csv

filename,course,module,lesson,url
genai-fundamentals_1-generative-ai_1-what-is-genai.pdf,genai-fundamentals,1-generative-ai,1-what-is-genai,https://graphacademy.neo4j.com/courses/genai-fundamentals/1-generative-ai/1-what-is-genai
genai-fundamentals_1-generative-ai_2-considerations.pdf,genai-fundamentals,1-generative-ai,2-considerations,https://graphacademy.neo4j.com/courses/genai-fundamentals/1-generative-ai/2-considerations
...

You can use the CSV file as input and a structured data source when creating the knowledge graph.

Load structured data

Your task is to:

Open the workshop-genai/kg_structured_builder.py file
Review the code and how it uses Cypher to create nodes from the structured data
Run the program to create the graph with the structured data
Review the graph and how the structured data gives you new ways to query and explore the knowledge graph

Continue with the lesson to load the structured data.

Load from CSV file

Open workshop-genai/kg_structured_builder.py and review the code.

python

kg_structured_builder.py

import os
from dotenv import load_dotenv
load_dotenv()

import asyncio
import csv

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

data_path = "./workshop-genai/data/"

docs_csv = csv.DictReader(
    open(os.path.join(data_path, "docs.csv"), encoding="utf8", newline='')
)

cypher = """
MATCH (d:Document {path: $pdf_path})
MERGE (l:Lesson {url: $url})
SET l.name = $lesson,
    l.module = $module,
    l.course = $course
MERGE (d)-[:PDF_OF]->(l)
"""

for doc in docs_csv:

    # Create the complete PDF path
    doc["pdf_path"] = os.path.join(data_path, doc["filename"])
    print(f"Processing document: {doc['pdf_path']}")

    # Entity extraction and KG population
    result = asyncio.run(
        kg_builder.run_async(
            file_path=os.path.join(doc["pdf_path"])
        )
    )

    # Create structured graph
    records, summary, keys = neo4j_driver.execute_query(
        cypher,
        parameters_=doc,
        database_=os.getenv("NEO4J_DATABASE")
    )
    print(result, summary.counters)

The key differences are:

The docs.csv file is loaded using csv.DictReader to read each row as a dictionary:

python

Load docs.csv

data_path = "./workshop-genai/data/"

docs_csv = csv.DictReader(
    open(os.path.join(data_path, "docs.csv"), encoding="utf8", newline='')
)

The path of the PDF document is constructed using the filename field from the CSV:

python

PDF path

    # Create the complete PDF path
    doc["pdf_path"] = os.path.join(data_path, doc["filename"])
    print(f"Processing document: {doc['pdf_path']}")

A cypher statement is defined to create Lesson nodes with properties from the CSV data:
python
Cypher statement
```
cypher = """
MATCH (d:Document {path: $pdf_path})
MERGE (l:Lesson {url: $url})
SET l.name = $lesson,
    l.module = $module,
    l.course = $course
MERGE (d)-[:PDF_OF]->(l)
"""
```
The pdf_path is used as the key to match the Document nodes created from the PDF files.

A Lesson node is created for each document using the cypher statement and the CSV data:

python

Lesson nodes

    # Create structured graph
    records, summary, keys = neo4j_driver.execute_query(
        cypher,
        parameters_=doc,
        database_=os.getenv("NEO4J_DATABASE")
    )

The resulting knowledge graph will now contain Lesson nodes connected to the Document nodes created from the PDF files:

A data model showing Lesson nodes connected to Document nodes using a PDF_OF relationship.

Run the program to create the knowledge graph with the structured data.

Remember to delete the existing graph before re-running the pipeline

cypher

Delete the existing graph

MATCH (n) DETACH DELETE n

OpenAI Rate Limiting?

When using a free OpenAI API key, you may encounter rate limiting issues when processing multiple documents. You can add a sleep between document processing to mitigate this.

Explore the structured data

The structured data allows you to query the knowledge graph in new ways.

You can find all lessons that cover a specific technology or concept:

cypher

Find lessons about Knowledge Graphs

MATCH (kg:Technology)
MATCH (kg)-[:FROM_CHUNK]->(c)-[:FROM_DOCUMENT]-(d)-[:PDF_OF]-(l)
WHERE toLower(kg.name) CONTAINS "knowledge graph"
RETURN DISTINCT toLower(kg.name), l.name, l.url

Explore the knowledge graph

The knowledge graph allows you to summarize the content of each lesson by specific categories such as technologies and concepts:

cypher

Summarize lesson content

MATCH (lesson:Lesson)<-[:PDF_OF]-(:Document)<-[:FROM_DOCUMENT]-(c:Chunk)
RETURN
  lesson.name,
  lesson.url,
  [ (c)<-[:FROM_CHUNK]-(tech:Technology) | tech.name ] AS technologies,
  [ (c)<-[:FROM_CHUNK]-(concept:Concept) | concept.name ] AS concepts

Spend some time exploring the knowledge graph and experiment with adding additional data.

Lesson Summary

In this lesson, you learned:

About benefits of adding structured data to a knowledge graph.
How to load structured data from a CSV file.
How to create nodes from structured data and connect them to unstructured data nodes.

In the next module, you will create retrievers to query the knowledge graph.

Neo4j and Generative AI Workshop

Generative AI

Knowledge Graph Construction

Retrieval

Agents

Add structured data to the knowledge graph

Overview

Structured data source

Load structured data

Load from CSV file

Explore the structured data

Explore the knowledge graph

Lesson Summary

Chatbot