Add Structured Data to the Knowledge Graph

The knowledge graph graph you created is solely based on unstructured data extracted from documents. You may have access to structured data sources such as databases, CSV files, or APIs that contain valuable information relevant to your domain.

Combining the structure and unstructured data can enhance the knowledge graph’s richness and usefulness.

The unstructured part of your graph is known as the Lexical Graph, while the structured part is known as the Domain Graph.

Load from CSV file

The repository contains a sample CSV file genai-graphrag-python/data/docs.csv which contains metadata about the lessons the document were created from.

csv
Sample docs.csv
filename,course,module,lesson,url
genai-fundamentals_1-generative-ai_1-what-is-genai.pdf,genai-fundamentals,1-generative-ai,1-what-is-genai,https://graphacademy.neo4j.com/courses/genai-fundamentals/1-generative-ai/1-what-is-genai
genai-fundamentals_1-generative-ai_2-considerations.pdf,genai-fundamentals,1-generative-ai,2-considerations,https://graphacademy.neo4j.com/courses/genai-fundamentals/1-generative-ai/2-considerations
...

You can use the csv file as input and a structure data source when creating the knowledge graph.

Open genai-graphrag-python/kg_structured_builder.py and review the code.

python
kg_structured_builder.py
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio
import csv

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

NODE_TYPES = [
    "Technology",
    "Concept",
    "Example",
    "Process",
    "Challenge",
    {"label": "Benefit", "description": "A benefit or advantage of using a technology or approach."},
    {
        "label": "Resource",
        "description": "A related learning resource such as a book, article, video, or course.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True}, 
            {"name": "type", "type": "STRING"}
        ]
    },
]

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "HAS_CHALLENGE",
    "LEADS_TO",
    "CITES"
]

PATTERNS = [
    ("Technology", "RELATED_TO", "Technology"),
    ("Concept", "RELATED_TO", "Technology"),
    ("Example", "USED_IN", "Technology"),
    ("Process", "PART_OF", "Technology"),
    ("Technology", "HAS_CHALLENGE", "Challenge"),
    ("Concept", "HAS_CHALLENGE", "Challenge"),
    ("Technology", "LEADS_TO", "Benefit"),
    ("Process", "LEADS_TO", "Benefit"),
    ("Resource", "CITES", "Technology"),
]

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

data_path = "./genai-graphrag-python/data/"

docs_csv = csv.DictReader(
    open(os.path.join(data_path, "docs.csv"), encoding="utf8", newline='')
)

cypher = """
MATCH (d:Document {path: $pdf_path})
MERGE (l:Lesson {url: $url})
SET l.name = $lesson,
    l.module = $module,
    l.course = $course
MERGE (d)-[:PDF_OF]->(l)
"""

for doc in docs_csv:

    # Create the complete PDF path
    doc["pdf_path"] = os.path.join(data_path, doc["filename"])
    print(f"Processing document: {doc['pdf_path']}")

    # Entity extraction and KG population
    result = asyncio.run(
        kg_builder.run_async(
            file_path=os.path.join(doc["pdf_path"])
        )
    )

    # Create structured graph
    records, summary, keys = neo4j_driver.execute_query(
        cypher,
        parameters_=doc,
        database_=os.getenv("NEO4J_DATABASE")
    )
    print(result, summary.counters)

The key differences are:

  1. The docs.csv file is loaded using csv.DictReader to read each row as a dictionary:

    python
    Load docs.csv
    data_path = "./genai-graphrag-python/data/"
    
    docs_csv = csv.DictReader(
        open(os.path.join(data_path, "docs.csv"), encoding="utf8", newline='')
    )
  2. The path of the PDF document is constructed using the filename field from the CSV:

    python
    PDF path
        # Create the complete PDF path
        doc["pdf_path"] = os.path.join(data_path, doc["filename"])
        print(f"Processing document: {doc['pdf_path']}")
  3. A cypher statement is defined to create Lesson nodes with properties from the CSV data:

    python
    Cypher statement
    cypher = """
    MATCH (d:Document {path: $pdf_path})
    MERGE (l:Lesson {url: $url})
    SET l.name = $lesson,
        l.module = $module,
        l.course = $course
    MERGE (d)-[:PDF_OF]->(l)
    """

    The pdf_path is used as the key to match the Document nodes created from the PDF files.

  4. A Lesson node is created for each document using the cypher statement and the CSV data:

    python
    Lesson nodes
        # Create structured graph
        records, summary, keys = neo4j_driver.execute_query(
            cypher,
            parameters_=doc,
            database_=os.getenv("NEO4J_DATABASE")
        )

The resulting knowledge graph will now contain Lesson nodes connected to the Document nodes created from the PDF files:

A data model showing Lesson nodes connected to Document nodes using a PDF_OF relationship.

Run the program to create the knowledge graph with the structured data.

Clear the graph before importing

Remember to clear the database before running the program to avoid inconsistent data.

cypher
Delete all
MATCH (n) DETACH DELETE n

Explore the structured data

The structured data allows you to query the knowledge graph in new ways.

You can find all lessons that cover a specific technology or concept:

cypher
Find lessons about Knowledge Graphs
MATCH (kg:Technology)
MATCH (kg)-[:FROM_CHUNK]->(c)-[:FROM_DOCUMENT]-(d)-[:PDF_OF]-(l)
WHERE toLower(kg.name) CONTAINS "knowledge graph"
RETURN DISTINCT toLower(kg.name), l.name, l.url

The knowledge graph allows you to summarizing the content of each lesson by specific categories such as technologies and concepts:

cypher
Summarize lesson content
MATCH (lesson:Lesson)<-[:PDF_OF]-(:Document)<-[:FROM_DOCUMENT]-(c:Chunk)
OPTIONAL MATCH (c)<-[:FROM_CHUNK]-(tech:Technology)
OPTIONAL MATCH (c)<-[:FROM_CHUNK]-(concept:Concept)
RETURN
  lesson.name,
  collect(DISTINCT tech.name) as technologies,
  collect(DISTINCT concept.name) as concepts

Spend some time exploring the knowledge graph and experiment with adding additional data.

Check your understanding

What are the benefits of adding structured data to a knowledge graph? (Select all that apply)

  • ✓ It enhances the knowledge graph’s richness and usefulness by combining structured and unstructured data

  • ✓ It allows you to query the knowledge graph in new ways, such as finding lessons about specific technologies

  • ❏ It automatically improves the accuracy of entity extraction from unstructured text

  • ✓ It enables you to summarize content by specific categories like technologies and concepts

Hint

Think about what new capabilities structured data adds to the knowledge graph - what can you do with the combination that you couldn’t do with unstructured data alone?

Solution

Adding structured data to a knowledge graph provides several benefits:

  1. Enhanced richness and usefulness - Combining structured data (like CSV metadata) with unstructured data (extracted from documents) creates a more comprehensive knowledge graph

  2. New query capabilities - You can find relationships between structured metadata and extracted entities, such as finding all lessons that cover specific technologies

  3. Content summarization - You can group and summarize content by categories, connecting lesson metadata with extracted concepts and technologies

The incorrect option suggests that structured data improves entity extraction accuracy, but structured data doesn’t directly affect the LLM’s ability to extract entities from text - it provides additional context and relationships.

Lesson Summary

In this lesson, you learned:

  • About benefits of adding structured data to a knowledge graph.

  • How to load structured data from a CSV file.

  • How to create nodes from structured data and connect them to unstructured data nodes.

In the next module, you will create retrievers to query the knowledge graph.

Chatbot

How can I help you today?