Create a Graph

In this lesson, you will learn how to create a knowledge graph from unstructured data using the SimpleKGPipeline class.

The SimpleKGPipeline class provides a pipeline which implements a series of steps to create a knowledge graph from unstructured data:

  1. Load the text

  2. Split the text into chunks

  3. Create embeddings for each chunk

  4. Extract entities from the chunks

  5. Write the data to a Neo4j database

Pipeline showing these steps

Typical default values are used for each step. Throughout the course will you learn how to customize each step to suit your requirements.

Create the knowledge graph

Open genai-graphrag-python/kg_builder.py and review the code.

python
kg_builder.py
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

The code loads a single pdf file, data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf, and run the pipeline to create a knowledge graph in Neo4j.

The PDF document contains the content from the Neo4j & Generative AI Fundamentals course What is Generative AI? lesson.

Breaking down the code, you can see the following steps:

  1. Create a connection to Neo4j:

    python
    Neo4j connection
    neo4j_driver = GraphDatabase.driver(
        os.getenv("NEO4J_URI"),
        auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
    )
    neo4j_driver.verify_connectivity()
  2. Instantiate an LLM model:

    python
    LLM
    llm = OpenAILLM(
        model_name="gpt-4o",
        model_params={
            "temperature": 0,
            "response_format": {"type": "json_object"},
        }
    )

    Model parameters, model_params, are set to lower the temperature of the model to be more deterministic, and set to response format to be json.

  3. Create an embedding model:

    python
    Embedding model
    embedder = OpenAIEmbeddings(
        model="text-embedding-ada-002"
    )
  4. Setup the SimpleKGPipeline:

    python
    kg_builder
    kg_builder = SimpleKGPipeline(
        llm=llm,
        driver=neo4j_driver, 
        neo4j_database=os.getenv("NEO4J_DATABASE"), 
        embedder=embedder, 
        from_pdf=True,
    )
  5. Run the pipeline to create the graph from a single PDF file:

    python
    kg_builder
    pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
    result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
    print(result.result)

When you run the program, the pipeline will process the PDF document and create the graph in Neo4j.

A summary of the results will be returned, for example:

{'resolver': {'number_of_nodes_to_resolve': 12, 'number_of_created_nodes': 10}}

Explore the Knowledge Graph

The SimpleKGPipeline creates the following default graph model:

a graph model showing (Document)<[:FROM_DOCUMENT]-(Chunk)←[:FROM_CHUNK]-(Entity)

The Entity nodes represent the entities extracted from the text chunks. Relevant properties are extract from the chunk and associated with the entity nodes.

You can view the documents and chunks created in the graph using the following Cypher query:

cypher
View the documents and chunks
MATCH (d:Document)<-[:FROM_DOCUMENT]-(c:Chunk)
RETURN d.path, c.text

Chunk size

The default chunk size is greater than the length of the document, so only a single chunk is created.

The extracted entities and the relationships between them can be found using a variable length path query:

cypher
View the entities extracted from each chunk
MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p
A graph showing entities extracted from a chunk

Check your understanding

What is the primary role of the SimpleKGPipeline class?

  • ❏ To provide a simple interface for querying existing knowledge graphs

  • ✓ To implement a series of steps that create a knowledge graph from unstructured data

  • ❏ To visualize knowledge graphs

  • ❏ To convert structured data from databases into unstructured text

Hint

Think about what the SimpleKGPipeline does when you run it - it takes unstructured data and transforms it into a structured knowledge graph.

Solution

The SimpleKGPipeline class provides a pipeline which implements a series of steps to create a knowledge graph from unstructured data. These steps include loading text, splitting it into chunks, creating embeddings, extracting entities, and writing the data to Neo4j.

Lesson Summary

In this lesson, you:

  • Learned how to use the SimpleKGPipeline class.

  • Explored the graph model created by the pipeline.

In the next lesson, you will modify the chunk size used when splitting the text and define a custom schema for the knowledge graph.

Chatbot

How can I help you today?