Chunk size

Overview

The graph created by the SimpleKGPipeline is based on chunks of text extracted from the documents.

By default, the chunk size is quite large, which may result in fewer, larger chunks.

The larger the chunk size, the more context the LLM has when extracting entities and relationships, but it may also lead to less granular data.

In this lesson, you will modify the SimpleKGPipeline to use a different chunk size.

Modify the chunk size

Your task is to:

Delete the existing graph
Create a custom FixedSizeSplitter
Modify the SimpleKGPipeline to use the text splitter
Run the pipeline
Review the documents, chunks, and extracted entities

Continue with the lesson to modify the chunk size.

Delete the existing graph

You will be re-importing the data and modifying the existing graph. To ensure a clean state, you can delete the graph at any time using:

cypher

Delete the existing graph

MATCH (n) DETACH DELETE n

Text Splitter Chunk Size

To modify the chunk size you will need to create a FixedSizeSplitter object and pass it to the SimpleKGPipeline when creating the pipeline instance:

Modify the workshop-genai/kg_builder.py file to import the FixedSizeSplitter class and create an instance with a chunk size of 500 characters:
python
```
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)
```
Chunk size and overlap
The chunk_size parameter defines the maximum number of characters in each text chunk. The chunk_overlap parameter ensures that there is some overlap between consecutive chunks, which can help maintain context.

Update the SimpleKGPipeline instantiation to use the custom text splitter:

python

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
)

Reveal the complete code

python

import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    text_splitter=text_splitter,
)

pdf_file = "./workshop-genai/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

Run the modified pipeline to recreate the knowledge graph with the new chunk size.

Explore

You can view the documents and the associated chunk using the following Cypher query:

cypher

View the documents and chunks

MATCH (d:Document)<-[:FROM_DOCUMENT]-(c:Chunk)
RETURN d.path, c.index, c.text, size(c.text)
ORDER BY d.path, c.index

View the entities extracted from each chunk using the following Cypher query:

cypher

View the entities extracted from each chunk

MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN p

Chunk size

You can experiment with different chunk sizes to see how it affects the entities extracted and the structure of the knowledge graph.

Lesson Summary

In this lesson, you:

Learned about the impact of chunk size on entity extraction
Modified the SimpleKGPipeline to use a custom chunk size with the FixedSizeSplitter

In the next lesson, you will define a custom schema for the knowledge graph.

Neo4j and Generative AI Workshop

Generative AI

Knowledge Graph Construction

Retrieval

Agents

Chunk size

Overview

Modify the chunk size

Delete the existing graph

Text Splitter Chunk Size

Explore

Lesson Summary

Chatbot