The graph created by the SimpleKGPipeline is based on chunks of text extracted from the documents. By default, the chunk size is quite large, which may result in fewer, larger chunks. The larger the chunk size, the more context the LLM has when extracting entities and relationships, but it may also lead to less granular data.
In this lesson, you will modify the SimpleKGPipeline to use a different chunk size.
Delete the existing graph
You will be re-importing the data and modifying the existing graph. To ensure a clean state, you can delete the graph at any time using:
MATCH (n) DETACH DELETE nText Splitter Chunk Size
To modify the chunk size you will need to create a FixedSizeSplitter object and pass it to the SimpleKGPipeline when creating the pipeline instance:
-
Modify the
genai-graphrag-python/kg_builder.pyfile to import theFixedSizeSplitterclass and create an instance with a chunk size of 500 characters:pythonfrom neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)Thechunk_sizeparameter defines the maximum number of characters in each text chunk. Thechunk_overlapparameter ensures that there is some overlap between consecutive chunks, which can help maintain context. -
Update the
SimpleKGPipelineinstantiation to use the custom text splitter:pythonkg_builder = SimpleKGPipeline( llm=llm, driver=neo4j_driver, neo4j_database=os.getenv("NEO4J_DATABASE"), embedder=embedder, from_pdf=True, text_splitter=text_splitter, )
Reveal the complete code
import os
from dotenv import load_dotenv
load_dotenv()
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
neo4j_driver = GraphDatabase.driver(
os.getenv("NEO4J_URI"),
auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()
llm = OpenAILLM(
model_name="gpt-4o",
model_params={
"temperature": 0,
"response_format": {"type": "json_object"},
}
)
embedder = OpenAIEmbeddings(
model="text-embedding-ada-002"
)
text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)
kg_builder = SimpleKGPipeline(
llm=llm,
driver=neo4j_driver,
neo4j_database=os.getenv("NEO4J_DATABASE"),
embedder=embedder,
from_pdf=True,
text_splitter=text_splitter,
)
pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)Run the modified pipeline to recreate the knowledge graph with the new chunk size.
MATCH (d:Document)<-[:FROM_DOCUMENT]-(c:Chunk)
RETURN d.path, c.index, c.text
ORDER BY d.path, c.indexYou can experiment with different chunk sizes to see how it affects the entities extracted and the structure of the knowledge graph.
MATCH p = (c:Chunk)-[*..3]-(e:__Entity__)
RETURN pCheck your understanding
What is the primary trade-off when increasing the chunk size in the SimpleKGPipeline?
-
❏ Larger chunks process faster but use more memory
-
✓ Larger chunks provide more context for entity extraction but result in less granular data
-
❏ Larger chunks create more entities but fewer relationships
-
❏ Larger chunks improve accuracy but require more computational power
Hint
Consider what happens to the level of detail and context when you make text chunks bigger or smaller.
Solution
The larger the chunk size, the more context the LLM has when extracting entities and relationships, but it may also lead to less granular data. This is the key trade-off - more context versus granularity of the extracted information.
Lesson Summary
In this lesson, you:
-
Learned about the impact of chunk size on entity extraction
-
Modified the
SimpleKGPipelineto use a custom chunk size with theFixedSizeSplitter
In the next lesson, you will define a custom schema for the knowledge graph.