You can modify how the SimpleKGBuilder splits text into chunks.
Your strategy for splitting text will impact the construction and how you search the knowledge graph.
In this lesson you will learn how to create a custom text splitter or integrate with an existing LangChain text splitter.
Custom Text Splitter
A custom text splitter allows you to define how the text is divided into chunks.
The PDF documents you have been using in this course contain sections, for example = Heading and == Subheading, that can be used to logically split the text into meaningful chunks.
= What is Generative AI
== GenAI
Generative AI (or GenAI) refers to artificial intelligence systems ...
== Large Language Models (LLMs)
LLMs are a type of generative AI model designed to understand and generate human-like text ...You can create a custom text splitter by extending the TextSplitter class:
from neo4j_graphrag.experimental.components.text_splitters.base import TextSplitter
from neo4j_graphrag.experimental.components.types import TextChunk, TextChunks
class SectionSplitter(TextSplitter):
def __init__(self, section_heading: str = "== ") -> None:
self.section_heading = section_heading
async def run(self, text: str) -> TextChunks:
index = 0
chunks = []
current_section = ""
for line in text.split('\n'):
# Does the line start with the section heading?
if line.startswith(self.section_heading):
chunks.append(
TextChunk(text=current_section, index=index)
)
current_section = ""
index += 1
current_section += line + "\n"
# Add the last section
chunks.append(
TextChunk(text=current_section, index=index)
)
return TextChunks(chunks=chunks)
splitter = SectionSplitter()The SectionSplitter class splits the text based on a section heading (== `), then creates `TextChunk objects for each section, and finally returns a TextChunks object containing all the chunks.
You can run the text splitter directly with test data to verify that it is working:
text = """
= Heading 1
This is the main section
== Sub-heading
This is some text.
== Sub-heading 2
This is some more text.
"""
chunks = asyncio.run(splitter.run(text))
print(chunks)Your text splitter can then be used in the SimpleKGPipeline to split text into chunks by setting the text_splitter parameter:
kg_builder = SimpleKGPipeline(
llm=llm,
driver=neo4j_driver,
neo4j_database=os.getenv("NEO4J_DATABASE"),
embedder=embedder,
from_pdf=True,
text_splitter=splitter,
)Reveal the complete code
This example code shows how to create and use the SectionSplitter in a SimpleKGPipeline:
import os
from dotenv import load_dotenv
load_dotenv()
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.pdf_loader import DataLoader, PdfDocument, DocumentInfo
from pathlib import Path
neo4j_driver = GraphDatabase.driver(
os.getenv("NEO4J_URI"),
auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()
llm = OpenAILLM(
model_name="gpt-4o",
model_params={
"temperature": 0,
"response_format": {"type": "json_object"},
}
)
embedder = OpenAIEmbeddings(
model="text-embedding-ada-002"
)
class TextLoader(DataLoader):
async def run(self, filepath: Path) -> PdfDocument:
# Process the file
with open(filepath, 'r', encoding='utf-8') as f:
text = f.read()
# Return a PdfDocument
return PdfDocument(
text=text,
document_info=DocumentInfo(
path=str(filepath),
metadata={}
)
)
data_loader = TextLoader()
kg_builder = SimpleKGPipeline(
llm=llm,
driver=neo4j_driver,
neo4j_database=os.getenv("NEO4J_DATABASE"),
embedder=embedder,
from_pdf=True,
pdf_loader=data_loader
)
pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.txt"
doc = asyncio.run(data_loader.run(pdf_file))
print(doc.text)
print(f"Processing {pdf_file}")
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)Integrate with LangChain Text Splitters
LangChain provides a variety of text splitters that you can use to implement different chunking strategies.
The neo4j_graphrag package includes a LangChainTextSplitterAdapter class that allows you to integrate LangChain text splitters with the SimpleKGBuilder.
You can use the LangChain CharacterTextSplitter to split text into paragraphs.
# You will need to install langchain: pip install langchain
from neo4j_graphrag.experimental.components.text_splitters.langchain import LangChainTextSplitterAdapter
from langchain.text_splitter import CharacterTextSplitter
splitter = LangChainTextSplitterAdapter(
CharacterTextSplitter(
separator="\n\n",
chunk_size=500,
chunk_overlap=100,
)
)The LangChainTextSplitterAdapter wraps the LangChain CharacterTextSplitter to create text chunks.
The LangChain text splitter can then be used in the SimpleKGPipeline to split text into chunks by setting the text_splitter parameter:
kg_builder = SimpleKGPipeline(
llm=llm,
driver=neo4j_driver,
neo4j_database=os.getenv("NEO4J_DATABASE"),
embedder=embedder,
from_pdf=True,
text_splitter=splitter,
)Reveal the complete code
This example code shows how to create and use the LangChainTextSplitterAdapter in a SimpleKGPipeline:
import os
from dotenv import load_dotenv
load_dotenv()
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
# You will need to install langchain: pip install langchain
from neo4j_graphrag.experimental.components.text_splitters.langchain import LangChainTextSplitterAdapter
from langchain.text_splitter import CharacterTextSplitter
neo4j_driver = GraphDatabase.driver(
os.getenv("NEO4J_URI"),
auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()
llm = OpenAILLM(
model_name="gpt-4o",
model_params={
"temperature": 0,
"response_format": {"type": "json_object"},
}
)
embedder = OpenAIEmbeddings(
model="text-embedding-ada-002"
)
splitter = LangChainTextSplitterAdapter(
CharacterTextSplitter(
separator="\n\n",
chunk_size=500,
chunk_overlap=100,
)
)
kg_builder = SimpleKGPipeline(
llm=llm,
driver=neo4j_driver,
neo4j_database=os.getenv("NEO4J_DATABASE"),
embedder=embedder,
from_pdf=True,
text_splitter=splitter,
)
pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
print(f"Processing {pdf_file}")
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)When you’re ready you can continue.
Lesson Summary
In this lesson, you learned how to create custom text splitters and integrate with LangChain text splitters.
In the next lesson, you will learn how to configure the lexical (unstructured) graph data model.