Splitting Text into Chunks

You can modify how the SimpleKGBuilder splits text into chunks. Your strategy for splitting text will impact the construction and how you search the knowledge graph.

In this lesson you will learn how to create a custom text splitter or integrate with an existing LangChain text splitter.

Custom Text Splitter

A custom text splitter allows you to define how the text is divided into chunks.

The PDF documents you have been using in this course contain sections, for example = Heading and == Subheading, that can be used to logically split the text into meaningful chunks.

text
Sectioned document
= What is Generative AI

== GenAI

Generative AI (or GenAI) refers to artificial intelligence systems ...

== Large Language Models (LLMs)

LLMs are a type of generative AI model designed to understand and generate human-like text ...

You can create a custom text splitter by extending the TextSplitter class:

python
from neo4j_graphrag.experimental.components.text_splitters.base import TextSplitter
from neo4j_graphrag.experimental.components.types import TextChunk, TextChunks

class SectionSplitter(TextSplitter):
    def __init__(self, section_heading: str = "== ") -> None:
        self.section_heading = section_heading

    async def run(self, text: str) -> TextChunks:
        index = 0
        chunks = []
        current_section = ""

        for line in text.split('\n'):
            # Does the line start with the section heading?
            if line.startswith(self.section_heading):
                chunks.append(
                    TextChunk(text=current_section, index=index)
                )
                current_section = ""
                index += 1
            
            current_section += line + "\n"

        # Add the last section
        chunks.append(
            TextChunk(text=current_section, index=index)
        )
        
        return TextChunks(chunks=chunks)

splitter = SectionSplitter()

The SectionSplitter class splits the text based on a section heading (== `), then creates `TextChunk objects for each section, and finally returns a TextChunks object containing all the chunks.

You can run the text splitter directly with test data to verify that it is working:

python
text = """
= Heading 1
This is the main section

== Sub-heading
This is some text.

== Sub-heading 2
This is some more text.
"""

chunks = asyncio.run(splitter.run(text))
print(chunks)

Your text splitter can then be used in the SimpleKGPipeline to split text into chunks by setting the text_splitter parameter:

python
kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder,
    from_pdf=True,
    text_splitter=splitter,
)
Reveal the complete code

This example code shows how to create and use the SectionSplitter in a SimpleKGPipeline:

python
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

from neo4j_graphrag.experimental.components.pdf_loader import DataLoader, PdfDocument, DocumentInfo
from pathlib import Path

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

class TextLoader(DataLoader):
    async def run(self, filepath: Path) -> PdfDocument:

        # Process the file
        with open(filepath, 'r', encoding='utf-8') as f:
            text = f.read()
        
        # Return a PdfDocument
        return PdfDocument(
            text=text,
            document_info=DocumentInfo(
                path=str(filepath),
                metadata={}
            )
        )
    
data_loader = TextLoader()

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    pdf_loader=data_loader
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.txt"
doc = asyncio.run(data_loader.run(pdf_file))
print(doc.text)

print(f"Processing {pdf_file}")
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

Integrate with LangChain Text Splitters

LangChain provides a variety of text splitters that you can use to implement different chunking strategies.

The neo4j_graphrag package includes a LangChainTextSplitterAdapter class that allows you to integrate LangChain text splitters with the SimpleKGBuilder.

You can use the LangChain CharacterTextSplitter to split text into paragraphs.

python
# You will need to install langchain: pip install langchain
from neo4j_graphrag.experimental.components.text_splitters.langchain import LangChainTextSplitterAdapter
from langchain.text_splitter import CharacterTextSplitter

splitter = LangChainTextSplitterAdapter(
    CharacterTextSplitter(
        separator="\n\n",
        chunk_size=500,
        chunk_overlap=100,
    )
)

The LangChainTextSplitterAdapter wraps the LangChain CharacterTextSplitter to create text chunks.

The LangChain text splitter can then be used in the SimpleKGPipeline to split text into chunks by setting the text_splitter parameter:

python
kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder,
    from_pdf=True,
    text_splitter=splitter,
)
Reveal the complete code

This example code shows how to create and use the LangChainTextSplitterAdapter in a SimpleKGPipeline:

python
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

# You will need to install langchain: pip install langchain
from neo4j_graphrag.experimental.components.text_splitters.langchain import LangChainTextSplitterAdapter
from langchain.text_splitter import CharacterTextSplitter

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

splitter = LangChainTextSplitterAdapter(
    CharacterTextSplitter(
        separator="\n\n",
        chunk_size=500,
        chunk_overlap=100,
    )
)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder,
    from_pdf=True,
    text_splitter=splitter,
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"

print(f"Processing {pdf_file}")
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

When you’re ready you can continue.

Lesson Summary

In this lesson, you learned how to create custom text splitters and integrate with LangChain text splitters.

In the next lesson, you will learn how to configure the lexical (unstructured) graph data model.

Chatbot

How can I help you today?