Loading documents

You can modify how the SimpleKGBuilder loads documents to suit your own use case.

You may want to:

  • Load documents from a custom source, such as a database or an API.

  • Pre-process documents before they are loaded, such as removing unwanted text or formatting.

  • Implement custom parsing logic for specific document types.

In this lesson you will explore examples of how to create custom document loaders.

Custom PDF Loader

You can extend the existing PDF loader to fit your own use case, such as pre-processing the text before it is loaded into the graph.

The PDF documents you have been using in this course contain :attributes: that are not relevant to the knowledge graphs and may introduce noise.

text
Unwanted attributes
= What is Generative AI
:order: 1
:type: lesson
:slides: true

...

You can strip the attributes from the text before it is loaded into the graph by extending the PdfLoader class to create a CustomPDFLoader:

python
from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader, PdfDocument

import re
from fsspec import AbstractFileSystem
from typing import Dict, Optional, Union
from pathlib import Path

class CustomPDFLoader(PdfLoader):
    async def run(
        self,
        filepath: Union[str, Path],
        metadata: Optional[Dict[str, str]] = None,
        fs: Optional[Union[AbstractFileSystem, str]] = None,
    ) -> PdfDocument:
        pdf_document = await super().run(filepath, metadata, fs)

        # Process the PDF document
        # remove asciidoc attribute lines like :id:
        pdf_document.text = re.sub(r':*:.*\n?', '', pdf_document.text, flags=re.MULTILINE)

        return pdf_document

data_loader = CustomPDFLoader()

The CustomPDFLoader overrides the run method to pre-process the text and uses regex to remove unwanted attributes.

You can run the custom loader directly to verify that it is working:

python
pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
doc = asyncio.run(data_loader.run(pdf_file))
print(doc.text)

The custom loader can then be used in the SimpleKGPipeline to load and process PDF documents by setting the pdf_loader parameter:

python
kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    pdf_loader=data_loader
)
Reveal the complete code

This example code shows how to create and use the CustomPDFLoader in a SimpleKGPipeline:

python
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader, PdfDocument

import re
from fsspec import AbstractFileSystem
from typing import Dict, Optional, Union
from pathlib import Path

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

class CustomPDFLoader(PdfLoader):
    async def run(
        self,
        filepath: Union[str, Path],
        metadata: Optional[Dict[str, str]] = None,
        fs: Optional[Union[AbstractFileSystem, str]] = None,
    ) -> PdfDocument:
        pdf_document = await super().run(filepath, metadata, fs)

        # Process the PDF document
        # remove asciidoc attribute lines like :id:
        pdf_document.text = re.sub(r':*:.*\n?', '', pdf_document.text, flags=re.MULTILINE)

        return pdf_document

data_loader = CustomPDFLoader()

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    pdf_loader=data_loader
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.pdf"
doc = asyncio.run(data_loader.run(pdf_file))
print(doc.text)

print(f"Processing {pdf_file}")
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

Custom Data Loader

The SimpleKGPipeline can load documents from different sources by implementing a custom data loader.

You can extend the DataLoader a class to load any text data from any source.

You can load data from a text file by implementing a TextLoader class:

python
from neo4j_graphrag.experimental.components.pdf_loader import DataLoader, PdfDocument, DocumentInfo
from pathlib import Path

class TextLoader(DataLoader):
    async def run(self, filepath: Path) -> PdfDocument:

        # Process the file
        with open(filepath, 'r', encoding='utf-8') as f:
            text = f.read()
        
        # Return a PdfDocument
        return PdfDocument(
            text=text,
            document_info=DocumentInfo(
                path=str(filepath),
                metadata={}
            )
        )
    
data_loader = TextLoader()

The SimpleKGPipeline expects a PdfDocument object to be returned that includes the raw text and a DocumentInfo object containing a path and optional metadata.

metadata

The metadata can include any information you want to associate with the document, such as author, date, or source.

You can run the text loader directly to verify that it is working:

python
pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.txt"
doc = asyncio.run(data_loader.run(pdf_file))
print(doc.text)
Reveal the complete code

This example code shows how to create and use the TextLoader in a SimpleKGPipeline:

python
import os
from dotenv import load_dotenv
load_dotenv()

import asyncio

from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

from neo4j_graphrag.experimental.components.pdf_loader import DataLoader, PdfDocument, DocumentInfo
from pathlib import Path

neo4j_driver = GraphDatabase.driver(
    os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
)
neo4j_driver.verify_connectivity()

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

class TextLoader(DataLoader):
    async def run(self, filepath: Path) -> PdfDocument:

        # Process the file
        with open(filepath, 'r', encoding='utf-8') as f:
            text = f.read()
        
        # Return a PdfDocument
        return PdfDocument(
            text=text,
            document_info=DocumentInfo(
                path=str(filepath),
                metadata={}
            )
        )
    
data_loader = TextLoader()

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver, 
    neo4j_database=os.getenv("NEO4J_DATABASE"), 
    embedder=embedder, 
    from_pdf=True,
    pdf_loader=data_loader
)

pdf_file = "./genai-graphrag-python/data/genai-fundamentals_1-generative-ai_1-what-is-genai.txt"
doc = asyncio.run(data_loader.run(pdf_file))
print(doc.text)

print(f"Processing {pdf_file}")
result = asyncio.run(kg_builder.run_async(file_path=pdf_file))
print(result.result)

When you’re ready you can continue.

Lesson Summary

In this lesson, you learned how to create custom data loaders.

In the next lesson, you will how to integrate custom chunking and text splitting strategies into the knowledge graph pipeline.

Chatbot

How can I help you today?