How the LLM Graph Builder works

The LLM Graph Builder follows the process you learned earlier in the course:

  1. Gather the data

  2. Chunk the data

  3. Vectorize the data

  4. Pass the data to an LLM to extract nodes and relationships

  5. Use the output to generate the graph

Python and LangChain are predominantly used to build the knowledge graph.

In this lesson, you will review Python code snippets that complete the above steps.

Gather the data

The application uses a document loader to load the PDFs from a directory.

python
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

DOCS_PATH = "llm-knowledge-graph/data/course/pdfs"

loader = DirectoryLoader(DOCS_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)

docs = loader.load()

The glob specifies the search path to find the PDFs.

Chunk the data

The application splits the documents using a text splitter.

python
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

This code splits the text into paragraphs (every time there is a double newline \n\n).

Vectorize the data

The application creates an embedding for each chunk of text and adds them to the graph.

python
from langchain_openai import OpenAIEmbeddings
from langchain_community.graphs import Neo4jGraph

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

for chunk in chunks:

    # Extract the filename
    filename = os.path.basename(chunk.metadata["source"])

    # Create a unique identifier for the chunk    
    chunk_id = f"{filename}.{chunk.metadata["page"]}"

    # Embed the chunk
    chunk_embedding = embedding_provider.embed_query(chunk.page_content)

    # Add the Document and Chunk nodes to the graph
    properties = {
        "filename": filename,
        "chunk_id": chunk_id,
        "text": chunk.page_content,
        "textEmbedding": chunk_embedding
    }

    graph.query("""
        MERGE (d:Document {id: $filename})
        MERGE (c:Chunk {id: $chunk_id})
        SET c.text = $text
        MERGE (d)<-[:PART_OF]-(c)
        WITH c
        CALL db.create.setNodeVectorProperty(c, 'textEmbedding', $embedding)
        """, 
        properties
    )

# Create the vector index
graph.query("""
    CREATE VECTOR INDEX `vector`
    FOR (c: Chunk) ON (c.embedding)
    OPTIONS {indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
    }};""")

OpenAI creates the embeddings in this example, but you could use any embedding model.

The code uses the Neo4jGraph.query method to create Document and Chunk nodes and store the text and embedding data.

Extract nodes and relationships

The application uses the LangChain LLMGraphTransformer, contributed by Neo4j, to extract the nodes and relationships.

The LLMGraphTransformer requires an llm, in this example, it is using OpenAI’s gpt-3.5-turbo, but you could use any LLM.

python
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer

llm = ChatOpenAI(
    openai_api_key=os.getenv('OPENAI_API_KEY'), 
    model_name="gpt-3.5-turbo"
)

doc_transformer = LLMGraphTransformer(
    llm=llm,
    )

for chunk in chunks:
    # Generate the entities and relationships from the chunk
    graph_docs = doc_transformer.convert_to_graph_documents([chunk])

The LLMGraphTransformer.convert_to_graph_documents uses the llm to generate a set of graph docs of nodes and relationships.

Use the output to generate the graph

Finally, the application uses the generated graph documents to create the graph.

The graph documents consist of a set of entity nodes. A Node representing the Chunk and a HAS_ENTITY relation is added to each graph document to create a link between the generated entities and the source document.

python
from langchain_community.graphs.graph_document import Node, Relationship

for chunk in chunks:
    
    filename = os.path.basename(chunk.metadata["source"])
    chunk_id = f"{filename}.{chunk.metadata["page"]}"

    graph_docs = doc_transformer.convert_to_graph_documents([chunk])

    # Map the entities in the graph documents to the chunk node
    for graph_doc in graph_docs:
        chunk_node = Node(
            id=chunk_id,
            type="Chunk"
        )

        for node in graph_doc.nodes:

            graph_doc.relationships.append(
                Relationship(
                    source=chunk_node,
                    target=node, 
                    type="HAS_ENTITY"
                    )
                )

    # add the graph documents to the graph
    graph.add_graph_documents(graph_docs)

The last step passes the graph documents to the Neo4Graph.add_graph_documents method to create the nodes and relationships in Neo4j.

This process creates a simplified version of the data model you saw in the previous lesson:

The graph data model showing the relationship between the Chunk and Entity nodes

When you are ready, move on to the next lesson.

Summary

In this lesson, you explored the code to build a knowledge graph using an LLM.

In the next lesson, you will run and adapt a Python program to create a knowledge graph.