Constructing Knowledge Graphs with LLMs

Unstructured data and graphs

Creating knowledge graphs from unstructured data can be complex, involving multiple steps of data query, cleanse, and transform.

You can use the text analysis capabilities of Large Language Models (LLMs) to automate the extraction of entities and relationships from your unstructured text.

An LLM generated this knowledge graph of Technologies, Concepts, and Skills from a lesson on grounding LLMS.

A knowledge graph showing the relationships between Technology Concepts and Skills

Extend your graph

In this challenge, you will use an LLM to extend your graph with new entities and relationships found in the unstructured text data.

Open the 1-knowledge-graphs-vectors\llm_build_graph.py starter code that creates the graph of lesson content.

Click to view the starter code
llm_build_graph.py
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.graphs.graph_document import Node, Relationship

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
    add_start_index=True
)

chunks = text_splitter.split_documents(docs)

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

def get_course_data(embedding_provider, chunk):
    filename = chunk.metadata["source"]
    path = filename.split(os.path.sep)

    data = {}
    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
    data['text'] = chunk.page_content
    data['embedding'] = embedding_provider.embed_query(chunk.page_content)
    return data

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

def create_chunk(graph, data):
    graph.query("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
    )

# Create an OpenAI LLM instance
# llm = 

# Create an LLMGraphTransformer instance
# doc_transformer =

for chunk in chunks:
    data = get_course_data(embedding_provider, chunk)
    create_chunk(graph, data)

    # Generate the graph docs
    # graph_docs =
    
    # Map the entities in the graph documents to the paragraph node
    # for graph_doc in graph_docs:
            
    # Add the graph documents to the graph
    # graph.
    
    print("Processed chunk", data['id'])

You will need to:

  1. Create an LLM instance

  2. Create a transformer to extract entities and relationships

  3. Extract entities and relationships from the text

  4. Map the entities to the paragraphs

  5. Add the graph documents to the database

Create an LLM

You need an LLM instance to extract the entities and relationships:

python
Create the llm
# Create an OpenAI LLM instance
llm = ChatOpenAI(
    openai_api_key=os.getenv('OPENAI_API_KEY'), 
    model_name="gpt-3.5-turbo"
)

The model_name parameter defines which OpenAI model will be used. gpt-3.5-turbo is a good choice for this task given its accuracy, speed, and cost.

Graph Transformer

To extract the entities and relationships, you will use a graph transformer. The graph transformer takes unstructured text data, passes it to the LLM, and returns the entities and relationships.

python
Create the transformer
# Create an LLMGraphTransformer instance
doc_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
    )

The optional allowed_nodes and allowed_relationships parameters allow you to defined the types of nodes and relationships you want to extract from the text.

In this example, the nodes are restricted to entities relevant to the content. The relationships are not restricted, allowing the LLM to find any relationships between the entities.

Restricting the nodes and relationship will result in a more concise knowledge graph. A more concise graph may support you in answering specific questions but it could also be missing information.

Extract entities and relationships

For each chunk of text, you will use the transformer to convert the text into a graph. The transformer returns a set of graph documents that represent the entities and relationships in the text.

python
Call the transformer
for chunk in chunks:
    data = get_course_data(embedding_provider, chunk)
    create_chunk(graph, data)

    # Generate the graph docs
    graph_docs = doc_transformer.convert_to_graph_documents([chunk])

Map extracted entities to the paragraphs

The graph documents contain the extracted nodes and relationships, but they are not linked to the original paragraphs.

To understand which entities are related to which paragraphs, you will map the extracted nodes to the paragraphs.

You will create a data model with a HAS_ENTITY relationship between the paragraphs and the entities.

A data model showing a HAS_ENTITY relationship between the Paragraph and entity nodes

Map extracted entities to the paragraphs

This code inserts the Paragraph node into the graph document, and creates a HAS_ENTITY relationship between the paragraph and the extracted entities.

python
Map the entities to the paragraphs
    # Map the entities in the graph documents to the paragraph node
    for graph_doc in graph_docs:
        paragraph_node = Node(
            id=data["id"],
            type="Paragraph",
        )

        for node in graph_doc.nodes:

            graph_doc.relationships.append(
                Relationship(
                    source=paragraph_node,
                    target=node, 
                    type="HAS_ENTITY"
                    )
                )

Add the graph documents

Finally, you need to add the new graph documents to the Neo4j graph database.

python
Add the graph documents
    # Add the graph documents to the graph
    graph.add_graph_documents(graph_docs)

When you are ready, run the program to extend your graph.

Calls to the LLM are relatively slow, so the program will take a few minutes to run.

Querying the knowledge graph

You can view the generated entities using the following Cypher query:

cypher
MATCH (p:Paragraph)-[:HAS_ENTITY]-(e)
RETURN p, e

Entities

The entities in the graph allow you to understand what the context in the text.

You can find the most mentioned topics in the graph by counting the number of times a node label (or entity) appears in the graph:

cypher
MATCH ()-[:HAS_ENTITY]->(e)
RETURN labels(e) as labels, count(e) as nodes
ORDER BY nodes DESC

Entities

You can drill down into the entity id to gain insights into the content. For example, you can find the most mentioned Technology.

cypher
MATCH ()-[r:HAS_ENTITY]->(e:Technology)
RETURN e.id AS entityId, count(r) AS mentions
ORDER BY mentions DESC

The knowledge graph can also show you the connections within the content. For example, what lessons relate to each other.

This Cypher query matches one specific document and uses the entities to find related documents:

cypher
MATCH (l:Lesson {
    name: "1-neo4j-and-genai"
})-[:CONTAINS]->(p:Paragraph)

MATCH (p)-[:HAS_ENTITY]->(entity)<-[:HAS_ENTITY]-(otherParagraph)
MATCH (otherParagraph)<-[:CONTAINS]->(otherLesson)
RETURN DISTINCT entity.id, otherLesson.name

Lesson entities

The knowledge graph contains the relationships between entities in all the documents.

This Cypher query restricts the output to a specific chunk or document:

cypher
MATCH (l:Lesson {
    name: "1-neo4j-and-genai"
})-[:CONTAINS]->(p:Paragraph)
MATCH (p)-[:HAS_ENTITY]->(e)

MATCH path = (e)-[r]-(e2)
WHERE (p)-[:HAS_ENTITY]->(e2)
RETURN path

A path is returned representing the knowledge graph for the document.

The graph output from the previous Cypher query

Labels, ids, and relationships

You can gain the nodes labels, ids, relationship types by unwinding the path’s relationships:

cypher
MATCH (l:Lesson {
    name: "1-neo4j-and-genai"
})-[:CONTAINS]->(p:Paragraph)
MATCH (p)-[:HAS_ENTITY]->(e)

MATCH path = (e)-[r]-(e2)
WHERE (p)-[:HAS_ENTITY]->(e2)

UNWIND relationships(path) as rels
RETURN
    labels(startNode(rels))[0] as eLabel,
    startNode(rels).id as eId,
    type(rels) as relType,
    labels(endNode(rels))[0] as e2Label,
    endNode(rels).id as e2Id

Explore the graph

Take some time to explore the knowledge graph to find relationships between entities and lessons.

Summary

You used an LLM to create a knowledge graph from unstructured text.