Extract Topics

In the last lesson, you built a graph using metadata to understand the course content and the relationships between the content and lessons.

In this lesson, you will add topics from the unstructured lesson content to the graph.

Topics

Topics are a way to categorize and organize content. You can use topics to help users find relevant content, recommend related content, and understand the relationships between different pieces of content. For example, you can find similar lessons based on their topics.

There are many ways to extract topics from unstructured text. You could use an LLM and ask it to summarize the topics from the text. A more straightforward approach is to identify all the nouns in the text and use them as topics.

To hold the topic data, you should extend the data model to include a new node type, Topic, and a new relationship, MENTIONS.

Data model showing the Topic node connected to the `Lesson` node via the `MENTIONS` relationship

Extract nouns

The Python NLP (natural language processing) library, textblob, can extract noun phrases from text. You will use it to extract the topics from the lesson content.

You can extract the topics using the Textblob.noun_phrases method.

Open and run the llm-vectors-unstructured/extract_topics.py program:

python
from textblob import TextBlob

phrase = "You can extract topics from phrases using TextBlob"

topics = TextBlob(phrase).noun_phrases

print(topics)

It will return a list of topics in the input. In this case, the topics 'extract topics', 'textblob'.

You may find changing the default Noun Phrase Chunker used by TextBlob improves results for your data.

Update the Graph

Your task is to update the build_graph.py program you created in the last lesson to:

  1. Extract topics from the lesson content.

  2. Add Topics and relationships to the graph.

View the code from the last lesson
python
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase

COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=chunk.page_content,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])

    return data

def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
        )

llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()

for chunk in chunks:
    with driver.session(database="neo4j") as session:
        
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )

driver.close()

First, update the get_course_data function to extract topics from the lesson content. Add the topics to the data dictionary using the TextBlob.noun_phrases method:

python
from textblob import TextBlob

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])
    data['topics'] = TextBlob(data['text']).noun_phrases

    return data

Next, update the create_chunk function to add the topics to the graph:

python
def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
           
        FOREACH (topic in $topics |
            MERGE (t:Topic {name: topic})
            MERGE (p)-[:MENTIONS]->(t)
        )
        """, 
        data
        )

The topics are returned as a list from the TextBlob.noun_phrases method. The FOREACH clause iterates over the list, creating a Topic node and a MENTIONS relationship between the Paragraph and Topic nodes.

View the complete code
python
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase
from textblob import TextBlob

COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=chunk.page_content,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])
    data['topics'] = TextBlob(data['text']).noun_phrases

    return data

def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
           
        FOREACH (topic in $topics |
            MERGE (t:Topic {name: topic})
            MERGE (p)-[:MENTIONS]->(t)
        )
        """, 
        data
        )

llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()
for chunk in chunks:
    with driver.session(database="neo4j") as session:
        
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )
driver.close()

Run the program to update the graph with the topics extracted from the lesson content.

Query topics

You can use the topics to find related lessons. For example, all the lessons that contain the topics "semantic search":

cypher
MATCH (t:Topic{name:"semantic search"})<-[:MENTIONS]-(p:Paragraph)<-[:CONTAINS]-(l:Lesson)
RETURN DISTINCT l.name, l.url

You can list the topics and the number of lessons that mention them to understand the most popular topics:

cypher
MATCH (t:Topic)<-[:MENTIONS]-(p:Paragraph)<-[:CONTAINS]-(l:Lesson)
RETURN t.name, COUNT(DISTINCT l) as lessons
ORDER BY lessons DESC

By adding topics to the graph, you can use them to find related content.

Topics are also universal and can be used to find related content across content from different sources. For example, if you added technical documentation to this graph, you could use the topics to find related lessons and documentation.

Combining data from different sources and understanding their relationships is the starting point for creating a knowledge graph.

When you have added topics to the graph, click Complete to finish this lesson.

Lesson Summary

In this lesson, you learned how to extract topics from unstructured text and add them to a graph.

In the next optional challenge, you can add more data to the graph.