In the last lesson, you built a graph using metadata to understand the course content and the relationships between the content and lessons.
In this lesson, you will add topics from the unstructured lesson content to the graph.
Topics
Topics are a way to categorize and organize content. You can use topics to help users find relevant content, recommend related content, and understand the relationships between different pieces of content. For example, you can find similar lessons based on their topics.
There are many ways to extract topics from unstructured text. You could use an LLM and ask it to summarize the topics from the text. A more straightforward approach is to identify all the nouns in the text and use them as topics.
To hold the topic data, you should extend the data model to include a new node type, Topic
, and a new relationship, MENTIONS
.
Extract nouns
The Python NLP (natural language processing) library, textblob, can extract noun phrases from text. You will use it to extract the topics from the lesson content.
You can extract the topics using the Textblob.noun_phrases
method.
Open and run the llm-vectors-unstructured/extract_topics.py
program:
from textblob import TextBlob
phrase = "You can extract topics from phrases using TextBlob"
topics = TextBlob(phrase).noun_phrases
print(topics)
It will return a list of topics in the input.
In this case, the topics 'extract topics', 'textblob'
.
You may find changing the default Noun Phrase Chunker used by TextBlob improves results for your data.
Update the Graph
Your task is to update the build_graph.py
program you created in the last lesson to:
-
Extract topics from the lesson content.
-
Add Topics and relationships to the graph.
View the code from the last lesson
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase
COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(docs)
def get_embedding(llm, text):
response = llm.embeddings.create(
input=chunk.page_content,
model="text-embedding-ada-002"
)
return response.data[0].embedding
def get_course_data(llm, chunk):
data = {}
path = chunk.metadata['source'].split(os.path.sep)
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['text'] = chunk.page_content
data['embedding'] = get_embedding(llm, data['text'])
return data
def create_chunk(tx, data):
tx.run("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
""",
data
)
llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
driver = GraphDatabase.driver(
os.getenv('NEO4J_URI'),
auth=(
os.getenv('NEO4J_USERNAME'),
os.getenv('NEO4J_PASSWORD')
)
)
driver.verify_connectivity()
for chunk in chunks:
with driver.session(database="neo4j") as session:
session.execute_write(
create_chunk,
get_course_data(llm, chunk)
)
driver.close()
First, update the get_course_data
function to extract topics from the lesson content.
Add the topics to the data
dictionary using the TextBlob.noun_phrases
method:
from textblob import TextBlob
def get_course_data(llm, chunk):
data = {}
path = chunk.metadata['source'].split(os.path.sep)
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['text'] = chunk.page_content
data['embedding'] = get_embedding(llm, data['text'])
data['topics'] = TextBlob(data['text']).noun_phrases
return data
Next, update the create_chunk
function to add the topics to the graph:
def create_chunk(tx, data):
tx.run("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
FOREACH (topic in $topics |
MERGE (t:Topic {name: topic})
MERGE (p)-[:MENTIONS]->(t)
)
""",
data
)
The topics
are returned as a list from the TextBlob.noun_phrases
method.
The FOREACH
clause iterates over the list, creating a Topic
node and a MENTIONS
relationship between the Paragraph
and Topic
nodes.
View the complete code
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase
from textblob import TextBlob
COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(docs)
def get_embedding(llm, text):
response = llm.embeddings.create(
input=chunk.page_content,
model="text-embedding-ada-002"
)
return response.data[0].embedding
def get_course_data(llm, chunk):
data = {}
path = chunk.metadata['source'].split(os.path.sep)
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['text'] = chunk.page_content
data['embedding'] = get_embedding(llm, data['text'])
data['topics'] = TextBlob(data['text']).noun_phrases
return data
def create_chunk(tx, data):
tx.run("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
FOREACH (topic in $topics |
MERGE (t:Topic {name: topic})
MERGE (p)-[:MENTIONS]->(t)
)
""",
data
)
llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
driver = GraphDatabase.driver(
os.getenv('NEO4J_URI'),
auth=(
os.getenv('NEO4J_USERNAME'),
os.getenv('NEO4J_PASSWORD')
)
)
driver.verify_connectivity()
for chunk in chunks:
with driver.session(database="neo4j") as session:
session.execute_write(
create_chunk,
get_course_data(llm, chunk)
)
driver.close()
Run the program to update the graph with the topics extracted from the lesson content.
Query topics
You can use the topics to find related lessons. For example, all the lessons that contain the topics "semantic search":
MATCH (t:Topic{name:"semantic search"})<-[:MENTIONS]-(p:Paragraph)<-[:CONTAINS]-(l:Lesson)
RETURN DISTINCT l.name, l.url
You can list the topics and the number of lessons that mention them to understand the most popular topics:
MATCH (t:Topic)<-[:MENTIONS]-(p:Paragraph)<-[:CONTAINS]-(l:Lesson)
RETURN t.name, COUNT(DISTINCT l) as lessons
ORDER BY lessons DESC
By adding topics to the graph, you can use them to find related content.
Topics are also universal and can be used to find related content across content from different sources. For example, if you added technical documentation to this graph, you could use the topics to find related lessons and documentation.
Combining data from different sources and understanding their relationships is the starting point for creating a knowledge graph.
When you have added topics to the graph, click Complete to finish this lesson.
Lesson Summary
In this lesson, you learned how to extract topics from unstructured text and add them to a graph.
In the next optional challenge, you can add more data to the graph.