Structured and Unstructured Data

As you have learned in previous lessons, a key challenge in data science is making sense of unstructured data. In this lesson, you will explore a strategy for storing unstructured data in a graph.

Vector indexes and embeddings go some way to allow you to search and query unstructured data, but they are not a complete solution. You can use the metadata surrounding the unstructured data to help make sense of it.

Imagine the following use case. You want to analyze customer emails to:

Understand the customer sentiment (are they happy or unhappy?)
Identify any products or services

You could represent this data in a graph of Email, Customer, and Product nodes.

An import for this process would have to:

Extract the email metadata (date, sender, recipient, subject)
Embed the email text
Extract the customer sentiment using a vector index
Search for references to products or services in the email text

By importing the unstructured data into a graph, you can use the known relationships between the data to help make sense of it.

For example, you could use the graph to answer questions like:

What products are customers talking about positively in their emails?
Are there times in the year when customers are more likely to complain?
What are customers saying about a particular product?

Course data

During this module, you will use Python and LangChain to import the text of a GraphAcademy course into Neo4j.

GraphAcademy represents courses as a graph of Course, Module, and Lesson nodes. A course has modules, and a module has lessons.

A simplistic view of the graph would look like this:

The GraphAcademy course content is in a public GitHub repository. We write courses in plain text AsciiDoc that is parsed and displayed on the GraphAcademy website.

The course content is unstructured, but you can make sense of it by using the metadata (the course structure), embeddings, and vector indexes

View this lesson’s content on GitHub and note the following:

The lesson content is written in plain text and is unstructured.
The file name is lesson.adoc.
All lessons have the same file name.
The directory structure denotes the course (llm-vectors-unstructured), module (3-unstructured-data), and lesson (1-structured-unstructured).

You will use these files and directory structure to create the graph of the course content.

Check your understanding

Metadata

True or False - combining associated metadata with unstructured data improves the usefulness of the data over and above the use of embeddings and vectors.

✓ True
❏ False

Hint

Metadata can provide context to unstructured data.

Solution

The answer is True. You can use the metadata surrounding the unstructured data to help make sense of it.

Lesson Summary

In this lesson, you learned how unstructured data can be enhanced using associated metadata. You explored a strategy for storing unstructured data in a graph and how GraphAcademy stores the course content.

In the next lesson, you will learn about how to deal with large amounts of unstructured content.

Introduction to Vector Indexes and Unstructured Data

Introduction

Vector indexes

Importing unstructured data

Structured and Unstructured Data

Course data

Check your understanding

Metadata

Lesson Summary