Understanding the Source Data

CSV file structure

A CSV file represents rows of data that could be exported from relational and other databases, Web APIs, and other sources. When you are given CSV files, you must determine:

  • Whether the CSV file will have header information, describing the names of the fields.

  • What the delimiter will be for the fields in each row.

Including headers in the CSV file reduces syncing issues and is a recommended Neo4j best practice.

Example CSV files

Here are examples of CSV files with and without headers:

With and without headers

In these examples, the comma (,) is the field terminator. This is the default that Cypher uses. If the source CSV files use a different field terminator, you must specify the FIELDTERMINATOR in your Cypher LOAD CSV clause.

Normalized data

Data normalization is common in relational models. This enables you to have CSV files that correspond to a relational table where an ID is used to identify the relationships.

Here is an example where we have normalized data for people, movies, and roles:

Normalized data

Notice that the people.csv file has a unique ID for every person and the movies1.csv file has a unique ID for every movie. The roles.csv file is used to relate a person to a movie and provide the characters. This is the roles data that could be used to create the :ACTED_IN relationship that you have seen in the Movie graph.

De-normalized data

Here is an example where we have de-normalized data for the same dataset:

Denormalized data

With de-normalized data, the data is represented by multiple rows corresponding to the same entity, which will be loaded as a node. The difference, however, is that de-normalized data typically represents data from multiple tables in the RDBMS. For example, the movie and person data (including the ID) is repeated in multiple rows in the file, but a row represents a particular actor’s role in a particular movie. That is, a Movie and Person data will be represented in multiple rows, but an actor’s role will be represented by a single row.

IDs must be unique

When you load data from CSV files, you rely heavily upon the IDs specified in the file. A Neo4j best practice is to use an ID as a unique property value for each node. If the IDs in your CSV file are not unique for the same entity (node), you will have problems when you load the data and try to create relationships between existing nodes.

Unique Ids

Check your understanding

1. Delimiting fields in CSV files

Suppose your CSV file uses the ";" character to separate the fields in the file. What keyword do you use to tell LOAD CSV to use this delimiter?

  • WITH SEMI-COLON

  • USE SEMI-COLON

  • FIELDTERMINATOR

  • ; is the default so you need not specify anything additional in your LOAD CSV clause.

Hint

There is only one default delimiter accepted by LOAD CSV, which is the "," character.

Solution

Use FIELDTERMINATOR to specify a delimeter that is not the default of ",".

2. De-normalized CSV data

What are the features of a de-normalized CSV file?

  • ❏ There is no duplication of data in the CSV file.

  • ✓ There is duplication of data in the CSV file.

  • ✓ IDs representing an entity that will be loaded as a Node must be unique.

  • ✓ Typically represents data from multiple tables in the RDBMS.

Hint

Normalized data typically corresponds to a single RDBMS table. Three of these features are true for de-normalized CSV files in Neo4j.

Solution

The features of a de-normalized CSV file are:

  1. There is duplication of data in the CSV file.

  2. IDs representing an entity that will be loaded as a Node must be unique.

  3. Typically represents data from multiple tables in the RDBMS.

Summary

In this lesson, you learned how to understand the organization of CSV data you will import and that you must have unique IDs for all nodes that will be created. In the next lesson, you will learn how you must ensure that the data you are importing is clean.