CSV files

CSV files store data in a structure where a special character, usually a comma, separates each value in a row, and each record is a new row in the file.

csv
personId,name,birthYear
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939

Viewing CSV files

CSV files are simple text files - you can open them with any text editor (e.g. Notepad or TextEdit). You can also open them with a spreadsheet application (e.g. Excel or Google Sheets).

Field Terminator

In the example above, a comma (,) separates each field, but other characters can be used, such as a tab (\t) or a pipe (|).

This is the same data as above but separated by a pipe character:

csv
personId|name|birthYear
23945|Gerard Pires|1942
553509|Helen Reddy|1941
113934|Susan Flannery|1939

The character that separates the fields is called the Field Terminator or Delimiter.

Headers

Typically, the first row in a CSV file is a header row, which contains the names of the columns.

csv
personId|name|birthYear

Headers are not mandatory; when working with CSV files with no header row, you will need to know the order of the columns and refer to them by index.

csv
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939

Adding headers to CSV files is a good practice. It makes the data easier to understand and work with.

Quotes

If the data contains the field terminator character, the data must be in quotes. e.g. if the names in the example were in the format last, first the data would need to be formatted as "last, first".

csv
personId,name,birthYear
23945,"Pires, Gerard",1942
553509,"Reddy, Helen",1941
113934,"Flannery, Susan",1939

Most data applications will automatically quote data if it contains the field terminator character.

Normalized data

If the source data is normalized (e.g. when exported from a relational data model) there will typically be multiple CSV files.

Each CSV file will represent a table in the relational data model, and the files will be related to each other by unique IDs.

In this normalized data example, there are three files for people, movies, and roles:

person.csv

csv
personId,name,birthYear
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939

movies.csv

csv
movieId,title,avgVote,releaseYear,genres
189,Sin City,8.000000,2005,Crime|Thriller
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi
11969,Tombstone,7.800000,1993,Action|Romance|Western

The genres field is a list of genres separated by a pipe (|) character.

roles.csv

csv
personId,movieId,character
2295,189,Marv
56731,189,Nancy
16851,189,Dwight

Notice that the person.csv file has a unique ID for every person, and the movies.csv file has a unique ID for every movie.

The roles.csv file relates a person to a movie and provides the characters.

De-normalized data

If the source data is de-normalized, there will typically be a single CSV file. The single file will contain all the data, often duplicated where there are relationships between entities.

Here is an example of de-normalized data for the person, movie, and role data:

movies-n.csv

csv
movieId,title,avgVote,releaseYear,genres,personType,name,birthYear,character
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Bruce Willis,1955,Korben Dallas
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Gary Oldman,1958,Jean-Baptiste Emanuel Zorg
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Ian Holm,1931,Father Vito Cornelius
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Kurt Russell,1951,Wyatt Earp
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Val Kilmer,1959,Doc Holliday
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Sam Elliott,1944,Virgil Earp

De-normalized data typically represents data from multiple tables.

For example, the movie and person data (including the ID) is repeated in multiple rows in the file. A row represents a particular actor’s role in a movie. That is, for each role an actor played, the movie and person data will be duplicated.

How your data is structured will determine how you need to process it.

Check Your Understanding

CSV data format

Which of the following statements are true about CSV files?

  • ❏ A comma has to be used as the field terminator.

  • ✓ Headers are optional.

  • ✓ CSV files can represent normalized and de-normalized data.

Hint

The CSV file format is flexible in format and the data it can store.

Solution

The CSV file format is flexible. The field terminator can be a comma, tab, pipe, or any other character. Headers are optional. CSV files can represent normalized and de-normalized data.

Summary

In this lesson, you learned about the structure of CSV files and typical normalized and de-normalized data representation.

In the next lesson, you will learn how to load a CSV file using Cypher.

Chatbot

Hi, I am an Educational Learning Assistant for Intelligent Network Exploration. You can call me E.L.A.I.N.E.

How can I help you today?