CSV files store data in a structure where a special character, usually a comma, separates each value in a row, and each record is a new row in the file.
personId,name,birthYear
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939
Viewing CSV files
CSV files are simple text files - you can open them with any text editor (e.g. Notepad or TextEdit). You can also open them with a spreadsheet application (e.g. Excel or Google Sheets).Field Terminator
In the example above, a comma (,
) separates each field, but other characters can be used, such as a tab (\t
) or a pipe (|
).
This is the same data as above but separated by a pipe character:
personId|name|birthYear
23945|Gerard Pires|1942
553509|Helen Reddy|1941
113934|Susan Flannery|1939
The character that separates the fields is called the Field Terminator or Delimiter.
Headers
Typically, the first row in a CSV file is a header row, which contains the names of the columns.
personId|name|birthYear
Headers are not mandatory; when working with CSV files with no header row, you will need to know the order of the columns and refer to them by index.
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939
Quotes
If the data contains the field terminator character, the data must be in quotes. e.g. if the names in the example were in the format last, first
the data would need to be formatted as "last, first"
.
personId,name,birthYear
23945,"Pires, Gerard",1942
553509,"Reddy, Helen",1941
113934,"Flannery, Susan",1939
Normalized data
If the source data is normalized (e.g. when exported from a relational data model) there will typically be multiple CSV files.
Each CSV file will represent a table in the relational data model, and the files will be related to each other by unique IDs.
In this normalized data example, there are three files for people, movies, and roles:
person.csv
personId,name,birthYear
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939
movies.csv
movieId,title,avgVote,releaseYear,genres
189,Sin City,8.000000,2005,Crime|Thriller
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi
11969,Tombstone,7.800000,1993,Action|Romance|Western
genres
field is a list of genres separated by a pipe (|) character.roles.csv
personId,movieId,character
2295,189,Marv
56731,189,Nancy
16851,189,Dwight
Notice that the person.csv file has a unique ID for every person, and the movies.csv file has a unique ID for every movie.
The roles.csv file relates a person to a movie and provides the characters.
De-normalized data
If the source data is de-normalized, there will typically be a single CSV file. The single file will contain all the data, often duplicated where there are relationships between entities.
Here is an example of de-normalized data for the person, movie, and role data:
movies-n.csv
movieId,title,avgVote,releaseYear,genres,personType,name,birthYear,character
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Bruce Willis,1955,Korben Dallas
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Gary Oldman,1958,Jean-Baptiste Emanuel Zorg
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Ian Holm,1931,Father Vito Cornelius
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Kurt Russell,1951,Wyatt Earp
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Val Kilmer,1959,Doc Holliday
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Sam Elliott,1944,Virgil Earp
De-normalized data typically represents data from multiple tables.
For example, the movie and person data (including the ID) is repeated in multiple rows in the file. A row represents a particular actor’s role in a movie. That is, for each role an actor played, the movie and person data will be duplicated.
How your data is structured will determine how you need to process it.
Check Your Understanding
CSV data format
Which of the following statements are true about CSV files?
-
❏ A comma has to be used as the field terminator.
-
✓ Headers are optional.
-
✓ CSV files can represent normalized and de-normalized data.
Hint
The CSV file format is flexible in format and the data it can store.
Solution
The CSV file format is flexible. The field terminator can be a comma, tab, pipe, or any other character. Headers are optional. CSV files can represent normalized and de-normalized data.
Summary
In this lesson, you learned about the structure of CSV files and typical normalized and de-normalized data representation.
In the next lesson, you will learn how to load a CSV file using Cypher.