Inspecting the Data for Import

Inspecting the CSV data

Before you start working with the source CSV data, you must understand how delimiters, quotes, and special characters are used for each row.

If the headers do not correspond to the data representing the fields, you cannot load the data. You must also know whether you can assume the use of the default delimiter ",", otherwise, you will need to use the FIELDTERMINATOR keyword along with LOAD CSV when you use Cypher to import the data.

You should have a local copy of the CSV files so you can inspect the data in them. In fact, when using the Neo4j Data Importer you will need a local copy of the CSV files.

Step 1: Acquire or download the CSV

If the CSV file is a URL, you can simply download it in a Web browser and save it locally.

Step 2: Determine the delimiter

You should view the contents (at least the beginning rows) of the file to determine the delimiter.

For example, here is what the movies.csv file looks like in an editor:

Movie CSV in editor

We can see that the CSV file indeed has a header row and the delimiter is a comma. It also looks like fields do not have quotes around values that are strings.

Step 3: Determine if headers match fields

Depending on the length of each row, it may be hard to determine if the values for fields look consistent. With a CSV file, you can open it in a spreadsheet to understand the data a little better.

Here is what the movies.csv file looks like in a spreadsheet:

Movie CSV in spreadsheet

Here we see that the data in the rows corresponds to the header row.

Important: By default all of these fields in each row will be read in as string types.

Notice also that for this CSV file, a multi-value field such as countries or languages has values delimited by the "|" character.

In a spreadsheet, it may be a little easier to inspect the data.

Step 4: Determine if all data is readable

You must make sure that all records can be read from the CSV file without error.

Here is the Cypher code that will read all data in a CSV file that contains headers and is specified as a URL:

Cypher
Unresolved directive in lesson.adoc - include::https://raw.githubusercontent.com/neo4j-graphacademy/llm-vectors-unstructured/main/modules/1-preparing/lessons/3-inspect-clean-data/load-csv.cypher[]

It will read every row from the CSV file and will return the number of rows successfully read. If an error occurs during the reading of the CSV file, an error will be raised as shown here when we attempt to read the test.csv file.

Error reading CSV file

In this case, you would need to investigate why the error occurred and fix the CSV file.

Step 5: Is the data clean?

Here are some additional things that you will check before you begin working with the data, depending on the data:

  • Are quotes used correctly?

  • If an element has no value will an empty string be used?

  • Are UTF-8 prefixes used (for example \uc)?

  • Do some fields have trailing spaces?

  • Do the fields contain binary zeros?

  • Understand how lists are formed (default is to use colon(:) as the separator.

  • Any obvious typos?

For this course, you will be working with data that has already been cleaned, but if you receive a CSV file from another source, you should ensure that it is clean.

Check your understanding

Before you start working with data

Before you start working with a CSV file and inspecting the data, what must you do?

  • ❏ Make sure only double-quotes are used for strings.

  • ✓ Make sure the headers match the data in the file.

  • ✓ Make sure every row is readable.

  • ✓ Confirm the delimiter used in the file.

Hint

These three checks are necessary before you start working with the CSV file.

Solution

Before you start working with a CSV file and inspecting the data, you must:

  1. Make sure the headers match the data in the file.

  2. Make sure every row is readable.

  3. Confirm the delimiter used in the file.

Summary

In this lesson, you learned about some of the things you should check for in any data you intend to import. In the next challenge, you will inspect some CSV files.