Before you start importing data, you should take time to understand the data you are working with, including:
-
The data format and structure
-
The frequency of updates
-
Data quality
-
Uniquely identifying data
These factors will influence the process you take to import the data into Neo4j.
Data Structure and Format
How the data is formatted may influence how you import the data into Neo4j.
For example, if the source data is in a propriety data format, you may need to write a custom script to extract the data before importing it. Do you need to use a specific tool or technology? Alternatively, the data may be in a format you can import directly, such as CSV or JSON.
The data structure will influence the complexity of the import process.
The data may be normalized or de-normalized. If de-normalized, you must understand how to transform the data before importing it. Do you have to split the data into multiple nodes or relationships? If normalized, are the entities correct for the graph data model you are implementing?
Frequency
How often the source data is updated will affect your import strategy and process.
Do you need to import the data once, or will you need to import data regularly? Are updates incremental, or must you import the entire dataset each time?
Are updates real-time or batched? Would an event-driven architecture, where you stream updates from the source, be appropriate? Or would a simple batch import process fulfill the requirement?
Data Quality
Whether you need to clean the data before importing will depend on the quality of the data in the source, the data model you are implementing, and your requirements.
You should assess the source data for accuracy, validity, completeness, reliability, and consistency:
-
Accuracy - verify that the data is correct and error-free by cross-referencing with reliable sources.
-
Validity - Ensure the data is applicable and suitable for the intended use or context.
-
Completeness - Check that you included all necessary data and there are no missing elements.
-
Reliability - Ascertain that the data comes from a credible and dependable source.
-
Consistency - Confirm that the data does not show discrepancies when compared over time or with similar datasets.
Some common issues with data format you should also check include:
-
Are quotes used correctly?
-
Are entities and values of the correct data type?
-
Are UTF-8 prefixes used (for example \uc)?
-
Do some fields have trailing spaces?
-
Do the fields contain binary zeros?
-
Are lists formed correctly?
-
Any obvious typos?
Uniquely identifying data
A Neo4j best practice is to use an ID as a unique property value for each node. Do all the entities in the source data have a unique identifier?
For example, if you are importing sales data into Customer
and Product
nodes, is there a unique identifier (ID) for each customer and product?
If the IDs in your source data are not unique for the same entity (node), you will have problems loading the data and creating relationships between existing nodes.
Check Your Understanding
Number of properties
What aspect of data quality ensures that data is suitable and applicable for its intended use or context? (select all that apply)
-
✓ Accuracy
-
✓ Validity
-
✓ Completeness
-
✓ Reliability
-
✓ Consistency
Hint
It is important you assess data quality against multiple factors.
Solution
Accuracy, validity, completeness, reliability, and consistency are all important aspects of data quality.
Summary
In this lesson, you learned the importance of understanding the source data before importing it into Neo4j.
In the next lesson, you will explore the implications of the graph data model you will implement during the import.