Introduction
In this lesson we will cover the motivation for graph native machine learning in GDS, how GDS manages models and pipelines, and some resources for pre-processing, hyperparameter tuning, and training methods.
Why GDS Has Machine Learning Capabilities?
To start, it may be helpful to discuss why GDS has machine learning capabilities in the first place. It always has, and remains to be true, that you can generate graph features in Neo4j and export them into another environment to do machine learning, such as those in Python, Apache Spark, etc. These external frameworks have a lot of flexibility to customize and tune machine learning models. However, there are still multiple reasons why you may want to use graph based machine learning tools in GDS:
-
Managing Complex Model Design: Graph data, by nature of being highly interconnected, introduces complications into machine learning workflows that can be tough to catch and work through for those less familiar with graph. Gone unaccounted for, these complications can compromise the validity, computational performance, and predictive performance of ML models. The GDS pipelines include methodology to account for these complications that would otherwise be difficult to develop and maintain in a general purpose, non-graph specific, ML framework. Prime examples being proper data splitting design, handling severe class imbalance, and avoiding data leakage in feature engineering.
-
Fast Path to Production with Strong Database Coupling: Because GDS exists within Neo4j, it is easy to directly apply GDS ML to the Neo4j database. Once you train a pipeline in GDS, the model is effectively automatically saved and deployed - ready to make predictions on data from the Neo4j database with a simple
predict
command. Enterprise users have the ability to persist these models for reuse as well as publish them to share across teams. -
Development & Experimentation: Even for seasoned practitioners at enterprises with mature and robust MLOps workflows, the native ML pipelines remove a lot of initial friction often associated with graph machine learning. This allows you to experiment and test model approaches to get off the ground quickly.
ML Pipelines and Models in GDS
GDS focuses on offering managed pipelines for end-to-end ML workflows. Data selection, feature engineering, data splitting, hyperparameter configuration, and training steps are coupled together within the pipeline object to track the end-to-end steps needed.
There are currently two supported types of ML pipelines:
-
Node Classification Pipelines: Supervised binary and multi-class classification for nodes
-
Link Prediction Pipelines: Supervised prediction for whether a relationship or "link" should exist between pairs of nodes
These pipelines have a train
procedure that, once run, produces a trained model object. These trained model objects, in turn, have a predict
procedure that can be used to produce predictions on the data.
You can have multiple pipelines and model objects at once in GDS. Both pipelines and models have a catalog that allows you to manage them by name, similar to graphs in the graph catalog.
Pre-Processing and Other Resources
To help with building ML models, our documentation includes additional guides for pre-processing, training methods, and hyperparameter configuration.
Check your understanding
1. GDS and Machine Learning
Which of the following statements are true regarding ML in GDS? Select all that apply.
-
✓ GDS focuses on an ML pipeline pattern
-
❏ GDS only offers graph feature engineering and connectors, no native ML
-
✓ GDS has a pipeline for binary and multi-class node classification
-
✓ GDS has a pipeline for supervised prediction for whether a relationship should exist between node pairs otherwise known as “Link Prediction”
-
✓ GDS algorithms can be used to generate features for machine learning
Hint
GDS does so much more than offering graph feature engineering and connectors.
Solution
The following statements are true regarding ML in GDS:
GDS focuses on an ML pipeline pattern.
GDS has a pipeline for binary and multi-class node classification.
GDS has a pipeline for supervised prediction for whether a relationship should exist between node pairs (Link Prediction).
GDS algorithms can be used to generate features for machine learning.
Summary
In this lesson you learned about some of the challenges with graph native machine learning and why GDS offers a solution. You also learned about the ML pipeline and model pattern in GDS and how training and prediction works at a high level.
In the next two lessons we will cover node classification and link prediction pipelines in more depth.