Feature Store Motivation
In this series of blog posts, we will showcase an end-to-end hybrid cloud ML workflow using the Iguazio MLOps Platform & Feature Store combined with Azure ML. This blog will be more of an overview of the solution and the types of problems it solves, while the next parts will be a technical deep dive into each step of the process:
- Part 1: Feature Store Motivation
- Part 2: Data Ingestion + Transformation into Iguazio's Feature Store
- Part 3: Model Training via Azure ML leveraging Iguazio
- Part 4: Hybrid Cloud + On-Premise Model Serving + Model Monitoring with Iguazio
The Gaps: Challenges When Operationalizing Data Science
Regardless of the environment, one of the main challenges when operationalizing data science is fostering collaboration between teams, and eliminating tech silos. This is both a technological and organizational challenge, requiring the right processes in place and the right tools to support these processes. In a typical data science project, there are usually three different teams involved at different points in the pipeline:
- Data Engineer: Ingest and transform raw data from various sources
- Data Scientist: Utilize transformed data to train model
- MLOps Engineer: Containerize and deploy model at scale with monitoring, drift detection, and re-training capabilities
While the pipeline itself looks straightforward, there are more than a few places where things can go wrong - mostly at the handoff points between teams. What happens when a data scientist needs additional or different features? What happens when the MLOps Engineer cannot anticipate a model issue the data scientist knew about? Or a data issue the Data Engineer knew about?
These organizational challenges can be solved, in part, with central artifact/job management and a feature store.
Filling the Gaps with Iguazio's Feature Store
The feature store is a relatively recent concept in the world of MLOps. The purpose and function of a feature store is slightly different depending on who is asked. However, most can agree on the main definition: a central place where teams can store and access features, and share them across projects. You can read more about the necessity of feature stores for scaling data science here.
For Iguazio's feature store, storing and retrieving features is the bare minimum. Also available out of the box are custom batch/real-time pipelines to transform data upon ingestion, dual storage formats to facilitate batch and real-time workloads, and integration with model monitoring and serving.
This allows the feature store to be used as a central communication plane across projects, jobs, models, artifacts, transformation pipelines, etc.
The Iguazio feature store also functions as a data transformation service, enabling complex feature calculations such as the customers’ mean purchases in the last 24 hours, financial transactions in the last 12 hours, or other sliding window aggregations.
Not only does this standardize the overall ML workflow, it uniquely benefits each team in the pipeline:
- Data Engineer: Allows for batch/real-time data ingestion and transformation pipelines
- Data Scientist: Allows for easily accessing features and reducing duplicated work, and the harnessing of complex real-time features for their predictions
- MLOps Engineer: Allows for access of features in real-time for model serving and monitoring
Next Steps: Iguazio + Azure
Despite the number of fantastic services available in the Azure ecosystem, a feature store as described above is not one of them. Additionally, due to the inherent nature of cloud services, they typically cannot be used for on-premise workloads or serving. This is where the Iguazio MLOps Platform can fill in some gaps.
In the next three blog posts, we will build out a full end-to-end hybrid cloud ML workflow using features from Azure and Iguazio:
E2E Hybrid Cloud ML Part 2: Data Ingestion + Transformation via Iguazio's Feature Store
- Detailed overview of Iguazio feature store functionality
- Ingest and transform dataset into feature store
- Retrieve features in batch and real-time
E2E Hybrid Cloud ML Part 3: Model Training via Azure AutoML
- Upload/register features from the Iguazio feature store into Azure ML
- Orchestrate Azure AutoML training job from the Iguazio platform
- Download trained model(s) + metadata from Azure back into the Iguazio platform
E2E Hybrid Cloud ML Part 4: On-Premise Model Serving + Model Monitoring
- Deploy models to a real-time HTTP endpoint
- Combine multiple models into a voting ensemble
- Integrate model serving with real-time feature retrieval
- Integrate model serving with model monitoring and drift detection