Building Unified Data Integration and ML Pipelines with Azure Synapse

Alexandra Quinn | June 22, 2021

Across organizations large and small, ML teams are still faced with data silos that slow down or halt innovation. Read on to learn about how enterprises are tackling these challenges, by integrating with any data types to create a single end-to-end pipeline and rapidly run AI/ML with Azure Synapse Analytics with Iguazio.

Data Challenges in ML Pipelines

Data integration is an important requirement for the entire ML lifecycle. It affects issues like:

Gathering raw data from different sources, structured or unstructured
Preparing it at scale
Feeding the data into both training and production environments
Gathering additional data from production to feed back into the inferencing layer
Model monitoring
Model explainability
Governance and compliance

Yet, data integration is also one of the biggest challenges data engineers and scientists have today. It is a siloed, cumbersome process that is full of friction.

Data handling today is divided into three different pipelines:

1. The Research Pipeline

In the research pipeline, all data sets are thrown into the data lake through ETL processes. Batch transformation led by the deep engineering team and additional transformations are run to train models. Quite often, the data is not kept up-to-date.

2. The Serving Pipeline

The deploy pipeline gets data from operational databases. This includes up-to-date information, streaming events, and more. Results are stored in an interactive database or key value store, from which they serve the model.

3. The Governance Pipeline

The governance pipeline collects data from the production environment, and then runs anomaly detection, accuracy analyses, and more. Data is stored for explainability, governance, and compliance.

Maintaining these three pipelines and ensuring they communicate requires collaboration across teams. It is a long and onerous process that runs the risk of losing data accuracy and value, and it is very resource-intensive.

Why Feature Engineering Requires a Single Pipeline

ML models are fed with features that are relevant to the business and originate from the actual operational databases or warehouses. But due to siloed data management processes, often the data used for training purposes and for the inferencing layer are different. This makes the live models inaccurate and can affect the business outcome.

Introducing Azure Synapse Analytics & Iguazio

By implementing analytics tools like Azure Synapse Analytics along with Iguazio’s feature store, we can build a single pipeline for data integration and machine learning. This pipeline collects the data from different sources, feeds it to the feature store where it can be transformed, makes it available for both training and serving and then feeds it to the rest of the ML pipeline.

Together, these two solutions allow ML teams to:

Connect to any type of data source.
Run transformations on the data
Feed the data to the feature store
Catalog the data with their metadata, statistics, lineage, etc.
Run any additional ML transformations
Index the data in random and batch orientation so it can be used for training, serving and governance
Feed the data into the rest of the ML pipeline

To learn more about governing AI over production data, watch this on-demand MLOps Live session with Microsoft here.

Getting Started with a Single ML Pipeline

The first step is setting up Azure Synapse Analytics, for unifying and exploring data from multiple data sources. Azure Synapse has more than 95 data ingestion connectors, and integrates with Spark and SQL engines, which are the most commonly used analytics runtimes in the industry.

The platform supports any type of data, structured or unstructured, as well as different types of storage. Users can explore the data and access it without having to move it or having to integrate with each data source manually. Data can be accessed from the databases and tables, or from data lake storage.

After Azure Synapse unifies the data from all the different sources, the Iguazio Data Science Platform runs analytics and pushes the transformed data to the feature store. With the feature store, data scientists can share and use features without additional engineering code. All the feature engineering is abstracted under the hood, which simplifies and accelerates training, serving and monitoring.

You can see a demo of Azure Synapse and the Iguazio Data Science Platform in this webinar. In addition to the full demo of the joint solution, you’ll learn how to create an Azure ML job for training and how to run a Spark session to explore the data.

Conclusion

Data handling is perhaps the biggest problem for ML teams, who must contend with data silos, different data types, complex glued-together solutions and rising costs. By creating a single operational ML pipeline with Azure Synapse and Iguazio, ML teams can remove friction and use operational data for accurate model training based on the most up-to-date features.

To learn more about the Iguazio Data Science Platform, or to find out how we can help you bring your data science to life, contact our experts. .

Building Unified Data Integration and ML Pipelines with Azure Synapse

Data Challenges in ML Pipelines

1. The Research Pipeline

2. The Serving Pipeline

3. The Governance Pipeline

Why Feature Engineering Requires a Single Pipeline

Introducing Azure Synapse Analytics & Iguazio

Getting Started with a Single ML Pipeline

Conclusion

Latest Posts

Using Agentic Frameworks to Build New AI Services

7 RAG Evaluation Tools You Must Know

Introducing MLRun v1.10: New tools for building agents and monitoring gen AI

Data Challenges in ML Pipelines

1. The Research Pipeline

2. The Serving Pipeline

3. The Governance Pipeline

Why Feature Engineering Requires a Single Pipeline

Introducing Azure Synapse Analytics & Iguazio

Getting Started with a Single ML Pipeline

Conclusion

Latest Posts

Using Agentic Frameworks to Build New AI Services

7 RAG Evaluation Tools You Must Know

Introducing MLRun v1.10: New tools for building agents and monitoring gen AI

You Might Also Enjoy

Top 27 Free Healthcare Datasets for Machine Learning [UPDATED]

16 Best Free Human Annotated Datasets for Machine Learning [UPDATED]

Orchestrating Multi-Agent Workflows with MCP & A2A