Git-based CI / CD for Machine Learning & MLOps

Yaron Haviv | June 22, 2020

For decades, machine learning engineers have struggled to manage and automate ML pipelines in order to speed up model deployment in real business applications.

Similar to how software developers leverage DevOps to increase efficiency and speed up release velocity, MLOps streamlines the ML development lifecycle by delivering automation, enabling collaboration across ML teams and improving the quality of ML models in production while addressing business requirements. Essentially, it’s a way to automate, manage, and speed up the very long process of bringing data science to production.

Right now, data scientists are adopting a paradigm that centers on building “ML factories”, i.e., automated pipelines that take data, pre-process it, then train, generate, deploy, and monitor models.

But as is the case with all models deployed in real-world scenarios, the code and data change, causing drifts and compromising the accuracy of models. ML engineers often have to run most, if not all of the pipeline again to generate new models and productionize it. And they have to do this each time the data or codebase changes. This is the major problem with MLOps…it incurs significant overhead because data scientists spend most of their time on data preparation and wrangling, configuring infrastructure, managing software packages, and frameworks.

In DevOps, the twin development practices…Continuous Integration and Continuous Deployment (CI/CD), enables developers to continuously integrate new features and bug fixes, initiate code builds, run automated tests and deploy to production thus automating the software development lifecycle and facilitating fast product iterations.

Implementing CI/CD in DevOps environments is fairly simple…you code, build, test, and release. However, applying these practices to ML pipelines is much more complicated and presents several unique challenges. This is due to the additional aspects of ML development which include data, parameter, and configuration versioning...requiring the use of powerful resources (data processing engines, GPUs, computation clusters, etc.) to execute.

Due to this inherent complexity in creating, running, and tracking ML pipelines, data scientists and ML engineers are now looking to automate MLOps the CI/CD way.

However, MLOps is challenging for the following reasons:

Tight coupling between the data and the model
Managing Data , code and model versioning
Silos - Data engineers, data scientists, and the engineers responsible for delivery operate in silos which creates friction between teams
Skills - Data scientists are not often trained engineers and thus do not always follow good DevOps practices
No easy way to identify model drift and trigger a pipeline for retraining the model
Too many manual steps. No Automation
Difficulty in migrating ML workloads from local environments to the cloud

Solving these challenges will require ML engineers to leverage a robust platform capable of incorporating CI/CD principles into the ML lifecycle, thus achieving true MLOps.

CI/CD helps to accelerate and improve the efficiency of workflows while shortening the time it takes data scientists to experiment, develop, and deploy models into production for real business applications.

By definition, a well-implemented MLOps process should achieve continuous development and delivery (CI/CD) for data and ML intensive applications. However, an effective CI/CD system is vital to this process. Not only should it understand ML elements natively but it also must stay in sync with any changes to underlying data or code, irrespective of the platform on which the model runs.

ML engineers looking to truly automate ML pipelines need a way to natively enable continuous integration of machine learning models to production.

CI/CD for ML & MLOps Using Github Actions

The wide variety of platforms for implementing CI/CD and automating builds in software development environments provides developers with a great deal of flexibility in how they build DevOps pipelines.

Data scientists, on the other hand, are severely limited in this area due to the dearth of interoperable tools for properly versioning, tracking, and productionizing ML models.

While there are a number of services that effectively incorporate CI/CD into ML pipelines, they place data scientists into a black-box silo situation in which they must build, train, track, and deploy in a closed technology stack.

Existing open-source systems that offer such functionality may not always interoperate cleanly with the platforms and tools data scientists prefer, thus forcing them to build customized deployment tools or leave their comfort zone (and go through a steep learning curve) to work with unfamiliar tools.

At Iguazio, we believe that ML teams should be able to achieve MLOps by using their preferred frameworks, platforms, and languages to experiment, build and train their models, and quickly deploy to production.

To facilitate this, we’ve provided CI/CD workflows in GitHub Actions, which you can easily configure (for more complicated workflows) to enable continuous deployment of models on Iguazio’s data science platform or within your on-prem cluster or private-cloud environment.

It’s a simple way to train models directly from GitHub (via Github Actions) and perform the kind of sophisticated data analysis required by production-ready models deployed in real-world scenarios.

Using open-source projects (GitHub Actions, MLRun) and Kubeflow, we’ve provided ML teams with an automated mechanism for launching ML pipelines and tracking and administrating the execution of the entire process from data ingestion all through to production.

Data scientists can now iterate on their models using Jupyter or other preferred tools and platforms on their workstations and fold that very quickly into the overall process of development and operations. Any changes they make become visible on GitHub, enabling them to tie back and deploy to the actual commit.

Essentially, this gives data scientists the same control over models that software developers had over their code through GitHub. It’s basically adding source control on top of your infrastructure and laying a CI/CD system on top of that — not just a generic CI/CD system, but one that’s built to address all the hassles specific to the ML domain.

The CI/CD workflow uses an open-source framework, MLRun, to serve as an end-to-end orchestration tool on top of Kubeflow Pipeline, presenting data scientists with a more holistic way to automate the entire ML pipelines from the development of code models, data engineering all the way to production. MLRun executes, tracks, and versions projects, orchestrates pipelines, and automates real-time data processing functions and model deployment.

For source control, GitHub Actions provides a rich CI/CD ecosystem that enables you to collaborate with ML teams distributed globally while automating and securing your workflows and ensuring they remain compliant. It enables data scientists to take advantage of the collaboration, versioning, and knowledge-sharing ability of GitHub (already enjoyed by software developers) while folding in the automated testing, model training, and deployment required of any ML-aware CI/CD system.

Any changes to the underlying code, data, or parameters kick off the process. It pushes newly added code into the repository, which triggers the workflow and then takes over and builds, runs pipelines, does automated testing, and executes everything else automatically.

As such, ML engineers can run developments within Jupyter that scales and distributes workloads across multiple containers, builds images, and dynamically assign GPUs, data volume mounts, and other resources and deploy on top of a Kubernetes cluster. And just like that, you have a full-blown machine learning pipeline that drives everything to production.

This enables data scientists to stay within their comfort zone and abstract some of the functionalities not available in their local environment.

Now, by opening a pull request or checking in your code, you can create and execute an entire machine learning pipeline, track and record all the process information, and update models from the actions.

We’ll continue to help both data scientists and end-users access models and move into production by abstracting most of the stuff under the hood — ultimately easing the process of bringing your data science initiatives to life.

Watch this webinar with Microsoft, Github and Iguazio to learn more about CI/CD for ML:

New call-to-action

Share: