MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

What is Machine Learning Model Retraining?

What Is Machine Learning Model Retraining?

Machine learning (ML) model retraining, or continuous training, is the MLOps capability to automatically and continuously retrain a machine learning model on a schedule or a trigger driven by an event. It involves designing and implementing processes for the automation of the model retraining over time. 

Retraining is fundamental to ensure that a machine learning model is constantly providing the most up-to-date predictions, while minimizing manual interventions and optimizing for monitoring and reliability.

In this post, we will outline considerations to take when designing and implementing ML model retraining, with a brief introduction of some relevant tools and a discussion about the benefits of doing this correctly.

How to Design and Implement Machine Learning Model Retraining

In the design phase of machine learning model retraining, you should outline a strategy that answers the following five questions. 

Why? 

The aim of retraining a model is to ensure that it consistently provides the most correct output. For a successful retraining pipeline, it’s key to define what makes an output the most correct for your business use case and how to measure this correctness. 

Typically, the data scientists who have developed the model are expected to perform a thorough observability and explainability analysis that informs offline metrics, technical and business online metrics, baseline behavior, expected performance, and impact of degradation. Model blessing and A/B testing scenarios also depend on this analysis to ensure that only the most correct model(s) is pushed and kept in production. Model monitoring is then congruently set up to guarantee performance is reliably monitored and, when degradation happens, trigger an automated retraining process.

When? 

As just mentioned, performance degradation is the main reason to perform an automated retraining process. The retraining can be started from a trigger (e.g., the click-through rate has gone below 1.91%), as well as from a schedule (e.g., every Monday at 2 a.m.). 

Model performance is always expected to be optimal with the most recent data, but the more often retraining happens, the higher the cost. You can define the ideal schedule by running an offline experiment to derive the expected time it takes for data drift and concept drift to push the model performance below a baseline threshold. Data and model changes, as well as code updates, are another reason to kickstart a model retraining, as part of the continuous integration (CI) pipeline.

What? 

Model retraining involves lifting and shifting the batch training code defined at development time into an automated workflow. You should abstract feature selection, model parameters, and other configurable pipeline parameters as input variables of the retraining pipeline. This will allow for maximum flexibility and for the code to be refactored for optimal logical separation.

Also, it is fundamental to decide how much data to retrain on. Depending on the data strategy, you can refer to:

  • Offline learning: This is the most typical approach. Each retraining uses all data available or the most recent period of data with the same optimal length (e.g., a year or 100 images). 
  • Online learning: This is a natural fit for applications that work on real-time streaming data. The system is retrained by passing only the new data instances sequentially, rather than retraining on already seen samples.

Many practitioners prefer online learning because it is more accurate and robust than offline learning, even though it is a far less cost-effective approach.

Who? 

Implementing and maintaining the automated training pipeline is a machine learning engineer’s job. Still, as a best practice, monitoring for retraining models via dashboards, alerts, and reports should be a shared team effort, involving machine learning engineers, data scientists, and business stakeholders.

New call-to-action

How? 

As we’ve established, model retraining involves automating the steps manually run by data scientists during the development phase. For more mature MLOps systems, consider functionalities beyond the automation task itself, such as monitoring, metadata tracking, artifact storage, and model registry.

We’ll discuss how to implement these capabilities in the Tools for Model Retraining section, below.

The Benefits of Model Retraining

Machine learning model retraining is fundamental to ensure that the model is consistently producing the most correct output by providing an automated pipeline that can respond promptly and correctly to changes that degrade its performance, such as data and concept drift.

You need a well-defined monitoring system in order to notice that these drifts have occurred and, thus, trigger a model retrain while alerting the correct stakeholders.

Additionally, setting an automated model retraining pipeline provides the following benefits:

  • A quicker way to production for similar machine learning pipelines
  • The ability to make sure developers set up logical, observability, and explainability tests, as the pipeline needs to reliably run without human intervention. This makes the model more trustworthy, both internally and externally.

All in all, a model retraining process that is not designed and implemented correctly can lead to a disastrous loss of  profit and customer trust. On the other hand, a correctly defined model retraining process will lead to more revenue and customer satisfaction. It will also give data scientists and machine learning engineers more time to spend on improving existing use cases and building new ones, rather than unnecessary maintenance.

Tools for Model Retraining

Model retraining is, at its core, an automation task. Tools such as Airflow and Prefect are the most common choices for workflow orchestration. While Airflow refers to a workflow as a directed acyclic graph, and Prefect refers to it as a flow, they both allow you to  automate tasks, schedule runs, and set up custom callbacks. We recommend Airflow if the robustness and maturity of your product are more relevant, while Prefect is best for a newer and more ML-focused perspective. 

Still, remember that retraining requires more than just pipeline automation. You need tools for supporting activities like metadata tracking, artifact management, model registry, and monitoring. While there are many valid open-source and third-party providers that specialize in each of these capabilities, we recommend an all-around MLOps solution like Iguazio, which will cater to increasingly mature MLOps needs.