How to Deploy Your Hugging Face Model to Production at Scale - MLOps Live #20, Oct, 25 at 12pm ET
While machine learning and artificial intelligence are often used interchangeably, machine learning is actually a specialized subfield of the latter: AI algorithms learn from encoded domain knowledge, and ML algorithms specifically learn to make predictions by extracting this knowledge directly from data.
There are various learning techniques that ML can be applied with, the most common being supervised learning. In supervised learning, ML algorithms learn in a training phase where the model adjusts its trainable parameters to fit the patterns that map features to label; this adjustment is performed progressively by splitting the training data into multiple batches and iterating through the split training data in many consecutive epochs.
Crucially, all ML techniques, from supervised to reinforcement learning, rely on adjusting trainable parameters to enable learning. Each ML algorithm has a set of hyper parameters that define how this adjustment is performed; and how these hyper parameters are set dictates how well the algorithm will learn, i.e., how accurate the model will be. Setting hyper parameters is the remit of model fine-tuning, or model tuning in short.
Below, we’ll explore in detail what hyper parameters and model tuning are, explain why model tuning is important, and walk through all the steps necessary to successfully tune your machine learning models.
Tuning a machine learning model is the process of configuring the implementation-specific parameters that act as control knobs to guide its learning—for the model structure itself as well as its training regime.
Specifically, hyper parameters guide how the model learns its trainable parameters.
To understand model tuning, we need to clarify the difference between two types of parameters:
While model training focuses on learning optimal trainable parameters, model tuning focuses on learning optimal hyper parameters.
It’s particularly important to understand the difference between these two since it’s common for practitioners to simply refer to either as “‘parameters,” leaving it to context to identify the exact type, which can lead to confusion and misunderstandings.
Each algorithm—sometimes each implementation of an algorithm—has its own set of hyper parameters, but it’s common for the same class of algorithms to at least share a small subset of them. When developing a pipeline for model training, it’s fundamental to always refer to the algorithm’s implementation for details about hyper parameters. We recommend reviewing the official documentation for XGBoost and LightGBM—two of the most widely used and successful implementations of tree-based algorithms—for in-depth examples.
While all hyper parameters affect the model’s learning capability, some are more influential than others, and it’s typical to only tune these for time and computational efficiency. For a neural network in TensorFlow Keras, we may want to tune:
Moving beyond the algorithmic perspective, most practitioners nowadays refer to any parameter that has an impact on model performance and can have multiple values assigned to it as a hyper parameter. This also includes data processing, e.g., which transformations are performed or which features are used as input.
As feature engineering is the process that transforms data into its best form for learning, model tuning is the process that assigns the best settings to an algorithm for learning.
All implementations of machine learning algorithms come with a default set of hyper parameters that have been proven to typically perform well. Relying on the defaults for a real-world application is too high a risk to take, as it is unlikely—if not impossible—that the default hyper parameter configuration will provide optimal performance to any use case.
In fact, it is well-known that ML algorithms are highly variable depending on the hyper parameter selection. Each model and data set combination requires its own tuning, which is particularly relevant to keep in mind for automated re-training.
After a data scientist selects the most appropriate algorithm for a given use case and performs the relevant feature engineering, they must determine the optimal hyper parameters for training. Even with lots of prior experience, empirically determining them is unthinkable.
While it’s a good idea to try a couple of hyper parameter selections that are thought to be relevant to ensure the use case is feasible and can achieve the expected offline performance, performing extensive hyper parameter tuning by hand is inefficient, error-prone, and difficult to reproduce.
Instead, hyper parameter tuning should be automated—this is what is typically referred to as “optimization.”
At experimentation, automated tuning refers to defining the optimal hyper parameter configuration via a reproducible tuning approach. There are three steps to model fine-tuning and optimization, covered below.
The more hyper parameters are selected and the wider their ranges are defined, the more combinations exist for the hyper parameter tuning configuration.
For example, if we define batch size as an integer with possible values in [32, 64, 128, 256, 512, 1024] and another 5 hyper parameters also with 6 possible values, 46,656 combinations exist.
Selecting all hyper parameters with exhaustive ranges is often unfeasible, and an educated compromise between efficiency and completeness of the search space is always made.
The most common tuning approaches are:
Each tuning approach comes with its own set of parameters to specify, including:
This last parameter can be set to a large value for tuning via independent trials such as grid and random search; on the other hand, it should be set to a small value for sequential tuning approaches such as Bayesian optimization.
This will be a series of parallel or sequential trainings, each with a specific hyper parameter selection in the allowable range, as defined by the configured tuning approach.
It is fundamental to keep track of all of the runs, metadata, and artifacts collaboratively via a robust experimentation framework.
Ideally, data scientists and machine learning engineers should collaborate to define what a productionizable tuning approach is before experimentation. Sometimes, this is not the case, and the selection of the tuning approach and hyper parameters may be updated for efficiency during productionization as considerations around re-training the same model or tuning multiple models become prioritized.
During productionization, automated tuning refers to the process of setting up tuning as part of the automated re-training pipeline—often as a conditional flow to standard training with the last optimal hyper parameter configurations. The default flow should be tuning at each re-training run, as data would have changed over time.
Many tuning solutions are available, from self-managed ones like Hyperopt and skopt to managed tools like AWS SageMaker and Google Cloud’s Vizier. These solutions focus on the experimentation phase with varying degrees of traceability and ease of collaboration.
Iguazio provides a state-of-the-art tuning solution via MLRun, which is seamlessly incorporated within a unique platform that handles both experimentation and productionization following MLOps best practices with simplicity, flexibility, and scalability.