Iguazio has been acquired by McKinsey!

What is Cross-Validation?

Reliably evaluating machine learning models offline before pushing them into production is fundamental for the success of any data science initiative.

A robust offline evaluation phase ensures that the production model will offer value and quality to end users without unexpected behaviors and bias. Offline evaluation also provides the basis for making educated forecasts on the expected return on investment (ROI).

The most common failure point when performing offline evaluation is to test the model on historical data which is not representative of the online data that the model will predict on at serving time. Cross-validation is a technique that aims to minimize this risk while also providing a simple and accurate model performance evaluation process. This is especially valuable when using small datasets.

This article presents an introduction to cross-validation, an overview of its benefits, and a walk through of when and how to use this technique in machine learning.

What Is Cross-Validation?

Cross-validation is a statistical technique used by data scientists for training and evaluating a machine learning model, with a focus on ensuring reliable model performance.

To understand how cross-validation supports model development, we first need to understand how data is used to train and evaluate models.

Splitting the Data Set

After performing data processing and feature engineering, it is typical to divide the available historical data set into three different splits: 

  • The training data set is the set of data samples from which the model learns.
  • The validation data set is the set of data samples that are used to evaluate the model’s performance as training progresses and, optionally, when performing hyperparameter fine-tuning.
  • The test data set—also called the holdout data set—is the set of data samples used to evaluate the trained model after training is completed.

Each data sample should appear in only one of the sets to avoid information leakage, and each data set should be of a statistically significant size. Most commonly, a 70/10/20 split is performed for large datasets, with the majority of data allocated to the training set.

The Holdout Approach

When not performing tuning, practitioners often prefer not to use a validation set so that they can use all available data for training and evaluation. In this approach, called the holdout approach, only training and test sets are used with data samples split randomly between the two.

The Cross-Validation Process

The cross-validation process builds upon the standard data splitting approaches outlined above. It involves feeding data for the training and evaluation of the machine learning algorithm using a new two-step approach:

  1. Split the data into smaller subsets, called folds, with some folds used for training and some for evaluation.
  2. Rotate the combination of folds used for training and evaluation so that most (“non-exhaustive cross-validation”) or all (“exhaustive cross-validation”) folds are used for each at least once.

The results are then averaged across these multiple training and evaluation runs to determine the model’s overall performance.

Thus, cross-validation reduces the risk that is inherent in the holdout approach of having a particularly lucky or unlucky split of data for training and testing. This is less likely in cross-validation since it provides more than one data split.

Also Known As…

When referring to the cross-validation technique, it is common to also use the terms out-of-sample testing or rotation estimation, both which allude to its underlying process.

While some also use k-fold cross-validation as a synonym, this is actually a specific implementation of the cross-validation method where k is equal to the number of folds.

New call-to-action

Why Is Cross-Validation Important?

Many data scientists would agree that cross-validation is one of the most helpful techniques to use when developing machine learning, as it results in models with high degrees of accuracy and reliability. Cross-validation has the additional benefit of being easy to understand and implement.

The advantages of using cross-validation in machine learning stem from the fact that the technique provides multiple training and test data splits. This allows data scientists to:

  • Protect against overfitting and underfitting, i.e., respectively learning too much or not enough from the training set which leads to poor generalized performance on the test set
  • Better evaluate how well a model learning generalizes to different data
  • Define less biased models, even though we should note that cross-validation does not remove bias in the dataset

These benefits are particularly relevant for small datasets as all available data is used both for training and evaluation.

Perhaps the most important downside worth noting is that this technique is necessarily expensive in both time and compute resources, because it runs training and evaluation multiple times.

When Is Cross-Validation an Appropriate Approach?

While cross-validation in machine learning can be applied to a variety of learning tasks, it is most commonly used for supervised learning.

As long as sufficient time and compute resources are available—which is often the case when doing cross-validation for small datasets—and data can be randomly split in folds, you can use this technique when prototyping any machine learning model to:

  • Evaluate a model’s performance.
  • Estimate the generalization performance of a process for building a model.
  • Compare the performance of multiple models, and determine the most suitable for the use case.

Also, specific implementations of cross-validation techniques exist in order to:

  • Perform feature selection via cross-validated recursive feature elimination. In this instance, 0 to N-1 features are removed from the dataset using recursive feature elimination, and the best feature subset is selected based on the cross-validation score.
  • Perform hyperparameter tuning via nested cross-validation, where a validation set is also used thus resulting in two cross-validation steps.
  • Support use cases where an order must be preserved in the dataset, such as with time-series data, via rolling cross-validation. Here, the data splitting is performed such that there is no chronological overlap between training and test folds.

How to Perform Cross-Validation

A large variety of machine learning validation techniques and implementations are available to allow practitioners to begin performing cross-validation immediately.

Cross-validation techniques can be divided into:

  • Non-exhaustive cross-validation techniques, which do not separate the original data set into all possible permutations and combinations. The most common are:
    • K-fold cross-validation: Data is randomly split into k folds; each fold is used once for testing while the other k-1 folds are used for training, for a total of k iterations.
    • Stratified k-fold cross-validation: Data is randomly split into k folds so that all folds preserve the overall label distribution; each fold is used once for testing while the other k-1 folds are used for training, for a total of k iterations.
  • Exhaustive cross-validation techniques, which rotate between all possible permutations and combinations of the original data. The most common are:
    • Leave-1-out cross-validation: N-1 data samples are used for training and 1 for testing.
    • Leave-p-out cross-validation: N-p data samples are used for training and p for testing.

Exhaustive approaches should be used for small datasets only, as it can be extremely expensive to run model training and evaluation for all data permutations.

The most commonly used cross-validation implementations for these techniques (and more besides) are provided by the sklearn library as well as popular frameworks such as xgboost and lightgbm; most data science libraries have built-in capabilities for model validation.

Track Your Cross-Validation Experiments with Iguazio

When implementing and running cross-validation as well as more generally prototyping machine learning models, it is fundamental to have processes set up for experiment tracking in order to avoid repetition, reduce human error, and provide full traceability.

MLRun is a powerful tool for experiment tracking with any framework, library, and use case—including cross-validation—which can be deployed as part of Iguazio’s MLOps platform for end-to-end data science.