What Is the Holdout Dataset in Machine Learning?

Data is at the heart of machine learning (ML,) because data is used to train ML models and test their performance. Using the test data to evaluate machine learning models comprehensively and reliably before they go to production is fundamental to ensuring that they provide value to the user by performing as intended, i.e., without unexpected behaviors. This process, called “model evaluation,” should test the model’s performance on both technical and business metrics relevant to the specific use case.

When labels are present (i.e., in supervised learning,) the most simple and commonly used approach for model evaluation is a holdout dataset. This is a specific type of test dataset which is unseen during training and used solely to gather the final model evaluation before production.

This article defines what the holdout set is, discusses desirable characteristics thereof, walks through what evaluation it can provide, and finally discusses when to use the holdout dataset.

What Is the Holdout Dataset?

The first step of any ML modeling is to define a dataset with predictive features for the use case. This dataset is divided into a training set, a test set, and a holdout set. The test set and holdout set are used for model evaluation. Keeping the test datasets separate from the training phase is fundamental in order to ensure that evaluation metrics reliably represent performance on new data when in production.

Specifically, the test set is used to evaluate the model’s performance during experimentation and when comparing across multiple models. The holdout set is reserved solely for the final model evaluation before moving to production. This “holdout method” minimizes model bias. In practice, the test set and holdout set often coincide, mainly due to the availability of data.

Often, a validation set is also created, specifically for model evaluation during training or for hypertuning the model.

What Are the Characteristics of a Good Holdout Set?

A good holdout dataset should be curated to ensure that it provides a reliable representation of the variety of data that the model will see in production, and generates a fair evaluation of the trained model.

To evaluate the quality of a holdout set, data analysis can be performed in order to check for the following desirable characteristics:

Uniformity: The holdout set has the same distribution as the other dataset splits and real-world data.
Uniqueness: The holdout set contains entirely unique data, i.e., no row of data is present more than once in the holdout set or in any other data splits.
Timeliness: The holdout set does not overlap logically with the other splits, i.e. there is no data leakage. This is particularly relevant when a random split is not possible, such as for time-series data.
Comprehensiveness: The holdout set encompasses all common and edge cases, and no irrelevant data is present.
Cleanliness: The holdout set’s features and labels are correct.
Validity: The holdout set is composed of features that are fully available at prediction time.
PII compliance: The holdout set does not contain data that can be used to identify a specific person.

It is recommended to perform the checks from this list as relevant for each and every dataset split.

Other than quality, the other dimension to define a “good” holdout set is quantity. Unfortunately, there is no single standard that fits all use cases with respect to data size. The dataset split ratio between training, validation, and test sets depends on the total size of the available data and the chosen algorithm.

Ultimately, the challenge is to ensure that the test set is of statistically significant size so that it can safely be assumed that the derived evaluation is representative of wider behavior.

What Model Evaluation Can Be Performed with the Holdout Method?

The evaluation performed on the holdout set is the same as the evaluation performed on any test set. The algorithm or, more generally, its type (classification vs regression,) determines the types of evaluation available, and the most valuable evaluation types to the use case.

Multiple evaluation metrics are available and typically used in supervised learning, such as accuracy for classification or mean squared error for regression. Business metrics should also be computed on top of technical metrics to provide insights around the direct value of the model for the specific use case, such as increased revenue or saved costs.

Technical and business metrics should be computed on the complete holdout set, as well as on relevant data slices, to provide a comprehensive understanding of the model’s performance. Often, it is recommended to define a templated metrics report to share with multiple business stakeholders.

When To Use The Holdout Method

The holdout method is a special implementation of cross-validation; in fact, its complete name is “holdout cross-validation.” While the holdout method creates one train-test split, general cross-validation creates multiple train-test splits to train and evaluate the model. The latter provides extra robustness and accuracy, with results averaged across validation rounds at the end. Thus, the chances of a particularly good or a particularly bad split are minimized, converging towards a more general performance.

Still, there are good reasons to select holdout validation over general cross-validation, such as when the dataset is very large, in the initial steps of model exploration, or when limited time or resources are available.

When running data science workloads, it is fundamental to use experiment tracking tools such as MLRun to track data sets and other relevant model metadata for reproducibility and efficiency. Deploy it today as part of Iguazio’s MLOps platform for end-to-end data science.

What Is the Holdout Dataset in Machine Learning?