Join our webinar on Implementing a GenAI Smart Call Center Analysis App - Tuesday 27th of February 2024 - 9am PST / 12 noon EST / 6pm CET
A machine learning (ML) model is an algorithm trained on a data set to perform a specific predictive task. Model evaluation aims to define how well the model performs its task.
The model’s performance can vary both across use cases and within a single use case, e.g., by defining different parameters for the algorithm or data selections. Accordingly, we need to evaluate the model’s accuracy at each training run. Typically, more than one model is experimented on and productionized for the same use case. This means that we also need to compare multiple models’ performance during evaluation.
Finally, data and concept drifts over time mean that models should be further evaluated on a regular basis when in production. The offline estimate and online measurement of the models’ performance determine whether a model should be in production, and its return on investment (ROI).
This article defines model evaluation, discusses its importance, and introduces best practices on how to report on and perform the evaluation.
Model evaluation in machine learning is the process of determining a model’s performance via a metrics-driven analysis. It can be performed in two ways:
The metrics selection for the analysis varies depending on the data, algorithm, and use case.
For supervised learning, the metrics are categorized with respect to classification and regression. Classification metrics are based on the confusion matrix, such as accuracy, precision, recall, and f1-score; regression metrics are based on errors, such as mean absolute error (MAE) and root mean squared errors (RMSE).
For unsupervised learning, the metrics aim to define the cohesion, separation, confidence, and error in the output. For example, the silhouette measure is used for clustering in order to measure how similar a data point is to its own cluster relative to its similarity to other clusters.
For both learning approaches, and necessarily for the latter, model evaluation metrics are extended during experimentation with visualizations and manual analysis of (groups of) data points. Domain experts are often required to support this evaluation.
Beyond technical metrics and analysis, business metrics such as incremental revenue and reduced costs should also be defined and reported. This allows an understanding of the impact of putting the model into production.
ML model evaluation ensures that production models’ performance is:
An incorrect or incomplete model evaluation can be disastrous for both user experience and a business’ income, especially for critical applications with real-time inference. Internally, good model evaluation ensures that all stakeholders are aware of the potential of the use case and supportive of it, thus streamlining development and management processes.
The AI community has yet to agree upon a templated format for reporting on ML model evaluation. This is mainly due to the high variability of designs of use cases in machine learning.
Typically, it is recommended to create a one-page report; this format ensures a balance of conciseness and comprehensiveness. The report should clearly explain the model’s performance via technical and business metrics, and summarize relevant technical model aspects with a focus on engineering requirements. Ideally, the underlying template of the report can be re-used across all ML use cases for the team. The model evaluation report should be created and shared before putting the model(s) into production. This process should continue on a regular basis during the production phase via continuous evaluation.
Model evaluation is performed both during experimentation and in production. Below, we outline general model evaluation methods for these two phases separately, with the acknowledgment that nuances will exist across different use cases.
During experimentation, a test set and a holdout set are removed from the historical dataset to enable model evaluation. These sets—which often coincide—contain data unseen at training time.
The test set is used to evaluate the performance of models during training. Technical metrics, such as accuracy or RMSE, are employed in order to compare the same model across different training runs, as well as to compare different models to one another. This phase of the model evaluation is typically manual, e.g. it often involves visualizations, and is iterative. Thus, it is essential to keep track of metrics and artifacts for all experiments. This is most easily and efficiently achieved via an experiment-tracking tool such as MLRun.
The holdout set is used to evaluate comprehensively the performance of the chosen model(s) before production. This analysis, which should be reproducible and recorded in a report, measures technical and business metrics overall and by relevant data slices, i.e., a subset of data by feature columns. Data slicing aims to discover systematic mistakes and improve both fairness and performance.
Continuous model evaluation is an automated routine in model monitoring that takes recent production data and model predictions—with ground-truth labels, if and when available—and evaluates the ML model performance in the same way as offline evaluation is performed.
When the feedback loop is particularly long or entirely unavailable, an estimated evaluation can be performed via proxies such as label distributions and user feedback. This evaluation relies on a production system with comprehensive model monitoring and dashboarding, such as that offered by Iguazio, which should aim to be consistent across multiple use cases.