Noise is a term first applied to digital and analog systems in signal processing. To understand what noise in machine learning is, we first need to understand this original meaning.
In signal processing, noise refers to unwanted modifications introduced to a source signal during the capture, storage, transmission, or processing of its information. These distortions are most commonly measured via the signal-to-noise ratio, a metric that compares the contribution of the desired signal to the contribution of the background noise. The higher the noise, the lower the quality of the signal—and the signal-to-noise ratio—is.
In machine learning, noise similarly refers to unwanted behaviors within the data that provide a low signal-to-noise ratio. Essentially, data = signal + noise.
While a minority of the noise in data is irreducible, most can be prevented by understanding its causes and correcting them. These causes are multiple and rather varied, which also explains why the term has so many different interpretations within the data science community.
We’ll provide an overview of the most common interpretations by deep-diving into general characteristics, causes, and remedies to noise in data science below.
We define noise as unwanted behavior within data.
Data can come in different formats in machine learning and is typically categorized as structured (e.g., CSV), semi-structured (e.g., JSON), or unstructured (e.g., JPG). We will focus here on noise that can be generally present in any data format, but it’s worth noting that it will take slightly different forms for each.
It is also important to distinguish between noise in what we can generally refer to as a feature set, i.e., an input signal to learn from, and noise in a label, i.e., a signal to predict.
Before diving into each separately, we should note that irreducible noise, or randomness, also exists. This is the natural variability that necessarily belongs to complex systems that cannot be reduced, contributing to model bias.
Noise in the feature set can be caused by various factors, including incorrect collection, from either humans or instruments, or both. Examples of the resulting noise here are missing values, outliers, and wrong/inconsistent formats. This is not a biconditional relationship though. While it is true that these effects can be caused by an incorrect collection and thus be looked at as noise, it is also true that they could be representative of real behavior—probably, exceptions—and thus be looked at as a signal.
When performing exploratory data analysis, it is fundamental to keep this in mind and not rush to accept either of the two interpretations, as it could cause major issues in the model’s effectiveness.
Incorrect processing also contributes to feature set noise. This can either be too much filtering, where the real data distribution is altered, or not enough filtering, where redundant values could confuse model learning by providing irrelevant signals.
Finally, attacks are another cause of noise. These are perpetrated by ill-intentioned actors who add intentional noise to data to manipulate what the model learns and, ultimately, skew predictions in their favor.
Noise in the label can result from mislabelled examples, particularly relevant when the label’s definition is ambiguous or hard to outline.
Incomplete feature sets are an additional factor; in fact, feature sets are almost necessarily incomplete in complex systems. This is especially true for real-world applications where unknown or unavailable features cannot be modeled, e.g., stock price prediction. As a consequence, the label cannot be fully determined.
Incomplete feature sets are also a contributor to irreducible noise in the system.
Noise in data is always bad.
Even though remedies exist for the varied causes of reducible noise introduced above, it is easy and common to confuse noise for a signal and vice versa. This leads to misinterpretations during data analysis and model training, as in the former, the algorithm can start generalizing from the noise introduced as a signal, and in the latter, the algorithm will be missing relevant information to correctly determine the label.
Still, sometimes it is actually desirable to add small amounts of input noise—called jitter—to boost generalization and reduce overfitting in model learning. Augmenting data to mimic different real-world scenarios is typical for computer vision applications where Gaussian and non-Gaussian (e.g., salt-and-pepper) noise is commonly added to images at training time to strengthen the algorithm against different camera settings and capabilities.
Remedies exist to treat noise in data. The process of discovering and handling noise is referred to as noise detection. When performing machine learning noise detection, we can aim to either remove noise or compensate for it.
These are the most common approaches for removing noisy data.
A signal is a function that maps from real space to a real number. This approach translates the signal into the frequency domain as a combination of sines and cosines with specific amplitudes. While noise exists at all frequencies, the signal is band-limited and can be extracted by only keeping relevant frequencies.
Note that this approach is borrowed from signal processing.
A signal and noise have different behaviors. This approach trains a machine learning model to separate clean and noisy data into different groups.
An autoencoder is a two-step machine learning model that first embeds (or “encodes”) the data into a lower dimension, and then reconstructs (or “decodes”) the original data from the lower dimension representation. This approach can be applied to de-noising by training a model to reconstruct an input signal while reducing degradation.
PCA transforms data by projecting it into N dimensions of the highest variance. These dimensions contain the most informative signal, thus reducing noise.
The most common approaches for compensating for noisy data are the following.
Cross-validation is a resampling technique that trains more robust machine learning models by performing multiple training iterations on different data splits. This approach minimizes the impact of a particularly noisy subset of data, and also avoids overfitting.
Ensemble learning is a technique that obtains better machine learning predictive performance by combining multiple algorithms. This minimizes the impact of a particularly noisy model.
It’s important to note that none of the approaches discussed above should be expected to perfectly treat noise. If possible, it is worth considering collecting more or new data to boost the signal-to-noise ratio at the source.
“Garbage in, garbage out” is a well-known law in data science. The sooner fixes are made in the pipeline, the better the outcome will be.