Artificial intelligence and machine learning models are incredibly powerful and versatile tools, but they rely heavily on the quality of their input data. Incorrect, irrelevant, or messy data results in suboptimal models that don’t deliver the return on investment that they otherwise would.
Part of cleaning up input data is outlier detection, the removal of data points that differ immensely from the others. Outlier detection is an essential step in developing machine learning models, but what are statistical outliers and what can AI developers do to minimize their impact?
What Are Outliers?
An outlier is a point of data that strays significantly from other data points. These extreme values are far and away from the average. For instance, if you have a bell curve distribution, outliers would be the points on the far left and right of the curve.
Anomalies like these can be the result of either measurement errors or genuine divergences from the norm. Either way, they are problematic for statistical analytics and machine learning development since they skew the final result. Identifying and resolving outliers is a mandatory step.
What Causes Outliers?
Causes for outliers can come from various sources. Some of the most common examples include:
- Human error, such as data entry or measurement mistakes
- Sampling errors, including data extracted from an incorrect source
- Unexpected variable distribution
- Genuine deviations in the data, which doesn’t actually constitute an error but still indicates an anomaly
Sometimes companies even insert dummy outliers into the input data to verify and check outlier detection methods.
What Are the Types of Outliers?
Outliers can be either univariate or multivariate. The former involves data points that stray from the average because of one variable; the latter involves a combination of multiple outlying data points.
Another way to categorize outliers follows:
- Point Outliers– Individual outlying data points.
- Contextual Outliers– Multiple outlying data points that “surround” the real data. For example, a speech recognition algorithm would consider background noise to be a contextual outlier.
- Collective Outliers– If enough deviating data points show up in the input data, they could point towards a new phenomenon. Collective outliers are always worth investigating to tweak the final model.
It’s worth noting that not all outliers are the result of errors. Genuine data that deviates from the norm are known as novelties, and data scientists study them carefully to improve accuracy and decision-making.
What Is Outlier Detection?
As its name suggests, outlier detection is the practice of detecting and removing outliers from the input data to improve analytics and algorithm development. It directly contributes to data cleaning, which aims to make data higher quality, reliable, and accurate.
Why Should You Spend Time Identifying Outliers?
Companies working with AI, ML, and data science in general must analyze outliers so that they don’t hamper model building. The more input data you collect, the more proportional anomalies show up.
For instance, businesses in the financial sector might build models to decide whether to give a client additional credit. Failing to account for outliers can result in giving credit to a high-risk individual.
An outlier detected can also botch a business decision, such as investing in a project with little potential for a return on investment.
What Are the Typical Challenges of Outlier Detection?
Creating a reliable outlier detector isn’t a task anybody can achieve, especially in businesses with large amounts of input data. For example,
- It can be difficult to distinguish outliers from valid data.
- The nature of outlying data can change over time, so models that worked previously may no longer be applicable in the future. Constant reassessment is necessary when building new models.
- Understanding where anomalies come from sometimes requires expertise in the field from which the input data originates. Knowing where outlying data points in a housing market chart requires knowledge of how real estate works in that particular area.
Data scientists must be diligent when resolving outliers without messing with genuine data and novelties.
What Are the Popular Outlier Detection Methods?
Many effective approaches and techniques exist for identifying outliers and categorizing them as either errors or novelties.
Interquartile Range
One of the easiest approaches is finding the numeric outlier. In a given dataset that you can illustrate in a one-dimensional graph, divide the data into three quartiles. Then set range limits and remove any outlying data that lands outside of those limits.
In statistics, the interquartile range (IQR) is the difference between the first and third quartiles, which are typically the 25% and 75% markers, respectively.
Standard Deviations
Another valid approach is using standard deviations. Data scientists use heuristics to determine the acceptable thresholds data points may differ from the average. Any points within this standard deviation are acceptable, while the ones beyond it are outliers.
This approach, while versatile, requires parametric data points in a normal bell curve arrangement. It’s also less accurate with significantly large datasets.
DBSCAN
Standing for Density Based Spatial Clustering of Applications with Noise, DBSCAN is an outlier detection test that visually illustrates data density and clusters data points together into related groups:
- Core– Points are part of the main dataset.
- Border– Points diverge slightly from the primary trend but have enough density to be valid.
- Outliers– Fail to be part of any group and are not worth considering in the final model.
DBSCAN works for data in three or more dimensions and visualizes the input intuitively. However, data scientists often must scale the data to make it compatible with this approach, and choosing the right parameters is another task entirely.
Isolation Forest
The isolation forest technique uses randomly generated binary decision trees to calculate scores between 0 and 1 for each data point. A score of 0 is a valid point, while a score of 1 heavily suggests an outlier.
Isolation forest can be a time-consuming procedure. However, this technique also involves few parameters, so it’s easy to optimize and effective when you don’t know the distribution of the input data.