In a perfect world, every data set would contain everything required to train a machine-learning model. Unfortunately, we don’t live in a perfect world. Actual data sets might contain extensive missing or under-represented values.
At best, this can severely impact a model’s quality and efficiency. At worst, it can introduce such considerable bias that the model’s output no longer accurately represents reality. This is especially true if you’re using an algorithm that assumes that all values within a data set are both numerical and relevant.
Data imputation seeks to address this by filling out those missing values.
What is Data Imputation?
Data imputation is an umbrella term for a school of data processing techniques in which missing values in a data set are replaced with non-missing values. Imputation represents a far more practical alternative to ignoring or discarding incomplete or missing samples and one that is significantly less likely to impair a model’s analytical capabilities. It also allows data scientists to work with a broader selection of tools and data sets.
Machine-learning models are often incapable of parsing missing data on their own. Because many real-world data sets and libraries contain missing entries, this imposes considerable limitations on the training process. For example, without imputation, a model will likely be unable to work with the vast majority of Python libraries related to machine learning.
What Kind of Missing Data Does Imputation Address?
Imputation mitigates all three major categories of missing data:
- Missing Completely At Random (MCAR): There are no observable variables that account for the missing values. This is exceptionally rare, as there is almost always an explanation for missing data.
- Missing At Random (MAR): Although the missing data may at first appear to be randomly distributed, there are observable variables that account for its presence.
- Missing Not At Random (MNAR): Missing data is clearly explicable by variables directly related to the data set. For instance, a subject that lacks a ‘date of death’ value in a population census is likely still alive.
Comparing the Most Common Data Imputation Techniques
Not all data imputation methods are created equal. Rather than improving representativeness and diversity, a data imputation technique that’s too simplistic could actually introduce more bias and distortion into the data set. Exercise caution when applying any imputation technique – and only use data imputation to data sets that have been fully pre-processed.
The imputation techniques below may be divided into two broad categories:
- Single-value imputation, which looks at each missing value as its own entity and replaces it with a single other value.
- Multiple imputation, which generates a data set with which to replace missing values, typically through some form of regression technique.
Next or Previous Value
When performing data imputation for a time series or similarly-ordered data set, it’s safe to assume that the values closest to a missing value are at least comparable to it. As such, Next or Previous Value fills out missing samples by substituting them with the value occurring either immediately before or immediately after.
K Nearest Neighbors
Similar to the next or previous value, K Nearest Neighbors examines any number of nearby relevant values. The difference is that rather than simply inserting one of these values, it substitutes the value that occurs most frequently in the group.
Minimum or Maximum Value
Minimum or Maximum value looks at the full range covered by the data set and replaces missing values with either the highest or the lowest value, depending on context.
Start or End of Distribution
Similar to minimum or maximum value, start/end of distribution imputation replaces missing values with values found at either the beginning or the end of the data set’s non-missing values.
Most Frequent Value
Most Frequent Value is basically K Nearest Neighbors applied to the entire data set. Missing values are replaced with whichever value occurs most consistently throughout.
Average or Linear Interpolation
Average or Linear Interpolation is a slightly more complex take on Next or Previous Value. Instead of substituting, it uses the two to predict what the missing value is likely to be.
Mean, Median, or Rounded Mean Imputation
As the name suggests, this school of techniques replaces missing values with either the average, middle, or rounded average of all entries in the data set.
Fixed Value Imputation
This technique replaces all missing data with a fixed value. It’s most useful for MNAR data – for instance, missing entries in a survey can simply be replaced with a value that signifies ‘unanswered’ or ‘unfilled.” This technique is also known as arbitrary value imputation.
Missing Value Prediction
Missing value prediction is a slightly more sophisticated single imputation method that uses a machine-learning model to analyze and predict missing values. It typically combines a specialized algorithm with one of the basic imputation methods described above.
Model-Based Imputation
Model-based imputation, also known as data augmentation, is an advanced technique that leverages statistical or machine-learning models to generate new data from existing data. The primary difference between data augmentation and other forms of data imputation is that augmentation isn’t strictly used to fill out missing data. Instead, because data augmentation generates new data based on existing data, it’s more frequently used to add diversity to a data set in order to prevent overfitting.
Although significantly more effective than other imputation techniques, data augmentation has weaknesses of its own – chief among these being that it can end up amplifying any biases a data set contains.
Categorical Missing Data Imputation
Categorical Missing Data Imputation is a specialized, cloud-based software tool that replaces missing categorical values in a data set based on the patterns the data set contains. Developed by multinational IT and consulting company Mphasis, it’s marketed as effective for data with up to 25% missing values.
Which Data Imputation Technique Should You Use?
Generally, there’s no such thing as a ‘perfect’ approach to data imputation. Instead, you should choose your techniques based on your specific use case. With that said, a machine learning-based approach is usually – but not always – the best option.