In today’s highly digital world, data hygiene is essential. It not only keeps your business operating smoothly by making essential data readily available to those who require it but also plays a major role in cybersecurity. Depending on your location, it may even be required by law.
Data hygiene is also important when processing datasets for machine learning. Poor-quality data can have a significant negative effect on a model’s accuracy. One of the most common mistakes in that regard involves redundant data.
What is Data Redundancy?
Data redundancy occurs when the same piece of data can be found in two or more locations within an organization. These locations could be anything from multiple servers to multiple entries within a database. Unintentional data redundancy can occur for many reasons, ranging from unnecessarily complex processes to inefficient coding practices.
It’s important to note that data redundancy in a business context isn’t always a bad thing, nor is it always accidental. A common best practice, for instance, is to maintain multiple redundant backups of critical files and systems. This intentional data duplication is a critical part of business continuity and disaster recovery.
In the context of machine learning, the definition of data redundancy is slightly more broad and refers more to variance and diversity in training data. Essentially, it classifies as redundant any data samples that, due to their similarity to other samples, fail to add value to a data set. If represented by vectors, these samples would have minimal distance separating them. Another word for this mindset is semantic redundancy.
Issues with semantic redundancy most frequently surface in machine learning tasks related to computer vision – essentially, an algorithm’s capacity to recognize and identify visual objects. Video is especially challenging for engineers in this regard. Photos are generally deliberate, and there’s a great deal of variance between them.
Video, on the other hand, presents an uninterrupted stream of images, many of which are nearly identical to one another. This makes it quite difficult to effectively train a computer vision model on video alone without significant pre-processing of the training data. Video data sets also tend to be exponentially larger than data sets consisting primarily of images.
This is made all the more challenging by the fact that semantic redundancy can occur in a number of different areas, not all of which are immediately visible at a glance. These include, but are not limited to:
- Weather conditions.
- Positioning of objects in frame.
- Positioning of subjects in frame.
- Number of subjects.
- Location – i.e., outdoors at the park versus indoors in an office.
- Color palette.
- Size.
- Lighting.
Data Duplication vs. Data Redundancy
Where general data management is concerned, data duplication and data redundancy ultimately mean the same thing. Both terms refer to the presence of identical data within a system. Machine learning is the only area in which the two terms can be said to differ from one another, owing largely to the field’s greater concern with semantic redundancy as opposed to outright duplicate data.
Semantic redundancies in computer vision are also referred to as nearby-duplicates.
Why is Data Redundancy in Machine Learning a Bad Thing?
The most obvious repercussion of redundant training data is that it wastes both time and resources. If an organization is engaged in supervised machine learning, for instance, this means that its data scientists must spend their time labeling unnecessary samples. The machine learning model, meanwhile, wastes computing and processing resources parsing the redundant samples.
While the impact of minor redundancies may not be noticeable, too many redundancies result in an unnecessarily large data set, which could, in turn, represent an efficiency bottleneck.
Efficiency is far from the only concern when it comes to redundant data, which can also impact machine learning models in several ways.
Data Redundancy and Performance
A model fed with a large quantity of redundant data often performs worse than one trained on a varied dataset. This is because the latter model has the context to understand and respond to a larger number of situations and scenarios, making it more flexible and versatile. In extreme cases, this can result in something known as overfitting.
Essentially, this occurs when a machine learning model gives accurate predictions for its training data but is incapable of parsing or adapting to new data.
Consequently, removing redundancy from training data can greatly increase the accuracy of a model, at least up to a point. According to The 10% You Don’t Need, the optimal ratio is 90-10. That is to say that a model should be trained on 90 percent of its data set while removing 10 percent of redundant samples.
Through testing, the same paper establishes that both CIFAR-10 and ImageNet – two of the most commonly used computer vision data sets – are at least ten percent redundant. This means that if your organization intends to use either data set to train its algorithms, it’s advisable to filter that data first.
Data Redundancy and Anomalies
In database design, data redundancy frequently results in data inconsistency. This is a scenario that occurs when a piece of data exists in multiple different formats. These different pieces of data also frequently contain different information from one another, ultimately rendering the data useless.
Where machine learning is concerned, the concept is a little different. Due to data redundancy, inconsistencies can occur in a model’s performance, but rarely does this result in anomalous output. That isn’t to say the two are entirely unrelated.
The presence of redundant samples in a model’s training may amplify other issues in the data pipeline. Data inconsistencies that would otherwise be disregarded may instead be learned by the model. This results in a machine learning algorithm that delivers inaccurate, inconsistent output.
How to Perform a Data Redundancy Check
There are multiple ways to reduce data redundancy in both databases and file systems:
- Consolidating all data into a single source of truth.
- Normalizing databases.
- Database management tools.
- Standardized data entry.
Unfortunately, none of these techniques are applicable to machine learning and computer vision. Instead, when addressing the issues of redundant training samples, you’re left with three options. First and foremost, you can try sifting through the data yourself to manually remove semantically similar samples – a painstaking process for what may ultimately be minimal returns.
Alternatively, you can try to find a pre-filtered version of the data set you want to use. However, there is no guarantee that the filtered data will be relevant to your specific training use case. Finally, you can follow the processes recommended in The 10% You Don’t Need to automatically filter out redundant or low-value samples. This involves a combination of embedding and agglomerative clustering. Towards Data Science outlines the two-step approach as follows:
- Use a specialized, pre-trained model to fine-tune a dataset through self-supervision.
- Use agglomerative clustering or a combination of destructive and constructive algorithms to automatically filter out redundancies.
The above can be completed manually or using a purpose-made data filtering solution. It ultimately comes down to how much time and effort is available for you to spend tweaking your training data. As a general rule, however, most organizations would be best off seeking out vendors rather than attempting to manage the entire process on their own.