Our unique Coreset Tree Structure (also referred to as Streaming Tree) allows you to update labels or delete data instances and immediately see how the changes affect your model without requiring re-computation of the model on the full dataset (to learn more about Streaming Trees, visit Introduction to Coresets).
The below graphs demonstrate the effect on balanced accuracy of identifying and fixing labeling errors using the Importance property, relative to random.
Data Cleaning
The DataHeroes framework uses a Coreset property referred to as Importance (or Sensitivity) to systematically identify potential errors and anomalies in your data.
When computing a Coreset, every instance in the data is assigned an Importance value, which indicates how important it is to the final machine learning model. Instances that receive a high
Importance value in the Coreset computation require attention as they usually indicate a labeling error, anomaly, out-of-distribution problem or other data-related issue (to learn more about
Coreset properties, visit Introduction to Coresets).
Reviewing the instances with the highest importance will uncover many errors. The below histogram illustrates how clean and noisy data can be separated by using the Importance value.
Finding Incorrect Labels in ImageNet
The ImageNet project is a large visual database designed for use in visual object recognition software research. There are various subsets of the ImageNet dataset used in various context. One of the most highly used subset of ImageNet is referred to in the research literature as ImageNet-1K or ILSVRC2017.
ImageNet-1K contains 1,281,167 training images, 50,000 validation images and 100,000 test images. In this blog post, we show how we used our library to easily identify many labeling errors in ImageNet-1K. Below are a few examples of labeling errors from the βMissileβ class (images incorrectly labeled as Missiles).