Data Cleaning

The DataHeroes framework uses a Coreset property referred to as Importance (or Sensitivity) to systematically identify potential errors and anomalies in your data.

When computing a Coreset, every instance in the data is assigned an Importance value, which indicates how important it is to the final machine learning model. Instances that receive a high
Importance value in the Coreset computation require attention as they usually indicate a labeling error, anomaly, out-of-distribution problem or other data-related issue (to learn more about
Coreset properties, visit Introduction to Coresets).

Reviewing the instances with the highest importance will uncover many errors. The below histogram illustrates how clean and noisy data can be separated by using the Importance value.


Ready to Get Started?