Build a Better ML Model. 10x Faster.
Reduce your dataset to a small subset that maintains the statistical properties and corner cases of your full dataset.
Use standard libraries to explore, clean, label, train and tune your models on a smaller subset, and build a higher quality model, faster.
- pip install dataheroes
Reduce Dataset Size without Losing Accuracy
Using algorithms from computational geometry known as Coresets, the library computes the Importance of each instance in your dataset and builds a weighted subset that is orders of magnitude smaller than the original dataset yet maintains all its statistical properties and corner cases (this subset is referred to as a Coreset). Save significant time and compute resources without losing accuracy by performing all data science operations on the Coreset.
Training Dataset size | % of full dataset | AUC | |
---|---|---|---|
Full dataset | 580,440 | 100% | 0.780 |
Random Sample 1% | 5,804 | 1% | 0.725 |
Random Sample 15% | 87,066 | 15% | 0.778 |
DataHeroes Coreset 1% | 5804 | 1% | 0.780 |
- from dataheroes import CoresetTreeServiceLG
- service_obj = CoresetTreeServiceLG(optimized_for='training', n_instances=580440)
- service_obj.build(X, y)
Build a Better Model (by Systematically Finding and Fixing Errors in Your Data)
Every model is only as good as the quality of the data used to train it. But finding errors in a large dataset is like finding a needle in a haystack. The DataHeroes framework uses the Coreset attributes to systematically identify potential errors and anomalies and flag them for review. Fix the errors and see your model update in real time as the model is re-trained on the Coreset.
- indices, importance = service_obj.get_important_samples(class_size={‘non_defect’: 50})
- service_obj.update_targets(indices, y=[‘defect’] * len(indices))
Full Dataset | DataHeroes Coreset | |
---|---|---|
Size | 1,800,000 | 90,000 |
# of Iterations | 75 | 75 |
Run Time | 4,155 secs | 189 secs |
CO2 Emissions | 35 grams | 1.5 grams |
Accuracy | 0.860 | 0.859 |
Save Time, Money & CO2 (by Training Your Model on the Coreset)
Use Sklearn, Pytorch or other standard libraries to train your model orders of magnitude faster on the Coreset or run many more hyperparameter tuning iterations to improve model quality without requiring excessive compute resources.
- gs = GridSearchCV(model, grid_params, scoring='roc_auc_ovo')
- coreset = service_obj.get_coreset()
- indices_coreset, X_coreset, y_coreset = coreset['data']
- gs.fit(X_coreset, y_coreset, sample_weight=coreset['w'])

Avoid Data Drift (by Continuously Updating Your Model on the Coreset)
Data drift is a common issue when moving models to production, yet always keeping your model up to date by continuously re-training as new data is collected is expensive and time consuming. Our unique Coreset tree structure allows you to add new data and update the Coreset on-the-go, and re-train your model on the Coreset in near real-time.
Ready to Get Started?
Our Blog
Stay updated with our latest blog posts, news, and announcements by reading here.