Training and Hyperparameter Tuning
Training your model on big data or running hyperparameter tuning can consume significant compute resources and energy and take significant time and expenses. Using distributed solutions such as Spark (when possible) can speed up compute time but will still consume significant energy and incur significant cost.
Using our much smaller Coreset structure, you can train or tune your model orders of magnitude faster and consume significantly less compute resources and energy, without impacting model accuracy. Just use DataHeroes to build the Coreset and then use any standard library such as Sklearn or Pytorch to train the model or use DataHeroes’ library built-in training mechanism.
Training Dataset size | % of full dataset | AUC | |
---|---|---|---|
Full dataset | 580,440 | 100% | 0.780 |
Random Sample 1% | 5,804 | 1% | 0.725 |
Random Sample 15% | 87,066 | 15% | 0.778 |
DataHeroes Coreset 1% | 5804 | 1% | 0.780 |
Training Dataset Size | % of Full Dataset | R^2 | MSE | |
---|---|---|---|---|
Full Dataset | 8,000,000 | 100% | 0.999 | 0.00153 |
Uniform Sample 20% | 1,600,000 | 20% | 0.999 | 0.00159 |
DH Coreset 0.001% | 80 | 0.001% | 0.999 | 0.00155 |
Full Dataset | DataHeroes Coreset | |
---|---|---|
Size | 1,800,000 | 90,000 |
# of Iterations | 75 | 75 |
Run Time | 4,155 secs | 189 secs |
CO2 Emissions | 35 grams | 1.5 grams |
Accuracy | 0.860 | 0.859 |
If you are still developing your model, you can do quick iterations of data cleaning and training on the Coreset, and see your model improve in real-time.
If you’ve finished development and want to do hyperparameter tuning, run your hyperparameter tuning on the Coreset and test many more iterations at a fraction of the time to find the optimal hyperparameters for your model.