Training and Hyperparameter Tuning

Training your model on big data or running hyperparameter tuning can consume significant compute resources and energy and take significant time and expenses. Using distributed solutions such as Spark (when possible) can speed up compute time but will still consume significant energy and incur significant cost.

Using our much smaller Coreset structure, you can train or tune your model orders of magnitude faster and consume significantly less compute resources and energy, without impacting model accuracy. Just use DataHeroes to build the Coreset and then use any standard library such as Sklearn or Pytorch to train the model or use DataHeroes’ library built-in training mechanism.

Below are various examples of model quality when training a model on full dataset compared to Coreset and uniform (random) sampling.
Training Dataset size % of full dataset AUC
Full dataset 580,440 100% 0.780
Random Sample 1% 5,804 1% 0.725
Random Sample 15% 87,066 15% 0.778
DataHeroes Coreset 1% 5804 1% 0.780
Training Dataset Size
% of full dataset
AUC
Full Dataset
580,440 100% 0.780
Uniform Sample 10%
5,804 1% 0.725
Uniform Sample 20%
87,066 15% 0.778
DataHeroes Coreset 1%
5,804 1% 0.780
Publicly available Pokerhand dataset with 580K samples for training, 249K samples for testing, and 83 features. Using logistic regression to predict the final poker hand baked on the initial cards drawn.
Training Dataset Size % of Full Dataset R^2 MSE
Full Dataset 8,000,000 100% 0.999 0.00153
Uniform Sample 20% 1,600,000 20% 0.999 0.00159
DH Coreset 0.001% 80 0.001% 0.999 0.00155
Training Dataset Size
% of Full Dataset
R^2
MSE
Full Dataset
8,000,000 100% 0.999 0.00153
Uniform Sample 20%
1,600,000 20% 0.999 0.00159
DH Coreset 0.001%
80 0.001% 0.999 0.00155
NYC Yellow Taxi dataset with 8 million rows for training and 2 million for test with 15 features; training a linear regression model that predicts tip amount.

If you are still developing your model, you can do quick iterations of data cleaning  and training on the Coreset, and see your model improve in real-time.

If you’ve finished development and want to do hyperparameter tuning, run your hyperparameter tuning on the Coreset and test many more iterations at a fraction of the time to find the optimal hyperparameters for your model.