Training and Hyperparameter Tuning

Training your model on big data or running hyperparameter tuning can consume significant compute resources and energy and take significant time and expenses. Using distributed solutions such as Spark (when possible) can speed up compute time but will still consume significant energy and incur significant cost.

Using our much smaller Coreset structure, you can train or tune your model orders of magnitude faster and consume significantly less compute resources and energy, without impacting model accuracy. Just use DataHeroes to build the Coreset and then use any standard library such as Sklearn or Pytorch to train the model or use DataHeroes’ library built-in training mechanism.

Below are various examples of model quality when training a model on full dataset compared to Coreset and uniform (random) sampling.

	Training Dataset size	% of full dataset	AUC
Full dataset	580,440	100%	0.780
Random Sample 1%	5,804	1%	0.725
Random Sample 15%	87,066	15%	0.778
DataHeroes Coreset 1%	5804	1%	0.780

Training Dataset Size

% of full dataset

AUC

Full Dataset

580,440 100% 0.780

Uniform Sample 10%

5,804 1% 0.725

Uniform Sample 20%

87,066 15% 0.778

DataHeroes Coreset 1%

                             5,804
                             1%
                            0.780
                        

Publicly available Pokerhand dataset with 580K samples for training, 249K samples for testing, and 83 features. Using logistic regression to predict the final poker hand baked on the initial cards drawn.

	Training Dataset Size	% of Full Dataset	R^2	MSE
Full Dataset	8,000,000	100%	0.999	0.00153
Uniform Sample 20%	1,600,000	20%	0.999	0.00159
DH Coreset 0.001%	80	0.001%	0.999	0.00155

Training Dataset Size

% of Full Dataset

R^2

MSE

Full Dataset

8,000,000 100% 0.999 0.00153

Uniform Sample 20%

1,600,000 20% 0.999 0.00159

DH Coreset 0.001%

80 0.001% 0.999 0.00155

NYC Yellow Taxi dataset with 8 million rows for training and 2 million for test with 15 features; training a linear regression model that predicts tip amount.

	Full Dataset	DataHeroes Coreset
Size	1,800,000	90,000
# of Iterations	75	75
Run Time	4,155 secs	189 secs
CO2 Emissions	35 grams	1.5 grams
Accuracy	0.860	0.859

Full Dataset

DataHeroes Coreset

Size

1,800,000 90,000

# of iterations

75 75

Run time

4,155 secs 189 secs

CO2 Emissions

35 grams 1.5 grams

Accuracy

0.860 0.859

If you are still developing your model, you can do quick iterations of data cleaning and training on the Coreset, and see your model improve in real-time.

If you’ve finished development and want to do hyperparameter tuning, run your hyperparameter tuning on the Coreset and test many more iterations at a fraction of the time to find the optimal hyperparameters for your model.