Reduce Model Training Time By 10x with Coresets

Reduce Model Training Time By 10x with Coresets

In general, businesses often desire a large number of data points when building machine learning models, but they can easily find themselves burdened by the time and cost associated with operations that require processing high volumes of data.

Fortunately, there is a promising subject being explored in academia that may be capable of resolving this issue: coresets.

Coresets is a topic originating in computational geometry with potential applications in machine learning. While the exploration of coresets is still in its infancy and has not gotten much exposure to the industry, it has the prospect of transforming how teams approach model training and deployment. Thus, it is certainly worth keeping abreast of such a topic.

Here, we provide a brief overview of coresets, explain why they are an asset for handling big data, and demonstrate how they can be used in Python.

The Challenge of Big Data

Given the value of data, businesses have opted to collect as much data as possible to enhance their operations. However, this endeavor has led to many businesses ending up with big data, which refers to data too large to be processed with traditional means.

Big Data

If you treat big data as a heavy load that your workers have to carry, a simple way to deal with it is by using more workers.

Organizations have adopted similar strategies when tackling high volumes of data, leveraging powerful GPUs and distributed computing (e.g., Spark). However, those approaches are not always viable in practice due to existing time and computing limitations. Training complex models from scratch will incur costs that most businesses are unable to afford. For instance, training complex models like GPT-4.0 costs millions of dollars, and even if a team can afford to train complex models, they may be hard-pressed to incorporate them in devices with limited hardware (e.g., smartphones). Moreover, training complex models leads to significant carbon emissions and will become an increasingly unpopular approach to businesses seeking more environmentally friendly solutions.

Thus, a more accommodating solution for handling big data is in demand.


A coreset is a small, weighted subset of data that serves as a summation of the original training dataset. Models trained with coresets can achieve the same performance (or near the same performance) as models trained with the entire dataset.

If we stick to the previous analogy, coresets can help businesses manage big data by decreasing the workload instead of adding more workers. More specifically, they reduce workload by identifying and focusing on the most important tasks while omitting tasks that do not add sufficient value.


More complex problems may call for using a coreset tree, a structure composed of multiple coresets.


There are a number of advantages that come from using coresets prior to building any machine learning models.

  • Reduced training time
    Training a model with a smaller training set will result in a training and tuning procedure that takes much less time.Furthermore, models put into production can be retrained with coresets at a much faster rate. This makes coresets a valuable tool for businesses looking to avoid data drift.
  • Reduced Compute 
    There are already established means for reducing the time needed to train models, like using distributed compute. However, such solutions still do not address the high computational demand that comes with training the models with big data and do not reduce the cost of carrying out such operations.This is a key area that makes coresets stand out. By using a small subset of the training data, the computational demand and cost of any subsequent modeling in the project automatically drops.
  • Improved data quality
    Coresets also have the added benefit of improving data quality by identifying anomalies in the training data. Anomalies are not necessarily harmful to a machine-learning model but can sometimes be a detriment. A good example of this is inaccurately labeled data, which is bound to hamper model performance.In other words, coresets give users the means to improve data quality, which directly contributes to improved model quality.

Creating Coresets in Python

Creating Coresets in Python

Photo by Hitesh Choudhary on Unsplash

As of late, there have been efforts to grant users access to the methodology of coresets for their machine-learning projects.

Such efforts are encapsulated in Dataheroes. Dataheroes is a free Python library that provides the means to leverage coresets to train models with less time and compute.

Dataheroes provides tools that can generate coresets for machine learning use cases including, but not limited to:

  • Regression with Linear Regression
  • Classification with Tree-based Models
  • Clustering with K-means

Users can access this library by simply installing it into their system with pip:

pip install dataheroes

First-time users will also have to activate their account with the following snippet:

from dataheroes.utils import activate_account


Case Study

Photo by Glenn Carstens-Peters on Unsplash

To demonstrate the effectiveness of coresets, let’s conduct a case study in which three XGBoost models are trained with the same underlying data.

The first model will be trained with the entire training data. The second model will be trained with the data derived using coresets. The third model will be trained with a random sample of the training data of the same size as the coresets.

Each model will be evaluated in terms of:

  • The number of data points
  • The balanced accuracy score (i.e., the average recall for all classes)
  • The training time

The case study will be carried out with the covtype dataset from OpenML.

# load the data

X, y = fetch_covtype(return_X_y=True)

# split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f'Number of data points in the training data: {X_train.shape[0]}')

Preview of Data (Created by Author)

As shown in the output, the training data comprises 464,809 data points. The features are subject to scaling prior to any modeling.

Training a Model with the Entire Dataset

First, we’ll train an XGBoost model with the entire training data (i.e., 464,809 data points).

# train the model with the entire training data

full_dataset_model = xgb.XGBClassifier(random_state=42), y_train)

Then, we’ll time the training process with the %%timeit operator.


full_dataset_model = xgb.XGBClassifier(random_state=42), y_train)


Next, the model is evaluated with the balanced score metric.

# generate predictions

y_pred_full = full_dataset_model.predict(X_test)

# evaluate the model

full_balanced = balanced_accuracy_score(y_test, y_pred_full)

print(f'Balanced Accuracy Score: {full_balanced}')


Training a Model with Coresets

This time, we can train an XGBoost model with a coreset.

  1. Create the coreset object
    The coreset can be built using the dataheroes library’s CoresetTreeServiceDTC subclass, which is used for decision tree classification problems.Once the key parameters have been provided, the coreset for a given dataset is created with the build method.

    from dataheroes import CoresetTreeServiceDTC
    # Build the coreset tree
    service_obj = CoresetTreeServiceDTC(
                                      ), y_train)

    Creating a coreset for an XGBoost model requires the optimized_for parameter, which specifies its purpose. The chunk_size and coreset_size parameters help determine the size of the derived coreset (i.e., the number of samples).

    Users can also let the library determine the optimal size of the coreset by only defining n_instances parameters.

  2. Obtain the data for the coresetNext, the coreset’s data can be retrieved using the get_coreset method. Users can configure the depth of the coreset tree by providing a value for the level parameter. When experimenting, it is best to start at level 0 and then increment the level by 1 and see how the coreset performs.
    # Get the coreset
    		coreset = service_obj.get_coreset(level=0) 
    		indices, X_train_coreset, y_train_coreset = coreset['data']
    		w = coreset['w']
    		# Train a XGBoost model on the coreset.
    		coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w)
    		y_pred_coreset = coreset_model.predict(X_test)
    		n_samples_coreset = y_train_coreset.shape[0]
    		print(f'Number of samples in the coreset: {n_samples_coreset}')

    Code Output

    The created coreset only has 25,741 data points! This is a small fraction (≈5.5%) compared to the 464,809 data points in the original training set. Furthermore, this result was achieved with limited experimentation. With some additional tuning, it is possible to achieve similar results with an even smaller subset!

    Let’s see how a model trained with this coreset performs.

  3. Train the model
    Using the retrieved data, we can train the XGBoost model.

    # Train a xgboost model on the coreset.
    		coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w)
    		y_pred_coreset = coreset_model.predict(X_test)

    The training process can be timed with the %%timeit operation.

    		# time the training process with the coreset
    		coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w)

  4. Evaluate the model
    Finally, the model can be evaluated with the balanced accuracy score metric.

    # Evaluate model
    		coreset_score = balanced_accuracy_score(y_test, y_pred_coreset) # target: 0.8296036929211656
    		print(f"Balanced score: {coreset_score}")

    code output

Training a model with a Random Sample

To prove that the coreset collected samples are better than a random sample, we can train the model with a random sample with the same size as the coreset.

  1. Get a random sample with the same size as the coreset
    import random
    		# size of coreset
    		sample_length = 25741
    		# Create a list of indices
    		indices = list(range(X_train.shape[0]))
    		# Get a random sample of indices
    		random_indices = random.sample(indices, sample_length)
    		# Retrieve elements from both arrays using the random indices
    		X_train_sample = np.array([X_train[i] for i in random_indices])
    		y_train_sample = np.array([y_train[i] for i in random_indices])
  2. Train the model with the sample
    Using the sample from the training data, we can train the xgboost model.

    # train the model with the sample
    		sample_model = xgb.XGBClassifier(random_state=42).fit(X_train_sample, y_train_sample)

    The training process can be timed with the %%timeit operator.

    		# time the training with the random sample
    		sample_model = xgb.XGBClassifier(random_state=42).fit(X_train_sample, y_train_sample)

    Code output

  3. Evaluate the model
    Finally, the model can be evaluated with the balanced score metric.

    # evaluate the model
    		sample_balanced = balanced_accuracy_score(y_test, sample_model.predict(X_test))
    		print(f"Balanced score: {sample_balanced}")

    Code output

Comparing All Approaches

We can summarize the results of the case study by comparing the models built with and without the coreset using the metrics mentioned above:


The table shows that the coreset comprises a small fraction of the samples (≈5.5%) in the original training data. Despite this, it is able to achieve an even greater balanced accuracy score! Furthermore, it took only a fraction of the time needed to train the model with the full dataset to train the model with the coreset. The model trained with the coreset also outperforms the model trained with the random sample.

To access the code for this case study, visit the GitHub repository:


So far, we’ve delved into an exciting subject in computational geometry that can potentially have far-reaching applications in machine learning. However, much like any tool or technique, using coresets comes with its own disadvantages.

  1. Incompatibility with certain datasets
    Firstly, coresets are not compatible with every dataset. They are primarily unsuited for datasets that are too small or too homogenous. Thus, teams considering using coresets would have to perform thorough exploratory data analysis on their data first.
  2. Difficulty in configuration
    The Dataheroes library contains a number of parameters that can be tuned for deriving the optimal coreset or coreset tree. However, the ideal set of parameters will vary from case to case, meaning that creating coresets will inevitably require experimentation, which could be time-consuming. That being said, such shortcomings are likely to be addressed as new versions of the library are released to the public.


Photo byPrateek Katyalon Unsplash

The machine learning space is ever-changing, with new tools and technologies emerging to replace the old. While it is too soon to praise coresets as the next household tool for combating big data, it is worth keeping tabs on a methodology that has already shown much promise.

While this article has not extensively covered every facet of coresets, it has hopefully sparked your interest in this subject. For more information on the various features of coresets or the math behind building coresets, you can visit the Dataheroes website, which covers this subject in a concise and digestible manner.

Thank you for reading!


Jubran, I., Maalouf, A., & Feldman, D. (2019, October 19). Introduction to Coresets: Accurate Coresets.

Subscribe to Our Blog

Subscribe to Our Blog

Related Articles

Unleashing the Power of ML: The Art of Training Models and Its Vital Significance

Unleashing the Power of ML: The Art of Training Models and Its Vital Significance

Learn More
Comparing Customer Segmentation Techniques: KMeans vs. KMeans Coreset from DataHeroes

Comparing Customer Segmentation Techniques: KMeans vs. KMeans Coreset from DataHeroes

Learn More
The Role of Data Cleaning in Computer Vision

The Role of Data Cleaning in Computer Vision

Learn More