In general, businesses often desire a large number of data points when building machine learning models, but they can easily find themselves burdened by the time and cost associated with operations that require processing high volumes of data.
Fortunately, there is a promising subject being explored in academia that may be capable of resolving this issue: coresets.
Coresets is a topic originating in computational geometry with potential applications in machine learning. While the exploration of coresets is still in its infancy and has not gotten much exposure to the industry, it has the prospect of transforming how teams approach model training and deployment. Thus, it is certainly worth keeping abreast of such a topic.
Here, we provide a brief overview of coresets, explain why they are an asset for handling big data, and demonstrate how they can be used in Python.
The Challenge of Big Data
Given the value of data, businesses have opted to collect as much data as possible to enhance their operations. However, this endeavor has led to many businesses ending up with big data, which refers to data too large to be processed with traditional means.
If you treat big data as a heavy load that your workers have to carry, a simple way to deal with it is by using more workers.
Organizations have adopted similar strategies when tackling high volumes of data, leveraging powerful GPUs and distributed computing (e.g., Spark). However, those approaches are not always viable in practice due to existing time and computing limitations. Training complex models from scratch will incur costs that most businesses are unable to afford. For instance, training complex models like GPT-4.0 costs millions of dollars, and even if a team can afford to train complex models, they may be hard-pressed to incorporate them in devices with limited hardware (e.g., smartphones). Moreover, training complex models leads to significant carbon emissions and will become an increasingly unpopular approach to businesses seeking more environmentally friendly solutions.
Thus, a more accommodating solution for handling big data is in demand.
Coreset
A coreset is a small, weighted subset of data that serves as a summation of the original training dataset. Models trained with coresets can achieve the same performance (or near the same performance) as models trained with the entire dataset.
If we stick to the previous analogy, coresets can help businesses manage big data by decreasing the workload instead of adding more workers. More specifically, they reduce workload by identifying and focusing on the most important tasks while omitting tasks that do not add sufficient value.
More complex problems may call for using a coreset tree, a structure composed of multiple coresets.
Benefits
There are a number of advantages that come from using coresets prior to building any machine learning models.
- Reduced training time
Training a model with a smaller training set will result in a training and tuning procedure that takes much less time.Furthermore, models put into production can be retrained with coresets at a much faster rate. This makes coresets a valuable tool for businesses looking to avoid data drift. - Reduced ComputeÂ
There are already established means for reducing the time needed to train models, like using distributed compute. However, such solutions still do not address the high computational demand that comes with training the models with big data and do not reduce the cost of carrying out such operations.This is a key area that makes coresets stand out. By using a small subset of the training data, the computational demand and cost of any subsequent modeling in the project automatically drops. - Improved data quality
Coresets also have the added benefit of improving data quality by identifying anomalies in the training data. Anomalies are not necessarily harmful to a machine-learning model but can sometimes be a detriment. A good example of this is inaccurately labeled data, which is bound to hamper model performance.In other words, coresets give users the means to improve data quality, which directly contributes to improved model quality.
Creating Coresets in Python
Photo by Hitesh Choudhary on Unsplash
As of late, there have been efforts to grant users access to the methodology of coresets for their machine-learning projects.
Such efforts are encapsulated in Dataheroes. Dataheroes is a free Python library that provides the means to leverage coresets to train models with less time and compute.
Dataheroes provides tools that can generate coresets for machine learning use cases including, but not limited to:
- Regression with Linear Regression
- Classification with Tree-based Models
- Clustering with K-means
Users can access this library by simply installing it into their system with pip:
pip install dataheroes
First-time users will also have to activate their account with the following snippet:
from dataheroes.utils import activate_account
activate_account(“first_last@gmail.com”)
Case Study
Photo by Glenn Carstens-Peters on Unsplash
To demonstrate the effectiveness of coresets, let’s conduct a case study in which three XGBoost models are trained with the same underlying data.
The first model will be trained with the entire training data. The second model will be trained with the data derived using coresets. The third model will be trained with a random sample of the training data of the same size as the coresets.
Each model will be evaluated in terms of:
- The number of data points
- The balanced accuracy score (i.e., the average recall for all classes)
- The training time
The case study will be carried out with the covtype dataset from OpenML.
# load the data X, y = fetch_covtype(return_X_y=True) # split the data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) print(f'Number of data points in the training data: {X_train.shape[0]}')
Preview of Data (Created by Author)
As shown in the output, the training data comprises 464,809 data points. The features are subject to scaling prior to any modeling.
Training a Model with the Entire Dataset
First, we’ll train an XGBoost model with the entire training data (i.e., 464,809 data points).
# train the model with the entire training data full_dataset_model = xgb.XGBClassifier(random_state=42) full_dataset_model.fit(X_train, y_train)
Then, we’ll time the training process with the %%timeit operator.
%%timeit full_dataset_model = xgb.XGBClassifier(random_state=42) full_dataset_model.fit(X_train, y_train)
Next, the model is evaluated with the balanced score metric.
# generate predictions y_pred_full = full_dataset_model.predict(X_test) # evaluate the model full_balanced = balanced_accuracy_score(y_test, y_pred_full) print(f'Balanced Accuracy Score: {full_balanced}')
Training a Model with Coresets
This time, we can train an XGBoost model with a coreset.
- Create the coreset object
The coreset can be built using the dataheroes library’s CoresetTreeServiceDTC subclass, which is used for decision tree classification problems.Once the key parameters have been provided, the coreset for a given dataset is created with the build method.from dataheroes import CoresetTreeServiceDTC # Build the coreset tree service_obj = CoresetTreeServiceDTC(                                    optimized_for='training',                                    n_classes=7,                                    chunk_size=40_000,                                    coreset_size=15_000                                   ) service_obj.build(X_train, y_train)
Creating a coreset for an XGBoost model requires the optimized_for parameter, which specifies its purpose. The chunk_size and coreset_size parameters help determine the size of the derived coreset (i.e., the number of samples).
Users can also let the library determine the optimal size of the coreset by only defining n_instances parameters.
- Obtain the data for the coresetNext, the coreset’s data can be retrieved using the get_coreset method. Users can configure the depth of the coreset tree by providing a value for the level parameter. When experimenting, it is best to start at level 0 and then increment the level by 1 and see how the coreset performs.
# Get the coreset coreset = service_obj.get_coreset(level=0)Â indices, X_train_coreset, y_train_coreset = coreset['data'] w = coreset['w'] # Train a XGBoost model on the coreset. coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w) y_pred_coreset = coreset_model.predict(X_test) n_samples_coreset = y_train_coreset.shape[0] print(f'Number of samples in the coreset: {n_samples_coreset}')
The created coreset only has 25,741 data points! This is a small fraction (≈5.5%) compared to the 464,809 data points in the original training set. Furthermore, this result was achieved with limited experimentation. With some additional tuning, it is possible to achieve similar results with an even smaller subset!
Let’s see how a model trained with this coreset performs.
- Train the model
Using the retrieved data, we can train the XGBoost model.# Train a xgboost model on the coreset. coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w) y_pred_coreset = coreset_model.predict(X_test)
The training process can be timed with the %%timeit operation.
%%timeit # time the training process with the coreset coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w)
- Evaluate the model
Finally, the model can be evaluated with the balanced accuracy score metric.# Evaluate model coreset_score = balanced_accuracy_score(y_test, y_pred_coreset) # target: 0.8296036929211656 print(f"Balanced score: {coreset_score}")
Training a model with a Random Sample
To prove that the coreset collected samples are better than a random sample, we can train the model with a random sample with the same size as the coreset.
- Get a random sample with the same size as the coreset
import random random.seed(42) # size of coreset sample_length = 25741 # Create a list of indices indices = list(range(X_train.shape[0])) # Get a random sample of indices random_indices = random.sample(indices, sample_length) # Retrieve elements from both arrays using the random indices X_train_sample = np.array([X_train[i] for i in random_indices]) y_train_sample = np.array([y_train[i] for i in random_indices])
- Train the model with the sample
Using the sample from the training data, we can train the xgboost model.# train the model with the sample sample_model = xgb.XGBClassifier(random_state=42).fit(X_train_sample, y_train_sample)
The training process can be timed with the %%timeit operator.
%%timeit # time the training with the random sample sample_model = xgb.XGBClassifier(random_state=42).fit(X_train_sample, y_train_sample)
- Evaluate the model
Finally, the model can be evaluated with the balanced score metric.# evaluate the model sample_balanced = balanced_accuracy_score(y_test, sample_model.predict(X_test)) print(f"Balanced score: {sample_balanced}")
Comparing All Approaches
We can summarize the results of the case study by comparing the models built with and without the coreset using the metrics mentioned above:
The table shows that the coreset comprises a small fraction of the samples (≈5.5%) in the original training data. Despite this, it is able to achieve an even greater balanced accuracy score! Furthermore, it took only a fraction of the time needed to train the model with the full dataset to train the model with the coreset. The model trained with the coreset also outperforms the model trained with the random sample.
To access the code for this case study, visit the GitHub repository:
https://github.com/anair123/Training-Classification-Models-with-Coresets
Limitations
So far, we’ve delved into an exciting subject in computational geometry that can potentially have far-reaching applications in machine learning. However, much like any tool or technique, using coresets comes with its own disadvantages.
- Incompatibility with certain datasets
Firstly, coresets are not compatible with every dataset. They are primarily unsuited for datasets that are too small or too homogenous. Thus, teams considering using coresets would have to perform thorough exploratory data analysis on their data first. - Difficulty in configuration
The Dataheroes library contains a number of parameters that can be tuned for deriving the optimal coreset or coreset tree. However, the ideal set of parameters will vary from case to case, meaning that creating coresets will inevitably require experimentation, which could be time-consuming. That being said, such shortcomings are likely to be addressed as new versions of the library are released to the public.
Conclusion
Photo byPrateek Katyalon Unsplash
The machine learning space is ever-changing, with new tools and technologies emerging to replace the old. While it is too soon to praise coresets as the next household tool for combating big data, it is worth keeping tabs on a methodology that has already shown much promise.
While this article has not extensively covered every facet of coresets, it has hopefully sparked your interest in this subject. For more information on the various features of coresets or the math behind building coresets, you can visit the Dataheroes website, which covers this subject in a concise and digestible manner.
Thank you for reading!
References
Jubran, I., Maalouf, A., & Feldman, D. (2019, October 19). Introduction to Coresets: Accurate Coresets. arXiv.org. https://arxiv.org/abs/1910.08707