The Problem
It is well known that the quality of a Machine Learning (ML) model is highly dependent on the quality of the dataset it is trained on. Recent analyses suggest that oftentimes, the datasets the models are trained on are delivered with labeling or target errors which can negatively impact the machine learning model. Additionally, it is commonly accepted that data curation and cleaning can amount to as much as half the effort a data scientist invests in developing and optimizing a machine learning model, partly because there is no well-established, consistent method for finding labeling errors in datasets.
The Solution: Coresets
Coresets are a sampling methodology originating in computational geometry used to approximate optimization problems. They are based on selecting a subset of the original dataset that maintains the entire dataset’s statistical properties and corner cases. Training a model on the Coreset will give the same result as training it on the full dataset.
When computing a Coreset, every data sample is quantitatively evaluated with respect to its importance. This importance value, associated with each sample, indicates how relevant the sample is within the entire dataset, from the ML model’s perspective. Samples that receive a high importance value in the Coreset computation require the attention of the data scientist as they could indicate a labeling error, out-of-distribution or other data-related issues. Therefore, leveraging Coresets, one can easily assign an importance score to each sample. Instances with high importance scores have a high probability of being mislabeled and, thus, one can sample instances of interest by looking at the top percentile of importance.
The Goal of the Post
In this blog post, we’ll use DataHeroes’ Coreset library for finding and fixing incorrect labels in the ImageNet-1K dataset. Using good practices, a class of interest with potential incorrect labels is selected first, which will be, subsequently, cleaned. Using the DataHeroes library, we will present a novel sampling method to select instances with a high probability of being mislabeled.
As a result, instead of randomly selecting samples of a class in hope of finding mislabels, the cleaning process becomes systematic, being guided by the Coreset’s sampling method.
The objective of this guide is to demonstrate how the Coreset service can be appropriately used for accelerating dataset cleaning. In particular, the intent is to find and correct labeling errors within ImageNet-1K dataset.
Dataset Cleaning Task Overview:
- Load dataset features
- Select a class to clean
- Clean the labels using the DataHeroes Coreset library
- Visualize final results
Using Python, the first step is to import DataHeroes’ Logistic Regression Coreset Service:
# Import the logistic regression coreset tree service, which we will use to find important samples within the dataset. from dataheroes import CoresetTreeServiceLG
Important: A free account needs to be created first, which can be done here.
Load ImageNet-1K dataset features
Computing the ImageNet-1K dataset features
The dataset features of the ImageNet-1K dataset used in this guide are calculated (using a distinct, in-house, Python script) from the original ImageNet-1K dataset by means of a ResNet18 classifier, which will process each sample of the dataset, both in the training and the validation subsets. The last classification layer of the ResNet18 model is shunted in order for the model to provide at its output an embedding instead of the class prediction. The embedding consists of a 512-values array.
Note: Because of the long latency for extracting the above-mentioned embeddings (in the order of a few hours), the code that computes them was externalized to a different Python script. In order to accelerate the extraction of ImageNet-1K embeddings, one can utilize GPUs, with the additionally required dependencies.
Using numpy.load() function, we will first load the precomputed embeddings of the ImageNet-1K dataset:
# Load the train data x_train = np.load(str(x_train_file)) y_train = np.load(str(y_train_file)) # Load the test data x_test = np.load(str(x_test_file)) y_test = np.load(str(y_test_file))
We also defined a helper function for rendering sequences of images from the ImageNet-1K dataset with the help of which one can visualize original dataset samples.
Below is an example of rendered images from class “pug, pug-dog”.
Pick a Class to Clean
We will select only one class of interest for which to find and clean the incorrectly labeled data. To do that, we will fit a Logistic Regression classifier on our training dataset and calculate the confusion matrix on the testing dataset. The class of interest will be represented by the worst-performing class.
clf = LogisticRegression() clf.fit(x_train, y_train)
Pick the Class Based on the Confusion Matrix
In the following, a didactic example is presented for justifying how the confusion matrix is used to determine the class that generated the most false-positives.
Suppose the normalized confusion matrix (with values between 0 and 1) is calculated for five classes: A, B, C, D, and E. Consider that the highest value in the confusion matrix, ignoring the values from the main diagonal, corresponds to the true class A and the predicted class C. As a result, class A was confused with class C the most within our test dataset because of the high number of samples from class A which were predicted as belonging to class C. With respect to this example, class C is the class of interest because there might be issues within the labels for this class, as indicated by the high confusion between classes A and C.
y_predicted = clf.predict(x_test) cm = confusion_matrix(y_test, y_predicted, normalize="true") # To extract the highest confusion value, ignore the diagonal matrix np.fill_diagonal(cm, 0) true_classid, predicted_classid = np.where(cm == cm.max()) true_classid, predicted_classid = true_classid[0], predicted_classid[0]
We found out that among the top 3 classes of interest (in descending order of confusion matrix’ value) is the class “missile” with a total of 1300 samples in the train dataset. The corresponding, the most confused class is “projectile, missile” with 1300 samples.
Note: To speed up the training, we only kept samples from the above two classes: “missile” or “projectile, missile” (referred to, in the code fragments, as the “small” dataset). Ultimately, we will use the Coreset service to sample only “missile” instances. As a result, this decision won’t affect the cleaning process but will speed up the experimentation process. After keeping only the essential classes, the training data size was reduced from 1281167 samples to 2600 samples.
Clean the Labels Using the DataHeroes library
We will use the DataHeroes Coreset library as a novel sampling method to review samples with potentially incorrect labels.
The CoresetTreeServiceLG, at its build phase, computes an importance score for every sample. As previously stated, samples with high importance have a high potential to be mislabeled. As a result, one approach for selecting the samples to be inspected for incorrect labeling is to guide this selection by the samples’ importance values, namely, inspect the samples having the top importance values, as calculated by the Coreset. Another approach, in the absence of any other information or hint regarding possible mislabeling, is to randomly select samples from the class of interest. When the data is small, random sampling can work with good results, but as the data gets larger (which is the case for real-world scenarios), randomly sampling instances is not practical nor is it a guarantee of consistent cleaning results. Thus, using the Coreset, one is provided with a clear methodology for looking for incorrect labels.
The code fragment below presents the construction of the coreset object and provides the top N samples with the highest importance values using a set of features X and labels y for a given class of interest.
# Build the coreset tree service object self.coreset_service = CoresetTreeServiceLG(optimized_for='cleaning') self.coreset_service.build(self.small_x_train, self.small_y_train)
The code below determines the top num_images_to_view samples from the class of interest, having the highest importance values.
# Get the most important samples result = self.coreset_service.get_important_samples( class_size={self.predicted_classid: self.num_images_to_view} ) important_sample_indices = result["idx"]
Once a set of samples are identified as being incorrectly labelled, meaning as belonging to “missile” class, the user has two options: assign the sample to the other class (the true class, from the didactic example above) if it clearly belongs to the true class, or, delete the sample from the train set, if it doesn’t belong to the true class (it could be the case that a mislabeled image could belong to another class but for the purpose of this guide, since we are using a reduced dataset of only two classes, assigning the sample to a class other than the two used is effectively equivalent to deleting it from the dataset). The deletion is handled by means of using the sample_weights, which sklearn’s Logistic Regression classifier operates with. More precisely, for samples to be eliminated, their corresponding sample weight is set to 0, while, initially, all samples have their sample_weight set to 1. Now, we will define one last component that will update the Coreset service with the clean data. Using this approach, we build the Coreset only once, in the beginning, and after every cleaning iteration, we just update it with the newly cleaned data. This method reflects two possible cleaning actions: change the label or remove the sample. Finally, the effect of the cleaning iteration over the dataset is evaluated with respect to the AUROC score when fitting it on the sklearn’s cross-validation-enabled LogisticRegressionCV classifier
# Removing the samples indices stored in `small_idxes_to_delete` from the Coreset self.coreset_service.remove_samples(self.deleted_idxes[self.iteration]) # Re-evaluate the AUROC score for the cleaned dataset self.sample_weights[small_idxes_to_delete] = 0 new_score = self.eval_lg_cv(sample_weights=new_weights)
The cleaning process
The cleaning process is split into multiple iterations. For each iteration, the following operations are performed:
- Determine the samples having the highest importance (in our use case select 50 samples with top importance)
- Manually inspect and correct any label issues for the selected samples (using an in-house dedicated tool)
- Correct the labels of the selected samples
- Evaluate the AUROC score for the current iteration’s corrected dataset, using K-fold cross-validation
- Repeat steps 1-4 until either AUROC score improvement or the number of corrected samples slows down
Important observation: Before starting the cleaning process, it is recommended to examine multiple samples in order to be able to define a cleaning strategy. For the case of the “missile” class, dataset initial inspection indicated that samples with incorrect labels should be removed instead of being assigned to the other class, precisely because of the inherent overlap between the class of interest (“missile”) and the corresponding other class (“projectile, missile”). The class overlap, which is the case for the two classes used in this guide, has long been regarded as one of the toughest pervasive problems in classification.
Example of performing one cleaning iteration
Steps 1: Obtain the indices of 50 samples with highest importance. The code fragment presented above is used for calling the Coreset’s get_important_samples with the class_size argument indicating that only samples of class of interest are to be retrieved and that only num_images_to_view samples (constant set to 50) samples are to be retrieved.
Step 2: The selected samples are manually reviewed and the appropriate action for label correction is taken. At the end of this step, a list of samples to be deleted from the trainset will be returned.
Steps 3 and 4: Update the coreset object by removing the samples returned in the previous step and clear their associated sample weights (used for effectively eliminating the samples from K-fold cross-validation Logistic Regression).
We repeated the cleaning process for 6 iterations, looking at 300 images out of a total of 1300 samples. The cleaning process was stopped after 6 iterations because the number of samples selected for deletion decreased considerably and the AUROC score reached a plateau.
Visualize the Final Results
AUC Score Across Iterations
The plot below indicates the evolution of the AUROC score after each cleaning iteration. In this case, by inspecting 300 out of 1300 samples, we improved the AUC score by 0.48%.
It is expected that this cleaning methodology would have a significantly better effect in a real-world system, where the data is mislabeled in a considerably higher percentage.
Fixed Samples Across Iterations
Using this strategy, we cleaned 8.08% samples of the class (105 samples out of the 1300 samples of the class).
Example of mislabeled samples identified by the multiple cleaning iterations
The samples below are among the ones manually found to be either incorrectly labeled or to not represent a pertinent example of the “missile” class. Samples 1, 2, 3 and 5 are clear label errors. No additional explanation is required for the mislabeling of images 1 and 3 (at least for image 1, some form of “resemblance” can be assumed) while image 2 depicts “the 5P85CM self-propelled launcher” and picture 5 depicts the “M61 Vulcan” Gatling-style rotary cannon. On the other hand, image 4, although possibly depicting part of a missile, due to the unusual coloring as well as the partial portion of a missile it includes, is best left out of the training set. Similarly, image 6, although possibly depicting a missile, given the environment, the perspective of the image makes its content classification less dependent on the actual structure of a missile and more on the environment.
Similarly, some of the samples selected for deletion during the second iteration are illustrated below. The 3 images either depict objects of class “missile” in an incomplete manner, like image 1, or contain different objects altogether (statue, respectively, airplane engine).
In cleaning iteration number 3, among the cleaned samples are the ones below. Image 1, although containing objects of class “missile”, due to the large number of such objects, some of them being overlapped, could negatively affect the classifier training task. Image number 2, although semantically linked to the class of interest, depicts, nonetheless, a plastic construction toy whereas the third image depicts a submarine.
Some additional samples cleaned in the 3 subsequent iterations are aggregated in the figure below, including either partially occluded objects of class “missile”, multiple such objects or foreground objects of other classes.
Conclusion
In conclusion, using the DataHeroes Coreset library, we cleaned the labels of the “missile” class from the ImageNet-1K dataset. We have seen the impact of the cleaning process by an increasing trend in the AUC metric from iteration to iteration. We stopped after six iterations because the AUROC metric plateaued.
Using the DataHeroes Coreset library, we managed to speed up the cleaning process by constantly looking at the top 50 samples based on the Coreset importance instead of randomly picking samples with the hope that we would find samples of interest. Thus, using the importance of the Coreset is a more effective way of finding bugs and errors in your dataset.