Full Dataset	DataHeroes Coreset
Size	1,800,000	90,000
# of Iterations	75	75
Run Time	4,155 secs	189 secs
CO2 Emissions	35 grams	1.5 grams
Accuracy	0.860	0.859

Full Dataset

DataHeroes Coreset

Size

1,800,000 90,000

# of iterations

75 75

Run time

4,155 secs 189 secs

CO2 Emissions

35 grams 1.5 grams

Accuracy

0.860 0.859

Model Maintenance

Data drift is a common issue when moving models to production, yet always keeping your model up-to-date by continuously re-training as new data is collected is expensive and time-consuming.

Our Coreset Framework uses a unique Coreset Tree Structure (also referred to as Streaming Tree). The Coreset Tree structure allows you to add new data to the Coreset and update it on-the-go, and re-train the model using the updated Coreset in near real-time (to learn more, visit Introduction to Coresets).

The below graphs show how data drift can be mitigated with more frequent updates. In the first graph, the red line shows the accuracy of a model in production with less frequent updates of the model (the model is re-trained every 200,000 new samples with the old data plus the new data). The graph under it shows the training time of the model. The training time of the model on the full dataset grows linearly with the size of the data, as depicted in the red line on the second graph. As more data is collected, it becomes more expensive and time consuming to train the model, and updates will become less frequent, making data drift more significant.

The blue line in the first graph shows more frequent updates (the model is re-trained every 10,000 new samples), using the Coreset Tree structure. Since the Coreset tree structure uses just a fraction of the data, the training time is significantly lower, as can be seen in the blue line in the second graph, depicting training time. Furthermore, as more data is collected, the training time of the Coreset remains almost constant, since the Coreset doesn’t grow linearly with more data, it gets updated while growing logarithmically.

Our Coreset framework allows you to easily and cost effectively go from updating your model every few weeks or months to updating it daily or at any frequency of your choosing to avoid unnecessary data drifts.