Overfitting

Overfitting represents one of the most common ways a machine-learning model can fail. Consequently, it’s also one of the most common challenges machine-learning engineers must overcome. But what causes it, exactly?

And more importantly, what can be done to prevent it?

What is Overfitting?

Overfitting in machine learning occurs when a model becomes too attuned to a particular data set. This impedes its ability to adapt – or, to use the more common parlance, generalize – to new data. Although it displays a high level of accuracy when applied to training data, an overfitted model tends to underperform everywhere else.

To better understand overfitting, it may be helpful to use an analogy. Imagine that you’re trying to teach a child about dogs, but instead of explaining the traits shared between all canines, you teach them exclusively about Scottish Terriers. When they later see a Great Dane, they might become confused, assuming it’s simply an overly large Scottish Terrier with considerably less fur.

What Causes Overfitting?

Overfitting can happen for many different reasons. With that said, every reason can ultimately be boiled down to either a problem with the machine-learning model, a problem with the data, or both. Common causes include:

  • Attempting to apply a complex machine-learning model to a relatively simple task.
  • The training data does not contain enough samples.
  • The training data does not accurately represent all possible edge cases and corner cases.
  • The training data contains too much irrelevant data, referred to in this context as noise.
  • The model trains too long on a single set of data.

Why is it Important to Detect Model Overfitting?

An overfitted machine-learning model cannot adequately perform the task for which it was originally deployed. Unfortunately, without some mechanism in place to detect and address overfitting, engineers are left unaware of this fact. They see that their model displays a high degree of accuracy on training data and assume that this accuracy also applies to novel data.

They then rely on the predictions or calculations of that machine-learning model to make decisions.

For instance, let’s say a brokerage deploys a machine-learning algorithm designed to produce economic forecasts for investing purposes. They believe that the model is 99% accurate when, in actuality, it’s closer to 50%. The model produces a forecast indicating that it’s an ideal time to purchase stock in a company, and the brokerage spends considerable capital to do so.

One month later, the value of that stock plummeted, potentially costing the brokerage and its clients millions.

Generally speaking, machine-learning engineers should always be cognizant of overfitting. However, it’s especially important to check for overfitting when:

  • Your machine-learning model is highly complex.
  • You’re working with a relatively small volume of training data.
  • The model will be making decisions or predictions around extremely high-stakes scenarios, such as for diagnostic purposes.

How to Avoid Overfitting

Detection and prevention of overfitting are processes that your team needs to plan for at the outset of your model’s training. While it’s still possible to address overfitting once a model is fully trained, it’s far more difficult to do so. An ounce of prevention is worth a pound of cure, as the saying goes.

Detecting and Preventing Overfitting

You have a few options when it comes to dealing with overfitting in your model.

Cross Validation

Also known as K-fold cross-validation, this is among the most common methods for overfitting detection. It involves splitting the training data into multiple equally sized segments called folds. The model is then trained and evaluated on each fold through the following process:

  1. Set one sample aside. This will serve as your validation data.
  2. Train the model on the remaining samples.
  3. Apply the validation data to the model and observe its performance.
  4. Score the model based on the quality of its output.

This process repeats until the model has been trained on all folds, at which point its score is then averaged to provide an assessment of its predictive capabilities.

Validation Set

A simpler form of cross-validation, this method splits a model’s training data into two parts. One part will be used for training, the other for validation. If the model performs well on its training data but poorly on its validation data, this indicates the presence of overfitting.

The most common ratio for this method is to use 80% of the data for training while retaining 20% for testing.

You may also apply learning curves to this method, plotting the model’s performance on its training and validation data over time; any sign of divergence is a likely indicator of overfitting.

Diversified Data

Typically, the best way to prevent overfitting is through the diversification and scaling of your data set. If collecting more data or diversifying your existing data is not an option, you could also consider data augmentation or data synthesis. Essentially, these two closely related processes allow you to artificially generate new data points – the former does so based on your existing training data, while the latter produces new data samples algorithmically.

For example, if your machine-learning model is being trained on a data set consisting entirely of images, data augmentation would involve making minor transformations to produce new image variants, such as flipping, scaling, shifting, re-shading, or rotating.

Simplifying Your Model

If, in spite of all your other efforts, your model continues to return inaccurate results, the problem might not be with your training data. The issue could lie with the model itself. You might consider restarting your training process with a simpler machine-learning model, adding complexity only if absolutely necessary.

Alternatively, you could selectively prune redundant or unnecessary features and parameters from your model to reduce the chance that it will overfit.

Data Processing

Data redundancy represents another common cause of overfitting. While augmentation and synthesization can help a great deal in increasing variance, it may be worthwhile to prune redundancy as well. Case in point, many commonly-used data sets contain at least 10% redundant data.

Remove these redundancies as a starting point.

Ensembling

Ensembling is a technique that combines multiple algorithms in an effort to improve their predictive power. The idea is that machine-learning models that are weak learners – meaning they tend to output inaccurate results – can augment one another for increased accuracy. There are two different ensembling techniques commonly used when preventing overfitting:

  • Bagging, which trains all algorithmic models in parallel with one another.
  • Boosting, which trains them one after another, eventually producing a final, consolidated result.

Regularization

Regularization is a technique for addressing overfitting that weighs different features and factors in a model’s training data by order of importance, penalizing those that are deemed less cogent to the model’s output. Going back to the earlier example of the brokerage’s economic model, regularization data might apply a reduced penalty to gold prices and an increased penalty to cryptocurrency sales.

The former has commonly been used as a solid predictor of global economic health, while the market for the latter is widely regarded as incredibly volatile.

Early Stopping

This method is exactly what it sounds like – instead of fully training the model on a data set, you stop the training early in an effort to prevent it from learning any noise that may be present. This must be done with extreme caution. Stopping too soon to prevent overfitting can result in the opposite problem, known as underfitting.

What’s the Difference Between Overfitting and Underfitting?

Overfitting and underfitting are similar in that both occur due to problems in the training stage of a machine-learning model. However, while overfitting represents over-specialisation, underfitting is the opposite. It typically occurs for one of two reasons:

  • Attempting to apply a simple machine-learning model to a complex task.
  • Not training a machine-learning model for a sufficient amount of time.

An underfitted model performs poorly on both training data and novel data. Interestingly, underfitting frequently occurs when engineers overcorrect in an effort to avoid overfitting via a process known as early stopping. The good news is that underfitting is a great deal easier to both detect and correct compared to overfitting.

When you’ve determined that your model is underfitted, you have a few options:

  • Continue training the machine-learning model.
  • Decreasing regularization within the training data to increase variance.
  • Use feature selection to increase the complexity and capabilities of the model.