Ensemble Learning

Turns out humans aren’t the only ones who tend to learn more effectively through collaboration.

Some machines do, as well. Combining multiple learning models allows an algorithm to provide more accurate predictions by reducing noise, variance, and bias. This technique, known as ensemble learning, essentially enables the creation of a predictive model that’s greater than the sum of its parts.

Ensemble learning has gained considerable popularity in the machine learning space owing to both its ease of implementation and its capacity to address a variety of predictive modeling problems.

The Three Core Ensemble Learning Methods

There exists a near-infinite number of ways to apply ensemble learning in machine learning. Complicating matters further, each of these techniques has its own associated algorithm. The good news is that the majority of these strategies are a variation of one of the following core methods.

Bagging (Bootstrap AGGregatING)

Bagging is an ensemble learning technique that’s primarily used to reduce variance errors in decision tree models. This is achieved by breaking the training data down into randomized subsets and then training each model in the ensemble on a different subset. This is nearly always done with an unpruned decision tree algorithm.

Once all ensemble members have made their predictions, those values are combined through metrics such as averaging or voting. On average, this allows predictions far more robust than would be possible with an individual model of a decision tree.

To train a machine learning model with bagging:

  • Start by creating random representative samples from your training data, a process known as bootstrap sampling. Samples selected for training should be immediately returned to the data set for possible re-selection.
  • Each subset will be used to train a predictive model, generally an unpruned decision tree.
  • Once each ensemble member has been fit, average their predictions.

Some popular bagging algorithms include:

  • Bagged Decision Trees
  • Extra Trees
  • Random Forest

Boosting

Boosting is an ensemble learning technique that specifically seeks to address and correct prediction errors. It does this by sequentially fitting and adding new models to an ensemble, with each model attempting to correct the predictions of the model immediately before it. The models used in boosting typically use basic decision trees designed to make only small decisions. In machine learning, these are referred to as weak learners.

As more models are added to the chain, their predictions are combined through either voting or averaging, with each model’s contributions weighted proportionate to their capabilities or performance. The end goal is to develop a strong learner – an ensemble learning classification model with high accuracy, precision, and recall.

Some of the most popular boosting algorithms include:

  • Gradient Boosting Machines
  • AdaBoost
  • Stochastic Gradient Boosting

Stacked Generalization

Stacked Generalization, or stacking for short, takes a hierarchical approach towards ensemble learning in which one or more higher-level models train on the output of multiple lower-level models. Each ensemble member also uses a different learning model, which helps data scientists better identify the most (and least) effective classifiers. Unlike in bagging and boosting, the data set in stacking remains unchanged and is fed equally to the models at the bottom of the hierarchy, known as level-0 models.

The predictions output by these models are immediately fed into a level-1 model, known in this context as a meta-learner. The meta-learner is typically a linear model, using linear and logistic regression for binary classification. The model trains on the data output by its sub-models, resulting in more accurate predictions.

It’s important to note that although a two-level hierarchy is the most common approach to stacking, it’s far from the only one. One could feasibly use as many layers as they deem necessary. For instance, instead of a single level-1 model, you might have five level-1 models, 3 level-2 models, and one level-3 model.

One key benefit of stacking is that lower-level models are typically more complex than higher-level models. This means that higher-level models not only tend to make more accurate predictions than lower-level models but also do so with greater efficiency.

To train a machine learning model with stacking:

  • Determine what kind of hierarchy you want to use and how many models should be at each level.
  • Choose your learning models. It’s generally best to have an ensemble of models that are constructed or trained in very different ways from one another.
  • Ensure all models within the ensemble access the same data set.
  • Feed the training data into your level-0 models.
  • Assess outputs at each level of the hierarchy.

Some popular stacking algorithms include:

  • Canonical stacking
  • Blending
  • Super Ensemble

Ensemble Learning Calculations

We mentioned above that most ensemble learning algorithms measure the outputs of their ensemble to improve accuracy. Let’s take a brief moment to talk about how they do so. The most common ways one might calculate or assess an ensemble’s predictions include:

  • Mode: Also known as voting, this approach treats the value that appears most often in an ensemble’s output as the correct prediction.
  • Mean: This approach calculates the average of all predictions made by an ensemble, treating that value as the correct prediction.
  • Weighted Average: Similar to Mean, except that each model in the ensemble is assigned a different weight, with more heavily weighted models and their outputs being treated as more relevant.

Best Practices for Ensemble Learning

  • Ensure the models in your ensemble are diverse enough to avoid overfitting. They should not all rely on the same algorithms or features.
  • Use a technique such as cross-validation to assess the performance of your ensemble and help you identify potential problems with the learning process.
  • Understand that ensemble learning is not a silver bullet and may not be suitable for every use case. It not only tends to be expensive and time-consuming but can also be difficult to interpret and explain.
  • If you intend to leverage ensemble learning, it’s advisable to start with something simple first, like bagging. You should also make sure you have a full understanding of whatever model or algorithm you decide to use.