You may have wondered why there are so many different algorithms for machine learning - after all, would it not be sufficient with one algorithm? A famous result is the no free lunch theorem, which tells us that no one algorithm will be optimal for every problem.

Intuitively, the reason for this stems from how machine learning algorithms learn. Ultimately, Machine Learning boils down to minimizing some loss function in a given environment. Once we have decide what an algorithm should learn, we have implicitly also constrained the type of patterns it can learn.

This variety in learning means that for a given problem, different algorithms will be able to capture different aspects of the signal. Sometimes, they capture more or less the same dynamics of the signal, but more often than not their predictive power complement each other. That is, where one learning algorithm fails to find a good signal, another succeeds. Ensembles let us harness this diversity.

Ensembles learn the optimal combination of base model predictions.

Suppose we have two models. We could construct an ensemble out of these predictions by taking an average. Better yet, we could fit a linear regression to learn the best (linear) combination of the two predictions. We call our original estimators, fitted on the input data, base learners, and the algorithm used to find the best combination the meta learner. Of course, nothing prevents us from finding non-linear combinations, and quite frequently such ensembles have much more predictive power.

To fix ideas, suppose we have a prediction problem and have trained a Lasso, a Support Vector Machine, a K-Nearest Neighbors estimator and a Random Forest. They all do decently, but we suspect we can get more out of them. After some diagnostics, we find that there is some part of the data each model predict poorly, but others seem to capture quite well. This is a good indicator that an ensemble can provide a significant improvement.

The Ensemble Architecture Figure 1: an illustration of the ensemble architecture

Once settled upon some meta learner, say another SVM, we could fit our base learners on half the training data, and have them predict the remainder. We then fit the meta learner on their predictions. At test time, we first generate predictions from the base learners and use these as input for the meta learner, as in figure 1. This amounts to the simplest type of ensemble, often called a blend. As we will see later, how to fit an ensemble is not trivial and solving that issue gives rise to a whole host of different types of ensembles. But before we get bogged down, let’s build some intuition on why ensembles work with a very simple example.

A simple example

To illustrate, at its most basic, how an ensemble improves the predictive performance, suppose we have a classification problem and have fitted two models, $l_1, \ l_2$. These are already fitted, and we are concerned with a test set. The ground truth in this test set is all positives, i.e.

1 1 1 1

For this test set, our models predict 0's and 1's an alternating sequences, like in the table below.

Ground Truth: 1 1 1 1 Acc
$l_1$ predictions: 1 0 1 0 $50\%$
$l_2$ predictions: 0 1 0 1 $50\%$

Table 1: example of model predictions

Both models achieve $50$% accuracy. This may seem like an odd prediction pattern, but recall that our test set is merely four observations, and it is perfectly possible that we get a test set with only one label. The training set, of course, could contain many more labels that we don’t care about here. Now, suppose we fit a decision tree $t$ to the first two observations. It is easy to see that the best this tree could learn is to predict a positive outcome if any of the base learners predicted a positive outcome:

This ensemble would now achieve a perfect score on the two remaining observations in the test set:

Ground Truth: 1 1 Acc
$l_1$: 1 0 $50\%$
$l_2$: 0 1 $50\%$
ensemble: 1 1 $100\%$

Table 2: example of an ensemble

What’s going on here is that the decision tree is able to correct for the errors made by one model, by considering what another model is doing instead. In this way, an ensemble can be thought of as an error correction mechanism.1

Ensembles learn to correct for base learner prediction errors. Hence, they can be thought of as error correction models.

A crucial insight is that for ensembles to be useful, the base learner’s errors must be relatively uncorrelated. If $l_1$ and $l_2$ make the same errors, there is not information the meta learner can use to learn how to correct for these errors. But claiming all this from such a simple example is a bit of an overreach. Next, we’ll develop these concepts in a slightly more rigorous fashion.

What does an ensemble learn?

Let $X= \{\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, …, \boldsymbol{x}^{(n)} \}$ define a set of $n$ observations of some input data with associated output $\boldsymbol{y}=\{y^{(1)}, y^{(2)}, …, y^{(n)}\}$. For each observation of $m$ features, $\boldsymbol{x}^{(i)}=(x_1^{(i)}, x_2^{(i)}, …, x_m^{(i)})$, we want to predict the (expected) output $y^{(i)}$.

Suppose you have fitted a set of models $l_1, l_2, …, l_k$ to the data. To ensemble these models, we put them in a library of base learners $\mathcal{L} = \{l_1, l_2, …, l_k\}$. Given an observation $\boldsymbol{x}^{(i)}$, this library can be used to generate a set of predictions. Each base learner $l_j \in \mathcal{L}$ outputs a prediction $p^{(i)}_{j} = l_j(\boldsymbol{x}^{(i)})$ and we can stack these predictions into a vector of predictions:

Note that $\boldsymbol{p}^{(i)}$ has a similar form to $\boldsymbol{x}^{(i)}$: both represent a set of features associated with an output $y^{(i)}$. The difference is that the features in $\boldsymbol{p}^{(i)}$ are the predictions of each base learner. It is now straightforward to train a meta learner $ g $ on a set of predictions as opposed to the original data

To build some intuition as to what $g$ is learning, suppose that $g$ is a linear regression model, $g(\boldsymbol{p}) = \boldsymbol{w}^\intercal \boldsymbol{p}$. The fit the meta learner, we need to find the coefficients that minimizes the sum of squared errors over the training set $\mathcal{T}$,

For our purposes, we can make a mild restriction, $\sum_j w_j = 1$. In this case the ensemble acts as a model averaging mechanism. In practice this constraint is not recommended ($g$ can always learn it), we’ll make use of it here to flesh out the learning dynamics. For any given observation $i$ in $\eqref{ols}$, the cost term can be written as

Now, $y ^ {(i)} - p^{(i)} _ j$ is the prediction error made by the $j$-th base learner, so $g$ is learning to correct for the errors made by each learner by choosing a linearly optimal combination of predictions, as we saw earlier.

We can also see that we need errors to be uncorrelated. Why? Consider an extreme case where all errors are perfectly correlated and $p^{(i)}_j > y^{(i)}$ for all $ j $ and all $ i $. In this case, there is one base learner $ l_s \in \mathcal{L} $ with minimum prediction error for any $i$. From $\eqref{err}$ we can immediately see that $\boldsymbol{w} ^ * $ is be given by $w _ s = 1, w _ {j \neq s} = 0$. But this ensemble cannot improve upon the best base learner.

For an ensemble to improve upon base learners, prediction errors must be relatively uncorrelated.

More generally, suppose $g$ a non-linear model. In this case, $g$ learns to correct for errors locally. That is, in each neighborhood of the prediction space, it learns the optimal combination of prediction. This can be very powerful, since such an ensemble doesn’t require one or two models to have predictive power throughout the feature space. Instead, one model is free to capture one set of signals, while another model is able to capture another type of signals. The meta learner’s task is then to infer in which neighborhood it should trust which set of models.

Another important insight from $\eqref{ols}$ is that $g$ will learn to correct for the specific type of prediction error encoded in the training set $\mathcal{T}$. In particular, suppose we use the same data to first fit the base learners and then fit meta learner. In this case, $g$ learns to correct for training errors, but during test-time, it will invariable correct for test errors. Typically, models do better on the training set than on the test set; in this case, the will compound this gulf by falsely believing the base learners are more accurate than they are. Even worse, it will learn to rely on the base learner that overfits the most, as opposed to the base learner that generalizes the best. For this reason, it is vital the the meta learner is trained on predictions that reflect the generalization capacity of each base learner.

Training the meta learner

Simultaneously using as much training data as possible when fitting the base learners as well as the meta learner, while avoiding re-using the same training data is not a trivial problem. In fact, ensembles are typically differentiated by how they solve this problem.

The simple approach: blending

In the simplest case, we just split the training data into two sets. One set for training the base learners, and one for training the meta learner. Formally, let $ X^{(\mathrm{train})} _ {\mathcal{L}} $ denote the training set for the base learners and $ X^{\mathrm{(train)}} _ {g} $ denote the training set used for the meta learner. Once fitted, the base learners predict $X^{\mathrm{(train)}} _ {g}$, and these predictions are used to train the meta learner. Hence, we require these sets to be disjoint: $ X^{(\mathrm{train})} _ {\mathcal{L}} \cap X^{\mathrm{(train)}} _ {g} = \varnothing $. Typically, we would also fit the base learners on all training data and use these models at test time.

It is vital the the meta learner is trained on predictions that reflect the generalization capacity of each base learner.

This method is often referred to as blending. While simple and relatively fast, it is also a bit unsatisfying. The reason is that the predictions used to train the meta learners are generated by base learners that only saw a portion of the training data, and we only have predictions for a subset of the training data. Unless we have a very large data set, odds are the predictions used to train the meta learner are not consistent with how the base learners behave at test time.

The generalized approach: stacking

An immediate way to at least partially overcome this dilemma is to first generate predictions as above, and then swap the datasets and generate a new set of predictions, but now for $ X^{(\mathrm{train})} _ {\mathcal{L}} $. We then fit the meta learner on both sets of predictions, which covers all our training samples. This way, the meta learner is trained on predictions for every sample in the training set. The only drawback of this method is that these predictions are generated by base learners fitted on only a subset of the training data, while at test time predictions are generated by base learners fitted on all the training data. For small and medium datasets, this can make a large difference. Luckily, there are methods to overcome this to an arbitrary degree.

The key is to note in the above approach that we essentially use a 2-fold cross validation strategy: we first fit models on one fold and predict the other, then repeat the process with the roles of the folds exchanged. We can now immediately generalize this notion to K-fold cross validation. In the limit, if we use a leave-one-out strategy, the set of predictions used to train the meta learner will exactly math their behavior on the test set. This method is referred to as (generalized) stacking. The ensemble itself is known as the Super Learner. This name stems from the fact that using K-fold cross-validation as the training method, the Super Learner is guaranteed to do at least as well as the best base learner, and as long as errors are at least partially uncorrelated, it will outperform the best base learner. Moreover, in theory it will asymptotically approximate to the true Oracle estimator.2

Stacking is especially valuable for small or medium sized datasets with hierarchical features, since blending risk completely omitting complete sub-graphs of the hierarchy from the training set. For instance, in the below example, using blending could completely omit data relating to Paris from the training set of the meta learner.

City Year Population (M)
Paris 2008 2.211
  2010 2.244
Madrid 2008 3.256
  2010 3.273

Table 3: example of grouped data

The main drawback is computational. If we use 10 folds, each base learner will be fitted 11 times, once on the full data, 10 times on 90% of the data. This quickly leads to significant amount of training time, and if hyper parameter tuning is necessary, stacking can be prohibitively costly.

Stacking also poses some serious implementation challenges. First, we must ensure there is no information leakage when constructing the predictions from base learners. Since all training data is used for fitting at some point, it is easy to make a slip, causing base learners to predict data used for training. For instance, if the training data is a time series, we must ensure that the folds are constructed in such a manner that the base learners always predict future outcomes. The second challenge is to speed up the ensemble as much as possible, which requires parallel processing, which brings with it a whole host of issues related to safely running estimations in parallel and avoiding excessive memory consumption. By far the best option is to use a maintained package for your ensemble adventures.

Subsembles

A final type of ensemble that can be very powerful is the Subsemble. Subsembles also use stacking, but with a twist. These ensembles partition the training data into $J$ partitions, and fit the base learners using K-fold cross-validation on each partition. Each partition acts as its own mini-ensemble, so the base learner is fitted on the predictions made in each partition. Hence, with $k$ base learners and $J$ partitions, the number of features for the meta learner is $k \times J$.

Subsembles have a few distinct benefits. First, by partitioning the dataset, they scale better than the Super Learner. Second, by creating subset-specific base learners, they can harness local information better, possibly leading to a richer prediction input set for the meta learner.

Their drawback is mainly noticeable for smaller datasets, where the scale of data is not sufficiently large to make the SuperLearner slow. Since the subsemble requires $J$ times as many fits as the Super Learner, it can be comparatively slow in these cases. Secondly, with noisy data, each partition can generalize poorly, leading to noisy predictions that obscure the signal for the meta learner. Whether Subsembles offer an improvement over Super Learners is ultimately an empirical question: you’ll simply have to try and see which does best.

Where to go from here

If you would like to deploy ensembles, your best bet is to use unit-tested and maintained software. Several packages exists that are optimized for speed and memory and carefully unit tested for information leakage. For R users, the SuperLearner and Subsemble are a good starting point. With very large datasets, H20 offers distributed computation. For Python users, you should check out ML-Ensemble, but then again I’m biased :)


Footnotes

[1] For the interested reader: http://web.engr.oregonstate.edu/~tgd/publications/mcs-ensembles.pdf

[2] van der Laan, Mark J.; Polley, Eric C.; and Hubbard, Alan E., “Super Learner” (July 2007). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 222. http://biostats.bepress.com/ucbbiostat/paper222