You may have wondered why there are so many different algorithms for machine
learning - after all, would it not be sufficient with one algorithm? A famous
result is the *no free lunch theorem*,
which tells us that no one algorithm will be optimal for every problem.

Intuitively, the reason for this stems from how machine learning algorithms
learn. Ultimately, Machine Learning boils down to minimizing some loss function
in a given environment. Once we have decide what an algorithm should learn,
we have implicitly also constrained the type of patterns it *can* learn.

This variety in learning means that for a given problem, different algorithms will be able to capture different aspects of the signal. Sometimes, they capture more or less the same dynamics of the signal, but more often than not their predictive power complement each other. That is, where one learning algorithm fails to find a good signal, another succeeds. Ensembles let us harness this diversity.

Ensembles learn the

optimal combinationof base model predictions.

Suppose we have two models. We could construct an ensemble out of these
predictions by taking an average. Better yet, we could fit a linear
regression to *learn* the best (linear) combination of the two predictions.
We call our original estimators, fitted on the input data, **base learners**,
and the algorithm used to find the best combination the **meta learner**. Of
course, nothing prevents us from finding *non-linear* combinations, and
quite frequently such ensembles have much more predictive power.

To fix ideas, suppose we have a prediction problem and have trained a Lasso, a Support Vector Machine, a K-Nearest Neighbors estimator and a Random Forest. They all do decently, but we suspect we can get more out of them. After some diagnostics, we find that there is some part of the data each model predict poorly, but others seem to capture quite well. This is a good indicator that an ensemble can provide a significant improvement.

*Figure 1: an illustration of the ensemble architecture*

Once settled upon some meta learner, say another SVM, we could fit our base
learners on half the training data, and have them predict the remainder. We
then fit the meta learner on their predictions. At test time, we first generate
predictions from the base learners and use these as input for
the meta learner, as in figure 1. This amounts to the simplest type of
ensemble, often called a *blend*. As we will see later, how to fit an ensemble
is not trivial and solving that issue gives rise to a whole host of different
types of ensembles. But before we get bogged down, let’s build some intuition
on why ensembles work with a very simple example.

## A simple example

To illustrate, at its most basic, how an ensemble improves the predictive performance, suppose we have a classification problem and have fitted two models, $l_1, \ l_2$. These are already fitted, and we are concerned with a test set. The ground truth in this test set is all positives, i.e.

For this test set, our models predict `0's`

and `1's`

an alternating
sequences, like in the table below.

Ground Truth: | 1 1 1 1 | Acc |
---|---|---|

$l_1$ predictions: | 1 0 1 0 | $50\%$ |

$l_2$ predictions: | 0 1 0 1 | $50\%$ |

*Table 1: example of model predictions*

Both models achieve $50$% accuracy. This may seem like an odd prediction pattern, but recall that our test set is merely four observations, and it is perfectly possible that we get a test set with only one label. The training set, of course, could contain many more labels that we don’t care about here. Now, suppose we fit a decision tree $t$ to the first two observations. It is easy to see that the best this tree could learn is to predict a positive outcome if any of the base learners predicted a positive outcome:

This ensemble would now achieve a perfect score on the two remaining observations in the test set:

Ground Truth: | 1 1 | Acc |
---|---|---|

$l_1$: | 1 0 | $50\%$ |

$l_2$: | 0 1 | $50\%$ |

ensemble: | 1 1 | $100\%$ |

*Table 2: example of an ensemble*

What’s going on here is that the decision tree is able to correct for the errors
made by one model, by considering what another model is doing instead. In
this way, an ensemble can be thought of as an error correction mechanism.^{1}

Ensembles learn to correct for base learner prediction errors. Hence, they can be thought of as

error correction models.

A crucial insight is that for ensembles to be useful,
the base learner’s errors must be relatively *uncorrelated*. If $l_1$ and
$l_2$ make the same errors, there is not information the meta learner can use
to learn how to correct for these errors. But claiming all this from such a
simple example is a bit of an overreach. Next, we’ll develop these concepts in
a slightly more rigorous fashion.

## What does an ensemble learn?

Let $X= \{\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, …, \boldsymbol{x}^{(n)} \}$ define a set of $n$ observations of some input data with associated output $\boldsymbol{y}=\{y^{(1)}, y^{(2)}, …, y^{(n)}\}$. For each observation of $m$ features, $\boldsymbol{x}^{(i)}=(x_1^{(i)}, x_2^{(i)}, …, x_m^{(i)})$, we want to predict the (expected) output $y^{(i)}$.

Suppose you have fitted a set of models $l_1, l_2, …, l_k$ to the data. To
ensemble these models, we put them in a *library* of base learners
$\mathcal{L} = \{l_1, l_2, …, l_k\}$. Given an observation $\boldsymbol{x}^{(i)}$,
this library can be used to generate a *set* of predictions. Each base learner
$l_j \in \mathcal{L}$ outputs a prediction
$p^{(i)}_{j} = l_j(\boldsymbol{x}^{(i)})$ and we can stack these predictions into
a vector of predictions:

Note that $\boldsymbol{p}^{(i)}$ has a similar form to $\boldsymbol{x}^{(i)}$: both
represent a set of *features* associated with an output $y^{(i)}$. The difference
is that the features in $\boldsymbol{p}^{(i)}$ are the predictions of each base learner.
It is now straightforward to train a meta learner $ g $
on a set of predictions as opposed to the original data

To build some intuition as to what $g$ is learning, suppose that $g$ is a linear regression model, $g(\boldsymbol{p}) = \boldsymbol{w}^\intercal \boldsymbol{p}$. The fit the meta learner, we need to find the coefficients that minimizes the sum of squared errors over the training set $\mathcal{T}$,

For our purposes, we can make a mild restriction, $\sum_j w_j = 1$. In this case the ensemble acts as a model averaging mechanism. In practice this constraint is not recommended ($g$ can always learn it), we’ll make use of it here to flesh out the learning dynamics. For any given observation $i$ in $\eqref{ols}$, the cost term can be written as

Now, $y ^ {(i)} - p^{(i)} _ j$ is the prediction error made by the $j$-th
base learner, so $g$ is learning to **correct for the errors** made by
each learner by choosing a **linearly optimal combination** of predictions, as
we saw earlier.

We can also see that we need *errors* to be *uncorrelated*. Why? Consider an
extreme case where all errors are perfectly correlated and
$p^{(i)}_j > y^{(i)}$ for all $ j $ and all $ i $. In this case, there is one
base learner $ l_s \in \mathcal{L} $ with minimum prediction error for any $i$.
From $\eqref{err}$ we can immediately see that $\boldsymbol{w} ^ * $ is be
given by $w _ s = 1, w _ {j \neq s} = 0$. But this ensemble cannot improve upon
the best base learner.

For an ensemble to improve upon base learners, prediction errors must be relatively

uncorrelated.

More generally, suppose $g$ a non-linear model. In this case, $g$ learns to
correct for errors *locally*. That is, in each neighborhood of the prediction
space, it learns the optimal combination of prediction. This can be very
powerful, since such an ensemble doesn’t require one or two models to have
predictive power throughout the feature space. Instead, one model is free to
capture one set of signals, while another model is able to capture another
type of signals. The meta learner’s task is then to infer in which neighborhood
it should trust which set of models.

Another important insight from $\eqref{ols}$ is that $g$ will learn to correct for
the *specific* type of prediction error encoded in the training set $\mathcal{T}$.
In particular, suppose we use the same data to first fit the base learners and
then fit meta learner. In this case, $g$ learns to correct for *training errors*,
but during test-time, it will invariable correct for *test errors*. Typically,
models do better on the training set than on the test set; in this case, the
will compound this gulf by falsely believing the base learners are more accurate
than they are. Even worse, it will learn to rely on the base learner that overfits
the most, as opposed to the base learner that generalizes the best. For this
reason, it is vital the the meta learner is trained on predictions that reflect
the generalization capacity of each base learner.

## Training the meta learner

Simultaneously using as much training data as possible when fitting the base learners as well as the meta learner, while avoiding re-using the same training data is not a trivial problem. In fact, ensembles are typically differentiated by how they solve this problem.

#### The simple approach: blending

In the simplest case, we just split the training data into two sets. One set for training the base learners, and one for training the meta learner. Formally, let $ X^{(\mathrm{train})} _ {\mathcal{L}} $ denote the training set for the base learners and $ X^{\mathrm{(train)}} _ {g} $ denote the training set used for the meta learner. Once fitted, the base learners predict $X^{\mathrm{(train)}} _ {g}$, and these predictions are used to train the meta learner. Hence, we require these sets to be disjoint: $ X^{(\mathrm{train})} _ {\mathcal{L}} \cap X^{\mathrm{(train)}} _ {g} = \varnothing $. Typically, we would also fit the base learners on all training data and use these models at test time.

It is vital the the meta learner is trained on predictions that reflect the generalization capacity of each base learner.

This method is often referred to as *blending*. While simple and relatively fast,
it is also a bit unsatisfying. The reason is that the predictions used to train
the meta learners are generated by base learners that only saw a portion of
the training data, and we only have predictions for a subset of the training data.
Unless we have a very large data set, odds are the predictions used to train
the meta learner are not consistent with how the base learners behave at test time.

#### The generalized approach: stacking

An immediate way to at least partially overcome this dilemma is to first generate predictions as above, and then swap the datasets and generate a new set of predictions, but now for $ X^{(\mathrm{train})} _ {\mathcal{L}} $. We then fit the meta learner on both sets of predictions, which covers all our training samples. This way, the meta learner is trained on predictions for every sample in the training set. The only drawback of this method is that these predictions are generated by base learners fitted on only a subset of the training data, while at test time predictions are generated by base learners fitted on all the training data. For small and medium datasets, this can make a large difference. Luckily, there are methods to overcome this to an arbitrary degree.

The key is to note in the above approach that we essentially use a 2-fold
cross validation strategy: we first fit models on one fold and predict the
other, then repeat the process with the roles of the folds exchanged. We can now
immediately generalize this notion to K-fold cross validation. In the limit, if
we use a leave-one-out strategy, the set of predictions used to train the meta
learner will exactly math their behavior on the test set. This method is referred
to as (generalized) *stacking*. The ensemble itself is known as the
*Super Learner*. This name stems from the fact that using K-fold cross-validation
as the training method, the Super Learner is guaranteed to do at least as well
as the best base learner, and as long as errors are at least partially uncorrelated,
it will outperform the best base learner. Moreover, in theory it will
asymptotically approximate to the true Oracle estimator.^{2}

Stacking is especially valuable for small or medium sized datasets with hierarchical features, since blending risk completely omitting complete sub-graphs of the hierarchy from the training set. For instance, in the below example, using blending could completely omit data relating to Paris from the training set of the meta learner.

City | Year | Population (M) |
---|---|---|

Paris | 2008 | 2.211 |

2010 | 2.244 | |

Madrid | 2008 | 3.256 |

2010 | 3.273 |

*Table 3: example of grouped data*

The main drawback is computational. If we use 10 folds, each base learner will be fitted 11 times, once on the full data, 10 times on 90% of the data. This quickly leads to significant amount of training time, and if hyper parameter tuning is necessary, stacking can be prohibitively costly.

Stacking also poses some serious implementation challenges. First, we must ensure there is no information leakage when constructing the predictions from base learners. Since all training data is used for fitting at some point, it is easy to make a slip, causing base learners to predict data used for training. For instance, if the training data is a time series, we must ensure that the folds are constructed in such a manner that the base learners always predict future outcomes. The second challenge is to speed up the ensemble as much as possible, which requires parallel processing, which brings with it a whole host of issues related to safely running estimations in parallel and avoiding excessive memory consumption. By far the best option is to use a maintained package for your ensemble adventures.

### Subsembles

A final type of ensemble that can be very powerful is the Subsemble. Subsembles
also use stacking, but with a twist. These ensembles *partition* the training data into
$J$ partitions, and fit the base learners using K-fold cross-validation on each
partition. Each partition acts as its own mini-ensemble, so the base learner
is fitted on the predictions made in *each* partition. Hence, with $k$ base
learners and $J$ partitions, the number of features for the meta learner
is $k \times J$.

Subsembles have a few distinct benefits. First, by partitioning the dataset, they scale better than the Super Learner. Second, by creating subset-specific base learners, they can harness local information better, possibly leading to a richer prediction input set for the meta learner.

Their drawback is mainly noticeable for smaller datasets, where the scale of data is not sufficiently large to make the SuperLearner slow. Since the subsemble requires $J$ times as many fits as the Super Learner, it can be comparatively slow in these cases. Secondly, with noisy data, each partition can generalize poorly, leading to noisy predictions that obscure the signal for the meta learner. Whether Subsembles offer an improvement over Super Learners is ultimately an empirical question: you’ll simply have to try and see which does best.

## Where to go from here

If you would like to deploy ensembles, your best bet is to use unit-tested and
maintained software. Several packages exists that are optimized for speed and memory
and carefully unit tested for information leakage. For R users, the
**SuperLearner**
and **Subsemble**
are a good starting point. With very large datasets,
**H20** offers distributed computation. For Python
users, you should check out **ML-Ensemble**,
but then again I’m biased :)

### Footnotes

[1] For the interested reader: http://web.engr.oregonstate.edu/~tgd/publications/mcs-ensembles.pdf

[2] van der Laan, Mark J.; Polley, Eric C.; and Hubbard, Alan E., “Super Learner” (July 2007). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 222. http://biostats.bepress.com/ucbbiostat/paper222