Deep neural networks are very powerful models, in theory able to approximate *any* function. In practice things are a little different. Oddly, a neural network tends to generalize better the larger it is, often to the point of having more parameters than there are data points in the training set.

In a recent paper I argue that one reason for why neural networks need to be so large is that they bias towards linear behavior. In particular, neural networks are at the mercy of the activation function, since it’s the only source of non-linearity in the model. The closer to linear the activation function is, the larger the network must be to learn complex patterns. We call this monopolization of non-linearity the *activation function bottleneck*.

## The Activation Function Bottleneck

How much of a problem is this bottleneck? Common choices for activation function, such as the Sigmoid, Tanh, and ReLU all behave close to linear over large ranges of their domain. More importantly, for networks to be stable, weights must be small. As a consequence, inputs to the activation function lie in a small neighborhood, which tends to reinforce the activation function’s tendency towards linear behavior. To make sure we’re all on the same page, here’s what popular choices look like (the Sigmoid is stretched and zero-centered for comparison):

The bottleneck arise because inputs to the activation function cluster around $0$, where these function behave largely linearly. The ReLU is less of a problem, but it is instead exactly linear on either side, which has its own set of drawbacks. The gist of the activation function bottleneck is that the activation function is largely linear over $p(a)$, an artifact of initializing weights to small values around $0$ (which is necessary for the network to be stable and the gradient well behaved). Because of this, the probability of drawing an $a$ such that the $\phi(a)$ is significantly different (up to a linear transformation) from $a$ is small.

To measure the activation function bottleneck, we need the *pre-activation* layer, $\boldsymbol{a}= W\boldsymbol{x} + \boldsymbol{b}$, for some parameter matrix $W$ and bias $\boldsymbol{b}$, and the feed-forward layer $\boldsymbol{y} = \phi(\boldsymbol{a})$. Because $\phi$ is applied element-wise, we can construct a random pre-activation variable $a$ by aggregating each $a_i \in \boldsymbol{a}$,

Assuming $p(a)$ is centered around $0$ (which it is), we measure the effective degree of non-linearity in $\phi$ as the difference between $\phi(a)$ and a first-order linear approximation, $\phi’(0) a + \phi(0)$. This difference is a measure of the operative non-linearity in $\phi$ over $p(a)$. We’ll call this amount the *activation effect*. To illustrate the activation function bottleneck, we train a two-layer feed-forward network on MNIST (based on TensorFlow’s starter code) using the Sigmoid function as non-linearity,

The linear approximation of the Sigmoid is given by $\phi(a) \approx 0.25 a + 0.5$. Our trained network reaches 95% accuracy, which is quite bad for MNIST. We’ll fix that later. Now, first, let’s consider the distribution of the pre-activation variable $a$:

As claimed, $a$ is zero-centered, and has a standard deviation around 0.7-0.8. Because the Sigmoid is largely linear on $[-1,1]$, most pre-activation units are largely unaffected by the activation function. In fact, the distribution of activation effects is problematic:

The distribution is sharply peaked around $0$ with few value outside $0.05$. The immediate consequence is that the network must be extremely sensitive to changes in the input to pick up on these minute activation effects, making it that much harder to train them.

To hone in on this point, let’s sample pre-activation values from the empirical distribution of $a$ and plot the activation function effect against each sample. As the animation below shows, only outlying pre-activation values enjoy a non-linear effect; most pre-activation units are practically unaffected by the activation function.

The activation bottleneck makes standard neural networks overly biased towards linear behavior and therefore both statistically and computationally inefficient. One way to improve things is to device better activation functions with more non-linearity around neighborhoods where pre-activation units cluster. ReLU is one such example: switching from Sigmoid to ReLU activations, our feed-forward network obtains a 97% accuracy. Still, the ReLU has its own set of drawbacks making it unsuitable for a host of tasks. In general, when the only source of non-linearity is an element-wise mapping, we are held hostage by the properties of that function. A more general approach is to simply remove the bottleneck entirely.

## Adaptive Parameterization

Our approach relies on the observation that for any given pair of input and output $(\boldsymbol{x}, \boldsymbol{o})$, there is a linear map $T$ such that $\boldsymbol{o} = T(\boldsymbol{x})$. Learning this map exactly is not possible, since for any finite training set we’re bound to overfit. But a good approach is to learn a point in parameter space from which we can approximate $T$. In fact, this is what the standard feed-forward network does.

### The Adaptive Feed-Forward Layer

To motivate the adaptive feed-forward layer, consider first how the standard feed-forward layer essentially learns an *adaptive composition of linear maps*. Define by $A: \boldsymbol{x} \mapsto W\boldsymbol{x} + \boldsymbol{b}$ the pre-activation mapping, and let $\boldsymbol{g} = \phi(\boldsymbol{a}) / \boldsymbol{a}$ denote the activation function effect (division being element-wise; we use division to be consistent with the paper), with $G = \operatorname{diag}(\boldsymbol{g})$ being an adaptive (diagonal) matrix. We can no re-write the feed-forward layer as an composite of linear maps:

Note that the feed-forward layer learns a point, a “prior”, $A$, around which it adapts to the input through $G$. More generally, with $L$ layers, we have an adaptive composition of linear maps of the form

The adaptive mechanism comes through the activation effects $G^{(1)}, \ldots, G^{(L)}$. This mechanism is weak: if $\phi$ is close to linear over the distribution of $a$, as is often the case, little adaptation can occur. To remedy the situation, we parameterize each $G^{(l)}$ in adaptation matrices

where $\pi$ is a parameterized *adaptation policy* that we learn jointly with the static parameters of the network. For instance, $\pi$ could be a linear projection or a feed-forward network. In the simplest case, we make the feed-forward layer adaptive by replacing $G$ with $D$. However, we can have any number of such adaptations in a feed-forward layer. For instance, in the paper I explore Singular Value Adaptation (SVA)

and IO-adaptation (IOA)

SVA essentially learns an eigenbasis in which to adapt the eigenvalues of the linear transformation (see paper for details), whereas IOA learns a policy for adapting the mean and variance of sub-matrices in $W$. Here’s an illustration of how these policies adapt the weights in some static matrix $W$:

We call the two policies to the left output- and input-adaptation, respectively, as they can be seen as adapt either adapting the output or the input of the linear transformation. These are partial adaptation policies, in that they operate only on either the rows or column of $W$. Center right is an example of IOA and to the right one of SVA.

### MNIST revisited

To test if the adaptive feed-forward layer breaks the activation function bottleneck, we replace the feed-forward layers in our original network with an adaptation policy; here, we’ll use SVA with $\pi(\boldsymbol{x}) = P\boldsymbol{x}$. With SVA layers, the model improves accuracy, from 95% with standard layers to 97%. If we remove the activation function entirely, we get about one additional percentage point in accuracy. But more interesting is to compare the distribution of activation effects. With adaptive feed-forward layers, we need to be a bit careful since we now have non-linearity in $\boldsymbol{a}$, the pre-activation layer. Hence, to measure the activation effect (or perhaps a better name would be the adaptation effect) we compare the output of the SVA layer with the linear transformation we get if we remove all non-linear components: with Sigmoid non-linearity and SVA layers we have $0.25 (W^{(2)}W^{(1)}\boldsymbol{x} + \boldsymbol{b}) + 0.5$. The activation effect now looks more healthy:

Note that not only is the distribution smoother, its dispersion is almost two order of magnitude larger. Neat. Now, to really unpack the adaptive feed-forward layer, let’s look at how it behaves on a simple regression problem where a two-layer feed-forward network fails.

### Multi-modal learning

The problem is to, given $x \in [0, 1]$, predict $\sin(10x)$. This is a deterministic 1-dimensional regression problem, so not particularly hard. What’s challenging about it is that we sample $x$ from a multi-modal distribution, $p(x) = 0.5 \, \mathcal{N}(0.2, 0.7) + 0.5 \, \mathcal{N}(0.8, 0.4)$. The below figure illustrates the sine curve and the data distribution (blue bars).

The model will therefore see plenty of data from the two “hills”, but very little from the “valley”. So to solve the problem, the model must be very flexible and able to quickly learn a representation space that is invariant to this multi-modality. The network we test on this problem is a two-layer feed-forward network:

We compare a static baseline to a model where the first layer is an adaptive feed-forward layer with an SVA policy (eq. \ref{eq:sva}). Because we are interested in statistical efficiency, we use as small a model as possible, so the hidden layer has only two units (the static baseline fails even with 10 units). Training these models with the Adam optimizer, the static benchmark model fails completely, whereas the SVA model does a pretty good job. The below animation shows how the two models perform over the course of training (blue being the SVA and green the static baseline). The figure to the right shows the learned representation space of the SVA.

Once the SVA learns a representation space that is invariant to the multi-modality of the input distribution, learning takes off pretty quickly. Of course, this is a rather simple problem, so if the static baseline model is made sufficiently large it too can solve this problem. That is precisely the point; it is inefficient.

## Dynamic Adaption and the Adaptive LSTM

So far we’ve looked at simple problems. To truly test the adaptive feed-forward layer, we need a more challenging benchmark. Feed-forward networks are prominent in recurrent Neural Networks (RNNs), so they pose a natural test-bed for adaptive feed-forward layers. The basis for any RNN is a recurrent linear transformation:

The vanilla RNN for instance defines the state update rule $\boldsymbol{h} _ t = \phi( \boldsymbol{u} _ t )$. The popular LSTM model uses a recurrent gating mechanism:

where $\sigma$ is the sigmoid and $\tau$ the tanh non-linearity and $\odot$ element-wise multiplication. If the LSTM is new to you, ignore these gating mechanisms (they aren’t important to us). To make any RNN adaptive, we replace recurrent linear transformations (i.e. eq. \ref{eq:linear}) with an adaptive recurrent transformation. Here, we’ll implement IOA (eq. \ref{eq:ioa}) by re-defining eq. \ref{eq:linear} to be

The difference between the adaptive recurrent transformation and the static version is that the latter uses different weights at each time step by adapting to the state of the system and the given input. This allows it to quickly respond to changes in the dynamics of the system, giving it a higher capacity to model rapid or large changes.

By simply substituting in eq. \ref{eq:adaptive-linear}, we can make any RNN, such as vanilla RNN, LSTM, or GRU, adaptive. Let’s focus on the LSTM. Because the LSTM uses four recurrent transformations, if we model each adaptive matrix as directly dependent on $\boldsymbol{x}$ and $\boldsymbol{h}$, the model would become very large. To prevent parameter explosion, we use a latent adaptation variable $\boldsymbol{z} _ t$ that is a function of the state of the system and the input,

This latent variable allows us to aggregate local, contextual-dependent information into a shared representation that we projection into adaptation matrices:

If $f$ is a recurrent model, such as an LSTM, the adaptive LSTM generalizes the HyperNetwork (Ha et. al., 2017) to a full adaptation policy. $f$ could also be something simpler, for instance a linear projection or a feed-forward network, or something more complicated: typically, the aLSTM does somewhat better when $f$ is itself a recurrent network. The full aLSTM is slightly more involved than what we have here, but let’s not complicate things.

To test the aLSTM, we turn to the popular Penn Treebank word modeling benchmark (code release). This benchmark has recently received quite a bit of attention, making it a challenging benchmark. Nonetheless, the aLSTM improves upon previous state of the art performance with a comfortable margin. Morover, it outperforms the previous SOTA-holder–the AWD-LSTM–in 144 epochs, compared to the 500 required by latter. In fact, the aLSTM is actually faster to train in wall clock time despite the latter using the CuDNN implementation of the LSTM. Here’s what the validation loss curve looks like:

The aLSTM is also very stable; you can train it without gradient clipping while backpropagating through thousands of time steps, without having to fiddle with the learning rate (that might not give you best performance though). The reason for this is that the IOA introduce these diagonal matrices that behave as gating mechanisms for the gradient during backprop, effectively preventing gradient explosion (the behavior is similar to that of the multiplicative-integration RNN, which was developed to facilitate gradient flow; Wu et. al., 2016).

Finally, the aLSTM is also relatively robust against hyper-parameter choices. Let’s quantify this by sampling hyper-parameters randomly in some neighborhood, say $\pm 0.1$, of the hyper-parameters (let’s focus on dropout rates for simplicity) that gives best performance and train for a few epochs to see how tight or dispersed the distribution is. Doing so, we get the below figure after about 100 samples of each model.

The aLSTM is rather robust, with no model performing particularly badly. The LSTM is another story: half the models fail to converge and the distribution among those that do features a heavy upward tail towards poor performance.

## Conclusion

We’ve seen that though deep neural networks are powerful models, limiting the non-linearity to the activation function has the undesirable side-effect of forcing us to use large models needlessly. This makes the model both harder to train and costly to evaluate and store. We’ve shown one way to remove this bottleneck through adaptive parameterization, and applying this to the aLSTM, find that it does indeed yield better performance along a variety of dimensions.