[code] - [paper]

The current wave of machine learning is the epitome of division of labor, wherein a given model is able to perform a narrowly defined task to perfection. While these tasks can be quite complex, each task requires a distinct model to be trained to solve that task and only that task. In contrast, a truly general artificial intelligence should be able to adapt to changes in its environment and to new tasks.

When humans adapt to new tasks, we draw heavily on previous knowledge and experience. This is a crucial ability that lets us learn new things with very few demonstrations. For instance, most people would be able to use Netflix without any demonstrations because they can extrapolate from general website design patterns.

Replicating this type of behaviour in a machine learning algorithm is very challenging. A common approach is to simplify the problem to using a model trained on some other task as our starting point and fine-tune that model to the task we now want to solve. While simple, fine-tuning is a highly effective method in Computer Vision [1,2,3] and Natural Language Processing [4,5,6,7] when the tasks to transfer between are similar.

When tasks are not so similar, or when there is severe data scarcity, fine-tuning fails. Fundamentally, the problem with fine-tuning and similar transfer learning methods is that they ignore the process of learning. This can induce a catastrophic loss of information, as the pre-trained model will discard information that is not useful for the task it is being trained on, but that would have been useful for the task we actually care about. When you learn to navigate the Netflix website, you use abstract knowledge about how websites work, not the details of how to navigate some specific website.

ft Fine-tuning ignores the process of learning when transferring knowledge. To consistently transfer knowledge, we have to transfer knowledge across across the learning processes themselves.

Towards an adaptable artificial intelligence

What we need is a more principled approach to sharing information between tasks. In particular, a way of sharing information between two (or more) learning processes. One such approach is given by meta-learning (also known as learning to learn).

mt Transferring knowledge across learning processes can be seen as a learning problem in its own right, so called learning to learn or meta-learning. A meta-learner learns how to transfer knowledge when learning a new task.

In meta-learning, we treat information transfer as a learning problem in its own right and optimise for maximal transfer. Because meta-learning learns to transfer knowledge between a set of tasks, it can transfer knowledge to new, unseen tasks that resemble those it was trained on. As such, meta-learning has the potential to learn a truly general artificial intelligence that can smoothly adapt to new tasks with very few demonstrations [8, 9, 10].

mt Meta-learning takes a set of tasks, each with a training and test set, and learns how to adapt a task learner to any one of them. The goal is to generalise to unseen tasks, in the sense that the task learner can adapt to them. Credit: Ravi & Larochelle [11].

Current meta-learning focuses exclusively on the so-called few-shot learning problem. In this setting, we have access to a relatively rich distribution of tasks but have relatively few samples from each task and thus can only take a handful of gradient steps before we overfit. For this to work, tasks must be very similar, or a handful of gradient steps will not be sufficient to learn a task. While this is an important case of meta-learning, it is by no means all that meta-learning has to offer.

We argue that meta-learning can contribute even more when tasks are complex and require many training steps. In fact, because they require many training steps, meta-learning can have a dramatic impact on the rate of convergence and ultimately, the final performance obtained. Currently, there are no meta-learning algorithms designed for large-scale meta-learning and most are computationally unfeasible beyond a few-shot setting.

To remedy the situation, we propose Leap. A light-weight meta-learning algorithm designed specifically for long training processes–even millions of gradient descent steps.

Leap was presented at ICLR (starts at 1:33:00) (starts at 1:33) on May 9th. We’ve also release code for Leap that you can play around with.

Designing a Scalable Meta-Learner

To design such a meta-learner, we face several tough challenges. The first is computational; meta-learning requires learning from learning processes, so inherently scales poorly. Most few-shot algorithms require some form of backpropagation through the learning process, and as such are unfeasible. Hence, our first constraint is that we cannot backprop through the learning process. The second problem is how to maintain a consistent transfer learning objective over very long learning processes. For instance, looking at the final loss won’t do: if there are a million steps between initialisation and convergence almost all information about the learning process lies in between them. We need to derive a novel meta-objective on first principles.

When we say a task was “easy” to learn, we usually mean that it didn’t take us too long and that the process was relatively smooth. From a machine learning perspective, this implies rapid convergence. It also implies parameter updates should improve performance monotonically (well, in expectation at least). Oscillating back and forth is equivalent to not knowing what to do.

Both these notions revolve around how we travel from our initialisation to our final destination on the model’s loss surface. The ideal is a going straight down-hill to the parameterisation with smallest loss for the given task. Worst case is taking a long detour with lots of back-and-forts. In Leap, we leverage the following insight:

Transferring knowledge therefore implies influencing the parameter trajectory such that it converges as rapidly and smoothly as possible.

surf Transferring knowledge across learning processes means that learning a new task becomes easier in the sense that we enjoy a shorter and smoother parameter trajectory.

Scaling meta-learning: Leap

To make this intuition crisp, we need a formal framework to work in. While we won’t get in to the details here (for that, read the paper), we’ll flesh out the high-level ideas. You can safely skip the math if you prefer.

A learning process starts with some initial guess of the model parameters $\theta^0$ and updates that guess via some update rule $u$ that depends on the learning objective $L$ and some observational input $x$ with target output $y$,

This sequence eventually converges, at which point our model hopefully has learned to solve the task. The length of this process can be formally described by the distance it traversed from the initial guess to the final parameterisation, say $\theta^K$. This distance is given by summing up the length of each update:

Assuming that we converge to a good minimum on this task, the distance d of this process tells us if it was an easy or hard task. If it is small, it means we didn’t have to travel far, so our initial guess was good. If it is large, our initial guess was poor as we had to travel a lot to get there.

Consequently, we can learn to transfer knowledge across learning processes by learning an initialisation such that the expected distance we have to travel when learning a similar task is as short as possible.

This is the overall objective of Leap. Given a distribution of task that we can sample from, Leap learns an initialisation $\theta^0$ such that the expected distance of any learning process from that task distribution is as short as possible in expectation. Thus, Leap extracts information across a variety of learning processes during meta-training and condenses it into a good initial guess that ensures learning a new task is as easy as possible. Importantly, this initial guess has nothing to do with the details of the final parameterisation on any task, it is meta-learned to facilitate the process of learning those parameters, whatever they might be.

evo Leap learns an initialisation that induces faster learning on tasks from the given task distribution. By minimising the distance we need to travel, we make tasks as ‘easy’ as possible to learn.

Taking a leap

We’re now ready to see if meta-learning can leap beyond the few-shot setting. To test Leap, we took a standard benchmark, Omniglot [12], and turned it into a much harder problem that cannot be solved with a few gradient steps. Omniglot is a dataset of 50 distinct alphabets, where each alphabet consists of 20 hand-drawn samples for each of its characters. To solve a task, the model draws data from the alphabet’s dataset and tried to predict which character each sample depicts. We allowed the model to take 100 gradient steps.

During meta-training, we allowed the meta-learned to see a subset of alphabets, ranging from 1 to 25, and held out 10 alphabets for final evaluation. To see if Leap offers any benefits, we tested it against no pre-training, multi-headed fine-tuning (which is cheating a bit because it has more parameters and use transduction, but let’s ignore that), the popular MAML [10] meta-learner, its first-order approximation (FOMAML), and Reptile [13], a meta-learner that iteratively moves the initialisation in the direction of the final task parameterisation. So, how did we do?

omni Left: mean learning curve on held-out tasks after meta-training on 25 tasks. Right: mean rate of convergence (area under the error curve during task training) as a function of the number of tasks used for meta-training.

When we don’t have an accurate representation of the task distribution, i.e. less than 5 tasks, standard fine-tuning is as good as it gets. That is unexpected, since we cannot generalise if our task distribution is degenerate. As the task distribution grows richer, Leap converges much faster than fine-tuning. Importantly, we find that the initialisation performs much better which means it would do much better on tasks where we face data scarcity. Strikingly, using few-short learning algorithms (MAML, FOMAML) falls short of fine-tuning.

We also tested Leap in a more demanding scenario where each task is a distinct computer vision dataset (see paper for details). Here, learning a task required thousands of parameter updates. We found that Leap not only improved the rate of convergence, but also the final performance obtained, as faster convergence protects against overfitting. Finally, we went a little overboard and tested Leap on a very difficult transfer learning problem in Reinforcement Learning, the Atari suite. Here, learning a task requires many millions of parameter updates.

While Leap has no overhead during task training, we still need to collect full training trajectories, rendering training costly. Even so, we found that training for a hundred steps yielded an initialisation with better properties than starting from scratch on each game. This improvement was not due to faster convergence, but more consistent exploration across seeds. While not definitive, we take these as encouraging results that meta-learning can solve extremely complex problems; very exciting!

atari Example of how Leap (orange) improved performance over a random initialisation (blue) on Atari games. Lines are averaged over 10 seeds, shading gives the standard deviations. Vertical axes are normalised cumulative scores and horizontal axis number of parameter updates, in tens of millions!

Leap is a first-step towards a general meta-learner that can tackle any level of complexity. It is simple, light-weight (constant memory, negligible compute overhead, linear complexity) and can be integrated with other meta-learning algorithms that try to tackle other challenges, such as probabilistic reasoning or embedded task inference. We hope you’ve found some inspiration for your next idea and are looking forward to see what you come up with. Don’t hesitate to reach out!

References

[1] He, Kaiming, et al. “Mask r-cnn.” ICCV. 2017.

[2] Zhao, Hengshuang, et al. “Pyramid scene parsing network.” CVPR. 2017.

[3] Papandreou, George, et al. “Towards accurate multi-person pose estimation in the wild.” CVPR. 2017.

[4] Peters, Matthew E., et al. “Deep contextualized word representations.” NAACL-HLT. 2018.

[5] Howard, Jeremy, and Sebastian Ruder. “Fine-tuned Language Models for Text Classification.” ACL. 2018.

[6] Radford, Alec, et al. “Improving Language Understanding by Generative Pre-Training.”

[7] Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv:1810.04805

[8] Koch, Gregory, et al. “Siamese neural networks for one-shot image recognition.” ICML. 2015.

[9] Vinyals, Oriol, et al. “Matching Networks for One Shot Learning.” NeuIPS. 2016.

[10] Finn, Chelsea, et al., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” ICML. 2017.

[11] Ravi, Sachin, and Hugo Larochelle. “Optimization as a Model for Few-Shot Learning.” ICML. 2017.

[12] Lake, Brenden, “One shot learning of simple visual concepts.” CogSci. 2011.

[13] Nichol, Alex, et. al. “On First-Order Meta-Learning Algorithms.” ArXiv:1803.02999. 2018.