When you peek under the hood of a deep learning library, things will look pretty complex. That’s because they use a lot of abstractions to handle a wide variety of cases. But the core code of a Deep Learning library is actually not that complex at all. In fact, you’d be surprised by just how simple it can be. And because it’s simple, it’s very instructive. In particular, it allows you to build some intuition about how deep learning software operates. That’s what we’ll do here. In part I, we develop a **computational graph** framework, a version of which powers every deep learning library. In part II we implement the core building blocks. Part III puts the pieces together and trains a classifier on MNIST.

## Part I

#### The computational graph

A Deep Learning library is, somewhat simplified, a specialized library for building and executing computational graphs. These in turn describe how to carry out a computation. To make matters concrete, suppose we wanted to compute sigmoid function

for some parameter matrix $W$ and parameter vector $\boldsymbol{b}$. Now, how do you evaluate $\sigma(\boldsymbol{x})$? Well, to evaluate the division, we first need to know the value of the denominator. To this end, we must evaluate the $\exp(\cdot)$ term, which in turn requires us to first evaluate the linear transformation $W\boldsymbol{x} + \boldsymbol{b}$. So to evaluate a complex function, we need to decompose into a chain of simple operations like addition and multiplication. Let $f _ 1$ denote matrix multiplication, $f_2$ addition, $f_3$ (negative) exponentiation, and $f_4$ as taking the arithmetic inverse. With these four simple functions, we can write $\sigma$ as

Following our above recipe, to evaluate $\sigma(\boldsymbol{x})$, we begin by evaluating

Here, we have introduced the *intermediate* variable $\boldsymbol{z} _ 1$. It’s only purpose is to temporarily contain the result of $f _ 1 (\boldsymbol{x}, W)$. In a similar manner, once we have $\boldsymbol{z} _ 1$, we can evaluate

The final intermediate variable $\boldsymbol{z} _ 5$ now represents the value of $\sigma(\boldsymbol{x})$ for this *particular* input vector $\boldsymbol{x}$. With more complicated functions, listing out all the operations might be a bit messy, and a more compact way of representing the steps taken to evaluate $\sigma(\boldsymbol{x})$ can be represented by a directed graph $\mathcal{G}$ that charts the computational step. Each node is a variable and edges represents the either the input or output dependence of the node.

Once we start building our own computational graphs, it will be important to separate three types of nodes: (i) inputs, (ii) parameters, (iii) intermittent variables. Occasionally, we might also have constants like in the graph above but those won’t interest us much.

#### Implementing the sigmoid function as a computational graph

To code this graph, we need to structure the nodes. First, let’s formalize a graph $\mathcal{G} = (V, E)$ as a set of nodes (vertices) $V$ and a set of edges $E$ connecting the nodes. For each node $v_i \in V$ we need to know its *parents* $\operatorname{Pa}(v_i) \subset V$ and its *children* $\operatorname{Ch}(v_i) \subset V$. The parents are the originating nodes of any incoming edges, and the children are the nodes hit by any outgoing edges. In our above graph, the node $\boldsymbol{z}_1$

has $\operatorname{Pa}(\boldsymbol{z} _ 1) = \{ \boldsymbol{x}, \ W \}$ and $\operatorname{Ch}(\boldsymbol{z} _ 1) = \{ \boldsymbol{z} _ 2 \}$. These concepts are cental in ensuring that we execute the computations in the graph in the correct order. After all, to correctly evaluate $\boldsymbol{z_2}$, we need to have evaluated $\operatorname{Ch}(\boldsymbol{z} _ 2) = \boldsymbol{z} _ 1$. Finally, we also need to know whether any operation ($f$) is associated with the node. Once we know these three components, we can compute the state of the node given its inputs. Hence, to construct a node, we’ll want to know any `parents`

and any `children`

, as well as any `operation`

to perform. These desiderata gives us the following `Node`

class:

```
class Node(object):
"""Node.
"""
def __init__(self, operation=None):
self.operation = operation
self.state = None
self.inputs = list()
self.outputs = list()
def compute(self):
"""Compute state of node."""
self.state = self.operation(self.get_input())
def set_input(self, inputs):
"""Set input node."""
if not isinstance(inputs, list):
inputs = [inputs]
self.inputs.extend(inputs)
def get_input(self):
"""Get state of input nodes."""
return [i.state for i in self.inputs]
```

Here, the `state`

of the node is `None`

as long as no computation takes place. In this way, we can construct a symbolic graph of linked nodes with `None`

states that we populate once we have fixed an input. This leads us to an important choice in implementation: we can either build a dynamic graph, which is constructed on-the-fly for a given unput. For instance in sudo-code:

```
def graph(x):
h = x ** 2
z = h + 2
return z
out = graph(array([1]))
```

The `graph`

function only gives us the blue-print for the graph. Once we pass an input to the function, a unique graph is created, one line at a time. The benefit of this approach is that we can handle variable calculations fluently, like recursion or conditional computations. The drawback is that the we don’t know what the final graph will look like. This prevents us from optimizing the graph and requires us to keep cache the entire graph.

Alternatively we can go with a *static* approach, where we set up a symbolic graph before as a distinct object. The graph can then be populated by passing a given input to it. For instance, in sudo-code:

```
x = Variable()
h = x ** 2
z = h + 2
run(z, inputs={x: array([1])})
```

Here, `h`

and `z`

are created as nominal variables with their own symbolic graphs. These graphs exist in memory and cannot be changed–for that we need to set up a new graph. We evaluate the graph by passing a specific input to run through the graph. The benefit of this approach is that we can implement a set of tricks to optimize the computations, like pruning the graph of redundant edges, or clearing intermediate variables we don’t need anymore. More practically, it’s also easier to implement, and for that reason we will implement a static computational graph. To compute the sigmoid function then, we first create the graph by instantiating our nodes with their respective operations:

```
import numpy as np
def matmul(inputs):
x, W = inputs
return np.dot(W, x)
def matadd(inputs):
x, b = inputs
return x + b
def exp(inputs):
x, = inputs
return np.exp(-x)
def inv(inputs):
x, = inputs
return 1 / x
x, W, b, c = Node(), Node(), Node(), Node()
z1, z2, z3, z4, z5 = Node(matmul), Node(matadd), Node(exp), Node(matadd), Node(inv)
z1.set_input([x, W])
z2.set_input([z1, b])
z3.set_input(z2)
z4.set_input([z3, c])
z5.set_input(z4)
```

We now have symbolic representation of the sigmoid function. But we can’t make any computations yet, since the parameters are vacuous and there’s no input. Once we’ve fixed a set of parameters, we can execute the graph for any given (acceptable) input. As an example, let’s initialize our parameters to

by setting `W.state = np.array([3, 1])`

and `b.state = 0`

. To then actually compute $\sigma(\boldsymbol{x})$ for some input, we fix our input node in a similar manner, and then iterate over our intermediate variables sequentially. Doing this for an input grid $\boldsymbol{x} \in [-10, 10]^2$ gives us the characteristic S-curve.

```
for i in np.linspace(-10, 10, 25):
for j in np.linspace(-10, 10, 25):
x = np.array([i, j])
z1.state = x # Set input node
# Compute graph
for z in [z1, z2, z3, z4, z5]:
z.compute()
```

Now, notice something, for every possible input, we can compute the exact value of each intermediate variable $\boldsymbol{z}_i$. In the above graph, to the left we see how the linear transformation changes across the input space; the middle graph visualizes the space of the intermediate variable $\boldsymbol{z}_3$, and finally to the right we see how $\sigma$ varies across the input space. Because we can represent each intermediate variable in this way, we can study how $\sigma$ varies with respect to *any* variable. In other words, we can study any partial derivative of $\sigma$. But rather than running over an entire grid of possible values, wouldn’t it be better to structure our graph to give us the gradient *automatically*?

#### Going in reverse

Consider a continuous and at least once differentiable function $f: \mathbb{R}^{n > 1} \mapsto \mathbb{R}$ that we compose as $f(\boldsymbol{x}) = ( f _ 3 \circ f _ 2 \circ f _ 1 ) (\boldsymbol{x})$. Applying the chain rule, we can find the *gradient* as

Now notice something really nice: the derivative of $f$ is itself a product of composite functions *of the derivatives of the composites*. Make sure you got that. It’s important because it means if we have a computational graph of $f$, we can reverse enginner a computational graph for $\nabla f$. The one caveat is that we need to know the derivatives of the composite functions. The entire idea is of course that the composite functions are simple so that we have access to the derivatives. Building the computational graph for $f$ as

we can construct a graph for the gradient $\nabla f$:

Note howe we go in the reverse order; we first define the intermediate variable $\boldsymbol{d} _ 1$ by differentiating the last composite function ($f _ 3$) with respect to its input ($\boldsymbol{h} _ 2$). We then move down a node and differentiate the second composite function ($f _ 2$) with respect to its input ($\boldsymbol{h} _ 1$). Multiplying these terms gives the third intermediate variable of the derivative ($\boldsymbol{d} _ 3$). Repeating the process, ultimately we reach the intermediate variable $\boldsymbol{d} _ 5$, which gives us $\nabla f(\boldsymbol{x})$. To see hammer this point in, we’ll visualize the graph below. Here, orange highlights nodes that represent derivatives with respect to the objective function $f$, and grey highlights derivatives of composite functions.

What stands out here is that each variable node is associated with it’s own derivative. Let’s zoom in on $\boldsymbol{h} _ 1$:

To compute $\partial f(\boldsymbol{x}) / \partial \boldsymbol{h} _ 1$, we need to know the derivative(s) of the function(s) used by $\operatorname{Ch}(\boldsymbol{h} _ 1)$, in our case $\nabla f _ 2$, as well the derivatives of these nodes ($\partial f(\boldsymbol{x}) / \partial \boldsymbol{h} _ 2$). The upshot of this observation is that what where children of a node during the forward pass, becomes parents during the backward pass. This is a crucial insight, because it means we need a way to order the computations in the graph depending on whether we go forward or backward. This is known as **topological sorting**.

#### Implementing a general-purpose computational graph

In terms of implementation, there’s a slight inconvencience with the above graphical representation: for a given node, to compute its state we need one function, but to compute the derivative we need another (or several others). It would be much easier if we could simply “invert” the nodes we already have to by running whatever function their associated with either backward or forward. For this reason, the computational graph we’re going to implement will look slightly different. Each node will either have no function attached, in which case it’s an input or parameter node, or a function $f _ i$ attached that we can run either `forward`

to compute the `state`

of the node, or `backward`

to compute the `grad`

of the node. Note here that the gradient we are after is with respect to the full function, so in order to compute it we need to take into account

- the state of the node,
- the state of the node’s parents,
- the gradients of the node’s children.

Let’s build the graph for $f(\boldsymbol{x}) = (f _ j \circ f _ i ) (\boldsymbol{x})$. We need three nodes: one for $f _ i $ and one for it’s child node $f _ j$, as well as an input node. These are connected as follows:

Here, $f_{ji} = f_j \circ f_i$ for brevity. We represent each node in the graph with a square. The input node is special in that it is not associated with any function, it merely contains the input variable. The square nodes are the ones we initiate. Each node has a `state`

that is computed in the forward pass of the graph (solid directed edges). As before, magenta denotes intermediate variables computed during the forward pass. The dashed directed edges gives us the dependencies of the backward pass. The `grad`

gives the gradient of $f$ down the function associated with that node, and for the input node this will represent the full function.

Before we can build our new `Node`

class, we need to deal with two cases. First, suppose a node had more than one parent (i.e. input dependencies). To accomodate this scenario, we can extend the `grad`

to be a list or dictionary. If instead we had several output dependencies (children), we need to figure out how to deal with several incoming `grad`

s. For this, consider $f(g(x), h(x))$. The node for $x$ has two children, $g(x)$ and $h(x)$ and so would have two incoming `grad`

s. Since

we can conveniently just *sum* the two incoming gradients. Going forward, we’ll use a slightly different terminology. Since running the graph backwards inverts the parent–child relationship, it’s better to speak of a node’s **output** and **input**, where these refer to the *forward pass*. Hence the input is always the nodes required to compute the `state`

, and the output is the node that takes the `state`

as their input. With that in mind, here’s the baseclass for a `Node`

:

```
class Node(object):
"""Node base class.
"""
def __init__(self):
self.grad = None # Will be a dict
self.state = None # Will be a numpy array
self.inputs = list()
self.outputs = list()
def set_input(self, inputs):
"""Set input node."""
if not isinstance(inputs, list):
inputs = [inputs]
for i in inputs:
if i not in self.inputs:
self.inputs.append(i)
def set_output(self, output):
"""Set output node."""
if not isinstance(output, list):
output = [output]
for out in output:
if out not in self.outputs:
self.outputs.append(out)
def get_output(self):
"""Set output node."""
return self.outputs
def get_input(self):
"""Get state of input nodes."""
return [i.state for i in self.inputs]
def get_grad(self):
"""Get gradient of output nodes."""
out = [o.grad[self]
for o in self.outputs
if o.grad[self] is not None]
if len(out) == 0:
# This happens for cost function nodes
return None
if len(out) == 1:
# This is the default for gates
return out[0]
else:
# Can happen for gates of variables if several consumers
s = sum(out)
return s
```

The `get_grad`

method handles a few exceptions we can ignore for now. The `Node`

class is our base class, what we now need is a distinct class for inputs, parameters and nodes with functions. We’ll also have an output node that we will use for training a model, but for now you can ignore that node. The value of separating these nodes is that it we can easily distinguish between nodes once we implement our graph. The input node then is implemented as follows:

```
class Input(Node):
"""Input node.
Input nodes contain the input data.
Args
X (array, None): input data
"""
def __init__(self, X=None):
super(Input, self).__init__()
self.state = X
self.__cls__ = self.__class__
def clear(self):
"""Clear state."""
self.state = None
```

You’ll notice that there’s no gradient here. The reason is that for our purposes, we won’t need it. However extending the input node to pass on a gradient is an easy exercise. Now, the parameter node is similar to an input node, but differs when it comes to the backward pass:

```
class Variable(Node):
"""Variable node.
Variable nodes contain parameters to be backropagated.
Args
X (array, int, float): parameter
"""
def __init__(self, X=None):
super(Variable, self).__init__()
self.state = X
self.__cls__ = self.__class__
def backprop(self):
"""Backpropagate through the node."""
grad = self.get_grad()
self.grad = grad
```

Here, we *don’t* have a `clear`

method, since we don’t want to wipe the parameters. Instead, we have a `backprop`

method, and if you compare to our computational graph above notice that for a parameter, we simply collect the gradients without computing anything. All the heavy lifting occurs in the `Gate`

nodes, each associated with their own function.

```
class Gate(Node):
"""Node for performing an operation on inputs.
Args
param (array, int): parameters for the node
operation (obj): instance of an op class
"""
def __init__(self, operation):
super(Gate, self).__init__()
self.operation = operation
self.__cls__ = self.operation.__class__
def forward(self, train=True):
"""Compute the state of the node in a forward pass."""
inputs = self.get_input()
if hasattr(self.operation, "train"):
self.operation.train = train
self.state = self.operation.forward(inputs)
def backprop(self):
"""Backpropagate through the node."""
inputs = self.get_input()
state = self.state
grad = self.get_grad()
grad = self.operation.backprop(inputs, state, grad)
self.grad = dict(zip(self.inputs, grad))
def clear(self):
"""Clear state and gradient cache."""
self.state = self.grad = None
try:
self.operation.clear()
except AttributeError:
pass
```

There’s quite a bit going on here. First, we now have a `forward`

method. This collects the input and perform the `forward`

pass of the node’s `operation`

. During `backprop`

, we collect gradients from output nodes as well as the inputs and the node’s state and compute the `backprop`

of the `operation`

. Hence `grad`

is a dictionary with the input nodes as keys, and their respective `grad`

as values. A minor detail here is that we’re doing *backpropagation*, that is, at each point we’re calculating the full derivative with respect to the final function (compare the orange nodes with the grey nodes). Hence, the operation itself does the multiplication with the incoming gradient. The final node we’ll need is the output node, and it’s only purpose is to hold any labels during training. We’ll post it here for reference, but you can ignore it for now.

```
class Output(Node):
"""Output node.
The output node contains the labels for the data at hand.
Args
X (array, None): labels
"""
def __init__(self, X=None):
super(Output, self).__init__()
self.state = X
self.__cls__ = self.__class__
def clear(self):
"""Clear state."""
self.state = None
```

Now, rather than manually keeping track of all the nodes we’ve created, it will be very helpful to have a `Graph`

object to wrap around our nodes. Since we’re using a static approach, this class will have the nodes at its disposal. It’s main task is to make sure that we execute each node in the correct order for the `forward`

and `backward`

pass, so called **topological sorting**. In the interest of space, we’ll make a very crude sorting here by leveraging that we know all nodes in the graph.

```
class ComputationalGraph(object):
"""Computational Graph.
Class for hosting a computational graph and performing
forward and backward passes.
"""
def __init__(self):
self.flag = 0
self.nodes = list()
self._input_nodes = list()
self._output_nodes = list()
def add_node(self, node, parent_nodes=None):
"""Add node to graph.
Args
node (Node): instance of the Node class
parent_nodes (list, None): list of input nodes, or None
"""
# Register if node requires external input
if isinstance(node, Input):
self._input_nodes.append(node)
if isinstance(node, Output):
self._output_nodes.append(node)
# Set input and outputs
if parent_nodes is not None:
if not isinstance(parent_nodes, list):
parent_nodes = [parent_nodes]
for inp in parent_nodes:
inp.set_output(node)
node.set_input(inp)
# Add node to graph
self.nodes.append(node)
def sorted_topologically(self, reverse=False):
"""Generates nodes according to forward dependencies.
Args
reverse (bool): whether to generate nodes in reverse.
"""
T = self.sort_nodes()
N = self.n_nodes()
for i in range(N):
if reverse:
i = N - i - 1
yield T[i]
def sort_nodes(self, idx=None):
"""Return list of nodes sorted topologically for forward pass"""
T = self._input_nodes
for i in range(N):
n = T[i]
outputs = n.get_output()
# Add all inputs to the outputs (i.e. params)
inputs = [inp for out in outputs if out is not None
for inp in out.inputs]
for inp in inputs:
if inp not in T:
T.append(inp)
# Add all output nodes
for out in outputs:
if out is not None and out not in T:
T.append(out)
if idx is None:
return T
else:
return T[idx]
def forward(self, X, y=None, train=True):
"""Forward pass through graph (topologically).
Args
X (array): input array.
y (array): label array.
train (bool): whether the forward pass if for training.
"""
self._initialize(X, y)
for node in self.sorted_topologically():
if isinstance(node, Gate):
node.forward(train=train)
# Set flag to 1 (forward pass completed)
self.flag = 1
def backprop(self):
"""Backward calculation.
Runs backpropagation on a forward pass of the graph.
"""
if not self.flag == 1:
raise ValueError("Need a forward pass to do backprop.")
for node in self.sorted_topologically(reverse=True):
if isinstance(node, (Output, Input)):
continue
node.backprop()
def _initialize(self, X, y):
"""Initialize graph on X and y."""
for n in self._input_nodes:
n.state = X
if y is not None:
for n in self._output_nodes:
n.state = y
def n_nodes(self):
"Number of nodes in graph"
return len(self.nodes)
def clear(self):
"""Clear cache."""
self.flag = 0
for node in self.nodes:
if not isinstance(node, Variable):
# Do not clear the parameters
node.clear()
```

## Wrapping up

So, we now have our compuational graph! To compute a function, we now need a set of `op`

classes that implements a `forward`

and `backward`

pass for a given function.
That, we’ll leave for the coming part II. In the meantime, feel free to play around and see if you can implement an `op`

class youself, or take a peek at the full library at the github repo.