$$ \definecolor{input}{RGB}{66, 133, 244} \definecolor{output}{RGB}{219, 68, 55} \definecolor{dinput}{RGB}{244, 180, 0} \definecolor{doutput}{RGB}{15, 157, 88} \definecolor{dweight}{RGB}{102, 0, 255} $$

Backpropagation algorithm

The backpropagation algorithm is essential for training large neural networks quickly. This article explains how the algorithm works.

Please scroll down...

Simple neural network

On the right, you see a neural network with one input, one output node and two hidden layers of two nodes each.

Nodes in neighboring layers are connected with weights \(w_{ij}\), which are the network parameters.

Activation function

Each node has a total input \(\color{input}x\), an activation function \(f(\color{input}x\color{black})\) and an output \(\color{output}y\color{black}=f(\color{input}x\color{black})\). \(f(\color{input}x\color{black})\) has to be a non-linear function, otherwise the neural network will only be able to learn linear models.

A commonly used activation function is the Sigmoid function: \(f(\color{input}x\color{black}) = \frac{1}{1+e^{-\color{input}x}}\).

Error function

The goal is to learn the weights of the network automatically from data such that the predicted output \(\color{output}y_{output}\) is close to the target \(\color{output}y_{target}\) for all inputs \(\color{input}x_{input}\).

To measure how far we are from the goal, we use an error function \(E\). A commonly used error functon is \(E(\color{output}y_{output}\color{black},\color{output}y_{target}\color{black}) = \frac{1}{2}(\color{output}y_{output}\color{black} - \color{output}y_{target}\color{black})^2 \).

Forward propagation

We begin by taking an input example \((\color{input}x_{input}\color{black},\color{output}y_{target}\color{black})\) and updating the input layer of the network.

For consistency, we consider the input to be like any other node but without an activation function so its output is equal to its input, i.e. \( \color{output}y_1 \color{black} = \color{input} x_{input} \).

Forward propagation

Now, we update the first hidden layer. We take the output \(\color{output}y\) of the nodes in the previous layer and use the weights to compute the input \(\color{input}x\) of the nodes in the next layer.
$$ \color{input} x_j \color{black} = $$$$ \sum_{i\in in(j)} w_{ij}\color{output} y_i\color{black} +b_j$$

Forward propagation

Then we update the output of the nodes in the first hidden layer. For this we use the activation function, \( f(x) \).
$$ \color{output} y \color{black} = f(\color{input} x \color{black})$$

Forward propagation

Using these 2 formulas we propagate for the rest of the network and get the final output of the network.
$$ \color{output} y \color{black} = f(\color{input} x \color{black})$$
$$ \color{input} x_j \color{black} = $$$$ \sum_{i\in in(j)} w_{ij}\color{output} y_i \color{black} + b_j$$

Error derivative

The backpropagation algorithm decides how much to update each weight of the network after comparing the predicted output with the desired output for a particular example. For this, we need to compute how the error changes with respect to each weight \(\color{dweight}\frac{dE}{dw_{ij}}\).
Once we have the error derivatives, we can update the weights using a simple update rule:
$$w_{ij} = w_{ij} - \alpha \color{dweight}\frac{dE}{dw_{ij}}$$
where \(\alpha\) is a positive constant, referred to as the learning rate, which we need to fine-tune empirically.

[Note] The update rule is very simple: if the error goes down when the weight increases (\(\color{dweight}\frac{dE}{dw_{ij}}\color{black} < 0\)), then increase the weight, otherwise if the error goes up when the weight increases (\(\color{dweight}\frac{dE}{dw_{ij}} \color{black} > 0\)), then decrease the weight.

Additional derivatives

To help compute \(\color{dweight}\frac{dE}{dw_{ij}}\), we additionally store for each node two more derivatives: how the error changes with:
  • the total input of the node \(\color{dinput}\frac{dE}{dx}\) and
  • the output of the node \(\color{doutput}\frac{dE}{dy}\).

Back propagation

Let's begin backpropagating the error derivatives. Since we have the predicted output of this particular input example, we can compute how the error changes with that output. Given our error function \(E = \frac{1}{2}(\color{output}y_{output}\color{black} - \color{output}y_{target}\color{black})^2\) we have:
$$ \color{doutput} \frac{\partial E}{\partial y_{output}} \color{black} = \color{output} y_{output} \color{black} - \color{output} y_{target}$$

Back propagation

Now that we have \(\color{doutput} \frac{dE}{dy}\) we can get \(\color{dinput}\frac{dE}{dx}\) using the chain rule.
$$\color{dinput} \frac{\partial E}{\partial x} \color{black} = \frac{dy}{dx}\color{doutput}\frac{\partial E}{\partial y} \color{black} = \frac{d}{dx}f(\color{input}x\color{black})\color{doutput}\frac{\partial E}{\partial y}$$
where \(\frac{d}{dx}f(\color{input}x\color{black}) = f(\color{input}x\color{black})(1 - f(\color{input}x\color{black}))\) when \(f(\color{input}x\color{black})\) is the Sigmoid activation function.

Back propagation

As soon as we have the error derivative with respect to the total input of a node, we can get the error derivative with respect to the weights coming into that node.
$$\color{dweight} \frac{\partial E}{\partial w_{ij}} \color{black} = \frac{\partial x_j}{\partial w_{ij}} \color{dinput}\frac{\partial E}{\partial x_j} \color{black} = \color{output}y_i \color{dinput} \frac{\partial E}{\partial x_j}$$

Back propagation

And using the chain rule, we can also get \(\frac{dE}{dy}\) from the previous layer. We have made a full circle.
$$ \color{doutput} \frac{\partial E}{\partial y_i} \color{black} = \sum_{j\in out(i)} \frac{\partial x_j}{\partial y_i} \color{dinput} \frac{\partial E}{\partial x_j} \color{black} = \sum_{j\in out(i)} w_{ij} \color{dinput} \frac{\partial E}{\partial x_j}$$

Back propagation

All that is left to do is repeat the previous three formulas until we have computed all the error derivatives.

The end.