Backpropagation algorithm

Backpropagation is a technique used to teach a neural network that has at least one hidden layer.

This is part 2 of a series of github repos on neural networks

part 1 - simplest network
part 2 - backpropagation (you are here)
part 3 - backpropagation-continued

Theory
Code example
References

Theory

Introducing the perceptron

A perceptron is a processing unit that takes an input , transforms it using an activation function and outputs the result .

Within a neural network, its input is the sum of the previous layer node outputs times their corresponding weight, plus the previous layer bias:

If we treat the bias as an additional node in a layer with a constant value of , then we can simplify the equation:

Activation functions

Why do we need an activation function? Without it the output of every node will be linear, making the neural network output a linear function of the inputs. Since the combination of two linear functions is also a linear function, you can't compute more interesting functions without non-linear ones. This means that the network will only be able to solve problems that can be solved with linear regression.

If then typical activation functions are:

Sigmoid
ReLU or rectified linear unit
tanh

Backpropagation

The backpropagation algorithm is used to train artificial neural networks, more specifically those with more than two layers.

It's using a forward pass to compute the outputs of the network, calculates the error and then goes backwards towards the input layer to update each weight based on the error gradient.

Notation

, are inputs to a node for layers respectively.
, are the outputs from a node for layers respectively.
is the expected output of a node of the output layer.
are weights of node connections from layer to and from layer to respectively.
is the current association out of associations.

We will assign the following activation functions to each layer nodes for all following examples:

input layer -> identity function
hidden layer -> sigmoid function
output layer -> identity function

The forward pass

During the forward pass, we feed the inputs to the input layer and get the results in the output layer.

The input to each node in the hidden layer is the sum of the output from all nodes of the input layer times their corresponding weight:

Since the hidden layer's activation function for each node is the sigmoid, then their output will be:

In the same manner, the input to the output layer nodes are

and their output is the same since we assigned them the identity activation function.

Once the inputs have been propagated through the network, we can calculate the error. If we have multiple associations, we simply sum the error of each association.

The backward pass

Now that we have the error, we can use it to update each weight of the network by going backwards layer by layer.

We know from part 1 of this series that the change of a weight is the negative of that weight's component in the error gradient times the learning rate. For a weight between the last hidden layer and the output layer, we then have

We can find the error gradient by using the chain rule

Therefore the change in weight is

For multiple associations, then the change in weight is the sum of each association

Similarly, for a weight between hidden layers, in our case between the input layer and our first hidden layer, we have

Here the calculations are slightly more complex. Let's analyze the delta term and understand how we got there. We start by calculating the partial derivative of in respect to the error by using the chain rule

Remember that our activation function is the sigmoid function and that its derivative is

Again, the change in weight for all associations is the sum of each association

Algorithm summary

First, initialize network weights to a small random value.

Repeat the steps below until the error is about 0

for each association, propagate the network forward and get the outputs
- calculate the term for each output layer node ()
- accumulate the gradient for each output weight ()
- calculate the term for each hidden layer node ()
- accumulate the gradient for each hidden layer weight ()
update all weights and reset accumulated gradients ()

Visualizing backpropagation

In this example, we'll use actual numbers to follow each step of the network. We'll feed our 2x2x1 network with inputs and we will expect an output of . To make matters simpler, we'll initialize all of our weights with the same value of . However, keep in mind that normally weights are initialized using random numbers. We will also design the network with a sigmoid activation function for the hidden layer and the identity function for the input and output layers and we'll use as our learning rate.

Forward pass

We start by setting all of the nodes of the input layer with the input values; .

Since the input layer nodes have the identity activation function, then .

We then propagate the network forward by setting the layer node inputs () with the sum of all of the previous layer node outputs times their corresponding weights:

We then activate the layer nodes by passing it's inputs to the sigmoid function

And we propagate those results to the final layer

Since the activation function of our output nodes is the identity, then

Backward pass

On the way back, we first calculate the term for the output node,

And using the term we calculate the gradient for each weight between and layer nodes:

We then do the same thing for each hidden layer (just the one in our case):

And calculate the gradient for each weight between and layer nodes:

The last step is to update all of our weights using the calculated gradients. Note that if we had more than one association, then we would first accumulate the gradients for each association and then update the weights.

As you can see the weights changed by a very little amount, but if we were run a forward pass again using the updated weights, we should normally get a smaller error than before. Let's check...

We had on our first iteration and we get after the weight changes.

We had and we get after the weight changes.

We successfully reduced the error! Although these numbers are very small, they are much more representative of a real scenario. Running the algorithm many times over would normally reduce the error down to almost 0 and we'd have completed training the network.

Code example

The example teaches a 2x2x1 network the XOR operator.

Where is the sigmoid function for the hidden layer nodes.

Note that the XOR operation could not be solved with the linear network used in part 1 because the dataset is distributed non-linearly. Meaning you could not pass a straight line between the four XOR inputs to divide them into the correct two categories. If we replaced the hidden layer node activation functions from sigmoid to identity, this network wouldn't be able to solve the XOR problem as well.

Feel free to try it out yourself and experiment with different activation functions, learning rates and network topologies.

References

Artificial intelligence engines by James V Stone (2019)
Complete guide on deep learning: http://neuralnetworksanddeeplearning.com/chap2.html
Flow of backpropagation visualized: https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/
Activation functions: https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

gokadin / ai-backpropagation

Programming Languages

Labels

Projects that are alternatives of or similar to ai-backpropagation