Backpropagation and Automatic Differentiation

Backpropagation is the backbone of Deep Learning. Sometime back when I was trying to understand what ‘learning’ is I realized that I didn’t really know much about how backpropagation is actually implemented. I just knew how to do a simple loss.backward(). I soon found out that there is much more to backpropagation than chain rule. ‘Automatic Differentiation’ and ‘Message passing view of Backpropagation’ especially amazed me. When I set out learning in-depth about backpropagation, I couldn’t find everything necessary in one place. Some articles will only cover the basics of backprop, some will lightly touch on autodiff and not discuss autodiff’s basics. I hope this article serves as a good resource for people trying to understand backpropagation and how it’s implemented in modern libraries. Let me know in the comments if I am missing any aspect.

Note:
a) The article may seem math-heavy. To help build intuition, I have supplemented maths with visual elements.
b) The article is not intended as a quick read. However, to keep the main body short, I have used Appendixes for supplementary concepts.

Intro to backprop:

Central to the concept of backprop is the idea of a cost function. Cost functions let us choose what we want to achieve. In most cases, either we want to maximize rewards or we want to minimize errors. Cost function helps to formalize the uncertainty or randomness of our models[check Appendix A]. Once we can formalize randomness, we can use optimization algorithms to reduce it.

Let’s say our model comprises of parameters $\theta$ and our cost function is $\mathcal{J}(\theta)$ . So, with gradients $\frac{\partial \mathcal{J}(\theta)}{\partial \theta}$ we can update the parameters $\theta$ using gradient descent to change $\mathcal{J}(\theta)$ [For more on gradients and gradient descent check Appendix B.

A forward pass in a simple Multilayer Perceptron looks like this:

The above model has $W_1,b_1,W_2,b_2,W_3,b_3,W_4\text{ and }b_4$ as its trainable parameters. A partial derivative of the cost function with respect to a parameter like $W_1$ ( $\frac{\partial \mathcal{J}(\theta}{\partial W_1}$ ) tells how minute changes in $W_1$ changes the value of $\mathcal{J}(\theta)$ . But, $W_1$ doesn’t directly affect the value of cost function. Value of $W_1$ affects the value of output of node 1. This output value later affects outputs of Node 3 and 4. Node 3 and 4 affect Output Node. Output Node directly affects $\mathcal{J}(\theta)$ . So you can see that the connection between $\mathcal{J}(\theta)$ and $W_1$ is kind of a chain with multiple intermediary nodes. So the question comes how do we find the value of $\frac{\partial \mathcal{J}(\theta}{\partial W_1})$ ?

All the paths that connect a parameter to a cost function contribute in the total derivative of that parameter. In the above case, since there are 2 paths that connect $W_1$ with $\mathcal{J}(\theta)$ , $\frac{\partial \mathcal{J}(\theta)}{\partial W_1}$ has two terms:

$\begin{align*}\frac{\partial \mathcal{J}(\theta)}{\partial W_1} &= \text{derivative from path 1} + \text{derivative from path 2} \\&= \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}\frac{\partial Z_5}{\partial Z_3}\frac{\partial Z_3}{\partial Z_1}\frac{\partial Z_1}{\partial W_1} + \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}\frac{\partial Z_5}{\partial Z_4}\frac{\partial Z_4}{\partial Z_1}\frac{\partial Z_1}{\partial W_1}\end{align*}$

where $Z_i$ is the output of the ith node and considering Node ‘O’ as Node 5. We got the above expression using Chain Rule[check Appendix C].

If you look at the gradients of our model parameters, you will notice two fundamental things:
1. To update the value of $W_1$ we have to save 6 jacobians. For a much deeper network, the number of jacobians we have to save will keep on increasing. It is a big problem because jacobian can very large[Check Appendix D].
2. Repetitions of jacobians: If you check the next gif, you will notice how so many terms are repeated and need to be calculated again and again.

Automatic Differentiation:

Practically, we calculate gradients using Automatic Differentiation or Autodiff or Autograd. We use it in almost all libraries that support Neural Networks because it circumvents the aforementioned problems. Autodiff achieves so by leveraging the computational graph(CG) and converting all operations into primitive operations.

Converting Computational graph into primitive operations:

This allows us to reformulate the entire CG in terms of simple functions and basic operations like $exp(),tanh(),matmul(), addition(),subtraction(), division()$ etc. Converting CG into primitive operations is done behind the scenes in Pytorch and Tensorflow. The reason it’s done will soon become clear.

We can rewrite operations in our Node 1( $ReLU(I*W_1 +b_1)$ )as:
$\begin{align*}t3&=ReLU(t2)\\t2&=t1+b_1\\ t1 &= I*W_1\end{align*}$
where, $t_i$ are intermediate nodes with results of primitive operations.

Similarly, for all the nodes, we can rewrite the CG as:
Traversing the CG in reverse order

First off, we use a different notation in autodiff. We represent $\frac{ \partial \mathcal {J}(\theta)}{\partial x}$ as $\overline{x}$ . Autodiff algorithm can be written as:
- Find the value of all primitive nodes in the forward pass.
- In backward pass: Assign last node with a gradient of 1. Iterate through CG in reverse order. Assign each node a gradient $\overline{v_i}$ such that:
  $\begin{equation*}\overline{v_i} = \sum_{j}{} \overline{v_j}\frac{\partial v_j}{\partial v_i}\end{equation*}$
  where, j are all nodes that use the output value of Node i in forward pass. This rule will become clear once we apply it to our current CG.
If you want to check out gradients of all parameters computed in the above manner, check out Appendix E.

Key properties:
1. In autodiff, we iteratively apply the chain rule on each node as opposed to finding the chain rule ‘expression’ for each node.
2. Since it passes down gradients there is no repetition in terms.
3. If you notice, all gradients are of the form $\overline{v_y}*\frac{\partial y}{\partial x}$ . This avoids costly multiplication of multiple jacobians which as we discussed tend to be huge.
4. The overall cost of doing backpropagation depends linearly on the time taken to do forward pass.
Writing Jacobian Products for primitive operation:

We still haven’t answered why we reformulated our CG down to primitive operations. There are 2 major reasons:
1. We can bypass the construction of Jacobians:
  As you noticed that in autodiff, the gradients are of the form , Multiplication of a vector with Jacobian can still be costly. If we break down our CG into only user-defined primitive functions like then we can directly return the gradients of nodes. To do so, we use functions called VJP(Vector Jacobian Product) or sometimes JVP(Jacobian Vector product). They directly compute the gradients for a particular primitive node. Eg:
  1. Let’s take a simple primitive node which does addition:
    $v= x + y$ , $v$ is the child node and $x,y$ are parent nodes. $\overline{v}$ is gradient coming from the child and we want to find $\overline{x}$ . We know from autodiff algorithm:
    $\begin{equation*}\overline{x}=\overline{v}\frac{\partial v}{\partial x}\end{equation*}$
    But the neat trick is that we already know for addition, . Thus we can avoid creating a jacobian and its expensive multiplication by directly defining a function that returns .
  2. A primitive node that does exp():
    
    $v=e^{x},\text{ }\overline{v}$ is gradient coming from the child. Then its VJP:
  3. A primitive node that does log():
    
    $v=\log{x},\text{ }\overline{v}$ is gradient coming from the child. Then its VJP:
    I hope now you understand how these user-defined VJP of primitive operations circumvent costly operations of jacobian construction and its multiplication. A cool thing is that even if we write a complex and composite function, our autodiff library will break the function into corresponding primitive functions and take care of backpropagation using VJPs.
2. If we can break down a CG to its primitive operations and define the corresponding VJP, we have automated backpropagation. Yaaayyyyy!!!!! Well, not so quick, there are many more things we have yet to take care of. First, creating structures that enable the identification of primitive operations. Second, in real-world applications, everything is done using matrices so we will have to take care of matrix calculus as well. But the ability to only worry about the forward pass and let your library take care of the rest is a tremendous boost to productivity.

Message passing view of backprop

In this section, we will see how to leverage the recursive nature of backpropagation. Due to this property, we can design architectures containing many modules arranged in any shape as long as these modules follow certain conditions. The backpropagation signal here is similar to $\overline{v_i}$ which is passed between parent and child nodes.

A modular view of architectures:

We can plan our architectures as interconnected modules. Each module has 4 types of signals passing through it:
1. Input( $z_{in}$ )
2. Output( $z_{out}$ ): A module processes Input and transforms it to give an Output
3. Incoming Gradients( $\delta_{in}$ ): During backward pass, these are the gradients coming from children of this module.
4. Outgoing Gradients( $\delta_{out}$ ): During backward pass, these are gradients that a module sends to its parents.
A module should atleast have these 3 functions:
1. forward(): that converts inputs to outputs of a module
2. backward() that finds gradients of outputs wrt to inputs( $\delta_{out}$ ) of the module
3. parameter_gradients(): which finds gradients of outputs wrt trainable parameters of the module.
And that’s it. Now you can rearrange these modules in different shapes to form any architectures that use backprop. However, we still have not discussed how to formulate backward() and parameter_Gradients() in case of multiple children, and guess what we will use? The good old, Chain Rule.
Gradient flow between different modules:

The end expression is the same with single or multiple children. But to build intuition, let’s start with a single child.
1. Single child:
  Let our current module be and its child is . Here, during backward pass, we have only 1 incoming gradient . Then for the functions:
  1. backward():
    Remember backward() finds gradients of outputs wrt to inputs i.e. $\delta_{out}$ . So from chain rule:
    $\begin{equation*} \delta_{out}^{i} = \delta_{out}^{j}\frac{\partial z_{in}^{j}}{\partial z_{in}^{i}}\end{equation*}$
    
    The gradients we send back from $M_i$ ( $\delta_{out}^{i}$ ) is the gradient of module output wrt to its inputs times the gradients module received.
  2. parameter_gradients():
    Let’s say our module has $\theta_i$ as a trainable parameter. Then gradient of Cost Function $\mathcal{J}(\theta)$ wrt to $\theta_i$ is:
    $\begin{equation*}\frac{\partial \mathcal{J}(\theta)}{\partial \theta_i} = \delta_{out}^{j}\frac{\partial z_{in}^{j}}{\partial \theta_i}\end{equation*}$
    
    The gradients we use to update our parameters are the gradient of module output wrt to its parameters times the gradients module received.
    E.g.: It is similar to gradients we found in the initial MLP example. To find gradients for $W_1$ we multiplied the incoming gradients with $\frac{\partial Z_1}{\partial W_1}$
2. Multiple children:
  In case of multiple children, we have to accommodate for gradients received from all the children. The terms we used for a single child become summation over all the children. Let’s say module has N children then the functions are:
  1. backward() :
    $\begin{equation*} \delta_{out}^{i} = \sum_{j \in N}\delta_{out}^{j}\frac{\partial z_{in}^{j}}{\partial z_{in}^{i}}\end{equation*}$
    
    i.e. sum of ((gradients coming from child) times (derivative of (input to that child) wrt (input of Module i))). Note: here we are saying derivative of input to that child because a module might have multiple outputs with different outputs going to different child modules.
  2. parameter_gradients():
    $\begin{equation*}\frac{\partial \mathcal{J}(\theta)}{\partial \theta_i} = \sum_{j \in N}\delta_{out}^{j}\frac{\partial z_{in}^{j}}{\partial \theta_i}\end{equation*}$
    
    i.e. sum of ((gradients coming from child) times (derivative of (input to that child) wrt (params of Module i))).

Why we discussed the Message passing view of backprop? The reason is that it enables creating very different architectures. I hope the next image sparks your imagination.

Appendix A: Cost Functions

To understand the need for a cost function, we need to understand what learning is. In the context of machine learning and deep learning algorithms, learning means the ability to predict. In our models, we want to reduce the uncertainty or randomness involved in our predictions of the data. We want our predictions to be spot on, not vary with the same inputs. To measure uncertainty in our models we the classical measure of randomness, entropy.

Entropy by definition is:

$\begin{equation*} \mathcal{E} = - \sum \mathcal{P}(x) ln \mathcal{P}(x) \end{equation*}$

here, in the context of machine learning $\mathcal{P}(x)$ is the probability the model gives to an input x.

Let’s see how cost functions reduce the entropy of predictions for different tasks.

For classification tasks:

Let’s task a classification task of n classes using softmax.
So, the probability that a particular sample $x$ is of class n is:

$\begin{equation*} \mathcal{P}(x = n) = [\frac{e^{k_n}}{\sum_{j}^{n} e^{k_j}}]^{\Omega_n} \end{equation*}$

where, $\Omega_n = \begin{cases}1 &\text{if label of }x \text{ is class n,else} \\0\end{cases}$

So, our likelihood for one data sample is:

$\begin{equation*} \mathcal{P}(x) = \prod_{i=0}^{n}[\frac{e^{k_n}}{\sum_{j}^{n} e^{k_j}}]^{\Omega_i} \end{equation*}$

So entropy of classification for one data sample becomes:

$\begin{equation*} \mathcal{E} = - \sum_{i=0}^{n} [\frac{e^{k_n}}{\sum_{j}^{n} e^{k_j}}]^{\Omega_i} {\Omega_i} \ln [\frac{e^{k_n}}{\sum_{j}^{n} e^{k_j}}] \end{equation*}$

The above expression may seem daunting but if you notice $\Omega_i$ is $0$ for all non label classes and when $\Omega_i$ is 1 then the above expression gets it lowest value( $0$ ) when the model gives the probability of 1 to the label class which will be a spot on prediction for the label class.

For regression tasks:

The first go to option for regression tasks is MeanSquarredError. But in MSE there is no notion of a $\mathcal{P}(x)$ . Another way of setting up a regression task is to learn a probability distribution. This way we can formulate $\mathcal{P}(x)$ and use negative log-likelihood as our cost function. A go to choice is to learn a normal distribution of our logits. We assume that our logits are the mean of that distribution and we find the $\mathcal{P}(\bar{y})$ using the probability density function(PDF) of the distribution. The idea is then to maximize the probabilities $\mathcal{P}(\bar{y})$ which happens when our logits $(y)$ are equal to the labels $(\bar{y})$ i.e. ${(\bar{y}} = y)$ .

Using PDF of a normal distribution,

$\begin{equation*} \mathcal{P}(x) = \frac{e^{-(\bar{y} - y)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}}\end{equation*}$

So our entropy becomes,

$\begin{align*} \mathcal{E} &= - \frac{e^{-^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}}(-(\bar{y} - y)^{2}/(2\sigma^{2})- \ln(\sigma\sqrt{2\pi})) \\&\propto \frac{e^{-^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}}((\bar{y} - y)^{2})\end{align*}$

Again, a daunting expression, but it minimizes when $(\bar{y} = y)$

So, as we saw in both cases, the point of a cost function is to help us to formalize the uncertainty of the model and by improving our predictions we reduce this uncertainty.

Appendix B: Gradients & Gradient Descent

A gradient tells how a function is changing at that particular point. Key points regarding gradients:
1. A gradient shows the direction in which the function is increasing.
2. When gradients are zero, it means our function has reached a peak or a trough.

If we take a small step towards the direction mentioned by a gradient, we move to a point with a function value greater than the previous position. If we take this step again and again, we reach a ‘maxima’ point of that function. This is the basis of many optimization algorithms. Find gradients wrt to cost function and keep updating parameters in the direction of their gradients.

Gradient Descent:

Gradient Descent is one such optimization algorithm. It has a very simple rule:

$\begin{equation*}\theta_{i+1} = \theta_i + \alpha\frac{\partial J(\theta)}{\partial \theta} \end{equation*}$

where $\alpha$ is the step size which regulates the amount of movement in the direction of gradient.
We use the above update rule when we have to maximize our cost function. To minimize our cost function, we have to move in the opposite direction of our gradients. So the update rule becomes:

$\begin{equation*}\theta_{i+1} = \theta_i - \alpha\frac{\partial J(\theta)}{\partial \theta} \end{equation*}$

Stochastic Gradient Descent:

As simple as Gradient descent seems, implementing it for huge datasets has its own problem. Why? The problem remains in the gradient $\frac{\partial J(\theta)}{\partial \theta}$ . Let me explain, for most applications we define cost function something like this:

$\begin{equation*} J(\theta) = \sum_{i=0}^{n} f(\theta,i)\quad \text{where, n is the total length of the dataset}\end{equation*}$

If we had a dataset of length 1000, then we find $f(\theta, i)$ for each instance and then sum these to get $J(\theta)$ . We then use $J(\theta)$ to find $\frac{\partial J(\theta)}{\partial \theta}$ and perform one optimization step for the entire dataset. Here lies the problem, Gradient Descent performs one optimization step for the entire length of the dataset. If we had a dataset of length 1 billion, we will have to iterate over those 1 billion instances before we can move once in the direction of gradients.

Stochastic Gradient Descent comes here for the rescue. SGD says that we can estimate the true gradient using gradient at each instance and if we estimate for a large number of instances we can come pretty close to true gradients. This allows working with batches when we have huge datasets.

Note: Gradient descent will take you directly to maxima or minima. SGD will wander here and there, but if repeated correctly, over time it will take you to the same place or at least some place near it.

Appendix C: Chain Rule:

You can actually think of the Cost function as a function of functions. Using notation $Z_i$ as the output of node i and considering output node as node 5, we can say:

$\begin{align*}\mathcal{J}(\theta) &= f(\bar{y},Z_5)\\&= f(\bar{y},g(Z_3,Z_4)) \\&= f(\bar{y},g(h(Z_1,Z_2),h(Z_1,Z_2)))\end{align*}$

here $Z_1 = ReLU(I*W_1 +b_1)\text{ } and\text{ } Z_2 = ReLU(I*W_2 +b_2)$

Chain Rule helps in finding gradients of composite functions such as $\mathcal{J}(\theta)$ . Multivariate chain rule says,

(1) $\begin{equation*} \frac{\partial}{\partial t}f(x(t),y(t))= \frac{\partial f}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial t}\end{equation*}$

Now we can use the chain rule to find $\frac{\partial \mathcal{J}(\theta)}{\partial W_1}$

$\begin{align*}\frac{\partial \mathcal{J}(\theta)}{\partial W_1} &= \frac{\partial \mathcal{J}(\theta)}{\partial \bar{y}}\frac{\partial \bar{y}}{\partial W_1} + \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}\frac{\partial Z_5}{\partial W_1} \quad \text{,expanding } \frac{\partial Z_5}{\partial W_1}\\&= 0 + \frac{\partial \mathcal{J}(\theta)}{\partial Z_5} ( \frac{\partial Z_5}{\partial Z_3}\frac{\partial Z_3}{\partial W_1} + \frac{\partial Z_5}{\partial Z_4}\frac{\partial Z_4}{\partial W_1}) \quad \text{,expanding } \frac{\partial Z_3}{\partial W_1} \text{ and } \frac{\partial Z_4}{\partial W_1}\\&= \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}(\frac{\partial Z_5}{\partial Z_3}(\frac{\partial Z_3}{\partial Z_1}\frac{\partial Z_1}{\partial W_1} + \frac{\partial Z_3}{\partial Z_2}\frac{\partial Z_2}{\partial W_1} ) +\frac{\partial Z_5}{\partial Z_4}(\frac{\partial Z_4}{\partial Z_1}\frac{\partial Z_1}{\partial W_1} + \frac{\partial Z_4}{\partial Z_2}\frac{\partial Z_2}{\partial W_1}) )\\&= \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}(\frac{\partial Z_5}{\partial Z_3}(\frac{\partial Z_3}{\partial Z_1}\frac{\partial Z_1}{\partial W_1} + 0 ) +\frac{\partial Z_5}{\partial Z_4}(\frac{\partial Z_4}{\partial Z_1}\frac{\partial Z_1}{\partial W_1} + 0) ) \quad \text{Since, } \frac{\partial Z_2}{\partial W_1}=0 \\&= \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}\frac{\partial Z_5}{\partial Z_3}\frac{\partial Z_3}{\partial Z_1}\frac{\partial Z_1}{\partial W_1} + \frac{\partial \mathcal{J}(\theta)}{\partial Z_5}\frac{\partial Z_5}{\partial Z_4}\frac{\partial Z_4}{\partial Z_1}\frac{\partial Z_1}{\partial W_1}\end{align*}$

Appendix D: Jacobian and problem with their size:

Jacobian is a matrix which contains first-order partial derivatives of 2 vectors(even multidimensional vectors). Let’s say you have 2 vectors $x,y \in \mathbb{R}^2$ , then the jacobian is $\frac{\partial y}{\partial x} \in \mathbb{R}^{2\times2}$ , specifically:

(2) $\begin{equation*} \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_0}{\partial x_0} & \frac{\partial y_0}{\partial x_1}\\ \frac{\partial y_1}{\partial x_0} & \frac{\partial y_1}{\partial x_1}\end{bmatrix}\end{equation*}$

It tells how each element of x affects an element of y.
To show how Jacobians can be huge, let’s take the MLP we defined in the first gif.
$I \in \mathbb{R}^{10\times10}, W_1 \in \mathbb{R}^{10\times5}, Z_1 \in \mathbb{R}^{10\times5}, Z_3 \in \mathbb{R}^{10\times1},Z_4 \in \mathbb{R}^{10\times1},Z_5 \in \mathbb{R}^{10\times1}$
So the size of different Jacobians we will use to compute gradients of $W_1$ are

$\begin{align*}\frac{\partial \mathcal{J}(\theta)}{\partial Z_5} &= (1\times10\times1) &&= 10\\\frac{\partial Z_5}{\partial Z_3} &= (10\times1\times10\times1) &&= 100 \\\frac{\partial Z_3}{\partial Z_1} &= (10\times1\times10\times5) &&= 500\\\frac{\partial Z_1}{\partial W_1} &= (10\times5\times10\times5) &&= 2500\\\frac{\partial Z_5}{\partial Z_4} &= (10\times1\times10\times1) &&= 100\\\frac{\partial Z_4}{\partial Z_1} &= (10\times1\times10\times5) &&= 500\\& \text{Total} &&= 3710\end{align*}$

We will have to find 3710 scalar values in order to find gradients of 10×5= 50 scalar values. So, you can see that finding gradients using jacobians is not a feasible approach, especially when we increase the number of layers of increase the dimensions of our parameters.

Appendix E: Gradients in Autodiff

Especially notice $t7$ since it has 2 children.

Backpropagation and Automatic Differentiation

Intro to backprop:

Automatic Differentiation:

Converting Computational graph into primitive operations:

Traversing the CG in reverse order

Writing Jacobian Products for primitive operation:

Message passing view of backprop

A modular view of architectures:

Gradient flow between different modules:

Appendix A: Cost Functions

For classification tasks:

For regression tasks:

Appendix B: Gradients & Gradient Descent

Gradient Descent:

Stochastic Gradient Descent:

Appendix C: Chain Rule:

Appendix D: Jacobian and problem with their size:

Appendix E: Gradients in Autodiff

1 thought on “Backpropagation and Automatic Differentiation”

Leave a Comment Cancel Reply

Intro to backprop:

Automatic Differentiation:

Converting Computational graph into primitive operations:

Traversing the CG in reverse order

Writing Jacobian Products for primitive operation:

Message passing view of backprop

A modular view of architectures:

Gradient flow between different modules:

Appendix A: Cost Functions

For classification tasks:

For regression tasks:

Appendix B: Gradients & Gradient Descent

Gradient Descent:

Stochastic Gradient Descent:

Appendix C: Chain Rule:

Appendix D: Jacobian and problem with their size:

Appendix E: Gradients in Autodiff

Related Posts

1 thought on “Backpropagation and Automatic Differentiation”

Leave a Comment Cancel Reply