Backpropagation: From Intuition to FLOPs

This post dives into what happens during the backward pass of a neural network. We’ll start with a simple scalar example and build up to full matrix-based gradient derivations — showing not only how gradients are computed, but also how to count FLOPs along the way.

Introduction

In the previous post I covered how to construct tensors and pass them through operations - this is known as the “forward pass”.

The next step in deep learning is computing gradients and making an optimizer step - the backward pass.

Let’s look at a simple example: $y=0.5 \times (x \times w-5)^2$.
Forward pass: compute loss

x = torch.Tensor([1.0, 2.0, 3.0]) # we don't need gradients
w = torch.Tensor([1.0, 1.0, 1.0], requires_grad = True) # we need gradients
pred_y = x @ w
loss  = 0.5 * (pred_y - 5).pow(2)

Backward pass: compute gradients

loss.backward()
assert loss.grad is None
assert pred_y.grad is None
assert x.grad is None
assert torch.equal(w.grad, torch.tensor(1.0, 2.0, 3.0))

Matrix Gradients by Hand and Backward FLOPs

Let’s consider a more realistic setup. We have

  • input matrix X with dimensions $B \times D$
  • hidden layer(1) W1 with dimensions $D \times D$
  • hidden layer(2) W2 with dimensions $D \times C$ (outputs C classes)

The model is

  • $H_1 = XW_1$ (the dimensions: \((B \times D) \ @ \ (D \times D) \to B \times D\))
  • $H_2 = H_1W_2$ (the dimensions: \((B \times D) \ @ \ (D \times C) \to B \times C\))
  • $\text{loss} = \text{mean}(H_2^2)$

To perform a backward step, we will need gradients with respect to all weights: $\frac{\text{dloss}}{\text{dw}_1}$ and $\frac{\text{dloss}}{\text{dw}_2}$.

Let’s visualise matrices, this helps to build intuition for how gradients are calculated step by step.

\[X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1D} \\ x_{21} & x_{22} & \cdots & x_{2D} \\ \cdots & \cdots & \cdots & \cdots \\ x_{B1} & x_{B2} & \cdots & x_{BD} \\ \end{bmatrix} \qquad W_1 = \begin{bmatrix} w^{(1)}_{11} & w^{(1)}_{12} & \cdots & w^{(1)}_{1D} \\ w^{(1)}_{21} & w^{(1)}_{22} & \cdots & w^{(1)}_{2D} \\ \cdots & \cdots & \cdots & \cdots \\ w^{(1)}_{D1} & w^{(1)}_{D2} & \cdots & w^{(1)}_{DD} \\ \end{bmatrix} \qquad W_2 = \begin{bmatrix} w^{(2)}_{11} & w^{(2)}_{12} & \cdots & w^{(2)}_{1C} \\ w^{(2)}_{21} & w^{(2)}_{22} & \cdots & w^{(2)}_{2C} \\ \cdots & \cdots & \cdots & \cdots \\ w^{(2)}_{D1} & w^{(2)}_{D2} & \cdots & w^{(2)}_{DC} \\ \end{bmatrix}\]

Forward pass

We’ll walk through each layer’s output to build a concrete understanding of how activations are computed.

Step1: Compute H₁

\[H_1 = XW_1 = \begin{bmatrix} \sum_{k=1}^D x_{1k}w^{(1)}_{k1} & \sum_{k=1}^D x_{1k}w^{(1)}_{k2} & \cdots & \sum_{k=1}^D x_{1k}w^{(1)}_{kD} \\ \sum_{k=1}^D x_{2k}w^{(1)}_{k1} & \sum_{k=1}^D x_{2k}w^{(1)}_{k2} & \cdots & \sum_{k=1}^D x_{2k}w^{(1)}_{kD} \\ \cdots & \cdots & \cdots & \cdots \\ \sum_{k=1}^D x_{Bk}w^{(1)}_{k1} & \sum_{k=1}^D x_{Bk}w^{(1)}_{k2} & \cdots & \sum_{k=1}^D x_{Bk}w^{(1)}_{kD} \end{bmatrix}\]

Step2: Compute H₂

\[H_2 = H_1W_2 = \begin{bmatrix} \sum_{k=1}^D h^{(1)}_{1k}w^{(2)}_{k1} & \sum_{k=1}^D h^{(1)}_{1k}w^{(2)}_{k2} & \cdots & \sum_{k=1}^D h^{(1)}_{1k}w^{(2)}_{kC} \\ \sum_{k=1}^D h^{(1)}_{2k}w^{(2)}_{k1} & \sum_{k=1}^D h^{(1)}_{2k}w^{(2)}_{k2} & \cdots & \sum_{k=1}^D h^{(1)}_{2k}w^{(2)}_{kC} \\ \cdots & \cdots & \cdots & \cdots \\ \sum_{k=1}^D h^{(1)}_{Bk}w^{(2)}_{k1} & \sum_{k=1}^D h^{(1)}_{Bk}w^{(2)}_{k2} & \cdots & \sum_{k=1}^D h^{(1)}_{Bk}w^{(2)}_{kC} \end{bmatrix}\]

More generally, element $h^{(2)}_{ij} = \sum _{k=1}^D h^{(1)} _{ik} w^{(2)} _{kj}$.

Step3: Compute the final loss
\(\text{loss} = \frac{1}{BC}\sum_{i=1}^B \sum_{j=1}^C (h^{(2)}_{ij})^2\)

Forward FLOPs (excluding loss): num_forward_flops = 2 * B * D * D + 2 * B * D * C.

Now that we’ve computed the forward activations, it’s time to trace gradients backward using the chain rule.

Backward pass

To compute gradients, we use the chain rule - “a formula that expresses the derivative of the composition of two differentiable functions” (wiki). In Deep Learning it is often more intuitive to write it using Leibniz notation: if $z$ depends on $y$ and $y$ depends on $x$, then $z$ depends on $x$ and we can write it as

\[\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}\]

Gradient with respect to H₂

Now that we’ve defined the loss, the first step in the backward pass is to compute the gradient of the loss with respect to the model’s output, H₂:

\[\frac{\text{dloss}}{dh^{(2)}_{ij}}=\frac{1}{BC}\cdot 2 \cdot h^{(2)}_{ij} \quad \text{(Only one term of the sum depends on } h^{(2)}_{ij} \text{)}\]

Or in matrix form:

\[\frac{\text{dloss}}{dH_2}=\frac{2}{BC}H_2\]

Gradient with respect to W₂

With \(\frac{\text{dloss}}{dH_2}\) computed, we can calculate the gradient w.r.t. the second layer weights, W₂, using the chain rule.
We start with:

\[\frac{\text{dloss}}{dw^{(2)}_{ij}} = \sum_{i^{'}=1}^B \sum_{j^{'}=1}^C \frac{\text{dloss}}{dh^{(2)}_{i^{'}j^{'}}} \cdot \frac{dh^{(2)}_{i^{'}j^{'}}}{dw^{(2)}_{ij}}\]

The key insight is that the sum includes many terms that are zero - \(h^{(2)}_{i^{'}j^{'}}=\sum_{k=1}^{D} h^{(1)}_{i^{'}k}w^{(2)}_{kj^{'}}\) only depends on \(w^{(2)}_{ij}\) when \(j^{'}=j\).

Let’s derive it more formally:

\[\frac{dh^{(2)}_{i^{'}j^{'}}}{dw^{(2)}_{ij}} = \begin{cases} h_{i^{'}i} & \text{if } j^{'} = j \\ 0 & \text{if } j^{'} \neq j \end{cases} \quad \Rightarrow \quad \frac{\text{dloss}}{dw^{(2)}_{ij}} = \sum_{k=1}^{B} \frac{\text{dloss}}{dh^{(2)}_{kj}} \cdot h^{(1)}_{ki} = \sum_{k=1}^{B} (h^{(1)})^T_{ik} \cdot \frac{\text{dloss}}{dh^{(2)}_{kj}}\]

In matrix form it will look this way:

\[\frac{\text{dloss}}{dW_2} = H_1^T \frac{\text{dloss}}{dH_2} \text{ (dimensions: } D \times C\text{)}\]

This is a matrix multiplication between a \(D \times B\) and a \(B \times C\) matrix - resulting in a \(D \times C\) output.
The number of backward FLOPs so far: num_backward_flops = 2 * B * D * C.

Gradient with respect to H₁

Next, we need to backpropagate through W₂ into the hidden activations H₁. This is needed to continue the chain toward W₁. Following the same idea, we can get:

\[\frac{\text{dloss}}{dH_1} = \frac{\text{dloss}}{dH_2}W_2^T \text{ (dimensions: } B \times D\text{)}\]

The number of backward FLOPs so far: num_backward_flops = 4 * B * D * C = 2 * B * D * C + 2 * B * D * C.

Gradient with respect to W₁

Now we compute gradients for the first layer weights, \(W_1\) - this completes the parameter gradient path for training.

\[\frac{\text{dloss}}{dw^{(1)}_{ij}} = \sum_{i^{'}=1}^B \sum_{j^{'}=1}^D \frac{\text{dloss}}{dh^{(1)}_{i^{'}j^{'}}} \cdot \frac{dh^{(1)}_{i^{'}j^{'}}}{dw^{(1)}_{ij}} = \sum_{k=1}^D \frac{\text{dloss}}{dh^{(1)}_{kj}} \cdot \frac{dh^{(1)}_{kj}}{dw^{(1)}_{ij}} = \sum_{k=1}^D \frac{\text{dloss}}{dh^{(1)}_{kj}} \cdot x_{ki}\]

So, in matrix form:

\[\frac{\text{dloss}}{dW_1} = \frac{\text{dloss}}{dH_1^T}X \text{ (the dimensions: } D \times D\text{)}\]

Updated backwards FLOPs: num_backward_flops = 4 * B * D * C + 2 * B * D * D.

Gradient with respect to X

We can also compute gradients w.r.t. the input $X$ — although this isn’t needed during training, it’s useful for things like input sensitivity or adversarial attacks. Following the same chain rule logic:

\[\frac{\text{dloss}}{dX} = \frac{\text{dloss}}{dH_1}W_1^T \text{ (the dimensions: } B \times D\text{)}\]

The number of backward FLOPs: num_backward_flops = 4 * B * D * C + 4 * B * D * D. This is exactly $2\times$ the number of FLOPs compared to the forward pass.


Conclusion

  • Understanding the math behind helps in debugging, optimizing, and designing new architectures.
  • The backward pass is built on the chain rule, propagated layer by layer.
  • Backward pass requires ~2x the FLOPs of the forward pass.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Live CNN Training Dashboard: Hyperparameters Tuning
  • PyG Implementation of EDP-GNN: Generation via Score-Based Generative Modeling
  • LLM From Scratch: Building TinyGPT that works
  • Gradient-Based Optimization: Theory, Practice, and Evolution
  • DL Under the Hood: Tensors, Views, and FLOPs