Ok so if you read my last post - first of all, thank you ๐Ÿ™ and second of all, you now know what Q is. You know what L is. You know that images are just matrices of numbers and that training an AI is just making the average mistake smaller and smaller.

But here's the question I kept asking myself after I understood all that: ok but HOW does it actually get smaller? Like who tells the model to adjust? What's the mechanism? What's actually happening inside?

And the answer is this thing called gradient descent.

I'll be honest - when I first heard the name I thought it sounded like something you'd order at a fancy coffee shop. "Yes I'll have a gradient descent with oat milk please." ๐Ÿ˜„

But it's actually one of the most elegant ideas in all of mathematics. And once you see it - really see it - you'll understand something that most people who use AI every single day have absolutely no idea about.

So let's go. As always - no jargon, no PhD required, just the real thing explained like a human being talking to another human being. โ˜•

"Gradient descent is not a complicated idea. It's just a ball. Rolling downhill. Looking for the lowest point."

Imagine You're Lost on a Mountain. In the Dark.

Seriously - close your eyes and picture this. You're standing on a huge hilly landscape. It's pitch black. You can't see anything. Your only goal is to get to the lowest point in the valley.

What do you do? You feel the ground under your feet. You figure out which direction is sloping downward. And you take a step that way. Then you do it again. And again. Each time feeling the slope, each time stepping downward. Eventually - after enough steps - you reach the bottom.

That is gradient descent. Exactly that. The hill is your loss function Q. The lowest point is where Q is as small as possible - where your model is as good as it can be. The ball (you) is the model. The slope is the gradient. And each step is one training update.

๐Ÿ”๏ธ The Mountain Analogy

You're lost in the dark on a hill. You feel which way is downhill and take a step. Repeat until you reach the valley.

๐Ÿค– The AI Version

The model calculates which direction reduces Q. It takes a small step that way. Repeats millions of times until Q is minimised.

The "gradient" part is just the mathematical name for the slope - it tells you which direction is uphill and how steep it is. The "descent" part means going down. Put them together and you have an algorithm that follows the slope downward until it can't go any lower.

Simple. Elegant. Powerful enough to train every major AI system on the planet.

Watch the Ball Roll Downhill - Live

This is the part that made everything click for me. Below is your loss function Q drawn as a curve. The orange ball is your model. Press Start and watch gradient descent find the minimum - step by step.

Then try changing the learning rate - this controls how big each step is. Too small and it crawls forever. Too big and it bounces past the bottom and never settles. Finding the right learning rate is one of the key skills in training AI models.

Gradient Descent - Live Simulation

The curve is Q. The ball is your model. The bottom of the curve is the goal.
Learning rate (ฮฑ) 0.08
Starting position Left side
Steps taken
0
Current Q
โ€”
Position (w)
โ€”
Status
Waiting

Did you notice what happens with a very high learning rate? The ball overshoots the bottom, bounces to the other side, and either oscillates forever or diverges completely. This is one of the most common problems when training real AI models - and it's exactly why choosing the right learning rate matters so much.

The Update Rule - Three Symbols That Change Everything

Remember in the last post we had Q(a, X) - the average loss? Now I'll show you how the model uses that to update itself. This is the gradient descent update rule - the most important line of math in modern AI:

UPDATE RULE
w โ† w โˆ’ ฮฑ ยท โˆ‡Q(w)
Symbol Name What it means in plain English
w Weights The internal settings of your model - the numbers that get adjusted during training
โ† Assignment "Replace w with this new value" - this runs once per training step
ฮฑ Learning rate How big each step is - too small = slow, too big = unstable
โˆ‡Q(w) Gradient Which direction is uphill and how steep - the model goes the opposite direction

The minus sign is the key insight. The gradient tells you which way Q goes up. You subtract it - so you go down. That's descent. Every single training step in every AI model is just running this one line of math on millions of weights simultaneously.

"The minus sign is the whole trick. Know which way is uphill - go the opposite direction. That's all gradient descent is."

Why the Learning Rate Makes or Breaks Your Model

This is something I find fascinating - one tiny number can be the difference between a model that learns beautifully and one that completely falls apart. Watch how the same algorithm behaves completely differently depending on the learning rate ฮฑ.

Same Algorithm. Three Learning Rates. Very Different Results.

All three start at the same position on the same curve
ฮฑ = 0.02 (too small)
โš  Takes forever
ฮฑ = 0.15 (just right)
โœ“ Converges nicely
ฮฑ = 0.85 (too big)
โœ— Explodes

This is why AI researchers spend so much time tuning the learning rate. Too conservative and you're wasting compute running for days when you could converge in hours. Too aggressive and the model becomes unstable and never learns anything useful. The sweet spot is different for every model and every dataset - finding it is part art, part science.

Modern training systems use something called learning rate scheduling - they start with a higher rate to make fast progress, then slow it down as the model gets closer to the minimum. Like a car that drives fast on the highway but slows down as it approaches the destination.

The Problem Nobody Talks About - Local Minima

Ok so here's where it gets interesting. The simple curve I showed you before was smooth and had one obvious lowest point. Real loss functions? They look nothing like that.

Real Q functions have bumps, valleys, plateaus, and multiple low points. The algorithm might roll down into a valley - but it might not be the deepest valley. It might get stuck in a little dip thinking it found the bottom when the real bottom is somewhere else entirely.

This is called a local minimum - a point that looks like the bottom from nearby, but isn't the global minimum.

Real Loss Landscape - Multiple Valleys

Click anywhere on the curve to drop the ball and see where it ends up

โ† click anywhere on the curve to start the ball from that position

Notice how depending on where you drop the ball, it ends up in completely different valleys? This is a real challenge in training large AI models - and it's why techniques like random restarts, momentum, and Adam optimiser were invented. They all try to help the model escape local minima and find the true bottom.

๐Ÿ”๏ธ Local Minimum

A valley that looks like the bottom from nearby - but the real lowest point is somewhere else. The model gets stuck here.

๐ŸŒ Global Minimum

The actual lowest point of the entire loss function. This is where your model performs best. The goal of training.

Questions I Had When I First Learned This

How does the model calculate the gradient - does it just guess?
No guessing at all - it uses calculus (derivatives) to calculate the exact slope at the current position. The algorithm called backpropagation does this automatically for every weight in the model. We'll cover backpropagation in the next post - it's the mechanism that makes gradient descent practical for models with billions of parameters.
How many steps does gradient descent take to train a real model?
Millions to billions of steps. Training GPT-4 required an estimated 300 billion gradient descent steps. Each step processes a batch of training examples, calculates the gradient, and updates hundreds of billions of weights simultaneously. This is why training large AI models requires enormous compute - it's just this simple algorithm run an almost incomprehensible number of times.
What is stochastic gradient descent (SGD)?
Instead of calculating the gradient on your entire dataset (which would be incredibly slow), SGD picks a small random batch of examples - maybe 32 or 64 - and calculates the gradient on just those. It's a noisier estimate but it's much faster and actually helps escape local minima. Almost all modern AI training uses some form of mini-batch stochastic gradient descent.
Can I run gradient descent on my own machine?
Absolutely - and with a machine like an Apple M4 Max with 128GB of RAM, you can run gradient descent on surprisingly large models locally. PyTorch and MLX both implement gradient descent and all its variants out of the box. You just define your model, your loss function, pick an optimiser (Adam is the most popular), and call optimizer.step() - that's one gradient descent step right there.

What You Now Know

You came here not knowing what gradient descent was. You're leaving knowing the algorithm that powers every AI system on the planet. That's not nothing - that's actually a lot.

The Full Picture So Far
1
Your dataset X contains all your training examples (last post)
2
L measures how wrong the model was on one example (last post)
3
Q averages all the L values - your total score (last post)
4
Gradient descent calculates the slope of Q and takes a step downhill (this post) โ† you are here
5
Backpropagation efficiently computes the gradient for every weight - next post

Next up - backpropagation. The algorithm that makes gradient descent actually work at scale. Without it, updating the weights of a modern AI model would take longer than the age of the universe. With it, it takes milliseconds. See you in the next one. ๐Ÿš€