Ok so if you read my last post - first of all, thank you ๐ and second of all, you now know what Q is. You know what L is. You know that images are just matrices of numbers and that training an AI is just making the average mistake smaller and smaller.
But here's the question I kept asking myself after I understood all that: ok but HOW does it actually get smaller? Like who tells the model to adjust? What's the mechanism? What's actually happening inside?
And the answer is this thing called gradient descent.
I'll be honest - when I first heard the name I thought it sounded like something you'd order at a fancy coffee shop. "Yes I'll have a gradient descent with oat milk please." ๐
But it's actually one of the most elegant ideas in all of mathematics. And once you see it - really see it - you'll understand something that most people who use AI every single day have absolutely no idea about.
So let's go. As always - no jargon, no PhD required, just the real thing explained like a human being talking to another human being. โ
"Gradient descent is not a complicated idea. It's just a ball. Rolling downhill. Looking for the lowest point."
Imagine You're Lost on a Mountain. In the Dark.
Seriously - close your eyes and picture this. You're standing on a huge hilly landscape. It's pitch black. You can't see anything. Your only goal is to get to the lowest point in the valley.
What do you do? You feel the ground under your feet. You figure out which direction is sloping downward. And you take a step that way. Then you do it again. And again. Each time feeling the slope, each time stepping downward. Eventually - after enough steps - you reach the bottom.
That is gradient descent. Exactly that. The hill is your loss function Q. The lowest point is where Q is as small as possible - where your model is as good as it can be. The ball (you) is the model. The slope is the gradient. And each step is one training update.
๐๏ธ The Mountain Analogy
You're lost in the dark on a hill. You feel which way is downhill and take a step. Repeat until you reach the valley.
๐ค The AI Version
The model calculates which direction reduces Q. It takes a small step that way. Repeats millions of times until Q is minimised.
The "gradient" part is just the mathematical name for the slope - it tells you which direction is uphill and how steep it is. The "descent" part means going down. Put them together and you have an algorithm that follows the slope downward until it can't go any lower.
Simple. Elegant. Powerful enough to train every major AI system on the planet.
Watch the Ball Roll Downhill - Live
This is the part that made everything click for me. Below is your loss function Q drawn as a curve. The orange ball is your model. Press Start and watch gradient descent find the minimum - step by step.
Then try changing the learning rate - this controls how big each step is. Too small and it crawls forever. Too big and it bounces past the bottom and never settles. Finding the right learning rate is one of the key skills in training AI models.
Gradient Descent - Live Simulation
Did you notice what happens with a very high learning rate? The ball overshoots the bottom, bounces to the other side, and either oscillates forever or diverges completely. This is one of the most common problems when training real AI models - and it's exactly why choosing the right learning rate matters so much.
The Update Rule - Three Symbols That Change Everything
Remember in the last post we had Q(a, X) - the average loss? Now I'll show you how the model uses that to update itself. This is the gradient descent update rule - the most important line of math in modern AI:
| Symbol | Name | What it means in plain English |
|---|---|---|
| w | Weights | The internal settings of your model - the numbers that get adjusted during training |
| โ | Assignment | "Replace w with this new value" - this runs once per training step |
| ฮฑ | Learning rate | How big each step is - too small = slow, too big = unstable |
| โQ(w) | Gradient | Which direction is uphill and how steep - the model goes the opposite direction |
The minus sign is the key insight. The gradient tells you which way Q goes up. You subtract it - so you go down. That's descent. Every single training step in every AI model is just running this one line of math on millions of weights simultaneously.
"The minus sign is the whole trick. Know which way is uphill - go the opposite direction. That's all gradient descent is."
Why the Learning Rate Makes or Breaks Your Model
This is something I find fascinating - one tiny number can be the difference between a model that learns beautifully and one that completely falls apart. Watch how the same algorithm behaves completely differently depending on the learning rate ฮฑ.
Same Algorithm. Three Learning Rates. Very Different Results.
This is why AI researchers spend so much time tuning the learning rate. Too conservative and you're wasting compute running for days when you could converge in hours. Too aggressive and the model becomes unstable and never learns anything useful. The sweet spot is different for every model and every dataset - finding it is part art, part science.
Modern training systems use something called learning rate scheduling - they start with a higher rate to make fast progress, then slow it down as the model gets closer to the minimum. Like a car that drives fast on the highway but slows down as it approaches the destination.
The Problem Nobody Talks About - Local Minima
Ok so here's where it gets interesting. The simple curve I showed you before was smooth and had one obvious lowest point. Real loss functions? They look nothing like that.
Real Q functions have bumps, valleys, plateaus, and multiple low points. The algorithm might roll down into a valley - but it might not be the deepest valley. It might get stuck in a little dip thinking it found the bottom when the real bottom is somewhere else entirely.
This is called a local minimum - a point that looks like the bottom from nearby, but isn't the global minimum.
Real Loss Landscape - Multiple Valleys
โ click anywhere on the curve to start the ball from that position
Notice how depending on where you drop the ball, it ends up in completely different valleys? This is a real challenge in training large AI models - and it's why techniques like random restarts, momentum, and Adam optimiser were invented. They all try to help the model escape local minima and find the true bottom.
๐๏ธ Local Minimum
A valley that looks like the bottom from nearby - but the real lowest point is somewhere else. The model gets stuck here.
๐ Global Minimum
The actual lowest point of the entire loss function. This is where your model performs best. The goal of training.
Questions I Had When I First Learned This
What You Now Know
You came here not knowing what gradient descent was. You're leaving knowing the algorithm that powers every AI system on the planet. That's not nothing - that's actually a lot.
Next up - backpropagation. The algorithm that makes gradient descent actually work at scale. Without it, updating the weights of a modern AI model would take longer than the age of the universe. With it, it takes milliseconds. See you in the next one. ๐