Most problems involve some sort of optimisation.
Given a function
$$f(x_1, x_2, … , x_N)$$
of $N$ parameters, how do I find its minimum or maximum?
Gradient descent is used to find the minimum of a function using successive steps until a convergence criteria is obtained.
Algorithm:
$$ x_{i+1} = x_i - \eta \nabla f(x_i) $$
$\eta$ is the step size. It has to be chosen carefully.
If the step size is too small the convergence is too slow.
If the step size is too large we can loose convergence altogether!
If the step size is about right.
Let’s try a smaller $\eta$.
Looks better…
Let’s zoom in more
Still fine…
Let’s zoom more…
we found oscillations again!
We will always end up oscillating at the bottom of the well. We need a stopping criteria:
It is possible to have an adaptive step size: start larger and when we find oscillations we reduce the step size.
Rosenbrock’s Banana Function
$$ f(x, y) = (1−x)^2 +100\left(y − x^2 \right)^2 $$
A challenging test case for finding minima.
This is a difficult problem because we need the step size to be very small at the beginning, otherwise we diverge:
but then it means we re crawling in the valley.
There are many adaptive methods the change the step size as the algorithm progresses.
Gradient descent will converge to the global minimum if it is unique.
It can get stuck in local minima.
There are global methods to find the global minimum in the presence of local minima:
they are based on stochastic methods.