Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):
I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the
convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).
We assume that
is
-smooth convex function, and we take
in the gradient step. The momentum term
will be set to a very particular value, which comes out naturally in the proof.
The two basic inequalities
Let us denote
(note that
). Now let us write our favorite inequalities (using
and
):
and
On the way to a telescopic sum
Recall now that
, so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient
on the second inequality, so let us do
times the first inequality plus
times the second inequality. We obtain an inequality whose right hand side is given by
times
Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose
and
. What we would like to have is:
Observe that (since
) the right hand side can be written as
, and thus we see that we simply need to have:
Setting the parameters and concluding the proof
Writing
we now obtain as a result of the combination of our two starting inequalities:
It only remains to select
such that
(i.e., roughly
is of order
) so that by summing the previous inequality one obtains
which is exactly the
rate we were looking for.