Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis

This blog post extends the convergence theory from the first part of my notes on the Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize and strengthen the convergence guarantees obtained in the first part.

Although FW is one of the oldest methods in nonlinear constrained optimization, the development of primal-dual guarantees is relatively recent. In 2010, Clarkson gave primal-dual convergence rates in the particular case in which the domain is the simplex. These results were later extended by Jaggi to arbitrary domains.

These results however gave convergence guarantees for the minimum over all iterates, and not on the last one, as is usual in the case of primal suboptimality. It is not until 2017, with the work of Nesterov, that primal-dual convergence rates on the last iterate were obtained. These primal-dual guarantees were recently extended by myself and coauthors to other FW variant like Away-steps FW, Pairwise FW and FW with backtracking line search.

Problem and notation

As in the first part of these notes, I will be discussing the FW method applied to the following general optimization problem

\begin{equation}\label{eq:fw_objective}

\minimize_{\xx \in \mathcal{D}} f(\boldsymbol{x})~,

\end{equation}

where $f$ is differentiable with $L$-Lipschitz gradient and the domain $\mathcal{D}$ is a convex and compact set the domain of $g$ is compact a compact subset of $\RR^d$. In this post, and unlike in the first part, we will also assume that $f$ is convex.

From an initial guess $\xx_0$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2, \ldots$ that converge towards the solution to \eqref{eq:fw_objective}. Below is the pseudo-code for this algorithm, as presented in the first part of these notes:

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber
& \textbf{For }t=0, 1, \ldots \textbf{ do } \

&\quad\boldsymbol{s}t = \argmax{\boldsymbol{s} \in \mathcal{D}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle\label{eq:lmo}\

&\quad \boldsymbol{d}_t = \ss_t - \xx_t
&\quad g_t = -\langle \nabla f(\xx_t), \dd_t \rangle
&\quad \textbf{If } g_t < \delta:
&\quad\qquad\hfill\text{// exit if gap is below tolerance }\nonumber
&\quad\qquad\textbf{return } \xx_t\

&\quad {\textbf{Variant 1}}: \text{set step size as} \nonumber\

&\quad\qquad\gamma_t = \vphantom{\sum_i}\min\Big{\frac{g_t}{L|\dd_t|^2}, 1 \Big}\label{eq:step_size}
&\quad \textbf{Variant 2}: \text{set step size by line search}\nonumber
&\quad\qquad\gamma_t = \argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma \boldsymbol{d}t)\label{eq:line_search}
&\quad\boldsymbol{x}
{t+1} = \boldsymbol{x}_t + \gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}
&\textbf{end For loop}
& \textbf{return } \xx_t \end{align}

Duality and Frank-Wolfe

In convex optimization, every problem can be paired with another, said to be dual to it. On the basis of this, close connections between otherwise disparate properties arise. The FW algorithm has deep and unexpected links with duality, which we will now explore.

The dual problem is a concave maximization problem that we can associate with our original convex minimization problem. Let $\imath_{\mathcal{D}}$ be the indicator function over $\mathcal{D}$, that is, the function that returns $0$ if the argument belongs to $\mathcal{D}$ and $+\infty$ otherwise.It is sometimes more convenient to specify the constraints of our problem through the indicator function. For example, our original optimization problem $\min_{\xx \in \mathcal{D}}f(\xx)$ can be equivalently written as the following unconstrained problem using the indicator function: $\min_{\xx \in \RR^p} f(\xx) + \imath_{\mathcal{D}}(\xx)$. Then by definition of convex conjugateDefinition (convex conjugate). Let $f: \RR^d \to [-\infty, \infty]$ be an extended real-valued function. Its convex conjugate, denoted $f^*$, is the function $f: \RR^d \to [-\infty, \infty]$ defined by

$$

f^{{*}}(\yy) \defas \sup_{\xx \in \RR^d} \left { \langle \yy , \xx \rangle - f \left( \xx \right) \right}\,,\qquad \yy \in \RR^d\,.

$$ we then have the following sequence of equivalences that starting from the the original (primal) problem: \begin{align}

\min_{\xx \in \mathcal{D}} f(\xx) &= \min_{\xx \in \RR^d} f(\xx) + \imath_{\mathcal{D}}(\xx)\

&= \min_{\xx \in \mathcal{D}} \max_{\uu \in \RR^d}\Big{\langle \xx, \uu\rangle - f^*(\uu)\Big} + \imath_{\mathcal{D}}(\xx)\

&= \max_{\uu \in \RR^d}\min_{\xx \in \RR^d}\Big{ \imath_{\mathcal{D}}(\xx) + \langle \xx,\uu\rangle\Big} - f^*(\uu)\

&= \max_{\uu \in \RR^d}\underbrace{-\imath_{\mathcal{D}}^(-\uu) - f^(\uu)}_{\defas \psi(\uu)}~,\label{eq:pd_relationship}

\end{align} where in the second identity we have used Sion’s minimax theorem to exchange the $\max$ and $\min$. Note also that we can use $\max$ instead of the usual $\sup$ in the definition of Fenchel conjugate because of the smoothness of $f$, which implies strong convexity of $f^$. We will call \eqref{eq:pd_relationship}, which is a concave maximization problem, the *dual problem. By extension, we will refer to $\psi$ as the dual objective or dual function.

The dual objective lower bounds the primal objective and so the difference between primal and dual objectives gives a bound on the suboptimality. In other words, let $\xx^\star$ be any solution to the original primal problem \eqref{eq:fw_objective}. Then by definition of dual objective we have \begin{equation} f(\xx) - f(\xx^\star) \leq f(\xx) - \psi(\uu)~, \end{equation} for any $\xx \in \mathcal{D}$ and any $\uu \in \dom(\psi)$. This is quite exceptional, as gives a meaningful distance to optimum without need to know $\xx^\star$, which is of course unknown. This can then be used for instance as stopping criterion in optimization algorithms. The quantity $f(\xx) - \psi(\uu)$ is often referred to as the primal-dual gap.

However, computing the primal-dual gap involves evaluating the dual objective, which in the general case can be as costly as solving the original problem. What is special in the case of FW is that the primal-dual gap is given as a byproduct of the method. This is quite unique among optimization methods.

The next lemma shows that the primal-dual gap (for a specific choice of primal and dual variables) is exactly equal to the FW gap $g_t$.

Lemma 2. The FW gap $g_t$ is a duality gap at the iterates $\xx = \xx_t, \uu = \nabla f(\xx_t)$. More precisely, we have \begin{equation} g_t = f(\xx_t) - \psi(\nabla f(\xx_t))~. \end{equation}

We have the following sequence of identities \begin{align}

g_t &= \max_{\ss \in \mathcal{D}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle\

& = \langle \nabla f(\xx_t), \xx_t\rangle + {\max_{\ss \in \RR^d}}\Big{ \langle- \nabla f(\xx_t), \ss\rangle - \imath_{\mathcal{D}}(\ss)\Big}\

& = \langle \nabla f(\xx_t), \xx_t\rangle + \imath^*_{\mathcal{D}}(-\nabla f(\xx_t))\

& = f(\xx_t) + \underbrace{\imath^_{\mathcal{D}}(-\nabla f(\xx_t)) + f^(\nabla f(\xx_t))}_{-\psi(\nabla f(\xx_t))}~,

\end{align} where the first identity uses the definition of $\ss_t$, the second one the definition of convex conjugate and the last one is a consequence of the Fenchel-Young identity (see for example Proposition 16.10 of Bauschke and Combettes).