Splitting methods decompose a complicated problem into a sequence of simpler subproblems. An idea as fundamental as this one has found applications in many areas like linear algebra, integration of ordinary differential equations, or the Monte-Carlo estimation of expectations, to name a few.
Splitting methods have also made their way in mathematical optimization. From solving non-smooth problems to parallel computing, some of the most effective methods are based on splitting. A method that has recently caught my eye is the Davis-Yin three operator splitting, which I will discuss in this blog post.
Outline:IntroductionIteration Complexity AnalysisConvergence TheoryCodeOpen QuestionsReferences
Introduction
The Davis-Yin three operator splitting is an algorithm that can solve optimization problems composed of a sum of three terms and was proposed by Damek Davis and Wotao Yin. The method can be seen as a slight generalization of previous methods like the Forward-Douglas-Rachford and the Generalized Forward-Backward, a fact that for some was not appropriately acknowledged. Despite this controversy, the method remains in my opinion one of the most exciting recent developments in optimization because of its excellent empirical performance, elegant formulation and its ease of use.
The three operator splitting can solve optimization problems of the form
\begin{equation}\label{eq:opt_objective}\tag{OPT} \minimize_{\boldsymbol{x} \in \RR^d} f(\xx) + g(\xx) + h(\xx) ~, \end{equation}
where we have access to the gradient of $f$ and the proximal operator of $g$ and $h$.The proximal operator is a crucial element of many splitting methods. For a function $\varphi: \RR^d \to \RR$ and step-size $\gamma > 0$ is denoted $\prox_{\gamma \varphi}$ and defined as Despite its simplicity, the proximal operator is at the core of many optimization methods. The following monograph provides a gentle introduction to the topic: Parikh, Neal, and Stephen Boyd. “Proximal algorithms.” Foundations and Trends in Optimization, 2014. This formulation allows to express a broad range of problems arising in machine learning and signal processing: $f$ can be any smooth loss function such as the logistic or least squares loss, while the two proximal terms can be generalized to an arbitrary number of them and includes many important penalties like the group lasso with overlap, total variation, $\ell_1$ trend filtering, etc. Furthermore, the penalties can be extended-valued, thus allowing an intersection for convex constraints through the use of the indicator function.
In its basic form, the algorithm takes as input an initial gues $\yy_0 \in \RR^d$ and generates a sequence of iterates ${\yy_1, \yy_2, \ldots}$ through the formula
\begin{align}
\zz_t &= \prox_{\gamma h}(\yy_t)~\label{eq:z_update}
\xx_t &= \prox_{\gamma g}(2 \zz_t - \yy_t - \gamma \nabla f(\zz_t))\label{eq:x_update}
\yy_{t+1} &= \yy_t - \zz_t + \xx_t~.\label{eq:y_update}
\end{align}
Each iteration requires to evaluate the proximal operator of $h$ and $g$, as well as the gradient of $f$. It depends on a single step-size parameter $\gamma$, which is a practical advantage with respect to similar methods like Condat-Vu that depend on two step-size parameters.Considering the three operator splitting and the Condat-Vu algorithm in the same bag is not entirely fair, as the latter can minimize a larger class of objectives of the form \begin{equation} f(\xx) + g(\xx) + h(\boldsymbol{K}\xx)\,, \end{equation} where $\boldsymbol{K}$ is an arbitrary linear operator. Compared to \eqref{eq:opt_objective}, this allows for an extra linear operator inside $h$, which is set to the identity in the three operator splitting. A recently proposed variant of the three operator splitting (“A New Primal–Dual Algorithm for Minimizing the Sum of Three Functions with a Linear Operator”, by Ming Yang) allows, as the Condat-Vu algorithm, a linear operator in one of the proximal terms.
The three operator splitting is strongly related to other splitting methods. For example, when $h$ is constant, it defaults to the proximal gradient method. This is easy to verify since in this case the first step becomes $\zz_t = \prox_{\gamma h}(\yy_t) = \yy_t$, and replacing in the next lines gives $\yy_{t+1} = \prox_{\gamma g}(\yy_t - \gamma \nabla f(\yy_t))$, which corresponds to an iteration of the proximal-gradient method. Similarly, it is trivial to verify that whenever $f$ is constant then it defaults to the Douglas-Rachford method. A more detailed comparison with related methods can be found in (Raguet 2018).
Iteration complexity analysis
In this section I will give a simplified proof of the iteration complexity of the method for convex functions.
Assumptions. I assume that $f$, $g$ and $h$ and convex, proper and lower semicontinuous functions. In addition, $f$ is differentiable with Lipschitz gradient, where $L$ denotes its Lipschitz constant.In other words, $f$ verifies for all $\xx, \yy$ in the domain. The class of functions that verifies this property is sometimes called $L$-smooth functions
It is common to analyze optimization algorithms by proving that the objective function decreases at each iteration. However, this turns out to be problematic for the three operator splitting, as the value of the objective function can be infinity at any iterate. This happens for example when both $g$ and $h$ are the indicator function over a set.
Davis and Yin overcame this difficulty by assuming Lipschitz continuity of one of the proximal terms (Davis and Yin 2017, Theorem 3.1 and 3.2). However, I find this not satisfactory because it leaves out important applications of the algorithm like optimization over an intersection of constraints.
A different approach was proposed in our recent paper, Although this approach seems to be new in the context of analyzing the three operator splitting, it is a classical approach for other primal-dual methods. See for instance the analysis of (Chambolle and Pock, 2016) for a different class of splitting methods. and consists in reformulating the original optimization as a saddle point problem. Let $h^$ denotes the convex conjugate of $h$, then we have the following identities:
\begin{align}
&\min_{\xx \in \RR^d} f(\xx) + g(\xx) + h(\xx)
&\quad = \min_{\xx \in \RR^d} f(\xx) + g(\xx) + \max_{\uu \in \RR^d}\left{\langle \xx, \uu \rangle - h^(\uu)\right}
&\quad= \min_{\xx \in \RR^d} \max_{\uu \in \RR^d} \underbrace{f(\xx) + g(\xx) + \langle \xx, \uu \rangle - h^(\uu)}_{\defas \mathcal{L}(\xx, \uu)}~.
\end{align}
We have transformed the original minimization into the problem of finding a saddle point of $\mathcal{L}(\xx, \uu) = f(\xx) + g(\xx) + \langle \xx, \uu \rangle - h^(\uu)$.
Formally, a saddle point of $\mathcal{L}$ is a pair $(\xx^\star, \uu^\star)$ such that the following is verified for any $(\xx, \uu)$ in the domain:
\begin{equation}\label{eq:saddle_point}
\mathcal{L}(\xx^\star!, \uu) \leq \mathcal{L}(\xx, \uu^\star) ~.
\end{equation}
From this definition it follows that $\mathcal{L}(\xx^\star!, \uu) - \mathcal{L}(\xx, \uu^\star)$ is negative for all $(\xx, \uu)$ only at a saddle point and so it is a meaningful convergence criterion.
An unexpected side effect of using the saddle point suboptimality is that it simplifies considerably the iteration complexity analysis. Below is a proof of the $\mathcal{O}(1/t)$ convergence rate on the saddle point suboptimality for the step-size $\gamma=\frac{1}{L}$. Later we will see that under some extra assumptions it is also possible to derive convergence rate on the objective suboptimality from this result.
**Theorem 1 ** (Ergodic convergence rate). Let $\yy_t, \xx_t, \zz_t$ denote the sequence iterates produced by the three operator splitting, let $\uu_t$ be defined as $\uu_t \defas \frac{1}{\gamma}(\yy_t - \zz_t)$ and let $\overline{\xx}t \defas \frac{1}{t+1}\sum{i=0}^t \xx_i, \overline{\uu}t \defas \frac{1}{t+1}\sum{i=0}^t \xx_i$ denote the ergodic (or average) sequence of iterates. Then for $\gamma=\frac{1}{L}$ we have the following suboptimality bound, valid for all $t \geq 0$, all $\xx, \uu$ in the domain of $\mathcal{L}$: \begin{equation} \mathcal{L}(\overline\xx_{t}, \uu) - \mathcal{L}(\overline\xx, \uu_{t}) \leq \frac{L|\yy_0 - \xx - \gamma \uu|^2}{2(t+1)}~. \end{equation}
The proof technique is the same as in our recent paper, and consists very roughly on interpreting the three operator splitting as two consecutive applications of the proximal-gradient algorithm and then use known properties of this method.
The proof crucially relies on the following technical lemma that relates the output of a proximal-gradient step with its input and an arbitrary point in the domain:
Lemma 1 (three point inequality). Let $\psi$ be convex and $L$-smooth and let $\eta$ be convex. If $\aa^+ = \prox_{\sigma \eta}(\aa - \sigma \nabla \psi(\aa))$, then for all $\bb$ in the domain of $\eta$ we have
\begin{equation}
\begin{aligned}
&\psi(\aa^+) + \eta(\aa^+) - \psi(\bb) - \eta(\bb)
&\qquad \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}|\aa^+ - \aa|^2
\end{aligned}\label{eq:three_point_inequality}
\end{equation}
By $L$-smoothless and convexity respectively we have the inequalities
\begin{aligned}
\psi(\aa^+) &\leq \psi(\aa) + \langle \nabla \psi(\aa), \aa^+ - \aa\rangle + \frac{L}{2}|\aa^+ - \aa|^2
\psi(\aa) &\leq \psi(\bb) + \langle \psi(\aa), \aa - \bb\rangle~.
\end{aligned}
Adding them together gives the inequality
\begin{equation}
\psi(\aa^+) \leq \psi(\bb) + \langle \nabla \psi(\aa), \aa^+ - \bb\rangle + \frac{L}{2}|\aa^+ - \aa|^2
\end{equation}
By the definition of proximal operator $\aa^+$ is characterized by the first order optimality conditions \begin{equation} 0 \in \frac{1}{\gamma}(\aa^+ - \aa) + \nabla\psi(\aa) + \partial \eta(\aa^+) \end{equation} from where $\frac{1}{\gamma}(\aa - \aa^+) - \nabla\psi(\aa)\in \partial \eta(\aa^+)$ and so by convexity we have \begin{equation} \eta(\aa^+) - \eta(\bb) \leq \langle \frac{1}{\gamma}(\aa - \aa^+) - \nabla \psi(\aa), \aa^+ - \bb \rangle \end{equation}
Finally, adding both we obtain the claimed bound \psi(\aa^+) - \psi(\bb) + \eta(\aa^+) - \eta(\bb) \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}|\aa^+ - \aa|^2~. $$
We will start the proof of this theorem by rewriting the algorithm in an equivalent but more convenient form. Using Moureau’s decomposition for the proximal operatorThe Moreau decomposition relates the proximal operator of a function $\varphi$ and that of its convex conjugate $\varphi^$ as and the definition of $\uu_t$ the original formulation of the algorithm in Eq. \eqref{eq:z_update}-\eqref{eq:y_update} is equivalent to:
\begin{align}
\uu_t &= \prox_{h^/\gamma}(\uu_{t-1} +\xx_{t-1}/\gamma)
\zz_t &= \xx_{t-1} + \gamma (\uu_{t-1} - \uu_t)\label{eq:pd_z_update}
\xx_t &= \prox_{\gamma g}(\zz_t - \gamma (\nabla f(\zz_t) + \uu_t))
\end{align}