This was a busy week, I had no time to read anything new, so I’m sharing a note that I wrote for myself, for no other reason than to understand things better. It’s a kind of cookbook of various “transformations” you can apply to a machine learning problem to eventually turn it into something we know how to solve: seeking stable attractors of a tractable vector field.
The typical setup is: you have some model parameters $\theta$. You seek to optimize some objective criterion, but the optimization problem is intractable or hard in one of the ways listed below. You then apply the corresponding transformation to your problem if you can. If your problem is now one you can efficiently optimize, great. If not, you can recursively apply the transformations until it is.
UPDATE: Although I called this post a cookbook, as readers so rightly pointed out, it is too light on details to be considered a cookbook. Consider it as a demonstration of a way of thinking about machine learning research as a compiler that compiles an abstract machine learning objective into the canonical optimization problem of finding stable attractors of a tractable vector field.
For the first batch, I have written up the following problem transformations:
There are many more transformations not included here, like the duality principle, half-quadratic splitting, Lagrangian multipliers, etc. Feel free to leave comments about what else I should include next.
Variational bounds
Typical problem:
My loss function $f(\theta)$ is intractable to compute, typically because it involves intractable marginalization. I can’t evaluate it let alone minimize it.
Solution:
Let’s construct a family of - typically differentiable - upper-bounds: and solve the optimization problem instead. Technically, once optimization is finished, you can discard the auxiliary parameter $\psi^\ast$ - although often turns out to be meaningful and useful in itself, often for approximate inference such as the recognition model of VAEs.
Tricks of the trade:
Jensen’s inequality: The mean value of a convex function is never lower than the value of the convex function applied to the mean. Generally appears in some variant of the standard evidence lower bound (ELBO) derivation below:
\begin{align}
- \log p(x) &= - \log \int p(x,y) dy
&= - \log \int q(y\vert x) \frac{p(y,x)}{q(y\vert x)}dy
&\leq - \int q(y\vert x) \log \frac{p(y,x)}{q(y\vert x)} dy \end{align}
Reparametrization trick: In variational inference we often encounter gradients of the form: \frac{\partial}{\partial \theta_i}\mathbb{E}{x \sim q\theta}\left[f(x, q_\theta(x))\right], x = h(\epsilon, \theta), \epsilon \sim p_\epsilon \iff x \sim q_{\theta}, \frac{\partial}{\partial \theta_i}\mathbb{E}{x \sim q\theta}\left[f(x, q_\theta(x))\right] = \mathbb{E}{\epsilon \sim p\epsilon}\left[ \frac{\partial}{\partial \theta_i} f(h(\epsilon, \theta), q_\epsilon(h(\epsilon, \theta)))\right] $$ A Monte Carlo estimators to this expectation typically have substantially lower variance than REINFORCE estimators to the same quantity.
Adversarial games
Typical problem:
I can’t estimate my loss function $f(\theta)$ directly from samples, typically because the loss function depends on the pdf of the data distribution or that of my model, or both.
Solution:
We can sometimes construct an approximation so that and then we can solve the problem of finding a stable equilibrium of the two-player game where players minimize losses $g$ and $h$ with respect to $\psi$ and $\theta$, respectively. Sometimes, the approximations may come in the form of lower-bounds, which is the case when $h=-g$: in which case we can solve the following minimax problem instead:
Tricks of the trade:
Bayes optimality in auxiliary tasks: When your loss function depends on densities of probability distributions you can easily sample from, often you can construct an auxiliary task, whose Bayes optimal solution depends on values of the densities in question. Examples of auxiliary tasks are: binary classification for likelihood ratio estimation, denoising or score matching for estimating the score function.
Convex conjugate: In case your loss function involves convex functional of the densities (such as in f-divergences), you can transform your problem by re-expressing it in terms of the convex conjugate. The expression for $f$ in terms of its convex conjugate $f^{\ast}$ is: \begin{align} f(u) &= \sup_{v \in \operatorname{dom}(f^{\ast})} { \langle u, v \rangle - f^{\ast}(v) } \ &\geq \sup_{\psi} { \langle u, v_\psi \rangle - f(v_\psi) }, \end{align} Note that if $u$ is a density function, then the inner product $\langle u,v_\psi\rangle$ is the expectation of $v_\psi$, which can be approximated by Monte Carlo sampling.
Evolution strategies
Typical problem
My $f(\theta)$ is easy to evaluate but hard to optimize, perhaps because it contains discrete operations, or the function is piecewise constant. Backpropagation is not possible.
Solution
Observe that for any probability distribution $p_\psi$ over $\theta$ the following holds: therefore in ES we concentrate of the following optimization problem instead: Often, depending on the function $f$ and class of distributions $p_\psi$, a local minimum of $f$ can be recovered from a local minimum of $\psi$.
Tricks of the trade
REINFORCE gradient estimator: It relies on the following trick \frac{\partial}{\partial \psi_i} \mathbb{E}{\theta\sim p\psi}[f(\theta)] = \mathbb{E}{\theta \sim p\psi}\left[\frac{\partial}{\partial \psi_i}\log p_\psi(\theta)f(\theta)\right], $$ where the RHS can be easily approximated by Monte Carlo. The variance of Monte Carlo REINFORCE estimators tends to be quite high.
Convex relaxation
Typical problem
My $f(\theta)$ is hard to optimize, because it has non-differentiable and non-convex components like the $\ell_0$-norm of a vector in sparse methods, or the Heaviside step function in classification.
Solution
Replace the non-convex component by a convex approximation, turning your objective into a now typically convex $g$
Tricks of the trade:
$\ell_1$ loss: In many sparse learning situations we wish to minimize counts of non-zero entries in a vector, which is known as $\ell_0$ loss. You can typically replace this by the $\ell_1$ norm of the vector.
hinge loss and large margin methods: The error rate of a binary classifier under zero-one loss, the objective is typically a piecewise constant function of the parameters of a classifier and is thus hard to optimize. We can replace the zero-one-loss with the hinge loss, which can be understood as a convex upper bound. The resulting optimization problem will have a tendency to maximize the classifier’s margin.