Two exciting NLP papers at ICML 2018!
ICML 2018 accepts are out, and I am excited about two papers that I will briefly outline here. I think both papers are phenomenally good and will bring back structured prediction in NLP to modern deep learning architectures.
Differentiable Dynamic Programming for Structured Prediction and AttentionArthur Mensch and Mathieu Blondel [arXiv]
Dynamic Programming (DP) is the bread and butter of popular NLP algorithms. Many NLP problems involve an implicit (overlapping) substructure that makes them the most likely candidates for a DP solution. DPs are so central to NLP that researchers have even built specialized declarative programming languages for specifying and solving them.
Computational graph for Viterbi algorithm
With the recent advent of deep learning and computational graph frameworks, we saw a trend of everything represented as simple layers with nonlinearities and complex structured inputs and outputs, such as those found in language, being hammered until submission with these simple architectures using large amounts of data and compute. This view of the world is quite limiting as it doesn’t allow us to encode structured priors in the models directly.
Traditional DPs are usually non-differentiable (due to the “max” operator in the DP), and why we never saw any “DP layers” in modern DL architectures. This paper changes that by proposing a smooth version of \max, \max_{\Omega}, where \Omega is strongly convex regularizer like negative entropy or \ell_{2}, and “differentiable” DP (DDP) layers. Of course, this line of thinking is not completely new. I know Jason Eisner has been explaining this for a while, and for the particular case of the inside-outside algorithm, he has a tutorial paper showing how the inside-outside algorithm is nothing but backprop. (The forward-backward algorithm is a special case of inside-outside)
“Computational linguists have been backpropagating through their arithmetic circuits for a long time without realizing it—indeed, since before Rumelhart et al. (1986) popularized the use of this technique to train neural networks. Recognizing this connection can help us to understand, teach, develop, and implement many core algorithms of the field.” – Eisner, 2016
However, what’s new is the generality of Mensch and Blondel paper, a tutorial-like synthesis of recent works, the description of DDP layers and DP operators in the context of computational graph frameworks, and the accompanying PyTorch implementation.
SparseMAP: Differentiable Sparse Structured InferenceVlad Niculae, André F. T. Martins, Mathieu Blondel, and Claire Cardie [arXiv]
Many NLP tasks involve Structured Prediction, where the outputs are combinatorial structures like sequences, sets, trees, and alignments. Enumerating all possible structures in the output (say when doing inference) is expensive and most times impossible. For example, if you are parsing, the number of possible parses for a sentence is bounded by the Catalan number. A big chunk of probabilistic graphical models literature is dedicated to handling this problem. The MAP (maximum a posteriori), a general statistical technique, is one solution to make inference tractable, but it produces a single solution. The other option is to “marginalize” over all possible structures and predict a distribution, but this requires a lot of compute or complex inference algorithms. Then there is work on how to efficiently do this marginalization without explicitly enumerating the possibilities.
SparseMAP is somewhere in between MAP and marginal inference and it comes from the following observation: Softmax is the entropy-regularized argmax (used in MAP). If you replace the entropy term with a special kind of sparsity-inducing regularizer, you get SparseMAP. (Not to be confused with earlier related work on sparsemax.) When can you use SparseMAP? By its design, pretty much in any structured prediction problem where MAP inference is available.
Both papers, unsurprisingly, are related to each other and worth careful reading.