ACL 2018 Highlights： Understanding Representations and Evaluation in More Challenging Settings

I attended the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) in Melbourne, Australia from July 15-20, 2018 and presented three papers . It is foolhardy to try to condense an entire conference into one topic; however, in retrospect, certain themes appear particularly pronounced. In 2015 and 2016, NLP conferences were dominated by word embeddings and some people were musing that Embedding Methods in Natural Language Processing was a more appropriate name for the Conference on Empirical Methods in Natural Language Processing, one of the top conferences in the field.

According to Chris Manning, 2017 was the year of the BiLSTM with attention. While BiLSTMs optionally with attention are still ubiquitous, the main themes of this conference for me were to gain a better understanding what the representations of such models capture and to expose them to more challenging settings. In my review, I will mainly focus on contributions that touch on these themes but will also discuss other themes that I found of interest.

Probing models

It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture. This was most commonly done by automatically creating a dataset that focuses on one particular aspect of the generalization behaviour and evaluating different trained models on this dataset:

Conneau et al. for instance evaluate different sentence embedding methods on ten datasets designed to capture certain linguistic features, such as predicting the length of a sentence, recovering word content, sensitivity to bigram shift, etc. They find that different encoder architectures can result in embeddings with different characteristics and that bag-of-embeddings is surprisingly good at capturing sentence-level properties, among other results.
Zhu et al. evaluate sentence embeddings by observing the change in similarity of generated triplets of sentences that differ in a certain semantic or syntactic aspect. They find—among other things—that SkipThought and InferSent can distinguish negation from synonymy, while InferSent is better at identifying semantic equivalence and dealing with quantifiers.
Pezzelle et al. focus specifically on quantifiers and test different CNN and LSTM models on their ability to predict quantifiers in single-sentence and multi-sentence contexts. They find that in single-sentence context, models outperform humans, while humans are slightly better in multi-sentence contexts.
Kuncoro et al. evaluate LSTMs on modeling subject-verb agreement. They find that with enough capacity, LSTMs can model subject-verb agreement, but that more syntax-sensitive models such as recurrent neural network grammars do even better.
Blevins et al. evaluate models pretrained on different tasks whether they capture a hierarchical notion of syntax. Specifically, they train the models to predict POS tags as well as constituent labels at different depths of a parse tree. They find that all models indeed encode a significant amount of syntax and—in particular—that language models learn some syntax.
Another interesting result regarding the generalization ability of language models is due to Lau et al. who find that a language model trained on a sonnet corpus captures meter implicitly at human-level performance.
Language models, however, also have their limitations. Spithourakis and Riedel observe that language models are bad at modelling numerals and propose several strategies to improve them.
Liu et al. show that LSTMs trained on natural language data are able to recall tokens from much longer sequence than models trained on non-language data at the Repl4NLP workshop.

In particular, I think better understanding what information LSTMs and language models will become more important, as they seem to be a key driver of progress in NLP going forward, as evidenced by our ACL paper on language model fine-tuning and related approaches.

Understanding state-of-the-art models

While the above studies try to understand a specific aspect of the generalization ability of a particular model class, several papers focus on better understanding state-of-the-art models for a particular task:

Glockner et al. focused on the task of natural language inference. They created a dataset with sentences that differ by at most one word from sentences in the training data in order to probe if models can deal with simple lexical inferences. They find that current state-of-the-art models fail to capture many simple inferences.
Mudrakarta et al. analyse state-of-the-art QA models across different modalities and find that the models often ignore key question terms. They then perturb questions to craft adversarial examples that significantly lower models’ accuracy.

I found many of the papers probing different aspects of models stimulating. I hope that the generation of such probing datasets will become a standard tool in the toolkit of every NLP researchers so that we will not only see more of such papers in the future but that such an analysis may also become part of the standard model evaluation, besides error and ablation analyses.

Analyzing the inductive bias

Another way to gain a better understanding of a model is to analyze its inductive bias. The Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP (RELNLP) sought to explore how useful it is to incorporate linguistic structure into our models. One of the key points of Chris Dyer’s talk during the workshop was whether RNNs have a useful inductive bias for NLP. In particular, he argued that there are several pieces of evidence indicating that RNNs prefer sequential recency, namely:

Gradients become attenuated across time. LSTMs or GRUs may help with this, but they also forget.
People have used training regimes like reversing the input sequence for machine translation.
People have used enhancements like attention to have direct connections back in time.
For modeling subject-verb agreement, the error rate increases with the number of attractors.

According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don’t seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour. Recurrent neural network grammars, a class of models that generates both a tree and a sequence sequentially by compressing a sentence into its constituents, instead have a bias for syntactic (rather than sequential) recency.

However, it can often be hard to identify whether a model has a useful inductive bias. For identifying subject-verb agreement, Chris hypothesizes that LSTM language models learn a non-structural “first noun” heuristic that relies on matching the verb to the first noun in the sentence. In general, perplexity (and other aggregate metrics) are correlated with syntactic/structural competence, but are not particularly sensitive at distinguishing structurally sensitive models from models that use a simpler heuristic.

Using Deep Learning to understand language

In his talk at the workshop, Mark Johnson opined that while Deep Learning has revolutionized NLP, its primary benefit is economic: complex component pipelines have been replaced with end-to-end models and target accuracy can often be achieved more quickly and cheaply. Deep Learning has not changed our understanding of language. Its main contribution in this regard is to demonstrate that a neural network aka a computational model can perform certain NLP tasks, which shows that these tasks are not indicators of intelligence. While DL methods can pattern match and perform perceptual tasks really well, they struggle with tasks relying on deliberate reflection and conscious thought.

Incorporating linguistic structure

Jason Eisner questioned in his talk whether linguistic structures and categories actually exist or whether “scientist just like to organize data into piles” given that a linguistics-free approach works surprisingly well for MT. He finds that even “arbitrarily defined” categories such as the difference between the /b/ and /p/ phonemes can become hardened and accrue meaning. However, neural models are pretty good sponges to soak up whatever isn’t modeled explicitly.

He outlines four common ways to introduce linguistic information into models: a) via a pipeline-based approach, where linguistic categories are used as features; b) via data augmentation, where the data is augmented with linguistic categories; c) via multi-task learning; d) via structured modeling such as using a transition-based parser, a recurrent neural network grammar, or even classes that depend on each other such as BIO notation.

In her talk at the workshop, Emily Bender questioned the premise of linguistics-free learning altogether: Even if you had a huge corpus in a language that you knew nothing about, without any other priors, e.g. what function words are, you would not be able to learn sentence structure or meaning. She also pointedly called out many ML papers that describe their approach as similar to how babies learn, without citing any actual developmental psychology or language acquisition literature. Babies in fact learn in situated, joint, emotional context, which carries a lot of signal and meaning.

Understanding the failure modes of LSTMs

Better understanding representations was also a theme at the Representation Learning for NLP workshop. During his talk, Yoav Goldberg detailed some of the efforts of his group to better understand representations of RNNs. In particular, he discussed recent work on extracting a finite state automaton from an RNN in order to better understand what the model has learned. He also reminded the audience that LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data. Even when a model has been trained using a domain-adversarial loss to produce representations that are invariant of a certain aspect, the representations will be still slightly predictive of said attribute. It can thus be a challenge to completely remove unwanted information from encoded language data and even seemingly perfect LSTM models may have hidden failure modes.

On the topic of failure modes of LSTMs, a statement that also fits well in this theme was uttered by this year’s recipient of the ACL lifetime achievement award, Mark Steedman. He asked “LSTMs work in practice, but can they work in theory?”.

Adversarial examples

A theme that is closely interlinked with gaining a better understanding of the limitations of state-of-the-art models is to propose ways how they can be improved. In particular, similar to adversarial example paper mentioned above, several papers tried to make models more robust to adversarial examples:

Cheng et al.  propose to make both the encoder and decoder in NMT models more robust against input perturbations.
Ebrahimi et al. propose white-box adversarial examples to trick a character-level neural classifier by swapping few tokens.
Ribeiro et al. improve upon the previous method with semantic-preserving perturbations that induce changes in the model’s predictions, which they generalize to rules that induce adversaries on many instances.
Bose et al. incorporate adversarial examples into noise contrastive estimation using an adversarially learned sampler. The sampler finds harder negative examples, which forces the model to learn better representations.

Learning robust and fair representations

Tim Baldwin discussed different ways to make models more robust to a domain shift during his talk at the RepL4NLP workshop. The slides can be found here. For using a single source domain, he discussed a method to linguistically perturb training instances based on different types of syntactic and semantic noise. In the setting with multiple source domains, he proposed to train an adversarial model on the source domains. Finally, he discussed a method that allows to learn robust and privacy-preserving text representations.

Margaret Mitchell focused on fair and privacy-preserving representations during her talk at the workshop. In particular, she highlighted the difference between a descriptive and a normative view of the world. ML models learn representations that reflect a descriptive view of the data they’re trained on. The data represents “the world as people talk about it”. Research in fairness conversely seeks to create representations that reflect a normative view of the world, which captures our values and seeks to instill them in the representations.

Improving evaluation methodology

Besides making models more robust, several papers sought to improve the way we evaluate our models:

Finegan-Dollak et al. identify limitations and propose improvements to current evaluations of text-to-SQL systems. They show that the current train-test split and practice of anonymization of variables are flawed and release standardized and improved versions of seven datasets to mitigate these.
Dror et al. focus on a practice that should be commonplace, but is often not done or done poorly: statistical significance testing. In particular, they survey recent empirical papers in ACL and TACL 2017 finding that statistical significance testing is often ignored or misused and propose a simple protocol for statistical significance test selection for NLP tasks.
Chaganty et al. investigate the bias of automatic metrics such as BLEU and ROUGE and find that even an unbiased estimator only achieves a comparatively low error reduction. This highlights the need to improve both the correlation of automatic metric as well as reduce the variance of human annotation.

Strong baselines

Another way to improve model evaluation is to compare new models against stronger baselines, in order to make sure that improvements are actually significant. Some papers focused on this line of research:

Shen et al. systematically compare simple word embedding-based methods with pooling to more complex models such as LSTMs and CNNs. They find that for most datasets, word embedding-based methods exhibit competitive or even superior performance.
Ethayarajh proposes a strong baseline for sentence embedding models at the RepL4NLP workshop.
In a similar vein, Ruder and Plank find that classic bootstrapping algorithms such as tri-training make for strong baselines for semi-supervised learning and even outperform recent state-of-the-art methods.

In the above paper, we also emphasize the importance of evaluating in more challenging settings, such as on out-of-distribution data and on different tasks. Our findings would have been different if we had just focused on a single task or only on in-domain data. We need to test our models under such adverse conditions to get a better sense of their robustness and how well they can actually generalize.

Creating harder datasets

In order to evaluate under such settings, more challenging datasets need to be created. Yejin Choi argued during the RepL4NLP panel discussion (a summary can be found here) that the community pays a lot of attention to easier tasks such as SQuAD or bAbI, which are close to solved. Yoav Goldberg even went so far as to say that “SQuAD is the MNIST of NLP”. Instead, we should focus on solving harder tasks and develop more datasets with increasing levels of difficulty. If a dataset is too hard, people don’t work on it. In particular, the community should not work on datasets for too long as datasets are getting solved very fast these days; creating novel and more challenging datasets is thus even more important. Two datasets that seek to go beyond SQuAD for reading comprehension were presented at the conference:

QAngaroo focuses on reading comprehension that requires to gather several pieces of information via multiple steps of inference.
NarrativeQA requires understand of an underlying narrative by asking the reader to answer questions about stories by reading entire books or movie scripts.

Richard Socher also stressed the importance of training and evaluating a model across multiple tasks during his talk during the Machine Reading for Question Answering workshop. In particular, he argues that NLP requires many types of reasoning, e.g. logical, linguistic, emotional, etc., which cannot all be satisfied by a single task.

Evaluation on multiple and low-resource languages

Another facet of this is to evaluate our models on multiple languages. Emily Bender surveyed 50 NAACL 2018 papers in her talk mentioned above and found that 42 papers evaluate on an unnamed mystery language (i.e. English). She emphasizes that it is important to name the language you work on as languages have different linguistic structures; not mentioning the language obfuscates this fact.

If our methods are designed to be cross-lingual, then we should additionally evaluate them on the more challenging setting of low-resource languages. For instance, both of the following two papers observe that current methods for unsupervised bilingual dictionary methods fail if the target language is dissimilar to language such as with Estonian or Finnish:

Søgaard et al. probe the limitations of current methods further and highlight that such methods also fail when embeddings are trained on different domains or using different algorithms. They finally propose a metric to quantify the potential of such methods.
Artetxe et al. propose a new unsupervised self-training method that employs a better initialization to steer the optimization process and is particularly robust for dissimilar language pairs.

Several other papers also evaluate their approaches on low-resource languages:

Dror et al. propose to use orthographic features for bilingual lexicon induction. Though these mostly help for related languages, they also evaluate on the dissimilar language pair English-Finnish.
Ren et al. finally propose to leverage another rich language for translation into a resource-poor language . They find that their model significantly improves the translation quality of rare languages.
Currey and Heafield propose an unsupervised tree-to-sequence model for NMT by adapting the Gumbel tree-LSTM. Their model proves particularly useful for low-resource languages.

Another theme during the conference for me was that the field is visibly making progress. Marti Hearst, president of the ACL, echoed this sentiment during her presidential address. She used to demonstrate what our models can and can’t do using the example of Stanley Kubrick’s HAL 9000 (seen below). In recent years, this has become a less useful exercise as our models have learned to perform tasks that seemed previously decades away such as recognizing and producing human speech or lipreading[1]. Naturally, we are still far away from tasks that require deep language understanding and reasoning such as having an argument; nevertheless, this progress is remarkable.

Marti also paraphrased NLP and IR pioneer Karen Spärck Jones saying that research is not going around in circles, but climbing a spiral or—maybe more fittingly—different staircases that are not necessarily linked but go in the same direction. She also expressed a sentiment that seems to resonate with a lot of people: In the 1980s and 90s, with only a few papers to read, it was definitely easier to keep track of the state of the art. To make this easier, I’ve recently created a document to collect the state of the art across different NLP tasks.

With the community growing, she encouraged people to participate and volunteer and announced an ACL Distinguished Service Award for the most dedicated members. ACL 2018 also saw the launch (after EACL in 1982 and NAACL in 2000) of its third chapter, AACL, the Asia-Pacific Chapter of the Association for Computational Linguistics.

The business meeting during the conference focused on measures to address a particular challenge of the growing community: the escalating number of submissions and the need for more reviewers. We can expect to see new efforts to deal with the large number of submissions at the conferences next year.

Back in 2016, it seemed as though reinforcement learning (RL) was finding its footing in NLP and being applied to more and more tasks. These days, it seems that the dynamic nature of RL makes it most useful for tasks that intrinsically have some temporal dependency such as selecting data during training[1][1] and modelling dialogue, while supervised learning seems to be better suited for most other tasks. Another important application of RL is to optimize the end metric such as ROUGE or BLEU directly instead of optimizing a surrogate loss such as cross-entropy. Successful applications of this are summarization[1][1] and machine translation[1].

Inverse reinforcement learning can be valuable in settings where the reward is too complex to be specified. A successful application of this is visual storytelling[1]. RL is particularly promising for sequential decision making problems in NLP such as playing text-based games, navigating webpages, and completing tasks. The Deep Reinforcement Learning for NLP tutorial provided a comprehensive overview of the space.

There were other great tutorials as well. I particularly enjoyed the Variational Inference and Deep Generative Models tutorial. The tutorials on Semantic Parsing and about “100 things you always wanted to know about semantics & pragmatics” also seemed really worthwhile. A complete list of the tutorials can be found here.

Cover image: View from the conference venue.

Thanks to Isabelle Augenstein for some paper suggestions.