When (not) to use Deep Learning for NLP

We are preparing for the second edition of our PyTorch-based Deep Learning for NLP training. It’s a two-day affair, crammed with a lot of learning and hands-on model building where we get to play the intricate dance of introducing the topics from the ground up while still making sure folks are not far from the state-of-the-art. Compared to our first attempt at NYC this year, we are adding new content, changing existing content to explain some basic ideas well. One subtopic I am quite excited to add is a discussion of “When to use Deep Learning for NLP and when not to”. This post will be expanding on that.

When to use Deep Learning for NLP? If you’ve attended any of the *CL conferences lately, the answer might seem to be a resounding “ALL THE TIME”! There is some truth to this. Deep Learning approaches to NLP have come a long way since 2014, and so have the software/hardware needed for it, to say that it is feasible to successfully apply DL approaches for many complex NLP problems. One emerging theme of 2017 is, as a community, there is (relatively) more understanding of what the deep models do for language or at least an attempt at that. I see more papers on showing a certain natural language hypothesis holds or does not hold using model X than simply saying model X does better than model Y for task Z. So in a way, an overview of the recent literature gives a good picture of when to use Deep Learning for NLP and how. So, in this post, I will focus on the second question: *When not to use deep learning for NLP? *This written mostly from a practitioner’s point of view, i.e. someone who cares not just about building useful models but also wants to put them to use in real-life.

Scenario 1: **When a simpler solution exists. **

This is the DL version of “If a regex solves a problem, don’t model it.” If simpler baselines work, implement/deploy them first. Will a choice of simple input representations (TF, TF/IDF etc) coupled with simple learners like NaiveBayes, LogisticRegression, SVMs give you a usable system? If yes, do that first.

Scenario 2: When you are required to be cost sensitive while building solutions.

Some of the costs in building NLP systems include annotation costs, training costs, and system maintenance costs. While it is possible to throw a lot of resources at a problem to get more labeled data (thereby driving up your annotation costs), some of the most innovative works and breakthroughs happen when you are resource constrained. If you are leading a startup research group or, more likely, a small research/development team in a larger business trying to adopt NLP approaches in your work, your resources are finite and using them judiciously will determine if your product will see the light of the day. Similarly, for most non-trivial models on production data, the training costs associated with deep learning are not negligible either. If you have another alternative that achieves similar metrics, insisting on using a deep model becomes more of dogma. Regarding the last point of system maintenance costs, something curious is happening here. It is no longer the case that building, training, and deploying deep learning models require more software engineering resources than other approaches. This is primarily due to the large community interest in DL leading to better tools for model building with DL.

Scenario 3: When you are working with the wrong level of problem abstraction (or when “End-to-End” is not glamorous)

This is probably the most important of the list and a bit nuanced, so it will require some explanation. Deep Learning (via the computation graph abstraction) gives an illusion of “general-purposeness” of the technique. If you can represent the input as a vector, the output as a vector, using a right combination of primitives (layers) in between, you can “transform” any input to any output, right? Unfortunately, the answer is “it’s complicated”. To illustrate what I mean by “working at a wrong level of problem abstraction”, consider sorting. Say you have sequences of numbers (X) and their sorted sequences (Y). You can try modeling an end-to-end sort-learner as a sequence to sequence problem. This is exactly the wrong thing to do. The sorting problem can be decomposed into a high-level structural component (something you learn in algorithms textbooks) and a comparator that “knows” a rank order. Having this insight about the problem will allow you to think at the right level of problem abstraction by learning an optimal comparator function from the data (the appropriate problem-abstraction here). This insight allows you to realize this is a less expensive learning-to-rank problem vs. a sequence modeling problem. Parallels to this exist in NLP tasks too. For example, treating the dialog problem (chatbots!) as a sequence to sequence problem incurs high inefficiencies by not taking advantage of regularities in whatever task the dialog is in. A dialog engine could be built using insights from the task and the nature of language to learn patterns of semantic and pragmatic content. This could be as simple as modeling the relationship between utterances and underlying task state, or as complex as modeling the dialog structure and classifying each utterance to its corresponding rhetorical relation. In either case, blindly using a sequence-to-sequence model conflates the variance in language due to word choice with the variance due to task semantics.

The above three scenarios are not exhaustive, but cover most situations in my experience.

(Credits: Brian McMahan for contributing to this discussion)