In a previous tutorial series I went over some of the theory behind Recurrent Neural Networks (RNNs) and the implementation of a simple RNN from scratch. That’s a useful exercise, but in practice we use libraries like Tensorflow with high-level primitives for dealing with RNNs.
With that using an RNN should be as easy as calling a function, right? Unfortunately that’s not quite the case. In this post I want to go over some of the best practices for working with RNNs in Tensorflow, especially the functionality that isn’t well documented on the official site.
The post comes with a Github repository that contains Jupyter notebooks with minimal examples for:
Preprocessing Data: Use tf.SequenceExample
Check out the tf.SequenceExample Jupyter Notebook here!
RNNs are used for sequential data that has inputs and/or outputs at multiple time steps. Tensorflow comes with a protocol buffer definition to deal with such data: tf.SequenceExample.
You can load data directly from your Python/Numpy arrays, but it’s probably in your best interest to use tf.SequenceExample
instead. This data structure consists of a “context” for non-sequential features and “feature_lists” for sequential features. It’s somewhat verbose (it blew up my latest dataset by 10x), but it comes with a few benefits that are worth it:
-
Easy distributed training. Split up data into multiple TFRecord files, each containing many SequenceExamples, and use Tensorflow’s built-in support for distributed training.
-
Reusability. Other people can re-use your model by bringing their own data into
tf.SequenceExample
format. No model code changes required. -
Use of Tensorflow data loading pipelines functions like
tf.parse_single_sequence_example
. Libraries like tf.learn also come with convenience function that expect you to feed data in protocol buffer format. -
Separation of data preprocessing and model code. Using
tf.SequenceExample
forces you to separate your data preprocessing and Tensorflow model code. This is good practice, as your model shouldn’t make any assumptions about the input data it gets.
In practice, you write a little script that converts your data into tf.SequenceExample
format and then writes one or more TFRecord files. These TFRecord files are parsed by Tensorflow to become the input to your model:
-
Convert your data into
tf.SequenceExample
format -
Write one or more TFRecord files with the serialized data
-
Use
tf.TFRecordReader
to read examples from the file -
Parse each example using
tf.parse_single_sequence_example
(Not in the official docs yet)
And, to parse an example:
Batching and Padding Data
Check out the Jupyer Notebook on Batching and Padding here!
Tensorflow’s RNN functions expect a tensor of shape [B, T, ...]
as input, where B
is the batch size and T
is the length in time of each input (e.g. the number of words in a sentence). The last dimensions depend on your data. Do you see a problem with this? Typically, not all sequences in a single batch are of the same length T, but in order to feed them into the RNN they must be. Usually that’s done by padding them: Appending 0
‘s to examples to make them equal in length.
Now imagine that one of your sequences is of length 1000, but the average length is 20. If you pad all your examples to length 1000 that would be a huge waste of space (and computation time)! That’s where batch padding comes in. If you create batches of size 32, you only need to pad examples within the batch to the same length (the maximum length of examples in that batch). That way, a really long example will only affect a single batch, not all of your data.
That all sounds pretty messy to deal with. Luckily, Tensorflow has built-in support for batch padding. If you set dynamic_pad=True
when calling tf.train.batch
the returned batch will be automatically padded with 0s. Handy! A lower-level option is to use tf.PaddingFIFOQueue.
Side note: Be careful with 0’s in your vocabulary/classes
If you have a classification problem and your input tensors contain class IDs (0, 1, 2, …) then you need to be careful with padding. Because you are padding tensos with 0’s you may not be able to tell the difference between 0-padding and “class 0”. Whether or not this can be a problem depends on what your model does, but if you want to be safe it’s a good idea to not use “class 0” and instead start with “class 1”. (An example of where this may become a problem is in masking the loss function. More on that later).
This gives:
And, the same with PaddingFIFOQueue
:
rnn vs. dynamic_rnn Check out the dynamic_rnn Jupyter Notebook here!
You may have noticed that Tensorflow contains two different functions for RNNs: tf.nn.rnn
and tf.nn.dynamic_rnn
. Which one to use?
Internally, tf.nn.rnn
creates an unrolled graph for a fixed RNN length. That means, if you call tf.nn.rnn
with inputs having 200 time steps you are creating a static graph with 200 RNN steps. First, graph creation is slow. Second, you’re unable to pass in longer sequences (> 200) than you’ve originally specified.
tf.nn.dynamic_rnn
solves this. It uses a tf.While
loop to dynamically construct the graph when it is executed. That means graph creation is faster and you can feed batches of variable size. What about performance? You may think the static rnn is faster than its dynamic counterpart because it pre-builds the graph. In my experience that’s not the case.
In short, just use tf.nn.dynamic_rnn. There is no benefit to tf.nn.rnn
and I wouldn’t be surprised if it was deprecated in the future.
Passing sequence_length to your RNN
Check out the Jupyter Notebook on Bidirectional RNNs here!
When using any of Tensorflow’s rnn functions with padded inputs it is important to pass the sequence_length
parameter. In my opinion this parameter should be required, not optional. sequence_length
serves two purposes: 1. Save computational time and 2. Ensure Correctness.
Let’s say you have a batch of two examples, one is of length 13, and the other of length 20. Each one is a vector of 128 numbers. The length 13 example is 0-padded to length 20. Then your RNN input tensor is of shape [2, 20, 128]
. The dynamic_rnn function returns a tuple of (outputs, state), where outputs
is a tensor of size [2, 20, ...]
with the last dimension being the RNN output at each time step. state
is the last state for each example, and it’s a tensor of size [2, ...]
where the last dimension also depends on what kind of RNN cell you’re using.
So, here’s the problem: Once your reach time step 13, your first example in the batch is already “done” and you don’t want to perform any additional calculation on it. The second example isn’t and must go through the RNN until step 20. By passing sequence_length=[13,20]
you tell Tensorflow to stop calculations for example 1 at step 13 and simply copy the state from time step 13 to the end. The output will be set to 0 for all time steps past 13. You’ve just saved some computational cost. But more importantly, if you didn’t pass sequence_length
you would get incorrect results! Without passing sequence_length
, Tensorflow will continue calculating the state until T=20 instead of simply copying the state from T=13. This means you would calculate the state using the padded elements, which is not what you want.
Bidirectional RNNs
When using a standard RNN to make predictions we are only taking the “past” into account. For certain tasks this makes sense (e.g. predicting the next word), but for some tasks it would be useful to take both the past and the future into account. Think of a tagging task, like part-of-speech tagging, where we want to assign a tag to each word in a sentence. Here we already know the full sequence of words, and for each word we want to take not only the words to the left (past) but also the words to the right (future) into account when making a prediction. Bidirectional RNNs do exactly that. A bidirectional RNN is a combination of two RNNs – one runs forward from “left to right” and one runs backward from “right to left”. These are commonly used for tagging tasks, or when we want to embed a sequence into a fixed-length vector (beyond the scope of this post).
Just like for standard RNNs, Tensorflow has static and dynamic versions of the bidirectional RNN. As of the time of this writing, the bidirectional_dynamic_rnn
is still undocumented, but it’s preferred over the static bidirectional_rnn
.
The key differences of the bidirectional RNN functions are that they take a separate cell argument for both the forward and backward RNN, and that they return separate outputs and states for both the forward and backward RNN:
RNN Cells, Wrappers and Multi-Layer RNNs
Check out the Jupyter Notebook on RNN Cells here!
All Tensorflow RNN functions take a cell
argument. LSTMs and GRUs are the most commonly used cells, but there are many others, and not all of them are documented. Currently, the best way to get a sense of what cells are available is to look at at rnn_cell.py and contrib/rnn_cell.
As of the time of this writing, the basic RNN cells and wrappers are:
-
BasicRNNCell
– A vanilla RNN cell. -
GRUCell
– A Gated Recurrent Unit cell. -
BasicLSTMCell
– An LSTM cell based on Recurrent Neural Network Regularization. No peephole connection or cell clipping. -
LSTMCell
– A more complex LSTM cell that allows for optional peephole connections and cell clipping. -
MultiRNNCell
– A wrapper to combine multiple cells into a multi-layer cell. -
DropoutWrapper
– A wrapper to add dropout to input and/or output connections of a cell.
and the contributed RNN cells and wrappers:
Using these wrappers and cells is simple, e.g.
Calculating sequence loss on padded examples Check out the Jupyter Notebook on Calculating Loss here!
For sequence prediction tasks we often want to make a prediction at each time step. For example, in Language Modeling we try to predict the next word for each word in a sentence. If all of your sequences are of the same length you can use Tensorflow’s sequence_loss and sequence_loss_by_example functions (undocumented) to calculate the standard cross-entropy loss.
However, as of the time of this writing sequence_loss
does not support variable-length sequences (like the ones you get from a dynamic_rnn). Naively calculating the loss at each time step doesn’t work because that would take into account the padded positions. The solution is to create a weight matrix that “masks out” the losses at padded positions.
Here you can see why 0-padding can be a problem when you also have a “0-class”. If that’s the case you cannotuse tf.sign(tf.to_float(y))
to create a mask because that would mask out the “0-class” as well. You can still create a mask using the sequence length information, it just becomes more complicated.