Task
In this post, we’ll use Tensorflow to construct an RNN that operates on input sequences of variable lengths. We’ll use this RNN to classify bloggers by age bracket and gender using sentence-long writing samples. One time step will represent a single word, with the complete input sequence representing a single sentence. The challenge is to build a model that can classify multiple sentences of different lengths at the same time.
Other tutorials on variable length sequences
There are a couple other tutorials on this topic. For example, the official Tensorflow seq2seq tutorial model accomodates variable length sequences. This official model, however, is a bit advanced for a first exposure and a little too specialized to be easily portable to other contexts. Danijar Hafner has written a more approachable guide here, which I recommend. In contrast to Danijar’s post, this post is written a linear ipython notebook-style to make it easy for you to follow along step-by-step. This post also includes a section on bucketing, a technique that can significantly improve your model’s training time.
Data
The data for this post is sourced from the “Blog Authorship Corpus”, available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by “<#>”. Tokens other than the 9999 most common tokens were replaced by “< UNK >”, for a 10000 token vocabulary. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe. The modified data and code to import can be found here.
Below is the head of the dataframe (tokens in “string” are delimited by spaces):
1 |
|
df = blogs_data.loadBlogs().sample(frac=1).reset_index(drop=True) vocab, reverse_vocab = blogs_data.loadVocab() train_len, test_len = np.floor(len(df)0.8), np.floor(len(df)0.2) train, test = df.ix[:train_len-1], df.ix[train_len:train_len + test_len] df = None train.head()
1 |
|
class SimpleDataIterator(): def init(self, df): self.df = df self.size = len(self.df) self.epochs = 0 self.shuffle()
def shuffle(self): self.df = self.df.sample(frac=1).reset_index(drop=True) self.cursor = 0
def next_batch(self, n): if self.cursor+n-1 > self.size: self.epochs += 1 self.shuffle() res = self.df.ix[self.cursor:self.cursor+n-1] self.cursor += n return res[‘as_numbers’], res[‘gender’]*3 + res[‘age_bracket’], res[‘length’]
1 |
|
Input sequences 0 [27, 3, 576, 146, 13, 204, 37, 150, 6, 804, 94… 1 [10, 210, 30, 1554, 10, 22, 325, 6240, 11, 4, … 2 [2927, 78, 9324, 5, 2273, 4, 5937, 8, 1058, 4,…
Target values 0 4 1 4 2 1
Sequence lengths 0 13 1 18 2 22
1 |
|
class PaddedDataIterator(SimpleDataIterator): def next_batch(self, n): if self.cursor+n > self.size: self.epochs += 1 self.shuffle() res = self.df.ix[self.cursor:self.cursor+n-1] self.cursor += n
# Pad sequences with 0s so they are all the same length maxlen = max(res[‘length’]) x = np.zeros([n, maxlen], dtype=np.int32) for i, x_i in enumerate(x): x_i[:res[‘length’].values[i]] = res[‘as_numbers’].values[i]
return x, res[‘gender’]*3 + res[‘age_bracket’], res[‘length’]
1 |
|
Input sequences [[ 34 90 5 470 16 19 16 7 159 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 82 1 109 7 377 8 421 8 0 33 124 3 69 180 17 90 5 133 16 19 33 34 12 3819 85 164 129 25] [1786 5570 1 13 7817 235 60 6168 19 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
1 |
|
def reset_graph(): if ‘sess’ in globals() and sess: sess.close() tf.reset_default_graph()
def build_graph( vocab_size = len(vocab), state_size = 64, batch_size = 256, num_classes = 6):
reset_graph()
# Placeholders x = tf.placeholder(tf.int32, [batch_size, None]) # [batch_size, num_steps] seqlen = tf.placeholder(tf.int32, [batch_size]) y = tf.placeholder(tf.int32, [batch_size]) keep_prob = tf.constant(1.0)
# Embedding layer embeddings = tf.get_variable(‘embedding_matrix’, [vocab_size, state_size]) rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
# RNN cell = tf.nn.rnn_cell.GRUCell(state_size) init_state = tf.get_variable(‘init_state’, [1, state_size], initializer=tf.constant_initializer(0.0)) init_state = tf.tile(init_state, [batch_size, 1]) rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=seqlen, initial_state=init_state)
# Add dropout, as the model otherwise quickly overfits rnn_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
””” Obtain the last relevant output. The best approach in the future will be to use:
last_rnn_output = tf.gather_nd(rnn_outputs, tf.pack([tf.range(batch_size), seqlen-1], axis=1))
which is the Tensorflow equivalent of numpy’s rnn_outputs[range(30), seqlen-1, :], but the gradient for this op has not been implemented as of this writing.
The below solution works, but throws a UserWarning re: the gradient. “”” idx = tf.range(batch_size)*tf.shape(rnn_outputs)[1] + (seqlen - 1) last_rnn_output = tf.gather(tf.reshape(rnn_outputs, [-1, state_size]), idx)
# Softmax layer with tf.variable_scope(‘softmax’): W = tf.get_variable(‘W’, [state_size, num_classes]) b = tf.get_variable(‘b’, [num_classes], initializer=tf.constant_initializer(0.0)) logits = tf.matmul(last_rnn_output, W) + b preds = tf.nn.softmax(logits) correct = tf.equal(tf.cast(tf.argmax(preds,1),tf.int32), y) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y)) train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
return { ‘x’: x, ‘seqlen’: seqlen, ‘y’: y, ‘dropout’: keep_prob, ‘loss’: loss, ‘ts’: train_step, ‘preds’: preds, ‘accuracy’: accuracy }
def train_graph(graph, batch_size = 256, num_epochs = 10, iterator = PaddedDataIterator): with tf.Session() as sess: sess.run(tf.initialize_all_variables()) tr = iterator(train) te = iterator(test)
step, accuracy = 0, 0 tr_losses, te_losses = [], [] current_epoch = 0 while current_epoch < num_epochs: step += 1 batch = tr.next_batch(batch_size) feed = {g[‘x’]: batch[0], g[‘y’]: batch[1], g[‘seqlen’]: batch[2], g[‘dropout’]: 0.6} accuracy_, _ = sess.run([g[‘accuracy’], g[‘ts’]], feed_dict=feed) accuracy += accuracy_
if tr.epochs > current_epoch: current_epoch += 1 tr_losses.append(accuracy / step) step, accuracy = 0, 0
#eval test set te_epoch = te.epochs while te.epochs == te_epoch: step += 1 batch = te.next_batch(batch_size) feed = {g[‘x’]: batch[0], g[‘y’]: batch[1], g[‘seqlen’]: batch[2]} accuracy_ = sess.run([g[‘accuracy’]], feed_dict=feed)[0] accuracy += accuracy_
te_losses.append(accuracy / step) step, accuracy = 0,0 print(“Accuracy after epoch”, current_epoch, “ - tr:”, tr_losses[-1], “- te:”, te_losses[-1])
return tr_losses, te_losses
1 |
|
Accuracy after epoch 1 - tr: 0.319347791963 - te: 0.351068906904 Accuracy after epoch 2 - tr: 0.355731238225 - te: 0.357366258375 Accuracy after epoch 3 - tr: 0.361505161451 - te: 0.358625811348 Accuracy after epoch 4 - tr: 0.363629598859 - te: 0.359358642169 Accuracy after epoch 5 - tr: 0.365078599278 - te: 0.358609453518 Accuracy after epoch 6 - tr: 0.365907767689 - te: 0.359358642169 Accuracy after epoch 7 - tr: 0.367192406322 - te: 0.359833019263 Accuracy after epoch 8 - tr: 0.368336397059 - te: 0.360304124791 Accuracy after epoch 9 - tr: 0.369028188455 - te: 0.360434987437 Accuracy after epoch 10 - tr: 0.37021715381 - te: 0.36041535804
1 |
|
tr = PaddedDataIterator(train) padding = 0 for i in range(100): lengths = tr.next_batch(256)[2].values max_len = max(lengths) padding += np.sum(max_len - lengths) print(“Average padding / batch:”, padding/100)
Average padding / batch: 3279.9
1 |
|
class BucketedDataIterator(): def init(self, df, num_buckets = 5): df = df.sort_values(‘length’).reset_index(drop=True) self.size = len(df) / num_buckets self.dfs = [] for bucket in range(num_buckets): self.dfs.append(df.ix[bucketself.size: (bucket+1)self.size - 1]) self.num_buckets = num_buckets
# cursor[i] will be the cursor for the ith bucket self.cursor = np.array([0] * num_buckets) self.shuffle()
self.epochs = 0
def shuffle(self): #sorts dataframe by sequence length, but keeps it random within the same length for i in range(self.num_buckets): self.dfs[i] = self.dfs[i].sample(frac=1).reset_index(drop=True) self.cursor[i] = 0
def next_batch(self, n): if np.any(self.cursor+n+1 > self.size): self.epochs += 1 self.shuffle()
i = np.random.randint(0,self.num_buckets)
res = self.dfs[i].ix[self.cursor[i]:self.cursor[i]+n-1] self.cursor[i] += n
# Pad sequences with 0s so they are all the same length maxlen = max(res[‘length’]) x = np.zeros([n, maxlen], dtype=np.int32) for i, x_i in enumerate(x): x_i[:res[‘length’].values[i]] = res[‘as_numbers’].values[i]
return x, res[‘gender’]*3 + res[‘age_bracket’], res[‘length’]
1 |
|
tr = BucketedDataIterator(train, 5) padding = 0 for i in range(100): lengths = tr.next_batch(256)[2].values max_len = max(lengths) padding += np.sum(max_len - lengths) print(“Average padding / batch:”, padding/100)
Average padding / batch: 573.49
1 |
|
from time import time g = build_graph() t = time() tr_losses, te_losses = train_graph(g, num_epochs=1, iterator=PaddedDataIterator) print(“Total time for 1 epoch with PaddedDataIterator:”, time() - t)
1 |
|
g = build_graph() t = time() tr_losses, te_losses = train_graph(g, num_epochs=1, iterator=BucketedDataIterator) print(“Total time for 1 epoch with BucketedDataIterator:”, time() - t)
1 |
|
Note how easy it was to move to a bucketed model–all we had to do was change our data generator. This was made possible by the use of a partially-known shape for our input placeholder, with the num_steps dimension unknown. Contrast this to the more complicated approach in Tensorflow’s seq2seq tutorial, which builds a different graph for each of four buckets.
A note on awkward sequence lengths
Suppose we had a dataset with awkward sequence lengths that made even a bucketed approach inefficient. For example, we might have lots of very short sequences of lengths 1, 2 and 3. Alternatively, we might have a few very long sequences among our shorter ones; we want to propagate the internal state forward through time for the long sequences, but don’t have enough of them to train efficiently in parallel. One solution in both of these scenarios is to combine short sequences into longer ones, but have the internal state of the RNN reset in between each such sequence. I believe this is not possible to do with Tensorflow’s default RNN functions (e.g., dynamic_rnn
), so if you’re looking for a way to do this, I would look into writing a custom RNN method using tf.scan
. I show how to use tf.scan
to build a custom RNN in my post, Recurrent Neural Networks in Tensorflow II. With the right accumulator function, you could program in the state resets dynamically based on either a special PAD symbol, or an auxiliary input sequence that indicates where the state should be reset.
A basic model for sequence to sequence learning
Finally, we extend our sequence classification model to do sequence-to-sequence learning. We’ll use the same dataset, but instead of having our model guess the author’s age bracket and gender at the end of the sequence (i.e., only once), we’ll have it guess at every timestep.
The added wrinkle when moving to a sequence-to-sequence model is that we need to make sure that time-steps with a PAD symbol do not contribute to our loss, since they are just there as filler. We do so by zeroing the loss at these time steps, which is known as applying a “mask” or “masking” the loss. This is achieved by pointwise multiplying the loss tensor (with each entry representing a time step), by a tensor of 1s and 0s, where 1s represent valid steps and 0s represent PAD steps. A similar modification is made to the “accuracy” calculation below, as noted in the comments.
1 |
|
g = build_seq2seq_graph() tr_losses, te_losses = train_graph(g, iterator=BucketedDataIterator)
1 |
|
As expected, our sequence-to-sequence model has slightly worse accuracy than our sequence classification model (because it’s early guesses are nearly random and reduce the accuracy).
Conclusion
In this post, we learned four concepts, all related to building RNNs that work with variable length sequences. First, we learned how to pad input sequences so that we can feed in a single zero-padded input tensor. Second, we learned how to get the last relevant output in a sequence classification model. Third, we learned how to use bucketing to get a significantly boost in training time. Finally, we learned how to “mask” our loss function so that we can train sequence-to-sequence models with variable length sequences.