This is the second in a series of posts about recurrent neural networks in Tensorflow. The first post lives here. In this post, we will build upon our vanilla RNN by learning how to use Tensorflow’s scan and dynamic_rnn models, upgrading the RNN cell and stacking multiple RNNs, and adding dropout and layer normalization. We will then use our upgraded RNN to generate some text, character by character.
Note 3/14/2017: This tutorial is quite a bit deprecated by changes to the TF api. Leaving it up since it may still be useful, and most changes to the API are cosmetic (biggest change is that many of the RNN cells and functions are in the tf.contrib.rnn module). There was also a change to the ptb_iterator. A (slightly modified) copy of the old version which should work until I update this tutorial is uploaded here.
Recap of our model
In the last post, we built a very simple, no frills RNN that was quickly able to learn to solve the toy task we created for it.
Here is the formal statement of our model from last time:
(S_t = \text{tanh}(W(X_t \ @ \ S_{t-1}) + b_s))
(P_t = \text{softmax}(US_t + b_p))
where (@) represents vector concatenation, (X_t \in R^n) is an input vector, (W \in R^{d \times (n + d)}, \ b_s \in R^d, \ U \in R^{n \times d}), (b_p \in R^n), (n) is the size of the input and output vectors, and d is the size of the hidden state vector. At time step 0, (S_{-1}) (the initial state) is initialized as a vector of zeros.
Task and data
This time around we will be building a character-level language model to generate character sequences, a la Andrej Karpathy’s char-rnn (and see, e.g., a Tensorflow implementation by Sherjil Ozair here).
Why do something that’s already been done? Well, this is a much harder task than the toy model from last time. This model needs to handle long sequences and learn long time dependencies. That makes a great task for learning about adding features to our RNN, and seeing how our changes affect the results as we go.
To start, let’s create our data generator. We’ll use the tiny-shakespeare corpus as our data, though we could use any plain text file. We’ll choose to use all of the characters in the text file as our vocabulary, treating lowercase and capital letters are separate characters. In practice, there may be some advantage to forcing the network to use similar representations for capital and lowercase letters by using the same one-hot representations for each, plus a binary flag to indicate whether or not the letter is a capital. Additionally, it is likely a good idea to restrict the vocabulary (i.e., the set of characters) used, by replacing uncommon characters with an UNK token (like a square: □).
1 |
|
Load and process data, utility functions “””
file_url = ‘https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt’ file_name = ‘tinyshakespeare.txt’ if not os.path.exists(file_name): urllib.request.urlretrieve(file_url, file_name)
with open(file_name,’r’) as f: raw_data = f.read() print(“Data length:”, len(raw_data))
vocab = set(raw_data) vocab_size = len(vocab) idx_to_vocab = dict(enumerate(vocab)) vocab_to_idx = dict(zip(idx_to_vocab.values(), idx_to_vocab.keys()))
data = [vocab_to_idx[c] for c in raw_data] del raw_data
def gen_epochs(n, num_steps, batch_size): for i in range(n): yield reader.ptb_iterator(data, batch_size, num_steps)
def reset_graph(): if ‘sess’ in globals() and sess: sess.close() tf.reset_default_graph()
def train_network(g, num_epochs, num_steps = 200, batch_size = 32, verbose = True, save=False): tf.set_random_seed(2345) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) training_losses = [] for idx, epoch in enumerate(gen_epochs(num_epochs, num_steps, batch_size)): training_loss = 0 steps = 0 training_state = None for X, Y in epoch: steps += 1
feed_dict={g[‘x’]: X, g[‘y’]: Y} if training_state is not None: feed_dict[g[‘init_state’]] = training_state training_loss_, training_state, _ = sess.run([g[‘total_loss’], g[‘final_state’], g[‘train_step’]], feed_dict) training_loss += training_loss_ if verbose: print(“Average training loss for Epoch”, idx, “:”, training_loss/steps) training_losses.append(training_loss/steps)
if isinstance(save, str): g[‘saver’].save(sess, save)
return training_losses
Data length: 1115394
1 |
|
def build_basic_rnn_graph_with_list( state_size = 100, num_classes = vocab_size, batch_size = 32, num_steps = 200, learning_rate = 1e-4):
reset_graph()
x = tf.placeholder(tf.int32, [batch_size, num_steps], name=’input_placeholder’) y = tf.placeholder(tf.int32, [batch_size, num_steps], name=’labels_placeholder’)
x_one_hot = tf.one_hot(x, num_classes) rnn_inputs = [tf.squeeze(i,squeeze_dims=[1]) for i in tf.split(1, num_steps, x_one_hot)]
cell = tf.nn.rnn_cell.BasicRNNCell(state_size) init_state = cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state = tf.nn.rnn(cell, rnn_inputs, initial_state=init_state)
with tf.variable_scope(‘softmax’): W = tf.get_variable(‘W’, [state_size, num_classes]) b = tf.get_variable(‘b’, [num_classes], initializer=tf.constant_initializer(0.0)) logits = [tf.matmul(rnn_output, W) + b for rnn_output in rnn_outputs]
y_as_list = [tf.squeeze(i, squeeze_dims=[1]) for i in tf.split(1, num_steps, y)]
loss_weights = [tf.ones([batch_size]) for i in range(num_steps)] losses = tf.nn.seq2seq.sequence_loss_by_example(logits, y_as_list, loss_weights) total_loss = tf.reduce_mean(losses) train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)
return dict( x = x, y = y, init_state = init_state, final_state = final_state, total_loss = total_loss, train_step = train_step )
1 |
|
It took over 5 seconds to build the graph of the most basic RNN model! This could bad… what happens when we move up to a 3-layer LSTM?
Below, we switch out the RNN cell for a Multi-layer LSTM cell. We’ll go over the details of how to do this in the next section.
1 |
|
t = time.time() build_multilayer_lstm_graph_with_list() print(“It took”, time.time() - t, “seconds to build the graph.”)
It took 25.640846967697144 seconds to build the graph.
1 |
|
def build_multilayer_lstm_graph_with_dynamic_rnn( state_size = 100, num_classes = vocab_size, batch_size = 32, num_steps = 200, num_layers = 3, learning_rate = 1e-4):
reset_graph()
x = tf.placeholder(tf.int32, [batch_size, num_steps], name=’input_placeholder’) y = tf.placeholder(tf.int32, [batch_size, num_steps], name=’labels_placeholder’)
embeddings = tf.get_variable(‘embedding_matrix’, [num_classes, state_size])
# Note that our inputs are no longer a list, but a tensor of dims batch_size x num_steps x state_size rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True) cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True) init_state = cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state=init_state)
with tf.variable_scope(‘softmax’): W = tf.get_variable(‘W’, [state_size, num_classes]) b = tf.get_variable(‘b’, [num_classes], initializer=tf.constant_initializer(0.0))
#reshape rnn_outputs and y so we can get the logits in a single matmul rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size]) y_reshaped = tf.reshape(y, [-1])
logits = tf.matmul(rnn_outputs, W) + b
total_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)) train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)
return dict( x = x, y = y, init_state = init_state, final_state = final_state, total_loss = total_loss, train_step = train_step )
1 |
|
Much better. One would think that pushing the graph construction to execution time would cause execution of the graph to go slower, but in this case, using dynamic_rnn actually speeds things up:
1 |
|
Average training loss for Epoch 0 : 3.53323210245 Average training loss for Epoch 1 : 3.31435756163 Average training loss for Epoch 2 : 3.21755325109 It took 117.78161263465881 seconds to train for 3 epochs.
1 |
|
Average training loss for Epoch 0 : 3.55792756053 Average training loss for Epoch 1 : 3.3225021006 Average training loss for Epoch 2 : 3.28286816745 It took 96.69413661956787 seconds to train for 3 epochs.
1 |
|
def build_multilayer_lstm_graph_with_scan( state_size = 100, num_classes = vocab_size, batch_size = 32, num_steps = 200, num_layers = 3, learning_rate = 1e-4):
reset_graph()
x = tf.placeholder(tf.int32, [batch_size, num_steps], name=’input_placeholder’) y = tf.placeholder(tf.int32, [batch_size, num_steps], name=’labels_placeholder’)
embeddings = tf.get_variable(‘embedding_matrix’, [num_classes, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True) cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True) init_state = cell.zero_state(batch_size, tf.float32) rnn_outputs, final_states = \ tf.scan(lambda a, x: cell(x, a[1]), tf.transpose(rnn_inputs, [1,0,2]), initializer=(tf.zeros([batch_size, state_size]), init_state))
# there may be a better way to do this: final_state = tuple([tf.nn.rnn_cell.LSTMStateTuple( tf.squeeze(tf.slice(c, [num_steps-1,0,0], [1, batch_size, state_size])), tf.squeeze(tf.slice(h, [num_steps-1,0,0], [1, batch_size, state_size]))) for c, h in final_states])
with tf.variable_scope(‘softmax’): W = tf.get_variable(‘W’, [state_size, num_classes]) b = tf.get_variable(‘b’, [num_classes], initializer=tf.constant_initializer(0.0))
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size]) y_reshaped = tf.reshape(tf.transpose(y,[1,0]), [-1])
logits = tf.matmul(rnn_outputs, W) + b
total_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)) train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)
return dict( x = x, y = y, init_state = init_state, final_state = final_state, total_loss = total_loss, train_step = train_step )
1 |
|
It took 0.6475389003753662 seconds to build the graph. Average training loss for Epoch 0 : 3.55362293501 Average training loss for Epoch 1 : 3.32045680079 Average training loss for Epoch 2 : 3.27433713688 It took 101.60246014595032 seconds to train for 3 epochs.
1 |
|
with this for LSTM:
cell = tf.nn.rnn_cell.LSTMCell(state_size)
1 |
|
The LSTM keeps two sets of internal state vectors, (c) (for memory cell or constant error carousel) and (h) (for hidden state). By default, they are concatenated into a single vector, but as of this writing, using the default arguments to LSTMCell will produce a warning message:
WARNING:tensorflow:<tensorflow.python.ops.rnn_cell.LSTMCell object at 0x7faade1708d0>: Using a concatenated state is slower and will soon be deprecated. Use state_is_tuple=True.
1 |
|
Note that if you are wrapping an LSTMCell that uses state_is_tuple=True
, you should pass this same argument to the MultiRNNCell as well.
Writing a custom RNN cell
It’s almost too easy to use the standard GRU or LSTM cells, so let’s define our own RNN cell. Here’s a random idea that may or may not work: starting with a GRU cell, instead of taking a single transformation of its input, we enable it to take a weighted average of multiple transformations of its input. That is, using the notation from Cho et al. (2014), instead of using (Wx) in our candidate state, (\tilde h^{(t)} = \text{tanh}(Wx + U(r \odot h^{(t-1)})), we use a weighted average of (W_1 x, \ W_2 x \dots W_n x) for some n. In other words, we will replace (Wx) with (\Sigma\lambda_iW_ix) for some weights (\lambda_i) that sum to 1. The vector of weights, (\lambda), will be calculated as (\lambda = \text{softmax}(W_{avg}x^{(t)} + U_{avg}h^{(t-1)} + b)). The idea is that we might benefit from treat the input differently in different scenarios (e.g., we may want to treat verbs differently than nouns).
To write the custom cell, we need to extend tf.nn.rnn_cell.RNNCell. Specifically, we need to fill in 3 abstract methods and write an __init__
method (take a look at the Tensorflow code here). First, let’s start with a GRU cell, adapted from Tensorflow’s implementation:
1 |
|
We modify the __init__
method to take a parameter (n) at initialization, which will determine the number of transformation matrices (W_i) it will create:
1 |
|
Then, we modify the Candidate
variable scope of the __call__
method to do a weighted average as shown below (note that all of the (W_i) matrices are created as a single variable and then split into multiple tensors):
1 |
|
Let’s see how the custom cell stacks up to a regular GRU cell (using num_steps = 30
, since this performs much better than num_steps = 200
after 5 epochs – can you see why that might happen?):
1 |
|
g = build_multilayer_graph_with_custom_cell(cell_type=’GRU’, num_steps=30) t = time.time() train_network(g, 5, num_steps=30) print(“It took”, time.time() - t, “seconds to train for 5 epochs.”)
1 |
|
g = build_multilayer_graph_with_custom_cell(cell_type=’Custom’, num_steps=30) t = time.time() train_network(g, 5, num_steps=30) print(“It took”, time.time() - t, “seconds to train for 5 epochs.”)
1 |
|
So much for that idea. Our custom cell took almost twice as long to train and seems to perform worse than a standard GRU cell.
Adding Dropout
Adding features like dropout to the network is easy: we figure out where they belong and drop them in.
Dropout belongs in between layers, not on the state or in intra-cell connections. See Zaremba et al. (2015), Recurrent Neural Network Regularization (“The main idea is to apply the dropout operator only to the non-recurrent connections.”)
Thus, to apply dropout, we need to wrap the input and/or output of each cell. In our RNN implementation using list, we might do something like this:
1 |
|
In our dynamic_rnn or scan implementations, we might apply dropout directly to the rnn_inputs or rnn_outputs:
1 |
|
But what happens when we use MultiRNNCell
? How can we have dropout in between layers like in Zaremba et al. (2015)? The answer is to wrap our base RNN cell with dropout, thereby including it as part of the base cell, similar to how we wrapped our three RNN cells into a single MultiRNNCell above. Tensorflow allows us to do this without writing a new RNNCell by using tf.nn.rnn_cell.DropoutWrapper
:
1 |
|
Note that if we wrap a base cell with dropout and then use it to build a MultiRNNCell, both input dropout and output dropout will be applied between layers (so if both are, say, 0.9, the dropout in between layers will be 0.9 * 0.9 = 0.81). If we want equal dropout on all inputs and outputs of a multi-layered RNN, we can use only output or input dropout on the base cell, and then wrap the entire MultiRNNCell with the input or output dropout like so:
1 |
|
Layer normalization
Layer normalization is a feature that was published just a few days ago by Lei Ba et al. (2016), which we can use to improve our RNN. It was inspired by batch normalization, which you can read about and learn how to implement in my post here. Batch normalization (for feed-forward and convolutional neural networks) and layer normalization (for recurrent neural networks) generally improve training time and achieve better overall performance. In this section, we’ll apply what we’ve learned in this post to implement layer normalization in Tensorflow.
Layer normalization is applied as follows: the initial layer normalization function is applied individually to each training example so as to normalize the output vector of a linear transformation to have a mean of 0 and a variance of 1. In math: (LN_{initial}: v \mapsto \frac{v - \mu_v}{\sqrt{\sigma_v^2 + \epsilon}}) for some vector (v) and some small value of (\epsilon) for numerical stability. For some the same reasons we add scale and shift parameters to the initial batch normalization transform (see my batch normalization post for details), we add scale, (\alpha), and shift, (\beta), parameters here as well, so that the final layer normalization function is:
[LN: v \mapsto \alpha \odot \frac{v - \mu_v}{\sqrt{\sigma_v^2 + \epsilon}} + \beta]
Note that (\odot) is point-wise multiplication.
To add layer normalization to our network, we first write a function that will layer normalization a 2D tensor along its second dimension:
1 |
|
Let’s apply it our layer normalization function as it was applied by Lei Ba et al. (2016) to LSTMs (in their experiments “Teaching machines to read and comprehend” and “Handwriting sequence generation”). Lei Ba et al. apply layer normalization to the output of each gate inside the LSTM cell, which means that we get to take a second shot at writing a new type of RNN cell. We’ll start with Tensorflow’s official code, located here, and modify it accordingly:
1 |
|
And that’s it! Let’s try this out.
Final model
At this point, we’ve covered all of the graph modifications we planned to cover, so here is our final model, which allows for dropout and layer normalized LSTM cells:
1 |
|
Let’s compare the GRU, LSTM and LN_LSTM after training each for 20 epochs using 80 step sequences.
1 |
|
It took 1051.6652357578278 seconds to train for 20 epochs. The average loss on the final epoch was: 1.75318197903
1 |
|
It took 614.4890048503876 seconds to train for 20 epochs. The average loss on the final epoch was: 2.02813237837
1 |
|
It took 3867.550405740738 seconds to train for 20 epochs. The average loss on the final epoch was: 1.71850851623
1 |
|
def generate_characters(g, checkpoint, num_chars, prompt=’A’, pick_top_chars=None): “”” Accepts a current character, initial state”””
with tf.Session() as sess: sess.run(tf.initialize_all_variables()) g[‘saver’].restore(sess, checkpoint)
state = None current_char = vocab_to_idx[prompt] chars = [current_char]
for i in range(num_chars): if state is not None: feed_dict={g[‘x’]: [[current_char]], g[‘init_state’]: state} else: feed_dict={g[‘x’]: [[current_char]]}
preds, state = sess.run([g[‘preds’],g[‘final_state’]], feed_dict)
if pick_top_chars is not None: p = np.squeeze(preds) p[np.argsort(p)[:-pick_top_chars]] = 0 p = p / np.sum(p) current_char = np.random.choice(vocab_size, 1, p=p)[0] else: current_char = np.random.choice(vocab_size, 1, p=np.squeeze(preds))[0]
chars.append(current_char)
chars = map(lambda x: idx_to_vocab[x], chars) print(““.join(chars)) return(““.join(chars))
1 |
|
ATOOOS
UIEAOUYOUZZZZZZUZAAAYAYf n fsflflrurctuateot t ta’s a wtutss ESGNANO: Whith then, a do makes and them and to sees, I wark on this ance may string take thou honon To sorriccorn of the bairer, whither, all I’d see if yiust the would a peid.
LARYNGLe: To would she troust they fould.
PENMES: Thou she so the havin to my shald woust of As tale we they all my forder have As to say heant thy wansing thag and Whis it thee shath his breact, I be and might, she Tirs you desarvishensed and see thee: shall, What he hath with that is all time, And sen the have would be sectiens, way thee, They are there to man shall with me to the mon, And mere fear would be the balte, as time an at And the say oun touth, thy way womers thee.
1 |
|
Load new data “””
file_url = ‘https://gist.githubusercontent.com/spitis/59bfafe6966bfe60cc206ffbb760269f/’+\ ‘raw/030a08754aada17cef14eed6fac7797cda830fe8/variousscripts.txt’ file_name = ‘variousscripts.txt’ if not os.path.exists(file_name): urllib.request.urlretrieve(file_url, file_name)
with open(file_name,’r’) as f: raw_data = f.read() print(“Data length:”, len(raw_data))
vocab = set(raw_data) vocab_size = len(vocab) idx_to_vocab = dict(enumerate(vocab)) vocab_to_idx = dict(zip(idx_to_vocab.values(), idx_to_vocab.keys()))
data = [vocab_to_idx[c] for c in raw_data] del raw_data
Data length: 3299132
1 |
|
It took 4877.8002140522 seconds to train for 30 epochs. The average loss on the final epoch was: 0.726858645461
1 |
|
DENT’SUEENCK
Bartholomew of the TIE FIGHTERS are stunned. There is a crowd and armored switcheroos.
PICARD (continuing) Couns two dim is tired. In order to the sentence…
The sub bottle appears on the screen into a small shuttle shift of the ceiling. The DAMBA FETT splash fires and matches them into the top, transmit to stable high above upon their statels, falling from an alien shaft.
ANAKIN and OBI-WAN stand next to OBI-WAN down the control plate of smoke at the TIE fighter. They stare at the centre of the station loose into a comlink cover – comes up to the General, the GENERAL HUNTAN AND FINNFURMBARD from the PICADOR to a beautiful Podracisly.
ENGINEER Naboo from an army seventy medical security team area re-weilergular.
EXT. ```
Not sure these are that much better than before, but it’s sort of readable?
Conclusion
In this post, we used a character sequence generation task to learn how to use Tensorflow’s scan and dynamic_rnn functions, how to use advanced RNN cells and stack multiple RNNs, and how to add features to our RNN like dropout and layer normalization. In the next post, we will use a machine translation task to look at handling variable length sequences and building RNN encoders and decoders.