Alas! what a great **loss** there will be to **learning**before the cycle of the Moon is completed.
“The Prophecies of Nostradamus“, Century I, 62
In 1555, did Nostradamus predict deep learning built on stochastic optimization using a loss function’s gradient? Almost certainly not, but, what can deep learning predict about Nostradamus in 2017? Let us begin by assuming that’s an interesting question.
The Rise of Deep Learning
Deep learning has evolved so rapidly in the last five years that it’s hard to keep up. It’s being applied to predict fraud or illness, to play games, and even playfully generate art. Within a year, new research makes its way into mainstream open-source software like TensorFlow and Keras. It also continues to drive a whole hardware market segment for ever bigger and more powerful GPUs and other specialized processors.
We’ve written about how to apply distributed deep learning frameworks like Deeplearning4j and BigDL, using Cloudera Data Science Workbench and a CDH cluster, to classify images. This blog will explore generating text instead: for fun, new prophecies in the style of Nostradamus. Instead of an Apache Spark-based distributed framework, this example will make use of TensorFlow and Keras (and thus Python), but will show how to use a GPUs to accelerate training of a particular type of model with the new 1.1 release of Cloudera Data Science Workbench.
Big Data Paradigms vs Deep Learning
The success of deep learning is the triumph of application of massive brute-force computation. Much of its magic lies in automatically discovering meaningful representations of raw input, by sheer volume of number-crunching. Of course, the big data movement is all about cheap, commodity hardware, and smart open source software, as a way to economically access nearly unlimited compute and storage. Deep learning should be a natural fit with frameworks like Spark and Apache Hadoop.
Deep learning in some ways challenges that paradigm, however. For example, Spark assumes problems are data-parallel, and that scaling out means splitting up large jobs into more and more independent tasks. While deep learning training algorithms are naturally parallelizable, they depend on intense and constant communication between tasks.
Training a model is CPU-intensive. It only requires some basic math, but in huge quantities. CPUs are generalists, whereas specialized hardware like GPUs happen to provide some of the massively parallel simple math that training needs. So, deep learning actually benefits significantly from specialized hardware. Finally, small-data problems can still be big-compute problems in deep learning. Not everything starts with a terabyte of data!
It’s no wonder that many popular deep learning frameworks have implemented training on single-machine GPUs. Many successful applications and popular examples can execute on a modern laptop, even. They are the right tool for the job sometimes.
GPUs: Now an Option for Deep Learning on CDH
That’s not to say that deep learning is a poor fit for platforms like CDH. Distributed training, and CPU-based training, remains possible and even essential in some instances. After all, you’re more likely to have a cluster of CPUs lying around already, ready to use. CDH maximizes options, making it possible to use Spark-based distributed frameworks if needed, as well as non-distributed frameworks where appropriate, for example frameworks written in Python that weren’t even designed for Hadoop technologies.
The release of the Cloudera Data Science Workbench earlier this year has made the latter far more feasible. Now it’s easy to run scripts that use any deep learning framework for Python, in a friendly notebook-like environment with secured access to cluster resources and data.
This is good news, because deep learning projects are undergoing their own Cambrian explosion, and practitioners are spoiled for choice. It’s powerful to have access to the current best tooling regardless of language or compute paradigm. With the 1.1 release, these tools can now also access GPUs that are installed on the Cloudera Data Science Workbench worker nodes.
Below, TensorFlow, arguably the most popular framework today, will be used to generate text with deep learning. Developed by Google, TensorFlow provides a Python language API for expressing the complex chains of mathematical operations underneath a deep learning model’s training. It can support single-node or distributed training, and CPUs or GPUs.
It is, however, more of a programming language for deep learning than a user-friendly toolkit. Keras has gained popularity alongside TensorFlow as a higher-level front-end for TensorFlow and other frameworks like Theano. TensorFlow and Keras are a natural choice for exploring deep learning.
Generative Deep Learning with Text
Machine learning models are often thought of as black boxes that consume input and spit out a number or decision. Useful, but austere. Now machine learning techniques have helped crack game playing and machine translation. They can generate outputs that feel creative.
Models that generate novel data aren’t new, but their outputs remain among the most intriguing. Transferring the style of a painting onto another is a prominent example. See the ‘painting’ of Nostradamus in the style of Van Gogh’s “Starry Night” below, for example.
This idea is already within range of its ominous logical conclusion: faking convincing audio or video recordings from a public figure!
Among others, Andrej Karpathy produced an excellent explanation of modeling text with recurrent neural networks (RNNs) in “The Unreasonable Effectiveness of Recurrent Neural Networks“. RNNs are a specialization of deep learning for sequences of data, and not specific to text. However, text can be viewed as a sequence of words to be learned, and his examples show RNNs learning to generate text in the style of a given corpus: new sonnets given Shakespeare’s plays, or mostly-valid C code from the Linux kernel’s source. It does so by continuously predicting the next character given previous characters. It’s astounding that it works as well as it does.
These aren’t even the most recent or sophisticated examples of models that generate creative outputs: see variational autoencoders and generative adversarial networks (GANs). However, this example will apply simple RNNs, and apply them to the prophecies of Nostradamus, in order to generate new predictions of what the rest of 2017 might have in store.
The Problem
The task is to learn to predict each successive word, based on immediately preceding words. A model that can do this can continuously predict a likely next word given previously-predicted words, and so generate a stream of new text. Consider the phrase:
The lily of the Dauphin will reach into Nancy,
As far as Flanders the Elector of the Empire
To learn from it effectively, the text needs a little preprocessing, to join lines, lower-case words, mark the start and end breaks, and break out punctuation as distinct “words”:
*<BREAK>* the lily of the dauphin will reach into nancy , as far as flanders the elector of the empire *<BREAK>*
So, the model will learn that, for example, “
Deep learning techniques work on numerical data, not text. The words must be encoded as numbers in a meaningful way to feed them into a training process. One-hot encoding is a possibility. Because there are about 5,000 distinct words in this text, each word becomes a vector of about 5,000 numbers, all 0, except for a single 1. Each distinct word is assigned a unique position in these vectors, and the vector representation for a word has a 1 in that word’s assigned position only. In fact, it’s common to see text-generation models actually learn to predict each successive character rather than word using this type of representation. This example won’t do so, in order to take advantage of an alternative representation of entire words.
One-hot encoding is common, but isn’t the only or necessarily most efficient representation. Techniques like Word2vec can discover a mapping from words to vectors such that similar words map to vectors that are “close”. The vectors live in a continuous space, and are not sparse (mostly 0 elements). They have a much smaller fixed dimension, in the hundreds instead of thousands. The mapping is known as an “embedding”.
It’s possible to learn the embedding as part of the model training. However, given how computationally-intensive it is to learn models, it’s useful to reuse existing models or parts of models where possible. This is called “transfer learning.”
In particular, Stanford NLP has produced a pre-trained set of embeddings called GLoVe, of various dimensions, trained on large sets of English text. (Hat tip for this idea to the book Deep Learning with Python, by the author of Keras, François Chollet.) These give a useful vector representation of almost all the words that will be found in the input text. Reusing these mappings from words to vectors will give training a head-start, though the learning will augment and tailor the embeddings as part of its learning also.
Given embeddings of several words in a phrase, a Gated Recurrent Unit (GRU) will learn how sequences of words influence the current state of a sentence. Finally, the model will predict the next word using a dense (affine) layer to learn the relationship between the state output by the GRU, and then actual next word. For technical reasons, it’s more effective to learn a one-hot encoding of the output word, rather than its embedding.
This example won’t go into details explaining how these techniques work or why. For that, do see “Deep Learning with Python” above, or the excellent book “Deep Learning” by Goodfellow, Bengio and Courville, available online at http://www.deeplearningbook.org/ .
Modeling with Keras and TensorFlow
The complete source code for this example is available at https://github.com/srowen/quatrains-rnn. The repository contains brief notes on downloading GLoVe and running the script, and the single Python script itself has comments that should make its operation fairly self-explanatory.
The core model definition is quite simple with Keras. After some code that is just concerned with loading and transforming the source data and embeddings, this is all that’s required to define the fairly sophisticated text model described above:
1234567891011121314151617181920212223 | if use_gpu: config = tf.ConfigProto() config.gpu_options.allow_growth=True session = tf.Session(config=config) K.set_session(session) device = ‘/gpu:0’else: device = ‘/cpu:0’ model = Sequential()with tf.device(device): model.add(Embedding(num_distinct_words + 1, embedding_dim, input_length=phrase_len)) model.add(GRU(gru_dim, dropout=dropout, recurrent_dropout=dropout, kernel_regularizer=l2(regularization), bias_regularizer=l2(regularization), recurrent_regularizer=l2(regularization))) model.add(Dense(num_distinct_words + 1, activation=’softmax’)) model.layers[0].set_weights([embeddings])model.compile(optimizer=RMSprop(lr=learning_rate), loss=’categorical_crossentropy’, metrics=[‘acc’])model.summary() |
2
4
6
8
10
12
14
16
18
20
22
config = tf.ConfigProto()
session = tf.Session(config=config)
device = ‘/gpu:0’
device = ‘/cpu:0’
model = Sequential()
model.add(Embedding(num_distinct_words + 1, embedding_dim, input_length=phrase_len))
dropout=dropout,
kernel_regularizer=l2(regularization),
recurrent_regularizer=l2(regularization)))
model.compile(optimizer=RMSprop(lr=learning_rate), loss=’categorical_crossentropy’, metrics=[‘acc’])
To generate predictions, the script simply repeatedly trains the model for a few epochs (full passes through the data) and then generates words from scratch:
12345678910111213141516171819202122232425262728 | model.fit(phrases_train, next_word_embeddings_train, class_weight=word_index_weights, epochs=5, batch_size=batch_size, shuffle=True, validation_data=(phrases_val, next_word_embeddings_val), verbose=2) for i in range(0, 5): random_phrase = np.array([0] * phrase_len) emitted_quatrain = [] while len(emitted_quatrain) < 32: random_phrase_t = np.copy(random_phrase) random_phrase_t.shape = (1, phrase_len) pred_next_word = model.predict(random_phrase_t)[0] draw = np.random.uniform() for pred_next_word_index in range(0, len(pred_next_word)): draw -= pred_next_word[pred_next_word_index] if draw < 0.0: break if pred_next_word_index == 0: break else: emitted_quatrain.append(index_word[pred_next_word_index]) random_phrase = np.append(random_phrase[1:], pred_next_word_index) print(“ “.join(emitted_quatrain)) |
2
4
6
8
10
12
14
16
18
20
22
24
26
28
next_word_embeddings_train,
epochs=5,
shuffle=True,
verbose=2)
for i in range(0, 5):
emitted_quatrain = []
random_phrase_t = np.copy(random_phrase)
pred_next_word = model.predict(random_phrase_t)[0]
for pred_next_word_index in range(0, len(pred_next_word)):
if draw < 0.0:
if pred_next_word_index == 0:
else:
random_phrase = np.append(random_phrase[1:], pred_next_word_index)
print(“ “.join(emitted_quatrain))
The repository has further comments on this code, for the interested.
Running with Cloudera Data Science Workbench
To execute this script in Cloudera Data Science Workbench, first create a New Project. Name it “quatrains-rnn” and choose a git-based initial setup, cloning the repo https://github.com/srowen/quatrains-rnn :
Open the file quatrains.py
and choose “Open In Workbench”. Start a Python 3 session, and choose a container size that has at least 2GB of memory, and as many CPUs as you can allocate!
Choose “Terminal access” and use the terminal window to download and unzip the GLoVe embedding files, as described in the repository README file:
curl -O https://nlp.stanford.edu/data/glove.6B.zipunzip glove.6B.zip |
unzip glove.6B.zip
Then simply choose Run > Run All to execute the entire script. TensorFlow and Keras will be installed, if not already.
Using GPUs
Note that enabling GPUs on Cloudera Data Science Workbench nodes does involve some extra setup of packages and drivers on the workers. You will also need to choose to add at least 1 GPU to your workbench session when it is run.
If using GPUs, then edit the line containing pip3 install tensorflow
to instead install tensorflow-gpu
. Also edit the line use_gpu = False
to use_gpu = True
.
Optimizing for CPUs
The TensorFlow package is built to be compatible with a wide range of CPUs, and you may see it warns that it’s not optimized for your particular CPU’s advanced features like AVX. It can be compiled from source to take advantage of these features, and can speed up training by 30-40%, making CPU-based training more cost-effective than GPUs in some cases.
While it’s optional, and not something a data scientist would normally ever do, it’s actually not hard to compile TensorFlow even in Cloudera Data Science Workbench. The quatrains-rnn repository describes how to do this.
TensorBoard
While the training process is running, it can be useful to explore the model itself and more detail about its training progress. The code is actually already logging details of each training pass in the directory logs/[run number]. TensorFlow’s TensorBoard web UI is a tool for parsing these logs and analyzing them visually.
To run TensorBoard, just open Terminal Access and type:
python3 -m tensorflow.tensorboard –port 8080 –logdir=logs |
You should now find that the TensorBoard UI is available from your workbench session:
The Graphs tab gives a visualization of the complex underlying TensorFlow model that Keras has defined:
The Scalars tab plots output metrics like validation loss and accuracy over time, and over successive runs of the model training process. Try selecting “Wall” and comparing loss versus validation loss:
Predictions for 2017
If you run this script on, for example, the CPU in a modern laptop, you’ll find it outputs something like the following, repeatedly:
123456789101112131415161718 | Run 12Train on 30152 samples, validate on 1587 samplesEpoch 1/515s - loss: 22.5385 - acc: 0.4203 - val_loss: 10.6454 - val_acc: 0.2035Epoch 2/516s - loss: 22.3075 - acc: 0.4255 - val_loss: 10.6076 - val_acc: 0.1941Epoch 3/516s - loss: 22.2522 - acc: 0.4292 - val_loss: 10.7896 - val_acc: 0.1960Epoch 4/516s - loss: 22.0518 - acc: 0.4292 - val_loss: 10.6768 - val_acc: 0.1960Epoch 5/516s - loss: 21.8356 - acc: 0.4334 - val_loss: 10.7958 - val_acc: 0.1928 vicenza , , seven , the not will be joined to pass the pyrenees mountains , the great of the lady who will cause the great one of the blood of thewill the house of the put out death before his wife , he will be in great people in the gulf of the long great which will come to flash the kingnoon the is of the great one of the great deed , the city of the great blood will come from the languedoc , the and the water of the great oneand leagues from the sky , a very high an taken , london through false , not the great one will be heard , blood , not the pig’s king will holdthrough a very great king , his slavonia , he will langres the mountains of france the great cock will be joined . |
2
4
6
8
10
12
14
16
18
Train on 30152 samples, validate on 1587 samples
15s - loss: 22.5385 - acc: 0.4203 - val_loss: 10.6454 - val_acc: 0.2035
16s - loss: 22.3075 - acc: 0.4255 - val_loss: 10.6076 - val_acc: 0.1941
16s - loss: 22.2522 - acc: 0.4292 - val_loss: 10.7896 - val_acc: 0.1960
16s - loss: 22.0518 - acc: 0.4292 - val_loss: 10.6768 - val_acc: 0.1960
16s - loss: 21.8356 - acc: 0.4334 - val_loss: 10.7958 - val_acc: 0.1928
vicenza , , seven , the not will be joined to pass the pyrenees mountains , the great of the lady who will cause the great one of the blood of the
noon the is of the great one of the great deed , the city of the great blood will come from the languedoc , the and the water of the great one
through a very great king , his slavonia , he will langres the mountains of france the great cock will be joined .
It’s getting somewhere. The cross-entropy loss is decreasing, and hasn’t yet gone below the validation loss, so it’s not overfitting yet. Some overfitting isn’t necessarily so bad here, given the task; it’s also hard to avoid given the small data set size.
The generated text is starting to look vaguely coherent in parts, but not interpretable even for a cryptic prophecy. Epochs are taking 16 seconds to run, and it may take hundreds of epochs to reach a sensible model.
Fortunately, the very same code can be run on a GPU. Just one change to the configuration in the script will select a GPU. Epochs complete almost 20 times faster on a GPU, in under a second. It doesn’t take long to get to something more like:
123456789101112131415161718 | Run 59Train on 30152 samples, validate on 1587 samplesEpoch 1/51s - loss: 12.5874 - acc: 0.6998 - val_loss: 13.1232 - val_acc: 0.1626Epoch 2/51s - loss: 12.5844 - acc: 0.6979 - val_loss: 13.1669 - val_acc: 0.1594Epoch 3/51s - loss: 12.3868 - acc: 0.7034 - val_loss: 13.1369 - val_acc: 0.1651Epoch 4/51s - loss: 12.4778 - acc: 0.7039 - val_loss: 13.1442 - val_acc: 0.1619Epoch 5/51s - loss: 12.5344 - acc: 0.7013 - val_loss: 13.1395 - val_acc: 0.1632 through force of arms the great city of the great queen will make the way of the sun , will not be a new king to peace . penetrate the mountains ofholy law will hold him .holy laws through a new king but the old one will see a very great plague , his people will it be , be in the realm , and the will betaken by the king of the gray bird , the black city where the great prince of annemark , with the surname he will be at his high place , he willwill attain to the great empire , soon and his wife . |
2
4
6
8
10
12
14
16
18
Train on 30152 samples, validate on 1587 samples
1s - loss: 12.5874 - acc: 0.6998 - val_loss: 13.1232 - val_acc: 0.1626
1s - loss: 12.5844 - acc: 0.6979 - val_loss: 13.1669 - val_acc: 0.1594
1s - loss: 12.3868 - acc: 0.7034 - val_loss: 13.1369 - val_acc: 0.1651
1s - loss: 12.4778 - acc: 0.7039 - val_loss: 13.1442 - val_acc: 0.1619
1s - loss: 12.5344 - acc: 0.7013 - val_loss: 13.1395 - val_acc: 0.1632
through force of arms the great city of the great queen will make the way of the sun , will not be a new king to peace . penetrate the mountains of
holy laws through a new king but the old one will see a very great plague , his people will it be , be in the realm , and the will be
will attain to the great empire , soon and his wife .
It’s overfitting a little now, but producing lines that might just pass for the real thing, especially with formatting restored:
Taken by the King of the gray bird,
The black city where the great Prince of Annemark,
With the surname he will be at his high place, he will
Will attain to the great empire, soon and his wife.
Letting it run all night will produce some impressive text, like:
By the king of the great temple of Le Mas by the Garonne,
Taken, those of Orléans will give a lawful wall.
*The greatest ones will be in despair. Those of the soil uneven, *
Port Selyn will make mighty invasions.
However, at this point, the model has mostly memorized the input, and you can readily find whole phrases here that were ‘quoted’ from the source.
Feel free to edit the training parameters in the file, like dropout, layer size, regularization, and phrase length, and re-run to observe the effects.
Conclusion
The deep learning renaissance is in full swing, and there are more options than ever today for building deep learning models on a CDH cluster. The popular combination of Keras and TensorFlow (with TensorBoard) is one of several frameworks that is readily usable for deep learning with CDH. They bring relatively simple, powerful modeling and visualization tools to deep learning practitioners.
As of today, these can be accelerated by GPUs, where appropriate, in the Cloudera Data Science Workbench as well. Models can be used to generate new text and images that are often delightfully creative. They can even generate prophecies of the future, ones with (as far as we know) no less accuracy than those of big-name seers! The court astrologer may be the occupation most under threat from automation by deep learning this year.