Deep learning with Apache MXNet on Cloudera Data Science Workbench

With the abundance of deep learning frameworks available today, it can be difficult to know what to choose for any particular application. Given the contrasting strengths and weaknesses of these frameworks, the ability to work with and switch between more than one is particularly important. Recent Cloudera blogs have shown how examples of applying deep learning on the Cloudera ecosystem using popular frameworks Deeplearning4j, BigDL, and Keras+TensorFlow. This post extends those examples by providing details on performing image recognition with convolutional neural networks using the Apache MXNet (incubating) framework, leveraging Cloudera’s Data Science Workbench for an interactive development environment.

The MXNet Python interface was easy to learn, in large part because of its dynamic computation model that is familiar to anyone with traditional Python experience. The performance and scaling of MXNet was on par with other GPU accelerated frameworks like PyTorch and TensorFlow, and the ability to easily add multiple GPUs for linear speedups in computation time was a big win.

Apache MXNet

MXNet is now an Apache incubator project, but originally came out of the Distributed Machine Learning Common (DMLC) initiative. It has an efficient backend implementation written in C++, but supports *many *different API languages including Python, R, and Scala. It is a full-featured deep learning framework, that supports multi-GPU and distributed training.

Most every framework has strengths and weaknesses: TensorFlow and Caffe are great for model deployment and productionizing, while PyTorch excels during the model building and research phases. MXNet, short for “mix-net,” intends to combine the strengths of a few different approaches to deep learning in a single framework. One of the original goals of the project was to provide both imperative and declarative – the tradeoffs of which are discussed later in this post – approaches to deep learning, giving users the options to choose. This is further reinforced by the recent release of the MXNet Gluon API, which allows seamless switching between static and dynamic computation graphs for deep learning models. The ability to use both static and dynamic computation models for deep learning make MXNet a good choice for both building custom, modular algorithms as well as deploying fast, optimized networks in the production phase.

Image Classification via Transfer Learning

Recent Cloudera blogs have already demonstrated the concept of using pre-trained image classification models to achieve state-of-the-art results on other datasets. Those examples used distributed computation frameworks like Apache Spark and Apache Hadoop to distribute the heavy computation among processing cores of CPUs across a cluster of machines. This example instead explores using GPUs on a cluster edge node using a Python environment on the Cloudera Data Science Workbench.

The goal of this post is to perform image classification on the Caltech-256 dataset, predicting the label of each image from one of 257 different possibilities. Instead of training a deep neural network from scratch on this data, it is generally a best practice to leverage pre-trained neural networks – in this case, networks that were trained on the popular ImageNet dataset – and transfer their learned weights to this task.

The first step in transfer learning with images is to “featurize” them, or to pass the images through a portion of the pretrained model to extract higher-level visual features about the images. The Gluon API provides an easy way to pull popular image recognition models from the MXNet “model zoo.”

  devices = [mx.gpu(i) for i in range(num_gpus)]vgg16 = gluon.model_zoo.vision.vgg16(pretrained=True, ctx=devices)

vgg16 = gluon.model_zoo.vision.vgg16(pretrained=True, ctx=devices)

The code above pulls the model structure and model weights from a remote repository, and loads the parameters onto the devices specified in the ctx argument. Inspecting the model is as easy as typing the model variable name into the CDSW prompt.

123456789101112131415161718192021222324252627282930313233343536373839404142 VGG(  (features): HybridSequential(    (0): Conv2D(64, kernel_size=(3, 3), …)    (1): Activation(relu)    (2): Conv2D(64, kernel_size=(3, 3), …)    (3): Activation(relu)    (4): MaxPool2D(size=(2, 2), …)    (5): Conv2D(128, kernel_size=(3, 3))    (6): Activation(relu)    (7): Conv2D(128, kernel_size=(3, 3))    (8): Activation(relu)    (9): MaxPool2D(size=(2, 2), …)    (10): Conv2D(256, kernel_size=(3, 3))    (11): Activation(relu)    (12): Conv2D(256, kernel_size=(3, 3))    (13): Activation(relu)    (14): Conv2D(256, kernel_size=(3, 3), …)    (15): Activation(relu)    (16): MaxPool2D(size=(2, 2), …)    (17): Conv2D(512, kernel_size=(3, 3), …)    (18): Activation(relu)    (19): Conv2D(512, kernel_size=(3, 3), …)    (20): Activation(relu)    (21): Conv2D(512, kernel_size=(3, 3), …)    (22): Activation(relu)    (23): MaxPool2D(size=(2, 2), …)    (24): Conv2D(512, kernel_size=(3, 3), …)    (25): Activation(relu)    (26): Conv2D(512, kernel_size=(3, 3), …)    (27): Activation(relu)    (28): Conv2D(512, kernel_size=(3, 3), …)    (29): Activation(relu)    (30): MaxPool2D(size=(2, 2), …)  )  (classifier): HybridSequential(    (0): Dense(4096, Activation(relu))    (1): Dropout(p = 0.5)    (2): Dense(4096, Activation(relu))    (3): Dropout(p = 0.5)    (4): Dense(1000, linear)  ))

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

  (features): HybridSequential(

    (1): Activation(relu)

    (3): Activation(relu)

    (5): Conv2D(128, kernel_size=(3, 3))

    (7): Conv2D(128, kernel_size=(3, 3))

    (9): MaxPool2D(size=(2, 2), …)

    (11): Activation(relu)

    (13): Activation(relu)

    (15): Activation(relu)

    (17): Conv2D(512, kernel_size=(3, 3), …)

    (19): Conv2D(512, kernel_size=(3, 3), …)

    (21): Conv2D(512, kernel_size=(3, 3), …)

    (23): MaxPool2D(size=(2, 2), …)

    (25): Activation(relu)

    (27): Activation(relu)

    (29): Activation(relu)

  )

    (0): Dense(4096, Activation(relu))

    (2): Dense(4096, Activation(relu))

    (4): Dense(1000, linear)

)

This provides the standard VGG16 model, but separated in two distinct blocks: features and classifier. For the featurization step, the features section will be used with pretrained weights to generate the features for each image. Reading the images is easy using Gluon’s built-in dataset and loading tools.

  phases = [‘train’, ‘test’, ‘valid’]datasets = {phase: gluon.data.vision.ImageFolderDataset(data_dir + ‘256_ObjectCategories/’ + phase, flag=1, transform=image_transform) for phase in phases}loaders = {phase: gluon.data.DataLoader(datasets[phase], batch_size=batch_size, shuffle=False) for phase in phases}

datasets = {phase: gluon.data.vision.ImageFolderDataset(data_dir + ‘256_ObjectCategories/’ + phase, flag=1, transform=image_transform) for phase in phases}

A dataset of images and a model loaded in the session is all that’s needed to start exploring some of the neat aspects of the Gluon framework. Gluon allows users to switch seamlessly between the dynamic graph (“define-by-run”) and the static graph (“define-then-run”) computation models. The dynamic graph model of PyTorch is generally more intuitive and easy to work with since it is closer to traditional programming, where users can view the outputs as lines of code are executed. The drawback of this approach is that it sacrifices performance enhancements available when compiling the program all at once, before execution.

The loaded model can be used as a dynamic graph by default, and can be used to check that a sample batch passes successfully through the network.

  > first_data, first_label = next(iter(loaders[‘valid’]))> vgg16_feat.forward(first_data.as_in_context(mx.gpu()))…<NDArray 32x512x7x7 @gpu(0)>

vgg16_feat.forward(first_data.as_in_context(mx.gpu()))

<NDArray 32x512x7x7 @gpu(0)>

The output of the forward pass in this case is an actual NDArray that can be used just like traditional NumPy arrays. The advantage of this is that it makes control flow, looping, and experimentation easier and there is less overhead to learning this paradigm since it feels like NumPy. Still, in the style of TensorFlow it is possible to compile the computation graph and use it as a “define-then-run” program.

In this case there is no real need for the dynamic graph approach since featurization just involves taking an unchanging pre-trained model to compute the forward pass. To boost the performance, convert the model to a static graph using the hybridize method.

Also, be sure to check out the official Gluon documentation for fantastic tutorials, with code, exploring all sorts of useful deep learning concepts.

Image Featurization

The below featurize method can be used to distribute data across multiple gpus, compute the forward pass through the model, and save it in MXNet’s efficient binary format, RecordIO.

  def featurize(file_name, iterator):  t0 = time.time()  batches = []  for data, label in iterator:    datas = gluon.utils.split_and_load(data, devs, even_split=False)    labels = gluon.utils.split_and_load(label, devs, even_split=False)    featurized = [(label, vgg16_feat.forward(data)) for label, data in zip(labels, datas)]    batches.extend(featurized)     write_batches(batches, file_name)  print(“Featurized %d images in %0.1f seconds.” % (k, time.time() - t0))

  t0 = time.time()

  for data, label in iterator:

    labels = gluon.utils.split_and_load(label, devs, even_split=False)

    batches.extend(featurized)

  write_batches(batches, file_name)

The split_and_load utility function is used to split up each minibatch onto the GPUs available in the cluster. MXNet then passes each batch through a copy of the network that lives on each GPU, and finally writes those batches to disk on the local file system.

Kickoff the featurization step for each of the train/test/valid phases as below.

  featurize(‘data/train_feat’, loaders[‘train’])featurize(‘data/valid_feat’, loaders[‘valid’])featurize(‘data/test_feat’, loaders[‘test’])

featurize(‘data/valid_feat’, loaders[‘valid’])

As this runs, it can be helpful to inspect the GPU utilization by opening a terminal session and executing the nvidia-smi command.

The output above indicates that two GPUs are currently being utilized to perform featurization for each batch in parallel. The featurization step is the most compute intensive part of the entire pipeline, including the training phase, because it amounts to a bunch of matrix computations that can be easily parallelized with relatively little communication overhead. Using the Workbench makes it trivial to add hardware during compute-intensive phases like featurization, but also free resources that aren’t needed during later phases (training, in this case). GPUs can be added to achieve near-linear speedups.

The ability to freely add and subtract resources, depending on the workload is particularly nice for deep learning applications since some stages of exploration/training require significantly more resources than others.

MXNet with CDSW provides the tools to easily distribute heavy deep learning computations across multiple GPUs and/or CPUs, while the Gluon API serves as a flexible, easy-to-learn interface to the optimized MXNet backend.

Training

Now that the images have been featurized, they are available in the data directory in the MXNet RecordIO format.

  data- valid_feat.idx- test_feat.idx- train_feat.idx- valid_feat.rec- train_feat.rec- test_feat.rec
  • valid_feat.idx

  • train_feat.idx

  • train_feat.rec

Before, the built-in ImageFolderDataset was used to read the JPEG files from disk, but in this case the format being read is RecordIO files, so a different approach is necessary.

Fortunately, it is possible to create a new type of dataset that extends the built-in RecordFileDataset that will read the binary format and convert the bytes back into the NDArray format needed for training. The full code for this new ArrayRecordDataset can be found here.

For the model, a reasonable starting point is to use the classifier section of the VGG16 model, but only train the final dense layer – leaving the other dense layers unchanged.

  > vgg16.classifierHybridSequential(  (0): Dense(4096, Activation(relu))  (1): Dropout(p = 0.5)  (2): Dense(4096, Activation(relu))  (3): Dropout(p = 0.5)  (4): Dense(1000, linear))

HybridSequential(

  (1): Dropout(p = 0.5)

  (3): Dropout(p = 0.5)

)

Two things need to happen before training: the first two dense layer’s parameters must be “frozen” so that they are not changed during optimization and the last layer must be replaced with one that has 257 outputs, instead of the 1000 originally used for the ImageNet dataset. First, replace the final layer of the model:

  vgg16cls = vgg16.classifierwith vgg16_cls.name_scope():  vgg16_cls._children.pop()  vgg16_cls.register_child(gluon.nn.Dense(257, prefix=”predictions”))

with vgg16_cls.name_scope():

  vgg16cls.register_child(gluon.nn.Dense(257, prefix=”predictions”))

With the model layers correctly constructed, separate fixed and trainable parameters, and tell the MXNet engine that the backward pass is not required for fixed parameters.

  trainable_params = gluon.parameter.ParameterDict()fixed_params = gluon.parameter.ParameterDict()for layer in vgg16_cls._children:  if ‘predictions’ in layer.name:    trainable_params.update(layer.collect_params())  else:    fixed_params.update(layer.collect_params())fixed_params.setattr(‘grad_req’, ‘null’)

fixed_params = gluon.parameter.ParameterDict()

  if ‘predictions’ in layer.name:

  else:

fixed_params.setattr(‘grad_req’, ‘null’)

While other deep learning libraries like Keras provide a model.fit function (and in fact, the original MXNet API does as well) libraries like MXNet Gluon and PyTorch are designed to make it so easy to work with layers, tensors, and inputs/outputs that writing a fit function by hand is not much more than writing a simple Python for loop. Fitting a deep learning model usually just involves calling the forward and backward passes on the model, and then updating model parameters according to some flavor of stochastic gradient descent. An example of such a function is shown below.

12345678910111213141516171819202122232425262728 def run(num_gpus, lr, num_epochs):    # the list of GPUs that will be used    ctx = [mx.gpu(i) for i in range(num_gpus)]    print(‘Running on {}’.format(ctx))     train_data = loaders[‘train’]    valid_data = loaders[‘valid’]    print(‘Batch size is {}’.format(train_data._batch_sampler._batch_size))        trainable_params.initialize(ctx=ctx, force_reinit=True)     trainer = gluon.Trainer(trainable_params, ‘adam’, {‘learning_rate’: lr})    for epoch in range(num_epochs):        # train        start = time.time()        num_samples = 0        for data, label in train_data:            num_samples += data.shape[0]            train_batch(data, label, ctx, vgg16_cls, trainer)        mx.nd.waitall()  # wait until all computations are finished to benchmark the time        print(‘Epoch %d, samples=%d, training time = %.1f sec’ %            (epoch, num_samples, time.time() - start))        # validation set        correct, num = 0.0, 0.0        for data, label in valid_data:            correct += valid_batch(data, label, ctx, vgg16_cls)            num += data.shape[0]        print(‘         validation accuracy = %.4f’ % (correct / num))

2

4

6

8

10

12

14

16

18

20

22

24

26

28

    # the list of GPUs that will be used

    print(‘Running on {}’.format(ctx))

    train_data = loaders[‘train’]

    print(‘Batch size is {}’.format(train_data._batch_sampler._batch_size))

    trainable_params.initialize(ctx=ctx, force_reinit=True)

    trainer = gluon.Trainer(trainable_params, ‘adam’, {‘learning_rate’: lr})

        # train

        num_samples = 0

            num_samples += data.shape[0]

        mx.nd.waitall()  # wait until all computations are finished to benchmark the time

            (epoch, num_samples, time.time() - start))

        correct, num = 0.0, 0.0

            correct += valid_batch(data, label, ctx, vgg16_cls)

        print(‘         validation accuracy = %.4f’ % (correct / num))

The run function is responsible for initializing the parameters on the correct devices, and then running a configurable number of epochs for training. Each epoch consists of training on all mini-batches, and then displaying results of the evaluation of the new parameters on the validation set. An equivalent but simplified version of the train_batch function above is shown below.

  def train_batch(data, label, network, trainer):  loss = loss_function(network.forward(data), label)  loss.backward()  trainer.step(batch_size)

  loss = loss_function(network.forward(data), label)

  trainer.step(batch_size)

That is basically all that is needed to build and train a deep learning model in MXNet Gluon! A call to the run function will kick off training on one or more GPUs (and CPUs if desired) and return an optimized model for image detection. For this task of object recognition on the Caltech-256 dataset, the Gluon model produces results in line with other frameworks and matches the state-of-the-art.

Conclusions

Apache MXNet is performant, versatile, and user-friendly

MXNet proved to be another great tool for building deep learning applications, further improved through the release of the new Gluon API. Though MXNet provides many of the same features and interfaces as other open-source frameworks, it manages to stand out in several ways:

  1. It provides an imperative interface for working with directly with tensors and layers that can leverage GPU acceleration. The dynamic graph framework that MXNet provides makes it easier to learn, since it feels like normal Python programming. It is easier to debug since the output can be inspected and validated at any point in the network. It is easier to create custom modules, when the need inevitably arises, because all of the building blocks are first-class citizens that users are already familiar with. Composing these into something custom feels almost as natural as using the standard tools. The ease-of-use of a dynamic framework lowers the overhead of new users of deep learning, which is crucially important at a time when deep learning has proven itself to be such an important differentiator in many industries and applications. 

  2. MXNet allows users to choose between dynamic and static graphs. While dynamic graphs are superior for debugging and general intuition, static graphs can be optimized in ways that dynamic graphs cannot. MXNet Gluon provides the ability to choose the paradigm that best suits the needs of the users.

  3. MXNet provides APIs in several popular languages: Python, R, Scala, C++, Matlab, and others. The ability to choose between APIs in these languages is certainly convenient, but the Python API remains the main focus and other languages may lag behind.

  4. MXNet supports multi-gpu and multi-machine training, and boasts near-linear scaling across hundreds of machines. Any serious deep learning framework is going to support multi-gpu and multi-machine training now or in the future, but MXNet is on the bleeding edge here.

BYODLF: Bring your own deep learning framework with CDSW

While Apache MXNet is an exciting framework that seeks to combine the best of PyTorch and TensorFlow, it is one of just many popular and worthy deep learning frameworks available today. Working in the Cloudera data science workbench, deployed and integrated on top of the full suite of Cloudera tools, makes it easy to leverage any of these frameworks, and painless to switch between them when the need arises. In addition to this example using MXNet Gluon, the git repository for this blog also contains examples for Keras+TensorFlow and PyTorch.

All code used in this blog post can be found here. The examples are highly inspired by the MXNet Gluon documentation.

Note: Cloudera Data Science Workbench is available to Cloudera customers who license Cloudera Enterprise Data Hub and Cloudera Data Engineering editions. Learn more about Apache MXNet here. Cloudera does not distribute or support Apache MXNet.