Getting Started with Cloudera Data Science Workbench

Last week, Cloudera announced the General Availability release of Cloudera Data Science Workbench. In this post, I’ll give a brief overview of its capabilities and architecture, along with a quick-start guide to connecting Cloudera Data Science Workbench to your existing CDH cluster in three simple steps.

At its core, Cloudera Data Science Workbench enables self-service data science for the enterprise. Data scientists can build, scale, and deploy data science and machine learning solutions in a fraction of the time, all while leveraging the full power and security capabilities of Cloudera’s Enterprise Data Hub.

Core capabilities of Cloudera Data Science Workbench

Since we first started building Cloudera Data Science Workbench, our goal has been to deliver a solution that data scientists, analytics leaders, and IT administrators can love. That means zero configuration and true flexibility for data scientists, multi-tenancy and seamless collaboration for analytics leaders, and easy integration and high security for IT administrators. For too long these goals have been in conflict.

With our Cloudera Data Science Workbench 1.0 release, we believe we’ve achieved these goals. Specifically, Cloudera Data Science Workbench delivers benefits to each group.

Key benefits of Cloudera Data Science Workbench

These capabilities and benefits are made possible by Cloudera Data Science Workbench’s underlying architecture. To understand how this is accomplished, let me dive a bit deeper.

Secure, scalable, multi-tenant gateways for data science

Cloudera Data Science Workbench runs on one or more dedicated gateway hosts on a CDH cluster. Cloudera Manager ensures that Cloudera Data Science Workbench has the libraries and configuration necessary to securely access the CDH cluster, without additional configuration. Data scientists access Cloudera Data Science Workbench directly from their web browser, with no required downloads or installation steps.

Cloudera Data Science Workbench connected to an existing CDH cluster

To ensure that users can bring all the tools and libraries they need without IT intervention, Cloudera Data Science Workbench uses Docker containers to run isolated user workloads. On a per project basis, users can run R, Python, and Scala workloads with different versions of libraries and system packages. CPU and memory are also isolated, ensuring reliable, scalable execution in a multi-tenant setting. Each Docker container running user workloads provides a virtualized gateway with secure access to cluster services such as Apache HDFS, Apache Spark 2, Apache Hive, and Apache Impala (incubating).

Cloudera Data Science Workbench is built from the ground up to support data science teams working together on a single shared environment. Each installation starts with one master gateway node. Worker gateway nodes can be added and removed at any time to increase the total capacity. This makes it easy to add capacity completely transparently to end users as usage expands.

Cloudera Data Science Workbench transparently schedules containers across multiple nodes. This scheduling is done using Kubernetes, a container orchestration system used internally by Cloudera Data Science Workbench. Neither Docker nor Kubernetes are directly exposed to end users, with users interacting with Cloudera Data Science Workbench through a web application. By isolating users from direct access to edge hosts, Cloudera Science Workbench provides additional flexibility to end-users, while preserving security.

Native Spark 2 support from R, Python, and Scala

In addition to supporting standalone R and Python access to CDH services such as HDFS, Hive, and Impala, Cloudera Data Science Workbench has native support for interactive and batch access to Spark 2.1 — the latest and greatest release of Spark. There’s no need to submit a Spark application, wait for the results, and then resubmit the application when you discover an error or unexpected result. Data scientists can work directly within an interactive workbench from exploration to production.

To harness the full power of existing CDH clusters, Cloudera Data Science Workbench leverages Spark using YARN client mode, where the Spark driver runs within a Cloudera Data Science Workbench project container and Spark executors run with full access to CDH cluster resources. With Spark’s dynamic allocation enabled, Spark only claims resources when necessary, enabling cluster resources to be shared dynamically across different workloads in a fine-grained manner. Running the driver within a container, allows data scientists to easily install packages and work interactively in a fully customizable environment, while still leveraging the full power of Spark’s distributed execution and robust multi-tenancy under YARN.

Spark 2 support from R, Python, and Scala integrated with YARN, including dynamic allocation for long running interactive sessions and batch jobs.

Easy installation in three steps

Cloudera Data Science Workbench delivers a self-service data science experience that data scientists, analytics leaders, and IT administrators can love. Thankfully, bringing these capabilities to your existing CDH cluster is also easy.

You can download the official 1.0 RPM by visiting our downloads page and then follow a few simple installation steps. At a high level, all you’ll need to do is:

Configure gateway host in Cloudera Manager.
Install Cloudera Data Science Workbench on the master gateway host.
Add zero or more worker hosts, if desired.

With that, you’ll be able to use R, Python, and Scala to securely connect to your CDH cluster, collaborate and share projects and results, and accelerate data science from exploration to production in a single, secure, multi-tenant environment.

To learn more about Cloudera Data Science Workbench’s capabilities you can read the product overview or watch a webinar overview and demo with Matt Brandwein, Director of Products at Cloudera. If you have questions, please join the discussion on the Cloudera Community portal.