Create conda recipe to use C extended Python library on PySpark cluster with Cloudera Data Science Workbench

Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension. The sample repository is here.

See more detail of the conda package in the official document. Especially for C extension, this tutorial is a useful resource to read.

(Optional) Prepare Anaconda Docker image

We developed this recipe in an Anaconda image for Docker. The use of a Docker container helps ensure library compatibility. Using Docker container makes sure the environment isolated, so it prevents troubles from mixed environment both system installed and anaconda installed ones. macOS users, in particular, need to build inside a Docker container because the C libraries that come with a macOS are not compatible with what’s needed here.

Let’s prepare the docker image from anaconda. The following command are executed on your physical machine.

  $ docker pull continuumio/anaconda$ docker run -i -v $(pwd):/root/mecab -t continuumio/anaconda /bin/bash

$ docker run -i -v $(pwd):/root/mecab -t continuumio/anaconda /bin/bash

Note: Because of a glibc incompatibility, using a docker image doesn’t work to build a package for older Linux distributions such as CentOS 6.

Prepare Conda Environment

According to the official document, it is required to install GCC/G++. If you don’t have any development tools, you should ensure install dependent tools. For this recipe, install the required packages as follows (e.g. Debian/Ubuntu command):

  $ sudo apt-get install g++ autoconf make

Before you start the development, you should install the conda build tool.

  $ conda install conda-build$ conda upgrade conda$ conda upgrade conda-build

$ conda upgrade conda

Write Your Recipe After creating the working directory, write meta.yaml to set package name, version, source repositories, and dependencies. In this case, we will make a package named mecab.

  $ cd ~/$ mkdir mecab$ cd mecab

$ mkdir mecab

Example recipe:

1234567891011121314151617181920 package: name: mecab version: “0.996” source: git_url: https://github.com/taku910/mecab git_rev: 32041d9504d11683ef80a6556173ff43f79d1268 build: number: 0 Requirements: Build:   - g++ run:   - libgcc about: home: http://taku910.github.io/mecab license: BSD,LGPL,GPL

2

4

6

8

10

12

14

16

18

20

name: mecab

git_url: https://github.com/taku910/mecab

number: 0

Requirements:

   - g++

   - libgcc

about:

license: BSD,LGPL,GPL

Write Your Build Script Write build.sh to compile your package.

Example script:

123456789101112131415161718 # Build MeCabcd mecab./configure –prefix=$PREFIX –with-charset=utf8 makemake install # Build dictionary for MeCabcd ../mecab-ipadic./configure –with-mecab-config=$PREFIX/bin/mecab-config –prefix=$PREFIX –with-charset=utf8 –with-dicdir=$PREFIX/lib/mecab/dic/ipadicmakemake install # Build MeCab Python wrappercd ../mecab/pythonswig -python -shadow -c++ ../swig/MeCab.ipython setup.py buildpython setup.py install –prefix=$PREFIX

2

4

6

8

10

12

14

16

18

cd mecab

 

make install

Build dictionary for MeCab

./configure –with-mecab-config=$PREFIX/bin/mecab-config –prefix=$PREFIX –with-charset=utf8 –with-dicdir=$PREFIX/lib/mecab/dic/ipadic

make install

Build MeCab Python wrapper

swig -python -shadow -c++ ../swig/MeCab.i

python setup.py install –prefix=$PREFIX

NOTE: Don’t forget to set $PREFIX for every dependent components prefix. $PREFIX is a reserved environment variable. I will describe it later, but if you forget to add the $PREFIX into compilation option, the component will not be packaged into your conda package.

Build a Package

Let’s build your package. This command creates a conda package archived with tar.gz.

After the build succeeds, you can install via local build files.

  $ conda install mecab –use-local

Upload and Install Your Package from Anaconda Repository If you want to distribute your package, you can use the anaconda repository. It requires the anaconda client, so you should install via conda.

  $ conda install anaconda-client

After creating your anaconda.org account, you can log in and upload your package.

  $ anaconda login# Input your anaconda.org username and password…$ anaconda upload /opt/conda/conda-bld/linux-64/mecab-0.996-1.tar.bz2

Input your anaconda.org username and password…

Now, we can install the package via anaconda repository. 

  $ conda install -c mecab

NOTE: Replace into your anaconda.org username.

How to use the package with PySpark on CDSW

You can create a conda environment with a C extension as follows:

  $ conda create –copy -q -y -c chezou -n mecab_env python=2 mecab

After creating the environment, the directory structure is as follows:

  .conda/envs/mecab_env/ bin/ conda-meta/ etc/ include/ lib/ libexec/ share/ ssl/

bin/

etc/

lib/

share/

If you add $PREFIX in your conda build script, conda installs under the conda environment appropriately. With the --copy option, conda includes dependencies without symlinks; that’s why you can distribute C extension.

You can distribute the package with a conda environment. See more detail:

Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench

Summary

Creating a conda recipe enables you to use C extension based Python packages on a Spark cluster without installing them on each node using the Cloudera Data Science Workbench. Data scientists can run their favorite packages without modifying the cluster.

To learn more about the Data Science Workbench visit our website.

Aki Ariga is a Field Data Scientist at Cloudera and sparklyr contributor