Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension. The sample repository is here.
See more detail of the conda package in the official document. Especially for C extension, this tutorial is a useful resource to read.
(Optional) Prepare Anaconda Docker image
We developed this recipe in an Anaconda image for Docker. The use of a Docker container helps ensure library compatibility. Using Docker container makes sure the environment isolated, so it prevents troubles from mixed environment both system installed and anaconda installed ones. macOS users, in particular, need to build inside a Docker container because the C libraries that come with a macOS are not compatible with what’s needed here.
Let’s prepare the docker image from anaconda. The following command are executed on your physical machine.
$ docker pull continuumio/anaconda$ docker run -i -v $(pwd):/root/mecab -t continuumio/anaconda /bin/bash |
$ docker run -i -v $(pwd):/root/mecab -t continuumio/anaconda /bin/bash
Note: Because of a glibc incompatibility, using a docker image doesn’t work to build a package for older Linux distributions such as CentOS 6.
Prepare Conda Environment
According to the official document, it is required to install GCC/G++. If you don’t have any development tools, you should ensure install dependent tools. For this recipe, install the required packages as follows (e.g. Debian/Ubuntu command):
$ sudo apt-get install g++ autoconf make |
Before you start the development, you should install the conda build tool.
$ conda install conda-build$ conda upgrade conda$ conda upgrade conda-build |
$ conda upgrade conda
Write Your Recipe
After creating the working directory, write meta.yaml
to set package name, version, source repositories, and dependencies. In this case, we will make a package named mecab
.
$ cd ~/$ mkdir mecab$ cd mecab |
$ mkdir mecab
Example recipe:
1234567891011121314151617181920 | package: name: mecab version: “0.996” source: git_url: https://github.com/taku910/mecab git_rev: 32041d9504d11683ef80a6556173ff43f79d1268 build: number: 0 Requirements: Build: - g++ run: - libgcc about: home: http://taku910.github.io/mecab license: BSD,LGPL,GPL |
2
4
6
8
10
12
14
16
18
20
name: mecab
git_url: https://github.com/taku910/mecab
number: 0
Requirements:
- g++
- libgcc
about:
license: BSD,LGPL,GPL
Write Your Build Script Write build.sh to compile your package.
Example script:
123456789101112131415161718 | # Build MeCabcd mecab./configure –prefix=$PREFIX –with-charset=utf8 makemake install # Build dictionary for MeCabcd ../mecab-ipadic./configure –with-mecab-config=$PREFIX/bin/mecab-config –prefix=$PREFIX –with-charset=utf8 –with-dicdir=$PREFIX/lib/mecab/dic/ipadicmakemake install # Build MeCab Python wrappercd ../mecab/pythonswig -python -shadow -c++ ../swig/MeCab.ipython setup.py buildpython setup.py install –prefix=$PREFIX |
2
4
6
8
10
12
14
16
18
cd mecab
make install
Build dictionary for MeCab
./configure –with-mecab-config=$PREFIX/bin/mecab-config –prefix=$PREFIX –with-charset=utf8 –with-dicdir=$PREFIX/lib/mecab/dic/ipadic
make install
Build MeCab Python wrapper
swig -python -shadow -c++ ../swig/MeCab.i
python setup.py install –prefix=$PREFIX
NOTE: Don’t forget to set $PREFIX
for every dependent components prefix. $PREFIX
is a reserved environment variable. I will describe it later, but if you forget to add the $PREFIX
into compilation option, the component will not be packaged into your conda package.
Build a Package
Let’s build your package. This command creates a conda package archived with tar.gz.
After the build succeeds, you can install via local build files.
$ conda install mecab –use-local |
Upload and Install Your Package from Anaconda Repository If you want to distribute your package, you can use the anaconda repository. It requires the anaconda client, so you should install via conda.
$ conda install anaconda-client |
After creating your anaconda.org account, you can log in and upload your package.
$ anaconda login# Input your anaconda.org username and password…$ anaconda upload /opt/conda/conda-bld/linux-64/mecab-0.996-1.tar.bz2 |
Input your anaconda.org username and password…
Now, we can install the package via anaconda repository.
$ conda install -c |
NOTE: Replace
How to use the package with PySpark on CDSW
You can create a conda environment with a C extension as follows:
$ conda create –copy -q -y -c chezou -n mecab_env python=2 mecab |
After creating the environment, the directory structure is as follows:
.conda/envs/mecab_env/ bin/ conda-meta/ etc/ include/ lib/ libexec/ share/ ssl/ |
bin/
etc/
lib/
share/
If you add $PREFIX
in your conda build script, conda installs under the conda environment appropriately. With the --copy
option, conda includes dependencies without symlinks; that’s why you can distribute C extension.
You can distribute the package with a conda environment. See more detail:
Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench
Summary
Creating a conda recipe enables you to use C extension based Python packages on a Spark cluster without installing them on each node using the Cloudera Data Science Workbench. Data scientists can run their favorite packages without modifying the cluster.
To learn more about the Data Science Workbench visit our website.
Aki Ariga is a Field Data Scientist at Cloudera and sparklyr contributor