conda-forge and PyData's CentOS moment

** Wed 20 April 2016

Summary: It’s finally time we worked as a community to create a reliable, community-governed repository of trusted Python binary package artifacts, just like Linux, R, Java, and many other open source tool ecosystems have already done. Enterprise-friendly platform distributions do play an important role, though. I examine the various nuances within. I also talk about the new conda-forge project which may offer the way forward.

Python environment management hell: a personal story

When I needed to get Python code using “primordial pandas” into production at AQR in 2008, the hardest part by far was the “installation problem”. At that time we didn’t have a centralized cluster framework for running all production jobs; many processes where Python needed to run were on Windows desktops sitting under people’s desks. Later, in 2009, we bought a rack of machines and we wrote a distributed task queue and scheduler (similar to Celery) that ran on these systems, but they were still Windows.

I didn’t want to have my efforts to use more Python to be stymied, so I engaged on a stressful tango with our IT staff to get consistent Python environments with NumPy, SciPy, matplotlib, and other heavy packages deployed on dozens of Windows desktops. Part of the difficulty was Windows, but the bigger issue was that installing a “blessed” Python environment was not as simple as “hey, run this one weird batch script”. It required a lot of typing, clicking and sometimes force-quitting python.exe (and a large helping of tears and sadness).

R, where I had also done a lot of work, comparatively had its act completely together. Install R, then run install.packages(c(...)) and you were done. Like magic. By comparison, especially on Windows, installing anything containing C extensions or depending on 3rd-party libraries using pip was unreliable.

Binary package installers: apt, brew, yum, conda, enstaller, brew, and friends

One of the ways to solve the packaging hell is to have a tool that can analyze package dependency graphs and download and install the appropriate binary artifacts from one or more trusted channels. Strong emphasis here must be put on the word trusted.

Installing any software, whether open source or proprietary, is a risky proposition because you are purposefully executing code written by someone else. You may be giving that software access to your networks, data, secrets (security credentials, SSH keys), or even existing production processes. If you are a business working with sensitive data, it is justifiable to be extremely paranoid about the provenance of any x86 instruction executed on hardware under your control. The most sensitive of businesses (e.g. 3-letter US government agencies) may use air-gapping to protect against hackers or malicious code run from within their walls.

The way around the trust issue is to use a packaging tool, like apt or yum, along with a trusted binary artifact repository. Often the binary artifacts are provided by a company you trust not to modify the code maliciously (e.g. inserting telemetry or malware code) in binary builds of otherwise open source software. The packaging tool verifies a MD5 or SHA1 hash of the artifacts to make sure that a man-in-the-middle has not tampered with the compiled code inside. Turns out this is not that unrealistic, as recently happened with Transmission.

Platform distributions: making open source work for the enterprise

As soon as a collection of related open source projects becomes viable as a solution to a major business problem (obvious examples: Hadoop, Linux, R, Python, Kafka, etc.), a common business strategy is to create a platform distribution, a big bundle of code, with the intent of making using open source software more palatable for use by large companies.

The notion of a platform distribution is appealing to enterprise users for many reasons. The distribution provider is handling a bunch of annoying problems for you:

Assembling components and all of the correct versions which are known to work well together.

  • Packaging components together and making them easy to install.

Compiling binaries for multiple platforms and running the test suites for individual components to verify a valid build. Performing integration testing to verify that the components work well together.

  • Providing tools for upgrading components over time

Usually, the distribution gets its own umbrella version number, like “Red Hat Enterprise Linux 5” to indicate the “blessed” collection of software.

Making money from 100% open-source platforms is very difficult. One of the more successful models used is known as “open core” or (increasingly) “hybrid open source”, where anyone (individuals or businesses) can download and use the open source components for free, but you can buy services, support, indemnity, and valuable add-on proprietary software from the platform vendor.

One of the most important aspects of paid support for open source is having priority bug fixes and patched builds when something goes wrong in production. All software has bugs, and by the inherently anarchic nature of open source software, paying for peace of mind is something many big companies are willing to do.

The importance of community-governed package channels

In the late 1990s and early 2000s, there were many efforts to create community-led Linux distributions. Red Hat was founded in 1993, and as Red Hat and other enterprise vendors worked to commercialize open source Linux in the enterprise, I suspect the push for community-governed distributions became all the stronger. I won’t present a revisionist history for what motivated the creators of Debian, CentOS, and others, but the commercialization of Linux likely played some significant role.

In Linux, like other open source ecosystems, one of the most important components outside of the Linux kernel itself is the package repository. From a minimal kernel installation with networking and a package manager, you can install a complete system. Thus, the stewardship of the source and binary packages is extremely important, including:

Governance: in general, no single commercial entity can decide what packages can be installed or not installed Quality standards: Packages are deemed of acceptable quality and suitable in a production environment Build verification: a binary’s build has been tested as appropriate for that package Trusted distribution: Packages are signed so that package managers and users can trust the provenance of a binary build Dependency management: Installing a package also installs its dependencies, which have been similarly verified and known to work together

Each distribution may have different goals. CentOS, for example, aims for compatibility with Red Hat Enterprise Linux and accordingly uses the yum package management tool. Debian and Ubuntu, by contrast, don’t target compatibility with any enterprise distributions, but provide multiple flavors focusing respectively on long-term stability versus bleeding-edge innovation.

Community-led packaging and distribution is not unique to Linux: R, for example, has CRAN and the CRAN submission policy. It also has R-Forge to provide a community-governed service for posting project builds.

Python: Enterprise distributions and conda

Anaconda is a freely-available Python platform distribution created by the venture-backed start-up Continuum Analytics (folks I know quite well!), that has grown extremely popular in recent years. It plays a similar role as any other enterprise platform distribution based on open source software, just like Red Hat (Linux), Revolution Analytics (R), or Cloudera (Hadoop. Disclaimer: this where I work) have done.

Anaconda is not the only Python platform distribution, nor the first. Canopy (formerly EPD: Enthought Python Distribution) preceded it, and many of the same people have worked on both projects.

One of the sources of Anaconda’s success is that it makes cross-platform Python environment management much easier than it used to be, and Continuum provides trusted builds of multiple Python versions and all of the Python and non-Python library dependencies needed to assemble a complete Python data analysis environment.

At the heart of Anaconda is a new packaging tool conda. I won’t go into the technical or open-source-politics reasons why Continuum created a new binary packaging tool for installation and dependency management for Anaconda. The bottom line is that it:

Provides an alternative to the pip / distribute / setuptools / virtualenv stack: use one command-line for everything and in general it just works. Works well for managing both Python and non-Python (e.g. C/C++) library dependencies

  • Works consistently on all major platforms (Windows, OS X, Linux)

  • Is freely available (the conda tool)

In practice, conda works extremely well. It’s the packaging tool I wish existed in 2008, and I’m glad we got it eventually!

The downside of the Anaconda distribution itself is that ultimately, just like other enterprise platform companies, Continuum will inevitably be faced with decisions that pit the needs of enterprise customers (valuing stability and long-term support) with the community (valuing innovation and community governance). The most obvious source of conflict would be getting new packages included in Anaconda requiring a lot of work (from Continuum employees) to build, test, and package. Another would be providing updated builds for projects that only matter to a small, but passionate subset of users (which may not be paying customers).

Community governance of the code is typically handled through organizations like the Apache Software Foundation. This is a whole different beast of a problem, and I’ll write about my thoughts on open source project governance some other time.

conda-forge: Community-led packaging using conda

The point of this is not to say “Anaconda is bad for the Python community”. Quite the contrary: Anaconda has been and continues to be good for the community, and conda is an excellent packaging tool. Bigger picture: acceptance of the Python data stack in the enterprise is existentially important for the continued succcess and growth of the ecosystem. Just like Linux needed RHEL and Hadoop needed CDH, PyData needed Anaconda to get where it has gotten now. High quality projects like pandas and scikit-learn were not enough by themselves.

Having a community-governed package channel for conda and a community process for submitting, verifying, and storing signed project releases would be ideal. Additionally, there would need to be shared build and continuous integration infrastructure so that we aren’t all having to install Visual Studio on our desktops to be able to create reliable Windows builds.

Some will note that there is the anaconda.org, a product created by Continuum for free and paid use (for private builds). The problem with anaconda.org is that it is mainly a place to put binary artifacts.

Given all this, I was incredibly excited when I learned about conda-forge. If we all as a community can throw our weight behind this effort, I believe we can achieve:

A community-led process for posting trusted binary builds of projects, like CRAN has worked for many years for R.

  • Easier integration testing amongst groups of projects, especially on Windows

  • Someday, a community-governed Anaconda-like Python distribution

It’s complicated, though. Doing this well will require a lot of money and people’s time. The R community has sustained itself through the support of academic institutions over the years which the Python data stack doesn’t have as much of. NumFOCUS may be able to provide a fiscal conduit for tax-deductible support of conda-forge.

Summary

I’ve long been envious of the community package management infrastructure that the R community has, But many enterprises prefer to use “blessed” platform distributions, e.g. R has Revolution R (now Microsoft R Open). So, we need both, and I encourage readers to help where possible, either through development or money, to help the nascent community effort (i.e. conda-forge) grow.