The problem with the data science language wars

** Mon 02 November 2015

I really enjoyed the cheeky blog post by my pal Rob Story.

Like many other data tool creators, I’ve been annoyed by the assorted “Python vs R” click-bait articles and Hacker News posts by folks who in all likelihood might not survive an interview panel with me on it.

The worst part of the superficial “R vs Python” articles is that they’re adding noise where there ought to be more signal about some of the real problems facing the data science community. Let me say some very brief words about my present perspective on this.

2015: It was the best of times, it was the worst of times

I am happy to say that we are no longer struggling (as much) to read CSV files. The funny part of this is that my friend, the myth, the legend Hadley Wickham is literally working on a new CSV reader library for R. It goes to show how important solving the small data problems well is. I’ve spent so much time working on CSV parsing software myself to know that time invested in this domain is time well spent.

Since I’ve been at Cloudera for a little over a year now, I’ve gotten to learn more about the role of languages like Python and R inside many of the largest and most sophisticated data-driven companies in the world.

A morsel of truth

The truth is, for some use cases Python and R are having a more difficult time than most people realize.

First, let me point out that most people (i.e. more than 50% of people) do not have Big Data. If you can fit it on your laptop, it’s decidedly Small Data or Medium Data, and analyzing it effectively may require some cleverness on your part or new tools that have not yet been created.

Many companies have properly big data requiring HDFS, HBase, S3, or some distributed, fault-tolerant storage (hey Kudu!). Sometimes big data is lots and lots of medium-sized datasets. While it’s true that you could analyze these on your laptop; you can get the answers faster and avoid time-consuming data transfer by analyzing them in situ on large clusters. This is especially true if you are developing internal applications with tight SLAs. Hence the Hadoop ecosystem, YARN, and the whole hashbang.

Python and R have a mix of proprietary and open source big data system support, but it’s not the 5-star smorgasbord that you have with the JVM ecosystem. Some people who aren’t very knowledgeable have chalked this up to Python or R being “slow”. This is pure silliness.

For example, Spark, the new data processing engine that’s replacing MapReduce (but not the Hadoop File System HDFS), has Python and R APIs (PySpark and SparkR, respectively), but they suffer from significant interoperability issues (which may manifest as performance and memory use problems).

Playing nice together

As I’ve discussed in recent talks, any data processing engine that allows you to extend it with user-defined code written in a “foreign language” like Python or R has to solve at least these 3 essential problems:

Data movement or access: making runtime data accessible in a form consumable by Python, say. Unfortunately, this often requires expensive serialization or deserialization and may dominate the system runtime. Serialization costs can be avoided by carefully creating shared byte-level memory layouts, but doing this requires a lot of experienced and well-compensated people to agree to make major engineering investments for the greater good.

Vectorized computation: enabling interpreted languages like Python or R to amortize overhead and calling into fast compiled code that is array-oriented (e.g. NumPy or pandas operations). Most libraries in these languages also expect to work with array / vector values rather than scalar values. So if you want to use your favorite Python or R packages, you need this feature.

IPC overhead: the low-level mechanics of invoking an external function. This might involve sending a brief message with a few curt instructions over a UNIX socket.

Any system that doesn’t do a good job of solving these problems will seriously struggle in many large-scale workloads involving user-created code. It will also be difficult to leverage the existing “walled garden” ecosystems that exist in data science languages like Python and R

As an example, response to AWS adding Python UDF support in Redshift has been muted for precisely these reasons.

The future will get here, eventually

I am focused personally at present on creating software to address these interoperability challenges (“Mr. Gorbachev, tear down this wall!”) facing not only Python but also R and any other programming languages that want to participate. My new project, Ibis, is beginning to address the Python side of the problem, but significant work remains to enable “foreign” languages in general to interoperate with big data compute engines.

Some people think of me as a Python zealot, but I’m really not. I think Python’s a great choice for many use cases, so I continue to build software for its users as an effective way to deliver software used to solve real world problems. I also think that it’s good for users to have choices. No language is necessarily “good” or “bad”; it’s that kind of moral absolutism that needs to be eradicated from our collective consciousness. For the toolmakers, it’s our job to build the software and demonstrate its usefulness in the real world.

I’m looking forward to sharing my progress on these ongoing efforts.