Compiling DataFrame code is harder than it looks

** Wed 16 March 2016

Many people have asked me about the proliferation of DataFrame APIs like Spark DataFrames, Ibis, Blaze, and others.

As it turns out, executing pandas-like code in a scalable environment is a difficult compiler engineering problem to enable composable, imperative Python or R code to be translated into a SQL or Spark/MapReduce-like representation. I show an example of what I mean and some work that I’ve done to create a better “pandas compiler” with Ibis.

Recently, I did an exploration of data access performance between various pandas data access layers, plus Impala and Spark’s Python data connectors. You can check out the notebook for more detail on that.

I used a 1 million row DataFrame containing a group column with 10 distinct members:

I was working on Ibis recently, and I looked at the pandas code to compare the average in two groups:

So then I looked at how to make this work in Ibis in a scalable way inside a SQL-on-Hadoop engine like Impala. After writing the CSV file to HDFS in Parquet format, I eventually improved the Ibis compiler to work identically:

This isn’t that easy to do in a single query. The SQL that gets generated by Ibis under the hood is:

Looking more closely at Spark DataFrames

After doing some of this arduous static analysis of pandas expressions to get this working, I was curious if Spark DataFrames were solving these problems in a way I could learn from.

First off, I found that executing simple scalar aggregates with Spark DataFrames requires forming an explicit SQL-like aggregation:

But, computing the percent difference I was after (from above) did not work like I expected or raise an error that I was doing something wrong:

I also tried some other things:

To get the answer you’re looking for, you would either need to form the cross-join yourself or run multiple queries (perhaps calling toPandas on the results), neither of which is ideal from a usability perspective.

Afterthoughts

I’ve approached Ibis development from the perspective of pandas users and working to enable as much of the pandas API to behave as expected in a scalable Hadoop environment. I also wanted to make sure that all (or nearly all) SQL could be portably rewritten in Ibis using the pandas-like API style.

In my view, Spark DataFrames have approached the problem differently: providing a DataFrame-like API for Spark SQL. But at the moment, you are still writing SQL.

Through the Apache Arrow project, we’ll be working to enable not only code that can be compiled to SQL, but also arbitrary user-defined Python code to be run in a performant way inside systems like Spark and Impala.