** Tue 27 December 2016
2017 is shaping up to be an exciting year in Python data development. In this post I’ll give you a flavor of what to expect from my end. In follow up blog posts, I plan to go into more depth about how all the pieces fit together. I have been a bit delinquent in blogging in 2016, since my hands have been quite full doing development and working on the 2nd edition of Python for Data Analysis. I am going to do my best to write more in 2017.
New position
After a productive 2 years with Cloudera, a few months ago I transitioned to a software architect role at Two Sigma Investments. Since leaving the quant finance world in 2010, I have observed the profound impact that open source tools have had on the financial industry. I was refreshed to find that forward thinking institutions like Two Sigma have increased their engagement with the open source ecosystem, releasing internally-developed tools and contributing work back to established OSS projects.
In my new role, I am working to solve data analysis problems that are well-aligned with the open source software I’ve been developing over the last decade:
-
User-friendly API design
-
High performance IO and data access
-
Fast and expressive computational engines
With regards to open source and business, there’s a couple of interesting trends happening right now, in my view:
Companies not participating in open source (as users and/or developers) are getting left behind Many of the best software engineers won’t work for a company that forbids them from working on open source projects (I certainly would not).
pandas 2.0
Over the last year, we have been publicly discussing a plan to improve the internals of pandas to better suit the needs of today’s data problems. I spoke a bit about this in a recent talk.
General speaking, the goals of pandas 2.0 are:
-
Fixing design warts and accumulated technical debt from the last 9 years.
-
Faster single-threaded performance
-
More efficient memory management, less memory usage
-
Improved performance and scalability through true multithreaded execution
My goal is to deliver the same quality pandas user experience on 10x as much data. pandas works well on 1GB of data, but less well on 10GB. This has to change for pandas to remain a relevant tool in the future.
In the meantime, the pandas team is toiling away with a major 0.20 release in the works, followed by pandas 1.0, which will mark a period of API stability while we work on refactoring the pandas core.
Apache Arrow
Last February, we announced Apache Arrow, a collaboration amongst open source data projects to establish a standard for high-performance in-memory columnar data structures and IO.
Arrow development has been proceeding well since then. We recently reached a major development milestone by achieving full binary compatibility between the initial Java and C++ library implementations. This is a critical step in delivering high performance IO between the JVM (and systems like Spark) and C/C++ or Python-based systems.
One of my goals in building Arrow’s C++ libraries is to facilitate low-overhead columnar memory management and high performance, multithreaded IO in pandas 2.0. When I can, I will write some posts about these tools and how they help.
While pandas 2.0 is in development, we’ll continue to do work to make Arrow a high speed “data conduit” for data en route to or from the current production version of pandas.
Apache Parquet for Python
In 2016, we’ve worked to create a production-grade C++ library for reading and writing the Apache Parquet file format. En route, I became a committer and then a PMC member of Apache Parquet. Uwe Korn, from Blue Yonder, has also become a Parquet committer.
As Parquet is columnar file format designed for small size and IO efficiency, Arrow is an in-memory columnar container ideal as a transport layer to and from Parquet.
To read and write Parquet files from Python using Arrow and parquet-cpp, you
can install pyarrow
from conda-forge:
Then, the code to read looks like:
We’ll be writing more documentation and blog posts about Parquet in the coming months.
Continuum Analytics recently started developing a separate Parquet implementation for Python that uses Numba for accelerating encoding and decoding routines. I’m glad to see more Python developers working on these problems.
Feather file format
Earlier this year, I worked with Hadley Wickham to design and deliver the Feather file format for R and Python. Feather uses Apache Arrow’s columnar representation and sports a simple metadata specification that can handle the main R and Python data types.
As time has passed, as one might expect, quite a bit of code overlap has developed between Feather’s C++ and Python components and Apache Arrow’s respective C++ and Python libraries. To address this, I’m planning to merge the Feather implementation into the Arrow codebase, which will enable me to provide better performance and new features to Feather users.
Improving PySpark
PySpark, the Python API for Apache Spark, has well known performance
issues (compared with Scala equivalents) around data serialization (Spark’s
DataFrame.toPandas
and sqlContext.createDataFrame
) and relatedly UDF
evaluation (rdd.map
, rdd.mapPartition
, or sqlContext.registerFunction
).
In line with the above work on fast columnar IO for Python and the JVM using Arrow, some of my Two Sigma colleagues and I are collaborating with IBM and the Spark community to accelerate PySpark using these new technologies. If you would like to get involved with this, please let me know.
The PySpark-Arrow work is progressing well, and we should have some interesting updates to share in February at the Spark Summit East conference.
Update on Ibis
Last, but not least, I am still maintaining Ibis and helping its users how I can. I’m proud of the level of deep SQL semantic support that Ibis provides its users. You can write very complex queries with pandas-like Python code that is composable and reusable.
As pandas 2.0 progresses, I am interested in building an in-memory backend for Ibis. I had thought about doing this in the past, but decided it would be better to wait.