Feather: it's about metadata

** Tue 26 April 2016

Summary: Feather’s good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages.

Feather performance

Several people asked me about Matt Dowle’s blog post about fast CSV writing. I say: bravo!

The dirty secret with Feather’s performance is that neither Hadley or I spent much effort on performance optimization. Through the project’s complete git history, you’ll be hard pressed to find anything relating to performance tuning.

Due to Feather’s design (using Arrow’s columnar memory layout and simple metadata using Google’s Flatbuffers library), there was simply no way for it to be slow (ruling out some gross programming error). I was pleased, but not too surprised, to find that the first feather.read_dataframe call in Python nearly saturated my laptop’s IO bandwidth.

Feather does perform a limited amount of conversion between Arrow memory layout and R or Python data frame. As a result, there are several performance optimization opportunities available:

Multi-threaded conversion to/from Arrow (convert multiple columns simultaneously) Pipeline reads or writes: perform disk IO concurrent with conversion to or from data frames.

We haven’t spent any energy on this, because the performance in the project’s first draft was good enough. Patches welcome, of course.

Metadata and metadata-free file formats

One of my personal goals in working on Feather was to start a broader discussion about what I call metadata-free file formats, with CSV being the most popular one. By “metadata-free”, I mean that in storing data in these formats, you lose type information (for example: factor levels) that may be impossible to recover. You may also lose numerical precision. When you read the files, you have to perform expensive type inference to “guess” the data types of columns. As you can imagine, there are a large number of esoteric edge cases for any type inference engine for CSVs.

That being said, metadata-free formats like CSV are still unfortunately a lowest-common denominator for data exchange in many systems. In both the Python and R communities, we’ve spent an extraordinary amount of time writing fast code for parsing and doing type inference on all of the bizarre delimited text files generated in the real world. In my opinion, this has been time well spent.

I often tell people that one of the things that made pandas successful early on was that pandas.read_csv usually just worked and was fairly fast, and that wasn’t true of any other Python CSV readers at the time.

Code sharing

Another goal of Feather was to share a common C++ library between the Python and R implementations. As Python and R libraries nowadays are often wrappers around C, C++, and Fortran code, it has bummed me out that so little of the native code used in Python and R packages is reusable. I’d like to see more sharing of user-invisible compiled C or C++ code.

Code sharing is hard, though, because much of it relates to “proprietary” data structures and memory layouts. This is why Apache Arrow is so important: it gives the community a common table / data frame memory layout on which we can collaborate more easily.

Summary

Feather’s priorities in order have been:

  1. Interoperable metadata and a shared memory layout

  2. Shared code

  3. Performance

I look forward to more interoperability and more code sharing in the Python and R communities. If we can also make things fast, of course, let’s do that, too.