** Thu 21 April 2016
Summary: I explain the relationship between Feather and Apache Arrow in more technical detail.
Memory representation and file formats
I was recently asked to explain the difference between Apache Arrow (providing a standard in-memory columnar memory representation) and Feather (a file format using Apache Arrow).
Before going deeper into Feather and Arrow, let’s look at how memory representations (these are also probably more commonly called data structures) can lead to file formats.
Let’s take a simplified array data structure in C:
This array_t
is an in-memory data structure. I haven’t told you how to put
it in a file yet.
Now suppose we have multiple arrays:
Now you want to write the data to a file. You have a memory representation for this data, but no file format. So now you have a bunch of decisions to make:
-
How do you lay out the bytes contained in the
array_t
structs in the file? -
How to you encode the metadata (a description that allows code to reconstruct
my_data
)? -
Where do you put the metadata in the file?
-
How do you verify that you have a valid file?
-
What happens when the metadata needs to evolve (file versioning)?
So you have a bunch of work to do if you want to go from data structures to a full blown file format.
Feather: the devil is in the metadata
From Feather’s specification document, the structure of a file currently looks like this:
The ARRAY
blobs are data that originated from in-memory Arrow data
structures. Here is what the data structure representing a single primitive
array looks like. There’s some C++ memory ownership stuff you can ignore for
this blog post.
We made an arbitrary decision of how to arrange the raw data (the null and
values buffers), but the bigger project was coming up with the metadata. We
used an open source project from Google called Flatbuffers to do this. In
Flatbuffers, the information about PrimitiveArray
looks like this:
When it comes down to it, most of the effort of creating Feather was in defining a metadata specification that works for both Python and R, and implementing code for reading and writing the metadata from in-memory data structures. Copying bytes into R or Python in-memory arrays is one of the simplest parts.