Apache Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores […] without having to create and manage schemas. […] It has a schema-free JSON document model similar to MongoDB and Elasticsearch; [a plethora of APIs, including] ANSI SQL, ODBC/JDBC, and HTTP[S] REST; [is] extremely user and developer friendly; [and, has a] pluggable architecture enables connectivity to multiple datastores.
To ring in the new year the Drill team knocked out a new 1.15.0 release with a cadre of new functionality including:
One super-helpful new feature of the REST API is that it now returns query results metadata along with the query results themselves. This means REST API endpoints finally know both column order and column type. This gave me cause to re-visit the sergeant package [GL |
GH] and make some accommodations for some of these new features. |
Ushering In A New Order
Drill REST API queries return a "columns"
field and "metadata"
field with the data itself. We can use that to force an order to the columns as well as mostly use proper types (vs JSON-parsed/guessed types). I say mostly since the package still uses jsonlite
to parse the results and there’s no support for 64-bit integers in jsonlite
(more on this later).
We’ll use the example from DRILL-6847 and use the example provided by Charles Givre in his Jira issue since it will let me demonstrate more of that “mostly” comment and show off another new feature:
1 |
|
The above is (logically):
1 |
|
What do we get when we take a preview of the result?
1 |
|
1 |
|
The first thing sergeant
users will notice is proper column order (before it just returned the columns in the order they came back in the JSON rows[]
structure). The second thing is that we didn’t get integer64
s back. Instead, we got double
s plus an information warning about why and what you can do about it. Said warning only displays once per-session and can be silenced with the option sergeant.bigint.warnonce
. i.e. just put:
1 |
|
in your script or ~/.Rprofile
and you won’t hear from it again.
The as.integer64()
we used is not from the bit64
package but an internal sergeant
package function that knows how to translate said operation to, e.g. CAST( employee_id AS BIGINT )
.
You can use the ODBC drivers to gain BIGINT support and there are plans for the 0.8.0 branch to eventually use rapidjsonr
at the C++-level to provide direct in-package support for BIGINTs as well.
Better Error Messages
Drill query errors that the sergeant
package bubbled up through its various interfaces have not been pretty or all that useful. This has changed with the 0.8.0 branch. Let’s take a look:
1 |
|
1 |
|
As you can see in the above output, you now get a highly-formatted return value with the original SQL query broken into lines (with line numbers) and a full link to the Drill query profile so you can dig in to the gnarly details of complex query issues. As you work with this and find edge cases I missed for messages, drop an issue on your social-coding site of choice.
SUPPORT ALL THE PCAPs!
Drill has had packet capture (PCAP) file support for a while now and 1.15.0 adds support for the more modern/rich pcapng
format. To enable support for this you need to add "pcapng": {"type": "pcapng", "extensions": ["pcapng"] },
to the "formats"
section of your storage plugins and also configure a workspace directory to use that as the default (the principle of which is covered here).
We’ll use one of the Wireshark example captures to demonstrate:
1 |
|
More work appears to be planned by the Drill team to enable digging into the packet (binary) contents.
Drill Metadata As Data
Drill has provided ways to lookup Drill operational information as actual tables but the Drill team has added support for even more metadata-as-data queries.
First up is finally having better access to filesystem information. Prior to 1.15.0 one could get file and path attributes as part of other queries, but now we can treat filesystems as actual data. Let’s list all the PCAPs in the above workspace:
1 |
|
The Drill system options
table now has full descriptions for the options and also provides a new table that knows about all of Drills functions and all your custom UDFs. drill_opts()
and drill_functions()
return a data frame of all this info and have an optional browse
parameter which, if set to TRUE
, will show a DT
interactive data table for them. I find this especially handy when I forget something like regexp_like
syntax (I use alot of back-ends and many are wildly different) and can now do this:
FIN
Keep on the lookout for the rapidjsonr
/BIGINT integration and more new features of the sergeant
package. NOTE: The better error messages have been ported over to the sergeant.caffeinated
package (the RJDBC interface) and the other niceties will make their way into that package soon as well.
So, make sure you’re using the 0.8.0
GL / GH, kick the tyres, file issues where you’re most comfortable working.
May your queries all be optimized and results sets complete in the new year!
Related