On the Road to 0.8.0 — Some Additional New Features Coming in the sergeant Package

It was probably not difficult to discern from my previous Drill-themed post that Iâ€™m fairly excited about the Apache Drill 1.15.0 release. Iâ€™ve rounded out most of the existing corners for it in preparation for a long-overdue CRAN update and have been concentrating on two helper features: configuring & launching Drill embedded Docker containers and auto-generation of Drill CTAS queries.

Drill Docker Goodness

Starting with version 1.14.0, Apache provides Drill Docker images for use in experimenting/testing/building-off-of. They run Drill in single node standalone mode so youâ€™re not going to be running this in â€œproductionâ€� (unless you have light or just personal workloads). Docker is a great way to get to know Drill if you havenâ€™t already played with it since you donâ€™t have do do much except run the Docker image.

Iâ€™ve simplified this even more thanks to @rgfitzjohnâ€™s most excellent stevedore package which adds a robust R wrapper to the Docker client without relying on any heavy external dependencies such as reticulate. The new drill_up() function will auto-fetch the latest Drill image and launch a container so you can have a running Drill instance with virtually no effort on your part.

Just running the vanilla image isnâ€™t enough since your goal is likely to do more than work with the built-in cp data source. The default container launch scenario also doesnâ€™t hook up any local filesystem paths to the container so you really canâ€™t do much other than cp-oriented queries. Rather than make you do all the work of figuring out how to machinate Docker command line arguments and manually configure a workspace that points to a local filesystem area in the Drill web admin GUI the drill_up() function provides a data_dir argument (that defaults to the getwd() where you are in your R session) which will then auto-wire up that path into the container and create a dfs.d workspace which auto-points to it for you. Hereâ€™s a sample execution:

library(sergeant)
library(tidyverse)

dr <- drill_up(data_dir = "~/Data")
## Drill container started. Waiting for the service to become active (this may take up to 30s).
## Drill container ID: f02a11b50e1647e44c4e233799180da3e907c8aa27900f192b5fd72acfa67ec0

You can use dc$stop() to stop the container or use the printed container id to do it from the command line.

Weâ€™ll use this containerized Drill instance with the next feature but I need to thank @cboettig for the suggestion to make an auto-downloader-runner-thingy before doing that. (Thank you @cboettig!)

Taking the Tedium out of CTAS

@dseverski, an intrepid R, Drill & sergeant user noticed some new package behavior with Drill 1.15.0 that ended up spawning a new feature: automatic generation of Drill CTAS statements.

Prior to 1.14.0 sergeant had no way to accurately, precisely tell data types of the columns coming back since the REST API didnâ€™t provide them (as noted in the previous Drill post). Now, it did rely on the JSON types to create the initial data frames but id also did something *kinda horribad: it ran readr::type_convert() on the result sets . Said operation had the singular benefit of auto-converting CSV/CSVH/TSV/PSV/etc data to something sane without having to worry about writing lengthy CTAS queries (at the expense of potentially confusing everyone, though that didnâ€™t seem to happen).

With 1.15.0, the readr::type_convert() crutch is gone, which results in less-than-helpful things like this when you have delimiter-separated values data:

# using the Drill container we just started above

write_csv(nycflights13::flights, "~/Data/flights.csvh")

con <- src_drill("localhost")

tbl(con, "dfs.d.`flights.csvh`") %>% 
 glimpse()
## Observations: ??
## Variables: 19
## Database: DrillConnection
## $ year "2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013â€¦
## $ month "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "â€¦
## $ day "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "â€¦
## $ dep_time "517", "533", "542", "544", "554", "554", "555", "557", "557"â€¦
## $ sched_dep_time "515", "529", "540", "545", "600", "558", "600", "600", "600"â€¦
## $ dep_delay "2", "4", "2", "-1", "-6", "-4", "-5", "-3", "-3", "-2", "-2"â€¦
## $ arr_time "830", "850", "923", "1004", "812", "740", "913", "709", "838â€¦
## $ sched_arr_time "819", "830", "850", "1022", "837", "728", "854", "723", "846â€¦
## $ arr_delay "11", "20", "33", "-18", "-25", "12", "19", "-14", "-8", "8",â€¦
## $ carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "â€¦
## $ flight "1545", "1714", "1141", "725", "461", "1696", "507", "5708", â€¦
## $ tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "â€¦
## $ origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"â€¦
## $ dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"â€¦
## $ air_time "227", "227", "160", "183", "116", "150", "158", "53", "140",â€¦
## $ distance "1400", "1416", "1089", "1576", "762", "719", "1065", "229", â€¦
## $ hour "5", "5", "5", "5", "6", "5", "6", "6", "6", "6", "6", "6", "â€¦
## $ minute "15", "29", "40", "45", "0", "58", "0", "0", "0", "0", "0", "â€¦
## $ time_hour "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-01-01T1â€¦

So the package does what it finally should have been doing all along. But, as noted, thatâ€™s not great if you just wanted to quickly work with a directory of CSV files. In theory, youâ€™re supposed to use Drillâ€™s CREATE TABLE AS then do a bunch of CASTS and TO_s to get proper data types. But who has time for that?

David had a stellar idea, might sergeant be able to automagically create CTAS statements from a query?. Yes. Yes it just might be able to do that with the new ctas_profile() function.

Letâ€™s pipe the previous tbl() into ctas_profile() and see what we get:

tbl(con, "dfs.d.`flights.csvh`") %>% 
 ctas_profile() %>% 
 cat()

-- ** Created by ctas_profile() in the R sergeant package, version 0.8.0 **

CREATE TABLE CHANGE____ME AS
SELECT
 CAST(`year` AS DOUBLE) AS `year`,
 CAST(`month` AS DOUBLE) AS `month`,
 CAST(`day` AS DOUBLE) AS `day`,
 CAST(`dep_time` AS DOUBLE) AS `dep_time`,
 CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
 CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
 CAST(`arr_time` AS DOUBLE) AS `arr_time`,
 CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
 CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
 CAST(`carrier` AS VARCHAR) AS `carrier`,
 CAST(`flight` AS DOUBLE) AS `flight`,
 CAST(`tailnum` AS VARCHAR) AS `tailnum`,
 CAST(`origin` AS VARCHAR) AS `origin`,
 CAST(`dest` AS VARCHAR) AS `dest`,
 CAST(`air_time` AS DOUBLE) AS `air_time`,
 CAST(`distance` AS DOUBLE) AS `distance`,
 CAST(`hour` AS DOUBLE) AS `hour`,
 CAST(`minute` AS DOUBLE) AS `minute`,
 TO_TIMESTAMP(`time_hour`, 'FORMATSTRING') AS `time_hour` -- *NOTE* You need to specify the format string. Sample character data is: [2013-01-01T10:00:00Z]. 
FROM (SELECT * FROM dfs.d.`flights.csvh`)


-- TIMESTAMP and/or DATE columns were detected.
Drill's date/time format string reference can be found at:
--
-- 

Thereâ€™s a parameter for the new table name which will cause the CHANGE____ME to go away and when the function finds TIMESTAMP or DATE fields it knows to switch to their TO_ cousins and gives sample data with a reminder that you need to make a format string (Iâ€™ll eventually auto-generate them unless someone PRs it first). And, since nodoby but Java programmers remember Joda format strings (theyâ€™re different than what youâ€™re used to) it provides a handy link to them if it detects the presence of those column types.

Now, we donâ€™t need to actually create a new table (though converting a bunch of CSVs to Parquet is likely a good idea for performance reasons) to use that output. We can pass most of that new query right to tbl():

tbl(con, sql("
SELECT
 CAST(`year` AS DOUBLE) AS `year`,
 CAST(`month` AS DOUBLE) AS `month`,
 CAST(`day` AS DOUBLE) AS `day`,
 CAST(`dep_time` AS DOUBLE) AS `dep_time`,
 CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
 CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
 CAST(`arr_time` AS DOUBLE) AS `arr_time`,
 CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
 CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
 CAST(`carrier` AS VARCHAR) AS `carrier`,
 CAST(`flight` AS DOUBLE) AS `flight`,
 CAST(`tailnum` AS VARCHAR) AS `tailnum`,
 CAST(`origin` AS VARCHAR) AS `origin`,
 CAST(`dest` AS VARCHAR) AS `dest`,
 CAST(`air_time` AS DOUBLE) AS `air_time`,
 CAST(`distance` AS DOUBLE) AS `distance`,
 CAST(`hour` AS DOUBLE) AS `hour`,
 CAST(`minute` AS DOUBLE) AS `minute`,
 TO_TIMESTAMP(`time_hour`, 'yyyy-MM-dd''T''HH:mm:ssZ') AS `time_hour` -- [2013-01-01T10:00:00Z].
FROM (SELECT * FROM dfs.d.`flights.csvh`)
")) %>% 
 glimpse()
## Observations: ??
## Variables: 19
## Database: DrillConnection
## $ year 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2â€¦
## $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1â€¦
## $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1â€¦
## $ dep_time 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 5â€¦
## $ sched_dep_time 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 6â€¦
## $ dep_delay 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -â€¦
## $ arr_time 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, â€¦
## $ sched_arr_time 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, â€¦
## $ arr_delay 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31,â€¦
## $ carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "â€¦
## $ flight 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71,â€¦
## $ tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "â€¦
## $ origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"â€¦
## $ dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"â€¦
## $ air_time 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34â€¦
## $ distance 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, â€¦
## $ hour 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6â€¦
## $ minute 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0â€¦
## $ time_hour 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 10:00:0â€¦

Ahhhhâ€¦ Useful data types. (And, see what I mean about that daft format string? Also, WP is mangling the format string so add a comment if you need the actual string.)

FIN

As you can see questions, suggestions (and PRs!) are welcome and heeded on your social-coding platform of choice (though yâ€™all still seem to be stuck on GH ).

NOTE: Iâ€™ll be subbing out most install_github() links in READMEs and future blog posts for install_git() counterparts pointing to my sr.ht repos (as I co-locate/migrate them there).

You can play with the new 0.8.0 features via devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0").

Related