implyr: R Interface for Apache Impala

New R package implyr enables R users to query Impala using dplyr.

Apache Impala (incubating) enables low-latency interactive SQL queries on data stored in HDFS, Amazon S3, Apache Kudu, and Apache HBase. With the availability of the R package implyr on CRAN and GitHub, it’s now possible to query Impala from R using the popular package dplyr.

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate(), select(), filter(), summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

Installing implyr

You can install the latest release of implyr from CRAN:

  install.packages(“implyr”)

Or you can install the current development version from GitHub:

  devtools::install_github(“ianmcook/implyr”)

implyr requires a connection package and a driver to provide connectivity to Impala. For superior performance and compatibility, I recommend using the package odbc with the latest Cloudera ODBC Driver for Impala. JDBC is also supported, but with some compatibility problems. For more details about connectivity to Impala, see the implyr README.

After the packages are installed and the Impala ODBC driver is installed and configured, load the packages in your R session:

  library(odbc)library(implyr)library(dplyr)

library(implyr)

Connecting to Impala Once the packages are loaded, the next step is to create an ODBC driver object:

 

 

Then call src_impala() to connect to Impala and create a dplyr data source. In the call to src_impala(), pass the ODBC driver object and specify the arguments required by odbc::dbConnect(). These arguments can consist of individual ODBC keywords (driver, host, port, database, uid, pwd, and others), an ODBC data source name (dsn), or an ODBC connection string (.connection_string). For example:

  impala <- src_impala( drv = drv, driver = “Cloudera ODBC Driver for Impala”, host = “host”, port = 21050, database = “default”, uid = “username”, pwd = “password”)

 drv = drv,

 host = “host”,

 database = “default”,

 pwd = “password”

The returned object impala provides a remote dplyr data source to Impala.

The Impala drivers support multiple authentication methods, including Kerberos. To use Kerberos, specify properties including AuthMech, KrbRealm, KrbHostFQDN, and KrbServiceName. Consult the Impala ODBC driver installation guide for details.

Using dplyr

Now you can use dplyr verbs against tables in Impala. In the examples below, Impala has two tables named flights and airlines, containing data from the R package nycflights13. For information about how to create Impala tables and load data into them using R, see the implyr README.

To see what tables are in the current database in Impala, issue the command:

  src_tbls(impala) ##  [1] “airlines” “flights”

 

For this example, start by creating a remote table object named flights_tbl representing the data in the Impala table named flights:

  flights_tbl <- tbl(impala, “flights”)

To specify the database that contains the table, use the function in_schema(). For example, if the Impala table named flights were in a database named nycflights13, then you would use the command:

  flights_tbl <- tbl(impala, in_schema(“nycflights13”, “flights”))

You can then use dplyr verbs to manipulate the remote table. When using implyr, you must specify table names and column names using lowercase characters. To ensure that Impala returns results in sorted order, you must use arrange() after other dplyr verbs. (If you apply arrange() before other verbs, implyr issues a warning.)

  delay_tbl <- flights_tbl %>%  select(tailnum, distance, arr_delay) %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>% filter(count > 20L, dist < 2000L, !is.na(delay)) %>% arrange(delay, dist, count)

 select(tailnum, distance, arr_delay) %>%

 summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%

 arrange(delay, dist, count)

dplyr, with help from implyr, translates this into an Impala SQL statement. The resulting object delay_tbl is a remote table object representing the dataset that would be generated by executing this SQL statement. When you display this remote table in R, dplyr queries and displays only the first few rows, not the full result:

12345678910111213141516171819 delay_tbl ## # Source:     lazy query [?? x 4]## # Database:   impalad version 2.8.0-cdh5.11.0 through odbc## #   [username@host:21050/default]## # Ordered by: delay, dist, count##    tailnum count      dist     delay##                ##  1  N433AA    22 1003.0000 -17.09091##  2  N548AA    49 1051.8776 -15.46939##  3  N479AA    34 1026.0294 -13.78788##  4  N647DL    34 1895.8824 -13.64516##  5  N595AA    54  942.4259 -13.29412##  6  N516AA    27 1036.7407 -13.11538##  7  N548UW    52 1688.0577 -13.03846##  8  N5EJAA    21 1311.8571 -12.47619##  9  N540AA    38 1012.0000 -12.29412## 10  N484AA    37  971.9189 -11.64706## # ... with more rows

2

4

6

8

10

12

14

16

18

 

# Database:   impalad version 2.8.0-cdh5.11.0 through odbc

# Ordered by: delay, dist, count

               

 2  N548AA    49 1051.8776 -15.46939

 4  N647DL    34 1895.8824 -13.64516

 6  N516AA    27 1036.7407 -13.11538

 8  N5EJAA    21 1311.8571 -12.47619

10  N484AA    37  971.9189 -11.64706

Use collect() to execute the query and return the full result to R as a local data frame:

  delay <- delay_tbl %>% collect()

Before using collect(), ensure that your data manipulation steps have filtered or aggregated the data to a small enough size that it can fit in memory in your R session.

You can visualize the result using ggplot2:

  library(ggplot2)ggplot(delay, aes(dist, delay)) + geom_point(   aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area(max_size = 2)

ggplot(delay, aes(dist, delay)) +

   aes(size = count), alpha = 1/2) +

 scale_size_area(max_size = 2)

If you are familiar with sparklyr, you may recognize this example and the resulting visualization from RStudio’s sparklyr website. Just as sparklyr enables you to use dplyr verbs with Spark on the back end, implyr enables you to use dplyr verbs with Impala on the back end.

For more examples, see the implyr README.

Using SQL

In addition to using dplyr grammar, you can also use implyr to directly write and execute SQL queries on Impala.

To execute a statement that returns no result set, use the dbExecute() function:

  dbExecute(impala, “REFRESH flights”)

To execute a query and return the result to R as a data frame, use the dbGetQuery() function.

  flights_by_carrier_df <- dbGetQuery( impala, ”SELECT carrier, COUNT(*) FROM flights GROUP BY carrier”)

 impala,

)

Only use dbGetQuery() when the query result will be small enough to fit in memory in your R session.

When you are finished, close the connection to Impala:

 

Future Work I intend to actively develop implyr to preserve compatibility with dplyr and related packages as they evolve, and to take advantage of relevant new features of Impala as they are added. Please report any problems you experience using implyr through GitHub issues, and please see the implyr README for more details about the intended scope and known limitations of this package.