This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.
Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.
We are excited to release this effort, dubbed hs2client, as a new Apache-licensed open source project on GitHub. As you may guess from the name, this library implement the HiveServer2 Thrift API as a C++ library, with careful handling of result sets to allow languages like Python to access data at high performance.
The initial alpha preview release contains:
-
A C++ library,
libhs2client
, which provides a clean C++ API for the HiveServer2 Thrift API. This can be built and dynamically or statically linked in C/C++ applications with no direct exposure to Apache Thrift. -
Python bindings, with optimized reads to
pandas.DataFrame
To try out the library, you can install a dev build of the project right now with conda:
conda install hs2client -c cloudera/channel/dev |
Optimizing for Speed and Memory Use
The Python bindings for libhs2client
are careful about performance and memory in a few key ways:
-
Converting row batches directly into pandas-compatible NumPy arrays
-
Care with categorical data: string data is deduplicated while it’s being converted for pandas. In the future, it would be straightforward to add an option to return all string data as
pandas.Categorical
arrays.
hs2client Demo and Simple Benchmark
The initial API is oriented at modeling the HiveServer2 protocol closely. Connecting to a cluster yields a Service
instance:
In [1]: import hs2client service = hs2client.connect(‘localhost’, 21050, ‘wesm’) |
To execute queries, you open a session then use its execute
method.
In [2]: session = service.open_session() op = session.execute(“select 1”) |
Queries are asynchronous by default.
In [3]: op.is_finished Out[3]: True In [4]: op.get_state() Out[4]: ‘finished’ |
Result sets can be fetched currently to pandas.DataFrame
objects:
In [5]: df = op.fetchall_pandas() In [6]: df Out[6]: 1 0 1 |
0 1
Compared with comparable database clients, we’ve optimized the fetch path to have excellent performance (IO speeds) and memory use.
1234567891011121314151617181920212223 | In [10]: import numpy as np import pandas as pd N, K = 1000000, 10 df = pd.DataFrame(np.random.randn(N, K), columns=[‘data{0}’.format(i) for i in range(K)]) import ibis con = ibis.impala.connect(‘localhost’, hdfs_client=ibis.hdfs_connect(port=5070)) con.drop_database(‘hs2_test’, force=True) con.create_database(‘hs2_test’) con.create_table(‘test_data’, df, database=’hs2_test’) In [16]: op = session.execute_sync(‘select * from hs2_test.test_data’) %time df = op.fetchall_pandas() CPU times: user 200 ms, sys: 64 ms, total: 264 ms Wall time: 682 ms In [21]: speed = df.memory_usage().sum() / 0.682 / 2**20 print(‘Speed: {0:.2f} MB/s’.format(speed)) Speed: 111.87 MB/s |
2
4
6
8
10
12
14
16
18
20
22
import pandas as pd
N, K = 1000000, 10
df = pd.DataFrame(np.random.randn(N, K),
con = ibis.impala.connect(‘localhost’,
con.drop_database(‘hs2_test’, force=True)
con.create_table(‘test_data’, df, database=’hs2_test’)
In [16]: op = session.execute_sync(‘select * from hs2_test.test_data’)
CPU times: user 200 ms, sys: 64 ms, total: 264 ms
print(‘Speed: {0:.2f} MB/s’.format(speed))
This benchmark is being run on localhost, but it shows we can move over 110MB/s through the Thrift protocol and convert to a fully-formed pandas DataFrame.
hs2client Compared with impyla
In Ibis and other Python projects, we have been using impyla to execute queries and access result sets. impyla either uses Apache Thrift’s official Python implementation (on Python 2) or thriftpy (on Python 3) for interacting with the Impala or Hive Thrift service. Because of this, the main difference is the performance in fetching large result sets.
Here, I will perform an equivalent fetch using impyla via Ibis:
In [29]: # Connect to Impala via impyla (on Python 2) con = ibis.impala.connect(‘localhost’, port=21050) expr = con.sql(‘select * from hs2_test.test_data’) %time df = expr.execute(limit=None) CPU times: user 1.15 s, sys: 52 ms, total: 1.2 s Wall time: 1.66 s In [30]: speed = df.memory_usage().sum() / 1.66 / 2**20 print(‘Speed: {0:.2f} MB/s’.format(speed)) Speed: 45.96 MB/s |
con = ibis.impala.connect(‘localhost’, port=21050)
expr = con.sql(‘select * from hs2_test.test_data’)
%time df = expr.execute(limit=None)
Wall time: 1.66 s
In [30]: speed = df.memory_usage().sum() / 1.66 / 2**20
Speed: 45.96 MB/s
In this particular use case with an all-numeric table (which is ideal for deserialization performance), the speed difference is only 2x or a bit more.
hs2client Roadmap
The C++ library does not yet implement some important features that are needed to be a drop-in replacement for impyla or other Hive or Impala drivers:
-
SSL transport (with certificate verification)
-
SASL Thrift transport for secure (i.e. Kerberos) clusters, or insecure clusters using SASL
-
Note that many Hive clusters, even unsecured ones, use the SASL transport, so using the library with Hive right now may be a challenge until that is completed.
On the Python side, we must implement a DB API 2.0 compatibility layer, since currently data can only be fetched to pandas, not Python tuples as with most Python database drivers.
We of course welcome contributions from the community to build out some of these features.
Wes McKinney is a Software Engineer at Cloudera. He is the creator of Python’s pandas library and the Ibis project, a committer to Apache Arrow and Apache Parquet, and the author of *Python for Data Analysis. *
Marcel Kornacker is the architect of Impala and its tech lead.
Matt Jacobs and Thomas Tauber-Marshall are Software Engineers working on Cloudera’s Impala team.