The 0.2.0 release of the spark-ts
package includes includes a fleshed-out Java API, among other things.
The spark-ts
library, which was initially developed by Cloudera’s Data Science team, enables analysis of datasets comprising millions of time series, each with millions of measurements. It runs atop Apache Spark, and exposes Scala, Java, and Python APIs. Check out this recent post for a closer look at the library and how to use it.
Time Series for Spark 0.2.0 released earlier in January 2016. With more than 100 commits from seven contributors, the release includes a number of features and changes. We’ll cover the major ones here.
Full documentation is available here. The examples repo contains updated examples that work with the release. (As a reminder, Cloudera does not currently offer support forspark-ts
.)
Nanosecond Precision and java.time
spark-ts
0.1 used the JodaTime library for representing and operating on DateTimes, but JodaTime only supports timestamps at millisecond precision, which is too constraining for domains like finance. The 0.2 release instead relies on Java 8’s java.time
library, which supports nanosecond precision. As thespark-ts
library often uses longs to represent timestamps internally, this means that it can only represent times within 292 years of the Unix epoch, i.e. between the years 1678 and 2262 AD. Another consequence of this change is that the library now no longer works with Java 7 and below.
Java API
The library now has a fleshed-out Java API. In many cases, Java users can use the Scala APIs. For example, there is only one version of the DateTimeIndex
class, accessible to both Java and Scala. However, for better compatibility with Spark’s JavaRDD
API, as well as to smooth out some of the quirks of Java/Scala interaction,spark-ts
now provides a JavaTimeSeriesRDD
and factories for creating TimeSeriesRDD
s and DateTimeIndex
es. A consequence is that the library has transitioned from Breeze to Spark’s Vector
and Matrix
classes, because they pay better attention to Java support, although Breeze is still used under the covers in many cases.
Non-String Time Series Identifiers
The library represents multivariate time series as collections of <key, series> tuples, where the series is a univariate series and the key is an identifier. In 0.1, these keys were restricted to be String objects, but in 0.2, they can be any arbitrary Java or Scala object. In the Python API, these identifiers are still restricted to be strings.
Lags
TimeSeries
and TimeSeriesRDD
s now support lag operators that augment the original dataset with shifted versions of each univariate series.
Acknowledgements
Many thanks to Ahmed Mahran and Simon Ouelette for their enormous contributions, as well as to Mohamed Baddar, Sun He, Gabor Liptak, and Sean Owen.
Sandy Ryza is a Data Scientist at Cloudera, a committer to the Apache Spark and Apache Hadoop projects. He is a co-author of Advanced Analytics with Spark, from O’Reilly Media.