We here at Win-Vector LLC have some really big news we would please like the R
-community’s help sharing.
vtreat
version 1.2.0 is now available on CRAN, and this version of vtreat
can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark
.
vtreat
is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
Thanks to the rquery
package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL
, Amazon RedShift
, Apache Spark
, or Google BigQuery
. Or, thanks to the data.table
and rqdatatable
packages, even fast large in-memory transforms are possible.
We have some basic examples of the new vtreat
capabilities here and here.
Like this:
Like Loading…
Related