We here at Win-Vector LLC have some really big news we would please like the
R-community’s help sharing.
vtreat version 1.2.0 is now available on CRAN, and this version of
vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as
vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
Thanks to the
rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as
Apache Spark, or
Google BigQuery. Or, thanks to the
rqdatatable packages, even fast large in-memory transforms are possible.
We have some basic examples of the new
vtreat capabilities here and here.