Approximate Inclusion Probabilities for Survey Sampling (jipApprox)Approximate joint-inclusion probabilities in Unequal Probability Sampling, or compute Monte Carlo approximations of the first and second-order inclusio …
Document worth reading: “Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation”
Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions. Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation
Understanding Chicago’s homicide spike; comparisons to other cities
Michael Masinter writes:
Whats new on arXiv
Equality Constrained Decision Trees: For the Algorithmic Enforcement of Group Fairness
How to import a directory of csvs at once with base R and data.table. Can you guess which way is the fastest?
Inspired by a recent post on how to import a directory of csv files at once using purrr and readr by Garrick, in this post we will try achieving the same using base R with no extra packages, and with data·table, another very popular package and as an added bonus, we will play a bit with benchmarking to see which of the methods is the fastest, including the tidyverse approach in the benchmark.
GitHub Streak: Round Five
Four years ago I referenced the Seinfeld Streak used in an earlier post of regular updates to to the Rcpp Gallery:
Open Workshop: Deep Learning in R and Keras, November 14th in Frankfurt
Piping into ggplot2
In our wrapr
pipe RJournal article we used piping into ggplot2
layers/geoms/items as an example.
RcppNLoptExample 0.0.1: Use NLopt from C/C++
A new package of ours, RcppNLoptExample, arrived on CRAN yesterday after a somewhat longer-than-usual wait for new packages as CRAN seems really busy these days. As always, a big and very grateful Thank You! for all they do to keep this community humming.
Prophets of gloom: Using NLP to analyze Radiohead lyrics
From the first time I listened to Radiohead’s The Bends, the band has been my favorite. I was a grad student in England at the time, and I recall listening to “Fake Plastic Trees” on repeat as I made my way to and from the library each day. By the time OK Computer came out, I was hooked. I remain hooked to this day.