In this post in the R:case4base series we will look at one of the most common operations on multiple data frames – merge, also known as JOIN in SQL terms.
Can we do better than using averaged measurements?
Angus Reynolds writes:
RConsortium — Building an R Certification
For the last months, ThinkR has been involved (with Mango, Procogia and the Linux Foundation) in a working group for an RConsortium R Certification.
Because it's Friday: Parable of the Polygons
What if we lived in a society where everyone really happy about living in a diverse neighborhood? What if people only wanted to move when the disparity was really extreme: say, when fewer than 33% of people nearby looked like them? Well, we’d end up with a society like this:
R Packages worth a look
Fetch Sections of XML Scholarly Articles (pubchunks)Get chunks of XML scholarly articles without having to know how to work with XML. Custom mappers for each publisher and for each article section pull o …
Whats new on arXiv
Topic representation: finding more representative words in topic models
Document worth reading: “Causal inference and the data-fusion problem”
We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion – piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks. Causal inference and the data-fusion problem
Whats new on arXiv
Overoptimization Failures and Specification Gaming in Multi-agent Systems
The Final Data Science Roadshow is Just the Beginning
The Dataiku Data Science Roadshow wrapped up its 12-country tour last week, and while they say all good things must come to an end, we’re happy to say that we still have more on the docket to satisfy the worldwide data community.
CRAN’s New Missing Data Task View
It is a relatively rare event, and cause for celebration, when CRAN gets a new Task View. This week the r-miss-tastic team: Julie Josse, Nicholas Tierney and Nathalie Vialaneix launched the Missing Data Task View. Even though I did some research on R packages for a post on missing values a couple of years ago, I was dumbfounded by the number of packages included in the new Task View.