In this paper we argue that the data management community should devote far more effort to building data integration (DI) systems, in order to truly advance the field. Toward this goal, we make three contributions. First, we draw on our recent industrial experience to discuss the limitations of current DI systems. Second, we propose an agenda to build a new kind of DI systems to address these limitations. These systems guide users through the DI workflow, step by step. They provide tools to address the ‘pain points’ of the steps, and tools are built on top of the Python data science and Big Data ecosystem (PyData). We discuss how to foster an ecosystem of such tools within PyData, then use it to build DI systems for collaborative/cloud/crowd/lay user settings. Finally, we discuss ongoing work at Wisconsin, which suggests that these DI systems are highly promising and building them raises many interesting research challenges. Toward a System Building Agenda for Data Integration
If you did not already know
Robust Sparse Principal Component Analysis (ROSPCA)
A new sparse PCA algorithm is presented, which is robust against outliers. The approach is based on the ROBPCA algorithm that generates robust but nonsparse loadings. The construction of the new ROSPCA method is detailed, as well as a selection criterion for the sparsity parameter. An extensive simulation study and a real data example are performed, showing that it is capable of accurately finding the sparse structure of datasets, even when challenging outliers are present. In comparison with a projection pursuit-based algorithm, ROSPCA demonstrates superior robustness properties and comparable sparsity estimation capability, as well as significantly faster computation time. …
Document worth reading: “Lectures on Statistics in Theory: Prelude to Statistics in Practice”
This is a writeup of lectures on ‘statistics’ that have evolved from the 2009 Hadron Collider Physics Summer School at CERN to the forthcoming 2018 school at Fermilab. The emphasis is on foundations, using simple examples to illustrate the points that are still debated in the professional statistics literature. The three main approaches to interval estimation (Neyman confidence, Bayesian, likelihood ratio) are discussed and compared in detail, with and without nuisance parameters. Hypothesis testing is discussed mainly from the frequentist point of view, with pointers to the Bayesian literature. Various foundational issues are emphasized, including the conditionality principle and the likelihood principle. Lectures on Statistics in Theory: Prelude to Statistics in Practice
More on sigr
If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm()
summary object does in fact carry the R-squared and F statistics, both in the printed form:
Data Feminism
Data grows more intertwined with the everyday and more involved in important decisions. However, data is biased in many ways from collection, to analysis, and the conclusions, which is a problem when it is often intended to provide an objective point of view. In their recently released manuscript for Data Feminism, Catherine D’Ignazio and Lauren Klein discuss the importance of varied points of view:
xts 0.11-2 on CRAN
xts version 0.11-2 was published to CRAN yesterday. xts provides data structure and functions to work with time-indexed data. This is a bug-fix release, with notable changes below:
Happy 10th Bday, Rcpp – and welcome release 1.0 !!
Ten years ago today I wrote the NEWS.Rd entry in this screenshot for the very first Rcpp_release:
R Packages worth a look
Unifying Estimation Results with Binary Dependent Variables (urbin)Calculate unified measures that quantify the effect of a covariate on a binary dependent variable (e.g., for meta-analyses). This can be particularly i …
“Statistical and Machine Learning forecasting methods: Concerns and ways forward”
Roy Mendelssohn points us to this paper by Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos, which begins:
Whats new on arXiv
On Meta-Learning for Dynamic Ensemble Selection