This is a great idea! Unfortunately, only students at Columbia can submit. I encourage other institutions to do such contests too. We did something similar at Columbia, maybe 10 or 15 years ago? It went well, we just didn’t have the energy to do it again every year, as we’d initially planned. So I’m very happy to see the Data Science Institute start it up again.
Distilled News
How to Design a Successful Data Lake
High-profile statistical errors occur in the physical sciences too, it’s not just a problem in social science.
In an email with subject line, “Article full of forking paths,” John Williams writes:
Better R Code with wrapr Dot Arrow
Our R
package wrapr
supplies a “piping operator” that we feel is a real improvement in R code piped-style coding.
R Packages worth a look
Opinionated Approach for Digitizing Semi-Structured Qualitative GIS Data (qualmap)Provides a set of functions for taking qualitative GIS data, hand drawn on a map, and converting it to a simple features object. These tools are focuse …
If you did not already know
Document-Context Language Model (DCLM)
Text documents are structured on multiple levels of detail: individual words are related by syntax, and larger units of text are related by discourse structure. Existing language models generally fail to account for discourse structure, but it is crucial if we are to have language models that reward coherence and generate coherent texts. We present and empirically evaluate a set of multi-level recurrent neural network language models, called Document-Context Language Models (DCLMs), which incorporate contextual information both within and beyond the sentence. In comparison with word-level recurrent neural network language models, the DCLMs obtain slightly better predictive likelihoods, and considerably better assessments of document coherence. …
On “Competition” in the R Ecosystem
I’ve been thinking a bit on “competition” in the R
ecosystem.
If you did not already know
Coral Reefs Optimization (CRO)
This paper presents a novel bioinspired algorithm to tackle complex optimization problems: the coral reefs optimization (CRO) algorithm. The CRO algorithm artificially simulates a coral reef, where different corals (namely, solutions to the optimization problem considered) grow and reproduce in coral colonies, fighting by choking out other corals for space in the reef. This fight for space, along with the specific characteristics of the corals’ reproduction, produces a robust metaheuristic algorithm shown to be powerful for solving hard optimization problems. In this research the CRO algorithm is tested in several continuous and discrete benchmark problems, as well as in practical application scenarios (i.e., optimum mobile network deployment and off-shore wind farm design). The obtained results confirm the excellent performance of the proposed algorithm and open line of research for further application of the algorithm to real-world problems. …
Limit access to a Jupyter notebook instance by IP address
For increased security, Amazon SageMaker customers can now limit access to a notebook instance to a range of IP addresses.
Because it's Friday: Hurricane Trackers
With Hurricane Florence battering the US and Typhoon Manghkut bearing down on the Philippines, it’s a good time to take a look at the art of visualizing predicted hurricane paths. (By the way, did you know that “typhoon”, “hurricane” and “cyclone” are just different names for the same weather phenomenon?) Flowing Data has a good overview of the ways media have been visualizing the predicted path (hat tip: reader MB), including this animation from Axios which does a good job of demonstrating the uncertainty in the forecast: