Python retains its top spot in the fifth annual IEEE Spectrum top programming language rankings, and also gains a designation as an “embedded language”. Data science language R remains the only domain-specific slot in the top 10 (where it as listed as an “enterprise language”) and drops one place compared to its 2017 ranking to take the #7 spot.
“The most important aspect of a statistical analysis is not what you do with the data, it’s what data you use” (survey adjustment edition)
Dean Eckles pointed me to this recent report by Andrew Mercer, Arnold Lau, and Courtney Kennedy of the Pew Research Center, titled, “For Weighting Online Opt-In Samples, What Matters Most? The right variables make a big difference for accuracy. Complex statistical methods, not so much.”
If you did not already know
Principal Points
The k principal points of a p-variate random variable X are defined as those points x1,…xk which minimize the expected squared distance of X from the nearest of the xj.
High Precision Numerical Computation of Principal Points For Univariate Distributions …
Meta-packages, nails in CRAN’s coffin
Derek Jones recently discussed a possible future for the R
ecosystem in “StatsModels: the first nail in R’s coffin”.
“Optimized” floor plan with genetic algorithms
Genetic algorithms are inspired by natural selection, where the system is given a set of inputs and the “best” iteration is chosen until there’s some kind of convergence to a solution. Joel Simon applied this process to floor plan design.
Azure Functions for Data Science
Data Scientists do more than build fancy AI and machine learning models. They often times need to get involved with the data acquisition process. It is common for data to be pulled from other databases or even an API. Plus, the models need to be deployed. These tasks fall to the data scientist to solve (unless there is a data engineer willing to help). Recently, I have discovered Azure Functions to be an extremely useful tool for solving these types of tasks.
Let’s be open about the evidence for the benefits of open science
Kidwell et al. 2017, Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency. Published in Plos Bio, criticized in Plos blog here. Brian Nosek responds at great length therein.
When Recurrent Models Don't Need to be Recurrent
An earlier version of this post was published on Off the Convex Path. It is reposted here with the author’s permission.
2018 Data Sources for Cool Data Science Projects, provided by Thinknum
Quick and Dirty Serverless Integer Programming
We all know that Python has risen above its humble beginnings such that it now powers billion dollar companies. Let’s not forget Python’s roots, though! It’s still an excellent language for running quick and dirty scripts that automate some task. While this works fine for automating my own tasks because I know how to navigate the command line, it’s a bit much to ask a layperson to somehow install python and dependencies, open Terminal on a Mac (god help you if they have a Windows computer), type a random string of characters, and hit enter. Ideally, you would give the layperson a button, they hit it, and they get their result.