When you speak with researchers, data scientists, and practitioners who are involved in any capacity with data, you are bound to here one word multiple times in a conversation: Python.
Evaluating the Business Value of Predictive Models in Python and R
By Jurriaan Nagelkerke, Data Science Consultant, and Pieter Marcus, Data Scientist
Decolonising Artificial Intelligence
· Read in 6mins · 1297 words ·
Machine Reading Comprehension: Learning to Ask & Answer
By Han Xiao, Tencent AI.
Using Confusion Matrices to Quantify the Cost of Being Wrong
There are so many confusing and sometimes even counter-intuitive concepts in statistics. I mean, come on…even explaining the differences between Null Hypothesis and Alternative Hypothesis can be an ordeal. All I want to do is to understand and quantify the cost of my analytical models being wrong.
Guest Post: Galin Jones on criteria for promotion and tenture in (bio)statistics departments
Editor’s Note: I attended an ASA Chair’s meeting and spoke about ways we could support junior faculty in data science. After giving my talk Galin Jones, Professor and Director of Statistics at University of Minnesota, and I had an interesting conversation about how they had changed their promotion criteria in response to a faculty candidate being unique. I asked him to write about his experience and he kindly contributed the following post.
Document worth reading: “The Risk of Machine Learning”
Many applied settings in empirical economics involve simultaneous estimation of a large number of parameters. In particular, applied economists are often interested in estimating the effects of many-valued treatments (like teacher effects or location effects), treatment effects for many groups, and prediction models with many regressors. In these settings, machine learning methods that combine regularized estimation and data-driven choices of regularization parameters are useful to avoid over-fitting. In this article, we analyze the performance of a class of machine learning estimators that includes ridge, lasso and pretest in contexts that require simultaneous estimation of many parameters. Our analysis aims to provide guidance to applied researchers on (i) the choice between regularized estimators in practice and (ii) data-driven selection of regularization parameters. To address (i), we characterize the risk (mean squared error) of regularized estimators and derive their relative performance as a function of simple features of the data generating process. To address (ii), we show that data-driven choices of regularization parameters, based on Stein’s unbiased risk estimate or on cross-validation, yield estimators with risk uniformly close to the risk attained under the optimal (unfeasible) choice of regularization parameters. We use data from recent examples in the empirical economics literature to illustrate the practical applicability of our results. The Risk of Machine Learning
Distilled News
Practicing ‘No Code’ Data Science
Top KDnuggets tweets, Oct 3–9: 5 Reasons Logistic Regression should be the first thing you learn when becoming a Data Scientist
Most Retweeted, Favorited, Viewed & Clicked:5 Reasons Logistic Regression should be the first thing you learn when becoming a Data Scientist https://t.co/lobXcyIzpj https://t.co/3xuHbVQvR3