Happy New Year to all of you! Let us start the year with something for your inner maths nerd
Top 5 Data Visualization Tools for 2019
All the best dataset, Artificial Intelligence, Machine Learning, and Business Intelligence tools are useless without effective visualization capabilities. In the end, data science is all about presentation, Whether you are a chief data scientist at Google or an all-in-one ‘many-hats’ data scientist at a start-up, you still have to show the results of your algorithm to a management executive for approval. We have all heard the adage, “a picture is worth a thousand words”. I would rephrase that for data science as “An effective infographic is worth an infinite amount of data”. Because even if you present the most amazing algorithms and statistics in the universe to your management, they will be unable to comprehend it. But present even a simple infographic – and everyone in the boardroom, from the CEO to your personnel manager, will be able to understand what your findings mean for your business enterprise.
Published in 2018
- [2018] R-squared for Bayesian regression models. {\em American Statistician}. (Andrew Gelman, Ben Goodrich, Jonah Gabry, and Aki Vehtari)
Magister Dixit
“1. The nature of statisticsStatistics is the original computing with data. It is the field that deals with data with the most portability (it isn’t dependent on one type of physical model) and rigor. Statistics can be a pessimal field: statisticians are the masters of anticipating what can go wrong with experiments and what fallacies can be drawn from naive uses of data. Statistics has enough techniques to solve just about any problem, but it also has an inherent conservatism to it.I often say the best source of good statistical work is bad experiments. If all experiments were well conducted, we wouldn’t need a lot of statistics. However, we live in the real world; most experiments have significant shortcomings and statistics is incredibly valuable.Another aspect of statistics is it is the only field that really emphasizes the risks of small data. There are many other potential data problems statistics describes well (like Simpson’s paradox), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets. This is actually very important: datasets that are expensive to produce (such as drug trials) are necessarily small.It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to GIGO). And in some cases big data is promoted as valuable only because it is the cheapest to produce. Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).If your business is directly producing truly valuable data (not just producing useful proxy data) you likely have small data issues. If you have any hint of a small data issue, you want to consult with a good statistician.2. The nature of machine learningIn some sense machine learning rushes where statisticians fear to tread. Machine learning does have some concept of small data issues (such as knowing about over-fitting), but it is an essentially optimistic field.The goal of machine learning is to create a predictive model that is indistinguishable from a correct model. This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example: Bennett, K. P., & Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265/1281). Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.3. The nature of data miningData mining is a term that was quite hyped and now somewhat derided. One of the reasons more people use the term “data science” nowadays is they are loath to say “data mining” (though in my opinion the two activities have different goals).The goal of data mining is to find relations in data, not to necessarily make predictions or come up with explanations. Data mining is often what I call “an x’s only enterprise” (meaning you have many driver or “independent” variables but no pre-ordained outcome or “dependent” variables) and some of the typical goals are clustering, outlier detection and characterization.There is a sense that when it was called exploratory statistics it was considered boring, but when it was called data mining it was considered sexy. Actual exploratory statistics (as defined by Tukey) is exciting and always an important “get your hands into the data” step of any predictive analytics project.4. The nature of informaticsInformatics and in particular bioinformatics are very hot terms. A lot of good data scientists (a term I will explain later) come from the bioinformatics field.Once we separate out the portions of bioinformatics that are in fact statistics and the ones that are in fact biology we are left with data infrastructure and matching algorithms. We have the creation and management of data stores, data bases and design of efficient matching and query algorithms. This isn’t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like online suffix trees) are quite brilliant.5. The nature of big dataBig data is a white-hot topic. The thing to remember is: it is just the infrastructure (MapReduce, Hadoop, noSQL and so on). It is the platform you perform modeling (or usually just report generation) on top of.6. The nature of predictive analyticsThe Wikipedia defines Predictive analytics as the variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. It is a set of goals and techniques emphasizing making models. It is very close to what is also meant by data science.I don’t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background. To my ear analytics is more associated with visualization, reporting and summarization than with modeling. I also try to use the term modeling over prediction (when I remember) as prediction often in non-technical English implies something like forecasting into the future (which is but one modeling task).7. The nature of data scienceThe Wikipedia defines data science as a field that incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.8. ConclusionMachine learning and statistics may be the stars, but data science the whole show.” John Mount ( April 19, 2013 )
How to Write a Great Data Science Resume
Writing a resume for job applications is rarely a fun task, but it is a necessary evil. The majority of companies require a resume in order to apply to any of their open jobs, and a resume is often the first layer of the process in getting past the “Gatekeeper” — the Recruiter or Hiring Manager.
Adding Firebase Authentication to Shiny
Firebase and Shiny Firebase is a mobile and web application development platform owned by Google. Firebase provides front-end solutions for authentication, database storage, object storage, messaging, and more. Firebase drastically reduces the time needed to develop certain types of highly scalable and secure applications.
Purr yourself into a math genius
Abstract: We use the purrr package to solve a popular math puzzle via a combinatorial functional programming approach. A small shiny app is provided to allow the user to solve their own variations of the puzzle.
Approaches to Text Summarization: An Overview
‘data:’ Scraping & Chart Reproduction : Arrows of Environmental Destruction
Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has made Americans safer and healthier as a result. What’s happened to an EPA left in the hands of evil (yep, “evil”) in the past two years is beyond lamentable and we likely have two more years of lamenting ahead of us (unless you actually like your water with a coal ash chaser).
If you did not already know
Draw and Discard In this work, we propose a novel framework for privacy-preserving client-distributed machine learning. It is motivated by the desire to achieve differential privacy guarantees in the local model of privacy in a way that satisfies all systems constraints using asynchronous client-server communication and provides attractive model learning properties. We call it ‘Draw and Discard’ because it relies on random sampling of models for load distribution (scalability), which also provides additional server-side privacy protections and improved model quality through averaging. We present the mechanics of client and server components of ‘Draw and Discard’ and demonstrate how the framework can be applied to learning Generalized Linear models. We then analyze the privacy guarantees provided by our approach against several types of adversaries and showcase experimental results that provide evidence for the framework’s viability in practical deployments. …