by Mark Niemann-Ross
Mango Solutions presented the EARL conference in Seattle this November and I was fortunate enough to have time to attend. During the conference I took notes on the presentations, which I’ll pass along to you.
Thoughts on the future of R in industry
The EARL conference started with a panel discussion on R. Moderated by Doug Ashton from Mango Solutions, the panel included Julia Silge with Stack Overflow, David Smith with Microsoft, and Joe Cheng with Rstudio.
Topics ranged from the tidyverse, R certification, R as a general purpose language, Shiny, and the future of R. I captured some interesting quotes from the panel members:
Regarding R certification, David Smith points out that certifications are important for big companies trying to filter employment applications. He mentions that certification is a minimum bar for some HR departments. Julia Silge mentions that Stack Overflow has found certifications to be less influential in the hiring process.
R as a general purpose language: Joe Cheng feels that R is useful for more than just statistics, but that Rstudio isn’t interested in developing general purpose tools. There was discussion around Python as the “second best” language for a lot of applications, and an agreement that R should remain focused on data science.
Most interesting was the discussion regarding the future of R. Julia Silge points out that Stack Overflow data shows R growing fast year over year — at about the same rate as Python. There are a lot of new users and packages need to take that into account.
I learned more about Natural Language Processing
Tim Oldfield introduces this conference as NOT a code-based conference. However, Julia Silge doesn’t hesitate to illustrate her discussion with appropriate code. And seriously, how would you discuss natural language processing without using the language of the trade?
I won’t get into the terms (TF-IDF) and technology (tidytext) of Mz Silge’s presentation. I will mention she does a great job of summarizing how and why to perform text mining. Like all good tech, you can easily scratch the surface of text mining in fifteen minutes. A thorough understanding requires years of hard research. If you’d like an introduction to her work, take a look at her paper She Giggles, He Gallops - analyzing gender tropes in film.
I gained an understanding of machine-learning
David Smith with Microsoft presented a session on neural nets, machine learning and transfer learning. More than just a theoretical chat, David illustrated the concepts with working tools. I’ve read quite a bit about machine learning – but this illustration really brings it home. Oh — and it’s pretty damn funny. ( David posted this on a blog entry here )
I learned to (sort of) love Excel
Eina Ooka found herself forced to incorporate Excel with her data workflow. Now — we all have opinions about Excel in data science — but Eina points out that for multidisciplinary data science teams, it’s good for data storage, modeling, and reports. There are issues about reproducibility and source control and for that, R is a good solution. But Eina summarizes that excel is still useful. Not all projects can move away from it.
Data science teams without structured, intentional collaboration leak knowledge and waste resources
Stephanie Kirmer provided real-life experience and lessons learned on Data Science teams. Her themes included collaboration, version control, reproducibility, institutional knowledge, and other concerns. She has accomplished this with the consistent use of R packages.
One of her most interesting concepts was using packages to capture institutional knowledge. Documenting procedures with a function, contained in a standardized package provides stability and a common tool. This package then becomes an on-boarding tool for new hires and a knowledge repository for departing staff.
I learned risk can be studied and quantified
Risk is the chance that something bad will happen. David Severski with Starbucks revealed some of the tools used by the coffee giant, specifically OpenFAIR (Open Factor Analysis of Information Risk) and EvaluatoR (an R package for risk evaluation). Dave points out that R is an elegant language for data tasks. It has an ecosystem of tools for simulations and statistics, making risk evaluation a plug-and-play process.
Personally, I don’t have call for risk evaluation. But it’s interesting to get a quick peek into the language and concerns of this specialty.
I was reminded of the Science in Data Science
Mark Gallivan reminds us about the science *in *data science. He researched the effect of Zika on travel by searching twitter for #babymoon. With that data, he cross-references on the geolocation of the tweet to draw conclusions of the impact of Zika on travel plans of pregnant women. This is one of those useful presentations on the nuts and bolts of research.
I gained knowledge for non-R conversations
On November 7th I attended the EARL (Enterprise Applications of the R Language) conference in Seattle. Two days later I attended Orycon, the annual science fiction convention in Portland, Oregon. After every conference I attend, I do a private post-mortem. I ask myself if the conference was worthwhile, if I’d attend again, and if my time was well-spent.
EARL is a deep dive into the application of the R language. Attendees are assumed to have deep understanding of R, statistics, and a domain knowledge; the quintessential definition of data science.
Orycon is a gathering of science fiction fans. It includes a crowd of cosplayers imitating the latest Star Wars characters — but I’ll ignore them for this discussion. To be clear, the people that appreciate science fiction are deeply involved in science fact.
As a result of attending EARL, I’m better prepared to understand the talent and experience slightly under the radar at Orycon. I already knew the methods the experts used to perform and document their research. Thanks to David Smith’s “not hotdog” I understand machine learning at an operational level, so can skip over that discussion — or correct bad science from a misinformed espouser of pseudo-fact.
Mark is an author, educator, and writer and teaches about R and Raspberry Pi at LinkedIn Learning.