Note:A version of this article was also published via LinkedIn here.
* **With the rise of ‘Big Data’, ‘Machine Learning’ and the ‘Data Scientist’ has come an explosion in the popularity of using open-source programming tools for data analysis. *
*This articleprovides a short summary of some of the evidence of these tools overtakingcommercial alternatives and why, if you work with data, adding an openprogramming language, like R or Python, to your professional repertoire islikely to be a worthwhile career investment for 2019 and beyond. *
Like most faithful public policy wonks,I’ve spent more hours than I can count dragging numbers across a screen to understand,analyse or predict whatever segment of the world I have data on.
Exploring where the moneywas flowing in the world’s youngest democracy; analysing whichgovernment program was delivering the biggest impact; or predicting which roads werelikely to disappear first as a result of climate change.
New policy questions, new approaches toanswer them and a fresh set of data.
Yet, every silver-lining has a cloud. And inmy experience with data it’s often the need to scale a new learning curve to adhereto legacy systems and fulfil an organizational fetish for using their statisticalsoftware of choice.
Excel, SAS, SPSS, Eviews, Minitab, Stataand the list goes on.
Which is why I’ve decided this articleneeded to be written:
Because not only am I tired of talking to fellowanalytical wonks about why they’re limiting themselves by only being able to workon data with spreadsheets, but also that there are distinct professionaladvantages to unshackling yourself from the professional tyranny of proprietarytools:
- Open-Source Statistics isBecoming the Global Standard
Firstly, if you haven’t been watching, theworld is increasingly full of data. So much data, that the world is chasingafter nerds to analyse it. As a result, the demand for a ‘jack of all trades’data person, or “data scientist” has been outstripping that of a morevanilla-flavoured ‘statistician’:
% Job Advertisements with term “data scientist” vs. “statistician”
(Credit:Bob Muenchen – r4stats.com)
And although youmight not have aspirations to work in what the Harvard Business Review calledthe ‘Sexiest Job of the21st Century’ the data gold rush has had implications far beyond thesex appeal of nerds.
For one, online communitieslike Stackoverflow,Kaggleand Datafor Democracy have flourished. Providing practical avenues for learninghow to do some science with data and driving demand for tools that makeapplying this science accessible to everyone, like R and Python.
So much, that someof the best evidence, suggests that not only isdemand for quants with R and Python skills booming, but the practical use of open-sourcestatistical tools like R and Python are starting to eclipse their proprietary relatives:
Statistical software by Google Scholar Hits:
(More credit to Bob Muenchen –r4stats.com)
Of course, I’m not here to conclusively make the point that a particular piece of software is a ‘silver bullet’. Only that something has happened in the world of data that the quantitatively inclined shouldn’t ignore: Not only are R and Python becoming programming languages for the masses, but they’re increasingly proving themselves as powerful complements to more traditional business analysis tools like Excel and SAS.
2.R is for Renaissance Wo(Man)
*For those watching the news, you’ll no doubt have heard of the great battle being waged between the R and Python languages that has tragically left the internet strewn with the blood of programmers and their pocket protectors. *
*But I’m going to goosestep right over the issue as in myopinion much of what I say for R, is increasingly applicable to Python. *
For those of youunfamiliar with R, in essence it’s a programming language made to use computersto do stuff with numbers.
Enter: “10*10”and it will tell you ‘100’
Enter: “print(‘Sup?’)”and the computer will speak to you like that kid loitering on your lawn.
Developedaround 25 years ago, the idea behind R wasin essence to develop a simpler, more open and extendible programming languagefor statisticians. Something which allowed you greater power and flexibilitythan a ‘point and click’ interface, but that was quicker than punch cards or manuallykeying in 1s and 0s to tell the computer what to do.
The result: R –A free statistical tool whose sustained growth use has led to one of the mostflexible statistical tools in existence.
So much growthin fact, that in 2014 enough new functionality was added to R by the communitythat “Radded more functions/procs than the SAS Institute has written in its entirehistory.**” And while it’snot the quantity of your software packages that counts, the speed ofdevelopment is impressive and a good indication of the likely future trajectoryof R’s functionality. Particularly as many heavy hitters including the likes ofMicrosoft,IBM and Google are already using R and making their owncontributions to the ecosystem:
Using R for Analytics – Get in Before George Clooney Does:
Imagesource. Also, see here
Not only that, but with much of this growth being driven by user contributions, it is also a great reminder of the active and supportive community you have access to as an R and Python user. Making it easier to get help, access free resources and find example code to steal base your analysis on.
**3. R is Data and Discipline Agnostic **
(Source: xkcd)
One of the firstthings that motivated me to learn R, was the observation that many of the most interestingquestions I encountered went unanswered because they crossed disciplines, involvedobscure analytical techniques, or were locked away in a long-forgotten format. Ittherefore seemed logical to me that if I could become a data analytics “MacGyver”,I’d have greater opportunities to work on interesting problems.
Which is how Idiscovered R. You see, as somebody that is interested in almost everything, R’s adoption by such a diverse range of fieldsmade it nearly impossible to overlook. With extensions being freely available towork with a wide variety of data formats (proprietaryor otherwise) and apply a range of nerdy methods, R made a lot ofsense.
I think it was RichardBranson that once said “If somebody offers you a problem but you are not sureyou can do it, say yes. R probably has a package for it”:
Then R (andincreasingly Python) has you covered.
Yet there is perhaps a subtler reason adopting R made sense and that’s the simple fact that by being ‘discipline agnostic’ it’s well-suited for multidisciplinary teams, applied multi-potentialites and anyone uncertain about exactly where their career might take them.
4. R Helps Avoid Fitting the Problem to the Tool
As an economist, I love a good echo chamber. Not only does everybody speak my language and get my jokes, but my diagnosis of the problem is always spot-on. Unfortunately, thanks to errors of others, I’m aware that such cosy teams of specialists, isn’t always a good idea – with homogeneous specialist teams risking developing solutions which aren’t fit for purpose by too narrowly defining a problem and misunderstanding the scope of the system it’s embedded in.
(Source: chainsawsuit.com)
While goodorganizations are doing their best to addressthis, creating teams that are multidisciplinary and have morediverse networks can be a useful means to protect against theserisks while also driving better performance. Which of course standsto be another useful advantage of using more general statistical tools with adiverse user base like R: as you can more fluidly collaborate acrossdisciplines while being better able to pick the right technique for yourproblem, reducing the risk that everything look likea nail, merely because you have a hammer. 5. Programming Encourages Reproducibility
Yet programming languages also hold anadditional advantage to more typical ‘point and click’ interfaces forconducting analysis – transparency and reproducibility.
For instance, because software like R encouragesyou to write down each step in your analysis, your work is more likely to be ‘reproducible’than had it been done using more traditional ‘point and click solutions. Thisis because you’re encouraged to record each step needed to achieve the finalresult making it easier for your colleagues to understand what the hell you’redoing and increasing the likelihood you’ll be able to reproduce the results whenyou need to (orsomebody else will).
In addition to this being practicallyuseful for tracing your journey down the data-analysis-maze, for analyticalteams it can also serve as a means for encouraging collaboration by allowing tomore easily understand your work and replicate your results. Assisting withorganizational knowledge retention and providing an incentive for ensuringanalysis is accurate by often making it easier to spot errors before theyimpact your analysis or soilyour reputation.
Finally, while the use of scripting isn’t unique to open-source programming languages, by being free, R and Python comes with an additional advantage that in the instance you decide to release your analysis, the potential audience is likely to be greater and more diverse than had it been written using propriety software. Which is why in a world of the “Open Government Partnership” open-source programming languages makes a lot of sense, providing a means of easing the transition towards government publicly releasing government policy models.
6. R Helps Make Bytes Beautiful
As data-driven-everything becomes all therage, making data pretty is becoming an increasingly important skill. R is great atthis, with virtually unlimited options for unleashing yourcreativity on the world and communicating your results to the masses. Bargraphs, scatter diagrams, histograms and heat maps. Easy.
Just not pie graphs. They’reterrible.
But R’s visualization tools don’t finish at your desk, with the ‘Shiny’ package allowing you to take your pie graphs to the bigtime by publishing interactive dashboards for the web. Boss asking you to redo a graph 20 times each day? Outsource your work to the web by automating it through a dashboard and send them a link while you sip cocktails at the beach.
7. R and Python are free, but the Cost of Ignoring the Trend Towards Open-Source Statistics Won’t Be
Finally, R and Python are free, meaning notonly can you install it wherever you want, but that you can take it with youthroughout your career:
-
Statistics lecturers prescribingyou textbooks that are trying to get you hooked on expensive software thatlikely won’t exist when you graduate? Tellthem it’s not 1999 and sendthem a link to this.
-
Working for a not-for-profitorganization that needs statistical software but can’t afford the costs ofproprietary software? Install R and show them how to install Swirl’s free interactivelessons.
-
Want to install it at home? Noproblem. You can even give a copy to your cat.
-
Got a promotion and been gifteda new team of statisticians? Swap the Christmasbonuses for the give the gift that keeps giving: R!
But I’m not here to tell you R (or Python) areperfect. Afterall, there are goodreasons some companies are reluctant to switch their analysis to R or Python.Nor am I interested in convincing you that it can, or should, replace everyproprietary tool you’re using. As I’m an avid spreadsheeter and programslike Excel have distinct advantages.
Rather,I’d like to suggest that for all the immediate costs involved in learning anopen-source programming language, whether it be R or Python, the long-term benefitsare more than likely to surpass them.
(Source)
Not only that, but as a new generation of data scientists continue to push for the use of open-source tools, it’s reasonable to expect R and Python will become as pervasive a business tool as the spreadsheet and as important to your career as laughing at your boss’ terrible jokes.
*Interested in learning R? Check out thislink here for a range of free resources. *
You can also read my review of the online specialization I took to scale the R learning curve here.
Related