The post 5½ Reasons to Ditch Spreadsheets for Data Science: Code is Poetry appeared first on The Lucid Manager.
When I studied civil engineering some decades ago, we solved all our computing problems by writing code. Writing in BASIC or PASCAL, I could quickly perform fundamental engineering analysis, such as reinforced concrete beams, with my home-brew software library. Soon after I started my career, spreadsheets became widely available, and I fully embraced this fantastic business tool, first Lotus 123 and later grudgingly moved to MS Excel.
Spreadsheets were excellent in those early days because data, code, visualisations and tabular output are all stored in one convenient file. Creating graphs with computer code was a bit of a nightmare in those days, so spreadsheets were a minor miracle. The next twenty years, I must have created thousands of spreadsheets of varying complexity. I even developed a ‘jungle’ of interlinked spreadsheets to manage progress reporting.
In the pioneering days of spreadsheets, they provided enormous convenience for engineers and other professionals to quickly develop analytical tools. But after using this tool for a few years, cracks started to appear.
Spreadsheets are Chaos
Throughout my career, I had many nightmarish experiences trying to reverse engineer spreadsheets, even the ones I wrote myself. The combination of data, code and output that I loved at the start of my career was reaching its limits.
Spreadsheets use incomprehensible names for variables (
,
and so on) and the formulas are impossible to read because all code is cramped on one line with deeply nested logic. The multiple parentheses make Excel formulas are even harder to read than LISP expressions. Furthermore, spreadsheets hide the formulas behind the results, which renders spreadsheets notoriously hard to understand.
My love affair with the spreadsheet came to an end when I started writing my dissertation about customer service for water utilities. Excel was incapable of helping me with the complex machine learning I needed to draw my conclusions. A colleague suggested I look into this new thing called ‘Data Science’ and this advice changed my career.
Code is Poetry
I decided to learn how to write code in the R language for statistical analysis. The R language is like a Swiss army chainsaw for engineers with capabilities that far exceed anything a spreadsheet can do.
Writing in code, such as R or Python, is like writing an instruction manual on how to analyse data. Anyone who understands the language will be able to know how you derived your conclusions. Modern data science languages can generate print-quality visualisations and can output results in many formats, including a spreadsheet.
In my job as a data science manager for a water utility I use R code for everything. The awesome power of being able to easily combine large data sets, visualise data and undertake complex analysis.
Now that I have rediscovered the poetry of writing computer code, I advocate learning to use R or perhaps Python and ditch the spreadsheet. On my data science blog, I share examples of creating value and having fun with the R language. The only purpose I still have for spreadsheets is an interface for small data sets.
5½ Reasons to Ditch the Spreadsheet
If you are still using spreadsheets, or you are trying to convince a colleague to ditch this tool, here are 5½ reasons to start using code to analyse data:
- Good analysis is reproducible and can be peer-reviewed. Spreadsheets are hard to understand because of non-sequential references. Computer code is like an instruction book that can be read step-by-step.
Spreadsheet variables are hard to understand (e.g. ZX81:ZX99 ). In computer code, you give them meaningful names (e.g. sales[81:99] ).
-
Best practice in data management is to separate data, code and output. In spreadsheets, it is not immediately clear which cell is the result of another cell and which ones are raw data. Computer code separates the data from the code and the output.
-
You can only share spreadsheet output with people who have access to the relevant software package. Computer code can produce output in multiple formats, such as HTML, PDF or even Excel, including interactive dashboards you can publish on the web.
-
Functionality in spreadsheets is limited to what is made available by Microsoft. The R and Python languages are extendable and have extensive libraries to solve complex problems.
The bonus reason to ditch the spreadsheet is that the best data science software, such as R and Python, is Open Source and freely available on the web. No license fees and it comes with terrific community support.
Feel free to leave a comment if you like to defend the spreadsheet. If you have additional reasons to ditch this venerable but largely obsolete tool.
Subscribe to this monthly blog if you are interested in using the R language for practical data science and some fun.
The post 5½ Reasons to Ditch Spreadsheets for Data Science: Code is Poetry appeared first on The Lucid Manager.
Related