Running the Same Task in Python and R

According to a KDD poll fewer respondents (by rate) used only R in 2017 than in 2018. At the same time more respondents (by rate) used only Python in 2017 than in 2016.

Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.

For our task we picked the painful exercise of directly reading a 50,000,000 row by 50 column data set into memory on a machine with only 8GB of ram.

In Python the Pandas package takes around 6 minutes to read the data, and then one is ready to work.

In R both utils::read.csv() and readr::read_csv() fail with out of memory messages. So if your view of R is “base R only”, or “base R plus tidyverse only”, or “tidyverse only”: reading this file is a “hard task.”

With the above narrow view one would have no choice but to move to Python if one wants to get the job done.

Or, we could remember data.table. While data.table is obviously not part of the tidyverse, data.table has been a best-practice in R for around 12 years. It can read the data and is ready to work in R in under a minute.

In conclusion, to get things done in a pinch: learn Python or learn data.table. And, in my opinion, “tidyverse first teaching” (commonly code for “tidyverse only teaching”) may not serve the R community well in the long run.

Like this:

Like Loading…

Related