Inspired by a recent post on how to import a directory of csv files at once using purrr and readr by Garrick, in this post we will try achieving the same using base R with no extra packages, and with data·table, another very popular package and as an added bonus, we will play a bit with benchmarking to see which of the methods is the fastest, including the tidyverse approach in the benchmark.
Let us show how to import all csvs from a folder into a data frame, with nothing but base R
To get the source data, download the zip file from this link and unzip it into a folder, we will refer to the folder path as data_dir
.
To import all .csv files from the data_dir
directory and place them into a single data frame called result
, all we have to do is:
1 |
|
1 |
|
A quick explanation of the code:
-
list.files
– produces a character vector of the names of the files in the named directory, in our casedata_dir
. We have also passed apattern
argument"\\.csv$"
to make sure we only process files with .csv at the end of the name andfull.names = TRUE
to get the file path and not just the name. -
read.csv
– reads a file in table format and creates a data frame from its content -
lapply(X, FUN, ...)
– Gives us a list of data.frames, one for each of the files found bylist.files
. More generally, it returns a list of the same length asX
, each element of which is the result of applyingFUN
to the corresponding element ofX
. In our caseX
is the vector of file names in data_dir (returned bylist.files
) andFUN
isread.csv
, so we are applyingread.csv
to each of the file paths -
rbind
– in our case combines the rows of multiple data frames into one, similarly (even though a bit more rigidly) toUNION
inSQL
-
do.call
– will combine all the data frames produced bylapply
into one usingrbind
. More generally, it constructs and executes a function call from a name or a function and a list of arguments to be passed to it. In our case the function isrbind
and the list is the list of data frames containing the data loaded from the csvs, produced bylapply
.
To fully reconstruct the results from the original post, we need to do two extra operations
-
Add the source file names to the data frame
-
Fix and reformat the dates
To do this, we will simply adjust the FUN
in the lapply
– in the above example, we have only used read.csv
. Below, we will make a small function to do the extra steps:
1 |
|
1 |
|
Lets look at the extra code in the lapply
:
-
Instead of just using
read.csv
, we have defined our own little function that will do the extra work for each of the file paths, which are passed to the function aspath
-
We read the data into a data frame called
df
usingread.csv
, and can we specifystringsAsFactors = FALSE
, as the tidyverse packages do this by default, while base R’s default is different -
We add a new column
source
with the file name stored inpath
, repeated as many times asdf
has rows. This is a bit overkill here and could be done simpler, but it is quite robust and will also work with 0-row data frames -
We transform the
Month_Year
into the requested date format withas.Date
. Note that the relatively uglysub()
part is caused mostly by inconsistency in the source data itself -
Using
[[
instead of$
is less pleasing to the eye, but we find it to be good practice, so sacrifice a bit of readability
Using data.table
Another popular package that can help us achieve the same is data.table
, so let’s have a look and reconstruct the results with data.table’s features:
1 |
|
1 |
|
Where
-
rbindlist
does the same asdo.call("rbind", l)
on data frames, but much faster -
fread
is similar toread.table
(andread.csv
, which usesread.table
) but faster and more convenient -
':='()
is the data.table syntax to create multiple new columns in a data.table (data frame)
boxplot
Visualizing the results in this case shows that data.table is a winner, with base R being the slowest of the options.
Did you find the article helpful or interesting? Help others find it by sharing
Related