The data.table
R
package is really good at sorting. Below is a comparison of it versus dplyr
for a range of problem sizes.
The graph is using a log-log scale (so things are very compressed). But data.table
is routinely 7 times faster than dplyr
. The ratio of run times is shown below.
Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table
uses a radix sort which has the potential to perform in near linear time (faster than the n log(n)
lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).
In fact, if we divide the y
in the above graph by log(rows)
we get something approaching a constant.
The above is consistent with data.table
not only being faster than dplyr
, but also having a fundamentally different asymptotic running time.
Performance like the above is one of the reasons you should strongly consider data.table
for your R
projects.
All details of the timings can be found here.
Like this:
Like Loading…
Related