One billion row challenge using base R

R
Author

David Schoch

Published

January 8, 2024

Edit (2025-01-30): The text and benchmarks about Rfast are not correct anymore. I had to rerub the post for various reasons and the code didnt work anymore. After fixing it, Rfast was running slower than back in January 2024. What du we learn from this? Use renv….

One of my new years resolutions is to blog a bit more on the random shenanigans I do with R. This is one of those.1

The One Billion Row challenge by Gunnar Morling is as follows:

write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!

I didn’t take long, also thanks to Hacker News, that the challenge spread to other programming languages. The original repository contains a show & tell where other results can be discussed.

Obviously it also spread to R and there is a GitHub repository from Alejandro Hagan dedicated to the challenge. There were some critical discussions on the seemingly bad performance of data.table but that issue thread also evolved to a discussion on other solutions.

The obvious candidates for fast solutions with R are dplyr, data.table, collapse, and polars. From those, it appears that polars might solve the tasks the fastest.

I was curious, how far one can get with base R.

Creating the data

The R repository contains a script to generate benchmark data. For the purpose of this post, I created files with 1e6 and 1e8 rows. Unfortunately, my personal laptop cannot handle 1 billion rows without dying.

Reading the data

All base R functions will profit from reading the state column as a factor instead of a usual string.

D <- data.table::fread("measurements1e6.csv", stringsAsFactors = TRUE)
D
         measurement  state
               <num> <fctr>
      1:   0.9819694     NC
      2:   0.4687150     MA
      3:  -0.1079713     TX
      4:  -0.2128782     VT
      5:   1.1580985     OR
     ---                   
 999996:   0.7432489     FL
 999997:  -1.6855612     KS
 999998:  -0.1184549     TX
 999999:   1.2774375     MS
1000000:  -0.2800853     MD

Who would have thought that stringAsFactors = TRUE can be useful.

The obvious: aggregate and split/lapply

The most obvious choice for me was to use aggregate().

sum_stats_vec <- function(x) c(min = min(x), max = max(x), mean = mean(x))
aggregate(measurement ~ state, data = D, FUN = sum_stats_vec) |> head()
  state measurement.min measurement.max measurement.mean
1    AK   -4.1044940643    4.2710094897     0.0030819995
2    AL   -3.6350249689    4.5386719185    -0.0078110501
3    AR   -3.8138004849    4.1011149408     0.0084538758
4    AZ   -4.4150509298    3.9651248287     0.0002287343
5    CA   -4.1597256267    4.1024463673     0.0136493032
6    CO   -3.8604891180    4.2314151167    -0.0013340964

I was pretty sure that this might be the best solution.

The other obvious solution is to split the data frame according to stats and then lapply the stats calculation on each list element.

split_lapply <- function(D) {
  result <- lapply(split(D, D$state), function(x) {
    stats <- sum_stats_vec(x$measurement)
    data.frame(
      state = unique(x$state),
      min = stats[1],
      max = stats[2],
      mean = stats[3]
    )
  })
  do.call("rbind", result)
}
split_lapply(D) |> head()
   state       min      max          mean
AK    AK -4.104494 4.271009  0.0030819995
AL    AL -3.635025 4.538672 -0.0078110501
AR    AR -3.813800 4.101115  0.0084538758
AZ    AZ -4.415051 3.965125  0.0002287343
CA    CA -4.159726 4.102446  0.0136493032
CO    CO -3.860489 4.231415 -0.0013340964

The elegant: by

I stumbled upon by when searching for alternatives. I think it is a quite elegant way of solving a group/summarize task with base R. Unfortunately it returns a list and not a data frame or matrix (I made that an implicit requirement).

In the help for by I stumbled upon a function I wasn’t aware of yet: array2DF!

array2DF(by(D$measurement, D$state, sum_stats_vec)) |> head()
  D$state       min      max          mean
1      AK -4.104494 4.271009  0.0030819995
2      AL -3.635025 4.538672 -0.0078110501
3      AR -3.813800 4.101115  0.0084538758
4      AZ -4.415051 3.965125  0.0002287343
5      CA -4.159726 4.102446  0.0136493032
6      CO -3.860489 4.231415 -0.0013340964

Does exactly what is needed here. For the benchmarks, I will also include a version without the array2DF call, to check its overhead.

Another apply: tapply

In the help for by, I also stumbled upon this sentence

Function by is an object-oriented wrapper for tapply applied to data frames.

So maybe we can construct a solution that uses tapply, but without any inbuilt overhead in by.

do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)) |> head()
         min      max          mean
AK -4.104494 4.271009  0.0030819995
AL -3.635025 4.538672 -0.0078110501
AR -3.813800 4.101115  0.0084538758
AZ -4.415051 3.965125  0.0002287343
CA -4.159726 4.102446  0.0136493032
CO -3.860489 4.231415 -0.0013340964

At this point, I was also curious if the do.call("rbind",list) can be sped up, so I constructed a second tapply solution.

sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind) |> head()
            AK          AL           AR            AZ         CA           CO
[1,] -4.104494 -3.63502497 -3.813800485 -4.4150509298 -4.1597256 -3.860489118
[2,]  4.271009  4.53867192  4.101114941  3.9651248287  4.1024464  4.231415117
[3,]  0.003082 -0.00781105  0.008453876  0.0002287343  0.0136493 -0.001334096
               CT           DE           FL           GA           HI
[1,] -3.912464242 -4.139947796 -3.741261373 -3.946609985 -4.012318121
[2,]  4.424393025  4.103203429  3.614619126  3.928906751  3.542107460
[3,]  0.003656912  0.003007263  0.003644747  0.009984163  0.008314731
              IA           ID           IL          IN          KS           KY
[1,] -3.79082275 -3.964539123 -3.894335277 -4.03965959 -4.02908514 -4.048575194
[2,]  4.16662411  3.577876528  3.795398515  3.89974329  3.62109161  3.867433732
[3,]  0.01217377  0.003644028  0.002829236 -0.01164553  0.01471751 -0.001413558
               LA           MA          MD           ME           MI
[1,] -3.839122088 -4.037138674 -3.74470558 -3.923094861 -3.952862515
[2,]  3.569243248  3.792630597  3.71628977  3.663992861  4.155783544
[3,]  0.001506674 -0.006959067 -0.01049735  0.001780304  0.005783873
               MN            MO           MS           MT           NC
[1,] -4.410583765 -3.9391034949 -4.110337731 -4.087100626 -4.377367847
[2,]  4.311070983  4.4288231699  4.040434363  4.162059065  3.981134558
[3,]  0.002225353 -0.0005302625  0.004016168 -0.005199102 -0.007913031
               ND           NE           NH           NJ           NM
[1,] -4.068938069 -4.060063227 -3.915743483 -3.897490988 -3.807738095
[2,]  3.743970751  4.178671897  4.127142052  4.092974297  3.853332924
[3,] -0.006566703 -0.001995199  0.001732865 -0.008216333 -0.004512297
               NV           NY           OH          OK            OR
[1,] -4.499979315 -4.106887785 -3.970732385 -4.01749904 -3.7554051729
[2,]  3.984158422  3.775390296  4.055413781  3.92196743  4.2991202553
[3,]  0.005909496 -0.004012489  0.003917906  0.00272036  0.0004278573
              PA           RI          SC           SD           TN
[1,] -3.87864920 -3.656146720 -3.63600174 -4.252121844 -3.630113181
[2,]  4.18986336  4.257514030  4.11314448  3.742961731  3.920525368
[3,] -0.00303136 -0.004191743 -0.01102259  0.007743447  0.006712156
               TX           UT          VA            VT           WA
[1,] -4.205888754 -4.034818321 -4.29800813 -3.7493697276 -3.760239357
[2,]  3.929358388  3.995302051  3.86146553  3.9909618948  3.874636930
[3,] -0.007846478 -0.005618143 -0.01242856 -0.0005795627  0.006844534
               WI         WV           WY
[1,] -3.763026455 -3.6034317 -3.860997763
[2,]  4.322641193  3.9029500  4.116532763
[3,]  0.002972319 -0.0090395 -0.002380265

and we should obviously also include our new found array2DF

array2DF(tapply(D$measurement, D$state, sum_stats_vec)) |> head()
  Var1       min      max          mean
1   AK -4.104494 4.271009  0.0030819995
2   AL -3.635025 4.538672 -0.0078110501
3   AR -3.813800 4.101115  0.0084538758
4   AZ -4.415051 3.965125  0.0002287343
5   CA -4.159726 4.102446  0.0136493032
6   CO -3.860489 4.231415 -0.0013340964

The obscure: reduce

I thought that this should be it, but then I remembered reduce exists. The solution is somewhat similar to split/lapply.

reduce <- function(D) {
  state_list <- split(D$measurement, D$state)
  Reduce(function(x, y) {
    res <- sum_stats_vec(state_list[[y]])
    rbind(x, data.frame(state = y, mean = res[1], min = res[2], max = res[3]))
  }, names(state_list), init = NULL)
}

reduce(D) |> head()
     state      mean      min           max
min     AK -4.104494 4.271009  0.0030819995
min1    AL -3.635025 4.538672 -0.0078110501
min2    AR -3.813800 4.101115  0.0084538758
min3    AZ -4.415051 3.965125  0.0002287343
min4    CA -4.159726 4.102446  0.0136493032
min5    CO -3.860489 4.231415 -0.0013340964

The unfair contender: Rfast

Pondering about how this functions could be sped up in general, I remembered the package Rfast and managed to construct a solution using this package.

Rfast <- function(D) {
  lev_int <- as.numeric(D$state)
  minmax <- Rfast::group(D$measurement, ina = lev_int, method = "min")
  data.frame(
    state = levels(D$state),
    mean = Rfast::group(D$measurement, lev_int, method = "mean"),
    min = Rfast::group(D$measurement, ina = lev_int, method = "min"),
    max = Rfast::group(D$measurement, ina = lev_int, method = "max")
  )
}

Rfast(D) |> head()
  state          mean       min      max
1    AK  0.0030819995 -4.104494 4.271009
2    AL -0.0078110501 -3.635025 4.538672
3    AR  0.0084538758 -3.813800 4.101115
4    AZ  0.0002287343 -4.415051 3.965125
5    CA  0.0136493032 -4.159726 4.102446
6    CO -0.0013340964 -3.860489 4.231415

Pretty sure that this will be the fastest, maybe even competitive with the other big packages!

Benchmark

For better readability I reorder the benchmark results from microbenchmark according to median runtime, with a function provided by Dirk Eddelbuettel.

reorderMicrobenchmarkResults <- function(res, order = "median") {
  stopifnot("Argument 'res' must be a 'microbenchmark' result" = inherits(res, "microbenchmark"))

  smry <- summary(res)
  res$expr <- factor(res$expr,
    levels = levels(res$expr)[order(smry[["median"]])],
    ordered = TRUE
  )
  res
}

First up the “small” dataset with 1e6 rows. I added the dplyr and data.table results as references.

sum_stats_list <- function(x) list(min = min(x), max = max(x), mean = mean(x))
sum_stats_tibble <- function(x) tibble::tibble(min = min(x), max = max(x), mean = mean(x))

bench1e6 <- microbenchmark::microbenchmark(
  aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
  split_lapply = split_lapply(D),
  array2DF_by = array2DF(by(D$measurement, D$state, sum_stats_vec)),
  raw_by = by(D$measurement, D$state, sum_stats_vec),
  docall_tapply = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
  sapply_tapply = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
  array2DF_tapply = array2DF(tapply(D$measurement, D$state, sum_stats_vec)),
  reduce = reduce(D),
  Rfast = Rfast(D),
  dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_tibble(measurement)) |> dplyr::ungroup(),
  datatable = D[, .(sum_stats_list(measurement)), by = state],
  times = 25
)
Warning in microbenchmark::microbenchmark(aggregate = aggregate(measurement ~ :
less accurate nanosecond times to avoid potential integer overflows
ggplot2::autoplot(reorderMicrobenchmarkResults(bench1e6))

expr min lq mean median uq max neval
datatable 6.958848 7.405297 8.297311 7.933295 8.546245 13.83311 25
Rfast 11.743384 12.139526 13.755574 12.197664 12.969981 41.29192 25
sapply_tapply 12.767072 13.401711 14.626178 13.467024 13.522784 41.15502 25
docall_tapply 12.923692 13.404417 13.476324 13.472600 13.535248 15.58541 25
array2DF_tapply 12.840462 13.439144 17.402112 13.643693 13.766037 45.40615 25
raw_by 17.751565 18.374970 22.616228 18.889930 20.492866 48.64662 25
array2DF_by 17.893876 18.837942 21.434011 18.973857 21.597898 48.19763 25
reduce 19.132240 20.191311 22.698038 20.466585 21.401713 49.20381 25
dplyr 18.921623 20.267776 24.996619 20.922218 22.132005 122.47163 25
split_lapply 29.365840 30.348610 36.139006 32.277455 33.338781 62.49318 25
aggregate 121.751919 126.151547 135.702799 129.262094 145.984149 157.07530 25

First of, I was very surprised by the bad performance of aggregate. I looked at the source code and it appears to be a more fancy lapply/split type of functions with a lot of if/else and for which do slow down the function heavily. For the benchmark with the bigger dataset, I actually discarded the function because it was way too slow.

Apart from that, there are three groups. Rfast and data.table are the fastest. The second group are the tapply versions. I am quite pleased with the fact that the data frame building via do.call, sapply and array2DF are very much comparable, because I really like my array2DF discovery. The remaining solutions are pretty much comparable. I am surprised though, that dplyr falls behind many of the base solutions.2

Moving on to the 100 million file to see if size makes a difference.

D <- data.table::fread("measurements1e8.csv", stringsAsFactors = TRUE)

bench1e8 <- microbenchmark::microbenchmark(
  # aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
  split_lapply = split_lapply(D),
  array2DF_by = array2DF(by(D$measurement, D$state, sum_stats_vec)),
  raw_by = by(D$measurement, D$state, sum_stats_vec),
  docall_tapply = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
  sapply_tapply = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
  array2DF_tapply = array2DF(tapply(D$measurement, D$state, sum_stats_vec)),
  reduce = reduce(D),
  Rfast = Rfast(D),
  dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_tibble(measurement)) |> dplyr::ungroup(),
  datatable = D[, .(sum_stats_list(measurement)), by = state],
  times = 10
)
ggplot2::autoplot(reorderMicrobenchmarkResults(bench1e8))

expr min lq mean median uq max neval
datatable 952.2624 965.2889 1025.441 979.1874 1030.698 1307.946 10
dplyr 1122.5819 1148.4585 1217.971 1163.2574 1257.394 1434.936 10
Rfast 1238.9806 1279.6060 1322.938 1300.6930 1327.787 1545.191 10
reduce 1300.1480 1311.4900 1355.998 1325.8422 1379.634 1474.427 10
array2DF_tapply 1330.6773 1344.4986 1371.125 1353.8964 1394.718 1452.030 10
docall_tapply 1316.0870 1341.4056 1379.221 1366.9425 1382.036 1574.872 10
sapply_tapply 1313.0628 1328.9188 1448.447 1404.6124 1446.478 1742.124 10
array2DF_by 2182.8691 2231.2909 2262.856 2245.7254 2273.568 2395.180 10
raw_by 2186.5236 2228.7242 2271.197 2263.1893 2313.219 2334.976 10
split_lapply 2513.3321 2591.7153 2638.842 2627.3450 2686.659 2792.027 10

Again we see three groups, but this time with clearer cut-offs. Rfast and data.table dominate and Rfast actually has a slight edge! The second group are tapply, reduce and dplyr. Surprisingly, by falls behind here, together with split/lapply.

Update(2024-01-09)

I managed to run some of the functions on a 1e9 file.

bench1e9 <- microbenchmark::microbenchmark(
    docall_tapply = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
    reduce = reduce(D),
    Rfast = Rfast(D),
    dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_tibble(measurement)) |> dplyr::ungroup(),
    datatable = D[, .(sum_stats_list(measurement)), by = state],
    times = 5
)

The previously fastest base solutions fall of a little bit, but are in my opinion still very good and still comparable with dplyr! Also, I learned that one can reorder microbenchmark results with the print command!

Summary

This was a fun little exercise, and I think I learned a lot of new things about base R, especially the existence of arry2DF!

What was surprising is how competitive base R actually is with the “big guns”. I was expecting a much bigger margin between data.table and the base solutions, but that was not the case.

Footnotes

  1. Also inspired by a post of Danielle Navarro about the cultural loss of today’s serious blogging business.↩︎

  2. It is far more readable though.↩︎

Reuse

Citation

BibTeX citation:
@online{schoch2024,
  author = {Schoch, David},
  title = {One Billion Row Challenge Using Base {R}},
  date = {2024-01-08},
  url = {http://blog.schochastics.net/posts/2024-01-08_one-billion-rows-challenge/},
  langid = {en}
}
For attribution, please cite this work as:
Schoch, David. 2024. “One Billion Row Challenge Using Base R.” January 8, 2024. http://blog.schochastics.net/posts/2024-01-08_one-billion-rows-challenge/.