One billion row challenge using base R – schochastics

Edit (2025-01-30): The text and benchmarks about Rfast are not correct anymore. I had to rerub the post for various reasons and the code didnt work anymore. After fixing it, Rfast was running slower than back in January 2024. What du we learn from this? Use renv….

One of my new years resolutions is to blog a bit more on the random shenanigans I do with R. This is one of those.¹

The One Billion Row challenge by Gunnar Morling is as follows:

write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!

I didn’t take long, also thanks to Hacker News, that the challenge spread to other programming languages. The original repository contains a show & tell where other results can be discussed.

Obviously it also spread to R and there is a GitHub repository from Alejandro Hagan dedicated to the challenge. There were some critical discussions on the seemingly bad performance of data.table but that issue thread also evolved to a discussion on other solutions.

The obvious candidates for fast solutions with R are dplyr, data.table, collapse, and polars. From those, it appears that polars might solve the tasks the fastest.

I was curious, how far one can get with base R.

Creating the data

The R repository contains a script to generate benchmark data. For the purpose of this post, I created files with 1e6 and 1e8 rows. Unfortunately, my personal laptop cannot handle 1 billion rows without dying.

Reading the data

All base R functions will profit from reading the state column as a factor instead of a usual string.

D <- data.table::fread("measurements1e6.csv", stringsAsFactors = TRUE)
D

         measurement  state
               <num> <fctr>
      1:   0.9819694     NC
      2:   0.4687150     MA
      3:  -0.1079713     TX
      4:  -0.2128782     VT
      5:   1.1580985     OR
     ---                   
 999996:   0.7432489     FL
 999997:  -1.6855612     KS
 999998:  -0.1184549     TX
 999999:   1.2774375     MS
1000000:  -0.2800853     MD

Who would have thought that stringAsFactors = TRUE can be useful.

The obvious: aggregate and split/lapply

The most obvious choice for me was to use aggregate().

sum_stats_vec <- function(x) c(min = min(x), max = max(x), mean = mean(x))
aggregate(measurement ~ state, data = D, FUN = sum_stats_vec) |> head()

  state measurement.min measurement.max measurement.mean
1    AK   -4.1044940643    4.2710094897     0.0030819995
2    AL   -3.6350249689    4.5386719185    -0.0078110501
3    AR   -3.8138004849    4.1011149408     0.0084538758
4    AZ   -4.4150509298    3.9651248287     0.0002287343
5    CA   -4.1597256267    4.1024463673     0.0136493032
6    CO   -3.8604891180    4.2314151167    -0.0013340964

I was pretty sure that this might be the best solution.

The other obvious solution is to split the data frame according to stats and then lapply the stats calculation on each list element.

split_lapply <- function(D) {
  result <- lapply(split(D, D$state), function(x) {
    stats <- sum_stats_vec(x$measurement)
    data.frame(
      state = unique(x$state),
      min = stats[1],
      max = stats[2],
      mean = stats[3]
    )
  })
  do.call("rbind", result)
}
split_lapply(D) |> head()

   state       min      max          mean
AK    AK -4.104494 4.271009  0.0030819995
AL    AL -3.635025 4.538672 -0.0078110501
AR    AR -3.813800 4.101115  0.0084538758
AZ    AZ -4.415051 3.965125  0.0002287343
CA    CA -4.159726 4.102446  0.0136493032
CO    CO -3.860489 4.231415 -0.0013340964

The elegant: by

I stumbled upon by when searching for alternatives. I think it is a quite elegant way of solving a group/summarize task with base R. Unfortunately it returns a list and not a data frame or matrix (I made that an implicit requirement).

In the help for by I stumbled upon a function I wasn’t aware of yet: array2DF!

array2DF(by(D$measurement, D$state, sum_stats_vec)) |> head()

  D$state       min      max          mean
1      AK -4.104494 4.271009  0.0030819995
2      AL -3.635025 4.538672 -0.0078110501
3      AR -3.813800 4.101115  0.0084538758
4      AZ -4.415051 3.965125  0.0002287343
5      CA -4.159726 4.102446  0.0136493032
6      CO -3.860489 4.231415 -0.0013340964

Does exactly what is needed here. For the benchmarks, I will also include a version without the array2DF call, to check its overhead.

Another apply: tapply

In the help for by, I also stumbled upon this sentence

Function by is an object-oriented wrapper for tapply applied to data frames.

So maybe we can construct a solution that uses tapply, but without any inbuilt overhead in by.

do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)) |> head()

         min      max          mean
AK -4.104494 4.271009  0.0030819995
AL -3.635025 4.538672 -0.0078110501
AR -3.813800 4.101115  0.0084538758
AZ -4.415051 3.965125  0.0002287343
CA -4.159726 4.102446  0.0136493032
CO -3.860489 4.231415 -0.0013340964

At this point, I was also curious if the do.call("rbind",list) can be sped up, so I constructed a second tapply solution.

sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind) |> head()

            AK          AL           AR            AZ         CA           CO
[1,] -4.104494 -3.63502497 -3.813800485 -4.4150509298 -4.1597256 -3.860489118
[2,]  4.271009  4.53867192  4.101114941  3.9651248287  4.1024464  4.231415117
[3,]  0.003082 -0.00781105  0.008453876  0.0002287343  0.0136493 -0.001334096
               CT           DE           FL           GA           HI
[1,] -3.912464242 -4.139947796 -3.741261373 -3.946609985 -4.012318121
[2,]  4.424393025  4.103203429  3.614619126  3.928906751  3.542107460
[3,]  0.003656912  0.003007263  0.003644747  0.009984163  0.008314731
              IA           ID           IL          IN          KS           KY
[1,] -3.79082275 -3.964539123 -3.894335277 -4.03965959 -4.02908514 -4.048575194
[2,]  4.16662411  3.577876528  3.795398515  3.89974329  3.62109161  3.867433732
[3,]  0.01217377  0.003644028  0.002829236 -0.01164553  0.01471751 -0.001413558
               LA           MA          MD           ME           MI
[1,] -3.839122088 -4.037138674 -3.74470558 -3.923094861 -3.952862515
[2,]  3.569243248  3.792630597  3.71628977  3.663992861  4.155783544
[3,]  0.001506674 -0.006959067 -0.01049735  0.001780304  0.005783873
               MN            MO           MS           MT           NC
[1,] -4.410583765 -3.9391034949 -4.110337731 -4.087100626 -4.377367847
[2,]  4.311070983  4.4288231699  4.040434363  4.162059065  3.981134558
[3,]  0.002225353 -0.0005302625  0.004016168 -0.005199102 -0.007913031
               ND           NE           NH           NJ           NM
[1,] -4.068938069 -4.060063227 -3.915743483 -3.897490988 -3.807738095
[2,]  3.743970751  4.178671897  4.127142052  4.092974297  3.853332924
[3,] -0.006566703 -0.001995199  0.001732865 -0.008216333 -0.004512297
               NV           NY           OH          OK            OR
[1,] -4.499979315 -4.106887785 -3.970732385 -4.01749904 -3.7554051729
[2,]  3.984158422  3.775390296  4.055413781  3.92196743  4.2991202553
[3,]  0.005909496 -0.004012489  0.003917906  0.00272036  0.0004278573
              PA           RI          SC           SD           TN
[1,] -3.87864920 -3.656146720 -3.63600174 -4.252121844 -3.630113181
[2,]  4.18986336  4.257514030  4.11314448  3.742961731  3.920525368
[3,] -0.00303136 -0.004191743 -0.01102259  0.007743447  0.006712156
               TX           UT          VA            VT           WA
[1,] -4.205888754 -4.034818321 -4.29800813 -3.7493697276 -3.760239357
[2,]  3.929358388  3.995302051  3.86146553  3.9909618948  3.874636930
[3,] -0.007846478 -0.005618143 -0.01242856 -0.0005795627  0.006844534
               WI         WV           WY
[1,] -3.763026455 -3.6034317 -3.860997763
[2,]  4.322641193  3.9029500  4.116532763
[3,]  0.002972319 -0.0090395 -0.002380265

and we should obviously also include our new found array2DF

array2DF(tapply(D$measurement, D$state, sum_stats_vec)) |> head()

  Var1       min      max          mean
1   AK -4.104494 4.271009  0.0030819995
2   AL -3.635025 4.538672 -0.0078110501
3   AR -3.813800 4.101115  0.0084538758
4   AZ -4.415051 3.965125  0.0002287343
5   CA -4.159726 4.102446  0.0136493032
6   CO -3.860489 4.231415 -0.0013340964

The obscure: reduce

I thought that this should be it, but then I remembered reduce exists. The solution is somewhat similar to split/lapply.

reduce <- function(D) {
  state_list <- split(D$measurement, D$state)
  Reduce(function(x, y) {
    res <- sum_stats_vec(state_list[[y]])
    rbind(x, data.frame(state = y, mean = res[1], min = res[2], max = res[3]))
  }, names(state_list), init = NULL)
}

reduce(D) |> head()

     state      mean      min           max
min     AK -4.104494 4.271009  0.0030819995
min1    AL -3.635025 4.538672 -0.0078110501
min2    AR -3.813800 4.101115  0.0084538758
min3    AZ -4.415051 3.965125  0.0002287343
min4    CA -4.159726 4.102446  0.0136493032
min5    CO -3.860489 4.231415 -0.0013340964

The unfair contender: Rfast

Pondering about how this functions could be sped up in general, I remembered the package Rfast and managed to construct a solution using this package.

Rfast <- function(D) {
  lev_int <- as.numeric(D$state)
  minmax <- Rfast::group(D$measurement, ina = lev_int, method = "min")
  data.frame(
    state = levels(D$state),
    mean = Rfast::group(D$measurement, lev_int, method = "mean"),
    min = Rfast::group(D$measurement, ina = lev_int, method = "min"),
    max = Rfast::group(D$measurement, ina = lev_int, method = "max")
  )
}

Rfast(D) |> head()

  state          mean       min      max
1    AK  0.0030819995 -4.104494 4.271009
2    AL -0.0078110501 -3.635025 4.538672
3    AR  0.0084538758 -3.813800 4.101115
4    AZ  0.0002287343 -4.415051 3.965125
5    CA  0.0136493032 -4.159726 4.102446
6    CO -0.0013340964 -3.860489 4.231415

Pretty sure that this will be the fastest, maybe even competitive with the other big packages!

Benchmark

For better readability I reorder the benchmark results from microbenchmark according to median runtime, with a function provided by Dirk Eddelbuettel.

reorderMicrobenchmarkResults <- function(res, order = "median") {
  stopifnot("Argument 'res' must be a 'microbenchmark' result" = inherits(res, "microbenchmark"))

  smry <- summary(res)
  res$expr <- factor(res$expr,
    levels = levels(res$expr)[order(smry[["median"]])],
    ordered = TRUE
  )
  res
}

First up the “small” dataset with 1e6 rows. I added the dplyr and data.table results as references.

sum_stats_list <- function(x) list(min = min(x), max = max(x), mean = mean(x))
sum_stats_tibble <- function(x) tibble::tibble(min = min(x), max = max(x), mean = mean(x))

bench1e6 <- microbenchmark::microbenchmark(
  aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
  split_lapply = split_lapply(D),
  array2DF_by = array2DF(by(D$measurement, D$state, sum_stats_vec)),
  raw_by = by(D$measurement, D$state, sum_stats_vec),
  docall_tapply = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
  sapply_tapply = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
  array2DF_tapply = array2DF(tapply(D$measurement, D$state, sum_stats_vec)),
  reduce = reduce(D),
  Rfast = Rfast(D),
  dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_tibble(measurement)) |> dplyr::ungroup(),
  datatable = D[, .(sum_stats_list(measurement)), by = state],
  times = 25
)

Warning in microbenchmark::microbenchmark(aggregate = aggregate(measurement ~ :
less accurate nanosecond times to avoid potential integer overflows

ggplot2::autoplot(reorderMicrobenchmarkResults(bench1e6))

expr	min	lq	mean	median	uq	max	neval
datatable	6.958848	7.405297	8.297311	7.933295	8.546245	13.83311	25
Rfast	11.743384	12.139526	13.755574	12.197664	12.969981	41.29192	25
sapply_tapply	12.767072	13.401711	14.626178	13.467024	13.522784	41.15502	25
docall_tapply	12.923692	13.404417	13.476324	13.472600	13.535248	15.58541	25
array2DF_tapply	12.840462	13.439144	17.402112	13.643693	13.766037	45.40615	25
raw_by	17.751565	18.374970	22.616228	18.889930	20.492866	48.64662	25
array2DF_by	17.893876	18.837942	21.434011	18.973857	21.597898	48.19763	25
reduce	19.132240	20.191311	22.698038	20.466585	21.401713	49.20381	25
dplyr	18.921623	20.267776	24.996619	20.922218	22.132005	122.47163	25
split_lapply	29.365840	30.348610	36.139006	32.277455	33.338781	62.49318	25
aggregate	121.751919	126.151547	135.702799	129.262094	145.984149	157.07530	25

First of, I was very surprised by the bad performance of aggregate. I looked at the source code and it appears to be a more fancy lapply/split type of functions with a lot of if/else and for which do slow down the function heavily. For the benchmark with the bigger dataset, I actually discarded the function because it was way too slow.

Apart from that, there are three groups. Rfast and data.table are the fastest. The second group are the tapply versions. I am quite pleased with the fact that the data frame building via do.call, sapply and array2DF are very much comparable, because I really like my array2DF discovery. The remaining solutions are pretty much comparable. I am surprised though, that dplyr falls behind many of the base solutions.²

Moving on to the 100 million file to see if size makes a difference.

D <- data.table::fread("measurements1e8.csv", stringsAsFactors = TRUE)

bench1e8 <- microbenchmark::microbenchmark(
  # aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
  split_lapply = split_lapply(D),
  array2DF_by = array2DF(by(D$measurement, D$state, sum_stats_vec)),
  raw_by = by(D$measurement, D$state, sum_stats_vec),
  docall_tapply = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
  sapply_tapply = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
  array2DF_tapply = array2DF(tapply(D$measurement, D$state, sum_stats_vec)),
  reduce = reduce(D),
  Rfast = Rfast(D),
  dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_tibble(measurement)) |> dplyr::ungroup(),
  datatable = D[, .(sum_stats_list(measurement)), by = state],
  times = 10
)

ggplot2::autoplot(reorderMicrobenchmarkResults(bench1e8))

expr	min	lq	mean	median	uq	max	neval
datatable	952.2624	965.2889	1025.441	979.1874	1030.698	1307.946	10
dplyr	1122.5819	1148.4585	1217.971	1163.2574	1257.394	1434.936	10
Rfast	1238.9806	1279.6060	1322.938	1300.6930	1327.787	1545.191	10
reduce	1300.1480	1311.4900	1355.998	1325.8422	1379.634	1474.427	10
array2DF_tapply	1330.6773	1344.4986	1371.125	1353.8964	1394.718	1452.030	10
docall_tapply	1316.0870	1341.4056	1379.221	1366.9425	1382.036	1574.872	10
sapply_tapply	1313.0628	1328.9188	1448.447	1404.6124	1446.478	1742.124	10
array2DF_by	2182.8691	2231.2909	2262.856	2245.7254	2273.568	2395.180	10
raw_by	2186.5236	2228.7242	2271.197	2263.1893	2313.219	2334.976	10
split_lapply	2513.3321	2591.7153	2638.842	2627.3450	2686.659	2792.027	10

Again we see three groups, but this time with clearer cut-offs. Rfast and data.table dominate and Rfast actually has a slight edge! The second group are tapply, reduce and dplyr. Surprisingly, by falls behind here, together with split/lapply.

Update(2024-01-09)

I managed to run some of the functions on a 1e9 file.

bench1e9 <- microbenchmark::microbenchmark(
    docall_tapply = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
    reduce = reduce(D),
    Rfast = Rfast(D),
    dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_tibble(measurement)) |> dplyr::ungroup(),
    datatable = D[, .(sum_stats_list(measurement)), by = state],
    times = 5
)

The previously fastest base solutions fall of a little bit, but are in my opinion still very good and still comparable with dplyr! Also, I learned that one can reorder microbenchmark results with the print command!

Summary

This was a fun little exercise, and I think I learned a lot of new things about base R, especially the existence of arry2DF!

What was surprising is how competitive base R actually is with the “big guns”. I was expecting a much bigger margin between data.table and the base solutions, but that was not the case.

Footnotes

Also inspired by a post of Danielle Navarro about the cultural loss of today’s serious blogging business.↩︎
It is far more readable though.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{schoch2024,
  author = {Schoch, David},
  title = {One Billion Row Challenge Using Base {R}},
  date = {2024-01-08},
  url = {http://blog.schochastics.net/posts/2024-01-08_one-billion-rows-challenge/},
  langid = {en}
}

For attribution, please cite this work as:

Schoch, David. 2024. “One Billion Row Challenge Using Base R.” January 8, 2024. http://blog.schochastics.net/posts/2024-01-08_one-billion-rows-challenge/.