Six degrees of Hadley Wickham: The CRAN co-authorship network

R
networks
data analysis
Author

David Schoch

Published

January 17, 2024

Once upon a time I was a dedicated network scientist. Currently though they are more peripheral in my work and I just like to toy around with interesting datasets. One of those is the CRAN co-authorship network. In a co-authorship network, two individuals (in our case R developers), are connected, if they authored a piece of work together. Here, a “piece of work” is an R package. This network can be assembled quite easily based on the authors field in all DESCRIPTION files of packages available on CRAN. I have done a low level analysis on GitHub, also featured in tidytuesday, including the introduction of the Hadley number, but I always wanted to do a longer write up. And voila, this is said write up.

library(tidyverse)
library(igraph)
library(netUtils)

Getting the Data

It is actually quite easy to get all metadata (and more!) of the DESCRIPTION files from CRAN. It is a single function call

db <- tools::CRAN_package_db()
str(db)
'data.frame':   20301 obs. of  67 variables:
 $ Package                : chr  "A3" "AalenJohansen" "AATtools" "ABACUS" ...
 $ Version                : chr  "1.0.0" "1.0" "0.0.2" "1.0.0" ...
 $ Priority               : chr  NA NA NA NA ...
 $ Depends                : chr  "R (>= 2.15.0), xtable, pbapply" NA "R (>= 3.6.0)" "R (>= 3.1.0)" ...
 $ Imports                : chr  NA NA "magrittr, dplyr, doParallel, foreach" "ggplot2 (>= 3.1.0), shiny (>= 1.3.1)," ...
 $ LinkingTo              : chr  NA NA NA NA ...
 $ Suggests               : chr  "randomForest, e1071" "knitr, rmarkdown" NA "rmarkdown (>= 1.13), knitr (>= 1.22)" ...
 $ Enhances               : chr  NA NA NA NA ...
 $ License                : chr  "GPL (>= 2)" "GPL (>= 2)" "GPL-3" "GPL-3" ...
 $ License_is_FOSS        : chr  NA NA NA NA ...
 $ License_restricts_use  : chr  NA NA NA NA ...
 $ OS_type                : chr  NA NA NA NA ...
 $ Archs                  : chr  NA NA NA NA ...
 $ MD5sum                 : chr  "027ebdd8affce8f0effaecfcd5f5ade2" "d7eb2a6275daa6af43bf8a980398b312" "bc59207786e9bc49167fd7d8af246b1c" "50c54c4da09307cb95a70aaaa54b9fbd" ...
 $ NeedsCompilation       : chr  "no" "no" "no" "no" ...
 $ Additional_repositories: chr  NA NA NA NA ...
 $ Author                 : chr  "Scott Fortmann-Roe" "Martin Bladt [aut, cre],\n  Christian Furrer [aut]" "Sercan Kahveci [aut, cre]" "Mintu Nath [aut, cre]" ...
 $ Authors@R              : chr  NA "c(person(\"Martin\", \"Bladt\", email = \"martinbladt@math.ku.dk\", role = c(\"aut\", \"cre\")),\n             "| __truncated__ "person(\"Sercan\", \"Kahveci\", email = \"sercan.kahveci@sbg.ac.at\", role = c(\"aut\", \"cre\"))" NA ...
 $ Biarch                 : chr  NA NA NA NA ...
 $ BugReports             : chr  NA NA "https://github.com/Spiritspeak/AATtools/issues" NA ...
 $ BuildKeepEmpty         : chr  NA NA NA NA ...
 $ BuildManual            : chr  NA NA NA NA ...
 $ BuildResaveData        : chr  NA NA NA NA ...
 $ BuildVignettes         : chr  NA NA NA NA ...
 $ Built                  : chr  NA NA NA NA ...
 $ ByteCompile            : chr  NA NA "true" NA ...
 $ Classification/ACM     : chr  NA NA NA NA ...
 $ Classification/ACM-2012: chr  NA NA NA NA ...
 $ Classification/JEL     : chr  NA NA NA NA ...
 $ Classification/MSC     : chr  NA NA NA NA ...
 $ Classification/MSC-2010: chr  NA NA NA NA ...
 $ Collate                : chr  NA NA NA NA ...
 $ Collate.unix           : chr  NA NA NA NA ...
 $ Collate.windows        : chr  NA NA NA NA ...
 $ Contact                : chr  NA NA NA NA ...
 $ Copyright              : chr  NA NA NA NA ...
 $ Date                   : chr  "2015-08-15" NA NA NA ...
 $ Date/Publication       : chr  "2015-08-16 23:05:52" "2023-03-01 10:42:09 UTC" "2022-08-12 13:40:09 UTC" "2019-09-20 07:40:06 UTC" ...
 $ Description            : chr  "Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicab"| __truncated__ "Provides the conditional Nelson-Aalen and Aalen-Johansen estimators. The methods are based on Bladt & Furrer (2"| __truncated__ "Compute approach bias scores using different scoring algorithms,\n    compute bootstrapped and exact split-half"| __truncated__ "A set of Shiny apps for effective communication and understanding in statistics. The current version includes p"| __truncated__ ...
 $ Encoding               : chr  NA "UTF-8" "UTF-8" "UTF-8" ...
 $ KeepSource             : chr  NA NA NA NA ...
 $ Language               : chr  NA NA NA NA ...
 $ LazyData               : chr  NA NA "true" "true" ...
 $ LazyDataCompression    : chr  NA NA NA NA ...
 $ LazyLoad               : chr  NA NA NA NA ...
 $ MailingList            : chr  NA NA NA NA ...
 $ Maintainer             : chr  "Scott Fortmann-Roe <scottfr@berkeley.edu>" "Martin Bladt <martinbladt@math.ku.dk>" "Sercan Kahveci <sercan.kahveci@sbg.ac.at>" "Mintu Nath <dr.m.nath@gmail.com>" ...
 $ Note                   : chr  NA NA NA NA ...
 $ Packaged               : chr  "2015-08-16 14:17:33 UTC; scott" "2023-02-28 18:01:12 UTC; martinbladt" "2022-08-12 13:12:35 UTC; b1066151" "2019-09-12 14:16:35 UTC; s02mn9" ...
 $ RdMacros               : chr  NA NA NA NA ...
 $ StagedInstall          : chr  NA NA NA NA ...
 $ SysDataCompression     : chr  NA NA NA NA ...
 $ SystemRequirements     : chr  NA NA NA NA ...
 $ Title                  : chr  "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels" "Conditional Aalen-Johansen Estimation" "Reliability and Scoring Routines for the Approach-Avoidance Task" "Apps Based Activities for Communicating and Understanding\nStatistics" ...
 $ Type                   : chr  "Package" "Package" "Package" NA ...
 $ URL                    : chr  NA NA NA "https://shiny.abdn.ac.uk/Stats/apps/" ...
 $ UseLTO                 : chr  NA NA NA NA ...
 $ VignetteBuilder        : chr  NA "knitr" NA "knitr" ...
 $ ZipData                : chr  NA NA NA NA ...
 $ Path                   : chr  NA NA NA NA ...
 $ X-CRAN-Comment         : chr  NA NA NA NA ...
 $ Published              : chr  "2015-08-16" "2023-03-01" "2022-08-12" "2019-09-20" ...
 $ Reverse depends        : chr  NA NA NA NA ...
 $ Reverse imports        : chr  NA NA NA NA ...
 $ Reverse linking to     : chr  NA NA NA NA ...
 $ Reverse suggests       : chr  NA NA NA NA ...
 $ Reverse enhances       : chr  NA NA NA NA ...

A lot of data one can do a lot of things with, but we only need to fields. The package name and the authors.

The really hard part is to clean up the authors field. While there exists some standardized ways of entering author names into the DESCRIPTION file, it is still a wild west free-text field. I tried to to the cleaning semi-automatically with a script which was very tideous and I am sure it is not perfect1.

author_pkg_cran <- author_cleaner(db) |>
    dplyr::filter(!authorsR %in% c("Posit Software", "R Core Team", "R Foundation", "Rstudio", "Company"))
str(author_pkg_cran)
tibble [52,350 × 2] (S3: tbl_df/tbl/data.frame)
 $ Package : chr [1:52350] "A3" "AalenJohansen" "AalenJohansen" "AATtools" ...
 $ authorsR: chr [1:52350] "Scott Fortmann-Roe" "Martin Bladt" "Christian Furrer" "Sercan Kahveci" ...

The co-authorship network

The code below is used to build the co-authorship network as a weighted network. The weight shows how many packages two developers have authored together.

author_pkg_cran_net <- netUtils::bipartite_from_data_frame(author_pkg_cran, "authorsR", "Package")
A <- as_biadjacency_matrix(author_pkg_cran_net, sparse = TRUE)
A <- as(A, "sparseMatrix")
B <- Matrix::t(A) %*% A
auth_auth_net <- graph_from_adjacency_matrix(B, "undirected", diag = FALSE, weighted = TRUE)
auth_auth_net
IGRAPH 4cf3f2a UNW- 28895 145248 -- 
+ attr: name (v/c), weight (e/n)
+ edges from 4cf3f2a (vertex names):
 [1] Scott Fortmann-Roe--Clement Calenge    
 [2] Martin Bladt      --Christian Furrer   
 [3] Martin Bladt      --Alexander Mcneil   
 [4] Martin Bladt      --Jorge Yslas        
 [5] Martin Bladt      --Alaric Muller      
 [6] Sigbert Klinke    --Jaroslav Myslivec  
 [7] Sigbert Klinke    --Robert King        
 [8] Sigbert Klinke    --Benjamin Dean      
+ ... omitted several edges

To check if this is a connected network (there is a path connecting any pair of developers), we use the igraph::components() function.

comps_cran <- components(auth_auth_net)
comps_cran$no
[1] 5888

Thats quite a big number of components but it is not really surprising. Many package authors (or teams of authors) have only ever worked on one package (actually more than 40% of all packages are single-authored) and thus never interacted with the broader R developer community on any other package.

The biggest component can be extracted with the igraph::largest_component().

auth_auth_net_largest <- largest_component(auth_auth_net)
auth_auth_net_largest
IGRAPH f24d1a7 UNW- 15787 127067 -- 
+ attr: name (v/c), weight (e/n)
+ edges from f24d1a7 (vertex names):
 [1] Scott Fortmann-Roe--Clement Calenge    
 [2] Martin Bladt      --Christian Furrer   
 [3] Martin Bladt      --Alexander Mcneil   
 [4] Martin Bladt      --Jorge Yslas        
 [5] Martin Bladt      --Alaric Muller      
 [6] Sigbert Klinke    --Jaroslav Myslivec  
 [7] Sigbert Klinke    --Robert King        
 [8] Sigbert Klinke    --Benjamin Dean      
+ ... omitted several edges

From the 28,895 recorded package authors, 15,787 (54.64%) are part of the largest connected component. All subsequent analyses will be done with this network.

Plot of the biggest component of the CRAN co-authorship network

On average, every developer in the largest component has 16.1 co-authors. The median is 6. The two individuals who coauthored the most packages together (21), are Hadley Wickham and Jim Hester. The person with the most co-authors (756) is Hadley Wickham. What a great transition for the next section.

Six Degrees of Hadley Wickham

If you are familiar with the Erdős number number and/or the Bacon number then you know where this is going. Erdős was an incredibly prolific mathematician, publishing more than 1500 papers with a large number of co-authors by travelling the world. In honor of his prolific (and excentric) life, the “Erdős number” was created. This number describes the “collaboration distance” (or the degree of separation) between Paul Erdős and other mathematicians, measured by the authorship of papers. Authors who have written a paper with Erdős have an Erdős number of 1. Mathematicians who have co-authored with those but not Erdős himself have an Erdős number of 2, and so on.2 The same principle has been employed in other domains3, most prominently in the movie industry with the “Six degrees of Kevin Bacon”. The Bacon number shows how far away an actor is from appearing in a movie with Kevin Bacon.

The “Hadley number” can similarly be defined as the distance of R developers to Hadley Wickham in the co-authorship network. Someone (“A”) who developed a package that Hadley is a develeloper of has a Hadley number of 1. Someone who developed a package that A has developed but not Hadley has Hadley number 2, and so on. Hadley himself is the only person with Hadley number 0. Below is the distribution of the Hadley number for all developers in the largest connected component.

The maximum Hadley number is 10 and the average is 3.01.

To check your own Hadley number (if you are in the largest connected component, and my cleaning script didn’t butcher your name), scroll to the end of this post.4

The center of the collaboration network

Another interesting question in network analytic terms is who the center of the network is. The center is defined as the person who has the smallest average distance to all other developers. The top ten developers in that regard are shown below. The full list can again be explored at the end of this post.

name centrality
Hadley Wickham 3.00918
Ben Bolker 3.13207
Dirk Eddelbuettel 3.15988
Martin Maechler 3.20023
Romain Francois 3.21309
Michael Friendly 3.23076
Jim Hester 3.25356
Kevin Ushey 3.25692
Duncan Murdoch 3.28530
Yihui Xie 3.29898

Surprise, surprise, it is Hadley again!

Full results

In the below table, you can search for your own Hadley number and where you rank in terms of centrality. If you find any mistakes please do let me know in the comments.

Footnotes

  1. I did some extended cleaning which included removing companies such as “Posit Software”.↩︎

  2. My Erdős Number is 4↩︎

  3. I love this type of numbers. On this blog I also introduced the Zlatan number.↩︎

  4. My Hadley Number is 2↩︎

Reuse

Citation

BibTeX citation:
@online{schoch2024,
  author = {Schoch, David},
  title = {Six Degrees of {Hadley} {Wickham:} {The} {CRAN} Co-Authorship
    Network},
  date = {2024-01-17},
  url = {http://blog.schochastics.net/posts/2024-01-17_six-degrees-of-hadley-wickham},
  langid = {en}
}
For attribution, please cite this work as:
Schoch, David. 2024. “Six Degrees of Hadley Wickham: The CRAN Co-Authorship Network.” January 17, 2024. http://blog.schochastics.net/posts/2024-01-17_six-degrees-of-hadley-wickham.