About me

So before I get started on my talk - a little about me:
• I am originally from a research psychology background.
• I completed my master’s degree in research psychology not too long ago here in Johannesburg.
• For the last couple of years, I have been working at Kantar and I am currently senior analytics researcher there.

knitr::include_graphics("Kantar_analytics_practice.png")

knitr::include_graphics("Kantar_Global.png")

knitr::include_graphics("Kantar_practice_offer.png")

• We, as Kantar, do a lot of market research and typically assist companies with their business problems and data science needs.
• Globally, we are one of the largest marketing research companies with offices around the world.
• Depending on the client and problem we offer a host of analytical solutions.

Introduction - Why this topic? What actually led me down this path?

knitr::include_graphics("2017_2018downloadsR.png")

Original source: https://acastillogill.com/2018/12/linear-regression-r-downloads-around-the-world/

• So the inspiration for this topic started early in January when I saw a post about R downloads for the world on Twitter.
• I could verify that the final number of downloads was: \(\color{green}{\text{6752}}\).
• One can also not say much else about this figure except that we are close to the ceiling of this level.

This got me thinking - are there other ways in which to measure growth and usage of R?
• It turns out there are. While exploring the data they used I saw that we also have access to CRAN’s logs for package downloads globally.

Package downloads by day can tell us far more about how often R is being used and what R is being used for. If you think about it: R downloads just tell us how many people have downloaded it but not how they are using it and even if they are using it. Someone can easily download R and never use it. Whereas, with package downloads that does indicate something far more of a use/usage.

It is unlikely that you will stay with only a small subset of packages throughout your data analysis journey in R. As you encounter new problems and challenges, it is likely that a different package can provide the support and the specific techniques you require to address the problem at hand. Thus, as more people use R and as some people use R more, we should be seeing more package downloads happening.

The available data and research questions

So the files can be downloaded directly from: http://cran-logs.rstudio.com/

Each file contains the following variables:
1) date
2) time (in UTC)
3) size (in bytes)
4) r_version, version of R used to download package
5) r_arch (i386 = 32 bit, x86_64 = 64 bit)
6) r_os (darwin9.8.0 = mac, mingw32 = windows)
7) package
8) country, two letter ISO country code. Geocoded from IP using MaxMind’s free database
9) ip_id, a daily unique id assigned to each IP address

• It should be noted that the ip_id variable is our only way to track people but only \(\color{red}{\text{on a daily level and not over weeks or months.}}\)
• They have opted for this approach in order to allow for some tracking while prioritising user privacy and being cognisant of the issues surrounding it.
• Since \(\color{red}{\text{ip_id is based off IP address}}\) we also have no way to know if 1 or 4 people are behind a set of downloads for a day.
• Both these points act as methodological limitations, but there are still ways to work and evaluate the data.

With this in mind some of the questions I had were:
1) Are there any specific trends that we can uncover across the year?
2) Is the use of R growing within SA? If so, what does that growth look like?
3) Which are the most popular downloaded packages in SA?
4) Which packages are often downloaded together? Can we somehow visualise this in a meaningful way?

Data and preparation - Getting to the data!

Step 1: Downloading all the log files

library(tidyverse)
# Downloading all the logs from the start of 2015 to the start of 2019
this.year <- as.numeric(format(Sys.time(), "%Y"))
start <- as.Date("2015-01-01")
today <- as.Date("2019-01-01")

all_days <- seq(start, today, by = "day")

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0("http://cran-logs.rstudio.com/", year, "/", all_days, ".csv.gz")

# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))

dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0("CRANlogs/", missing_days[i], ".csv.gz"))
}

• I downloaded all the files from 2015-01-01 to 2019-01-01.
• This was 1462 files and was 17.6 GB compressed.

You might be asking if this talk is about 2018 why was the additional data downloaded?
• I had to download the additional data for the time series decomposition.
• Since there will be some seasonality effects at play you cannot plot a trend line across a single year of data and take it as is but more on this later!

Step 2: Reading the files in and extracting the entries we require

library(data.table)
library(here)

# handy little command that will show you all files in this folder
file_list <- list.files("Rlogs", full.names = TRUE)

# We split the full list into 49 partitions - 30 is an arbitrary number I used
# this next step is rather memory intensive hence the splitting here
file_list_split <- split(file_list, ceiling(seq_along(file_list) / 30))

# a loop repeating for the total length of our file split list
for (i in 1:length(file_list_split)) {
  # getting the part of list into a list that can be used in the right way
  files_to_read <- unlist(file_list_split[i])
  # creating an empty list to save the new files
  logs_temp <- list()
  # inner loop reading the respective days in the set
  for (file in files_to_read) {
    # printing in the loop so we can keep track of where we are
    print(paste("Reading", file, "..."))
    # the actual read.table() for the day
    logs_temp[[file]] <- read.table(file,
      header = TRUE, sep = ",",
      quote = "\"", dec = ".", fill = TRUE,
      comment.char = "", as.is = TRUE
    )
  }
  # binding all the days together in the set
  pulled_data <- rbindlist(logs_temp)

  # filtering the full frame to get rid of other countries
  data_to_write <- pulled_data %>% filter(country == "ZA")
  # saving files in a different directory so we can easily work with the extract
  write.csv(data_to_write,
    file = paste(here("write_location"), "/test", i, ".csv", sep = "")
  )


  # clearing the memory of these massive files
  rm("data_to_write", "logs_temp", "pulled_data")
}

• I took the list of 1462 files and created a smaller set of lists consisting of 30 days each (30 x 49 = 1470 files).
• Each small set was read in, the data combined into one file, filtered on South Africa and saved.
• After having all the entries extracted it was just a simple task of combining the 49 files into one final data file.

Step 3: Merging it all together

setwd(here("write_location"))

to_merge_file_list <- list.files(full.names = TRUE)

# creating an empty list to save the new files
logs_reduced <- list()
# same as before but we are reading all partitions we have kept
for (file in to_merge_file_list) {
  print(paste("Reading", file, "..."))
  logs_reduced[[file]] <- read.table(file,
    header = TRUE, sep = ",",
    quote = "\"", dec = ".", fill = TRUE,
    comment.char = "", as.is = TRUE
  )
}

# binding all parts together for the final frame
DF_ZA <- rbindlist(logs_reduced)
# write.csv(DF_ZA,"2018_CRAN_DOWNLOADS_forZA.csv")
DF_ZA_sorted <- DF_ZA %>% arrange(date, time)
setwd(here())
# write.csv(DF_ZA_sorted,"2015_2018_CRAN_DOWNLOADS_forZA_sorted.csv")

• My final dataset consisted of 5 749 500 observations. 2018 constituted 2 880 316 of these downloads. May that act as some \(\color{red}{\text{terrible foreshadowing}}\).

library(tidyverse)
library(data.table)
library(here)

DF_ZA_sorted <- read.table(here("2015_2018_CRAN_DOWNLOADS_forZA_sorted.csv.gz"),
  header = TRUE, sep = ",",
  quote = "\"", dec = ".", fill = TRUE,
  comment.char = "", as.is = TRUE
)

# Cutting off the single day of 2019 from the file
DF_ZA_sorted <- DF_ZA_sorted %>% filter(format(as.Date(date), "%Y") < 2019)

# Creating a subset that only includes 2018
DF_ZA_sorted_2018 <- DF_ZA_sorted %>% filter(format(as.Date(date), "%Y") == 2018)

# Adding variables to the frame to use for later
DF_ZA_sorted_2018_mutated <- DF_ZA_sorted_2018 %>%
  mutate(day = lubridate::wday(date, label = TRUE)) %>%
  mutate(month = lubridate::month(date, label = TRUE)) %>%
  mutate(package_download = as.factor(package))

• All of the above was just to create some year frames along with some basic mutations to get the graphs going.

head(DF_ZA_sorted_2018)
##       X.1     X       date     time   size r_version r_arch      r_os
## 1 2869185 28450 2018-01-01 00:09:22 445258     3.4.2 x86_64   mingw32
## 2 2869186 27342 2018-01-01 04:44:46  44565     3.4.2 x86_64 linux-gnu
## 3 2869187 27343 2018-01-01 04:44:46 447441     3.4.2 x86_64 linux-gnu
## 4 2869188 27344 2018-01-01 04:44:47 305818     3.4.2 x86_64 linux-gnu
## 5 2869189 27345 2018-01-01 04:44:47 193855     3.4.2 x86_64 linux-gnu
## 6 2869190 27346 2018-01-01 04:44:48  38579     3.4.2 x86_64 linux-gnu
##      package version country ip_id
## 1  animation     2.5      ZA 15265
## 2 assertthat   0.2.0      ZA   798
## 3 colorspace   1.3-2      ZA   798
## 4        cli   1.0.0      ZA   798
## 5       utf8   1.1.2      ZA   798
## 6       mime     0.5      ZA   798

• So this is the final dataframe for 2018.

Monthly downloads (2018)

library(tidyverse)
# turn off scientific notation like 1e+06
options(scipen = 999)

Total_monthly <- DF_ZA_sorted_2018_mutated %>%
  group_by(month) %>%
  summarize(downloaded = n())

Total_monthly %>%
  ggplot(aes(x = month, y = downloaded, group = 1)) +
  geom_area(color = "#ba6072", fill = "#ba6072") +
  geom_point() +
  geom_line(aes(month)) +
  theme(legend.position = "none") +
  labs(x = "Month of the year", y = "Total number of packages downloaded") +
  theme(axis.text.y = element_text(angle = 90, hjust = 1), legend.position = "none") +
  ggtitle("Total package downloaded per month for 2018")

Looking at the monthly downloads for 2018 we have the following image:
• There is quite a bit of variance.
• It is evident that the second half of the year had far more downloads.
• It is quite interesting to see the dip taking place in the mid-year and then this immense take up again.
• When we get to December there is an extreme drop that is close to the value we see in January.

If we zoom in and look at the individual days instead of the months we end up with the following picture.

Day of the year plotting (2018)

Total_daily <- DF_ZA_sorted_2018_mutated %>%
  group_by(date) %>%
  summarize(downloaded = n())

Total_daily %>%
  ggplot(aes(x = date, y = downloaded, fill = downloaded, group = 1)) +
  geom_area(color = "#135c89", fill = "#135c89") +
  geom_point() +
  geom_line(group = 1, color = "black") +
  theme(
    legend.position = "none",
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(angle = 90, hjust = 1)
  ) +
  labs(x = "Day of the year", y = "Total number of packages downloaded") +
  ggtitle("Total package downloaded per day for 2018")

• It is a rather jagged graph but this is how the whole year’s data looks like.
• We see more use within the second half of the year but not much else from this view.
• In the next graph it will become obvious why we have this jagged-ness.

Day of the week plotting (2018)

Total_day_count <- DF_ZA_sorted_2018_mutated %>%
  group_by(day) %>%
  summarize(downloaded = n())

Total_day_count %>%
  ggplot(aes(x = day, y = downloaded, group = 1)) +
  geom_area(color = "#00ffbf", fill = "#00ffbf") +
  geom_point() +
  geom_line(aes(day)) +
  labs(x = "Day of the week", y = "Total number of packages downloaded") +
  theme(axis.text.y = element_text(angle = 90, hjust = 1), legend.position = "none") +
  ggtitle("Package downloads grouped by day of the week for 2018")

Moving from day of the year to day of the week we can see the following trend and this also explains to a large extent the jaggedness we see throughout:
• Monday is the most popular day (not completely surprised).
• Wednesday has a bit of a mid-week slump and then after a slight increase on Thursday there is a decline to the weekend.
• People are clearly not downloading as many packages on the weekend.

Decomposition of the year’s data (2015-2018)

We have seen some basic patterns in the data showing considerable variability depending on day of the week and month of the year. With all of this in mind – has 2018 been a good year? How does the actual trend look?

Total_daily <- DF_ZA_sorted %>%
  group_by(date) %>%
  summarize(downloaded = n())

Total_daily %>%
  ggplot(aes(x = date, y = downloaded, fill = downloaded, group = 1)) +
  geom_area(color = "#135c89", fill = "#135c89") +
  geom_point() +
  geom_line(group = 1, color = "black") +
  theme(
    legend.position = "none",
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(angle = 90, hjust = 1)
  ) +
  labs(x = "", y = "Total number of packages downloaded") +
  ggtitle("Total package downloaded per day from 2015 to 2018")

In order to uncover this answer, we need to do some basic decomposition of a time series. In time series, your data can be said to consist of a certain trend + cyclical + season + remainder. In some instances, trend and cyclical components are grouped together into one and we can represent it as follow: \[Y_{v} = T_{v} + S_{v} + R_{v}\]

To be able to decompose your time series and to estimate a seasonal effect you require a couple of years’ data as you cannot derive a seasonal effect from a single year’s data.

So to do the decomposition:
• I made use of the Stats package. It has a convenient function called stl() which is based on STL developed by Cleveland et al. (1990). STL (Seasonal-Trend Decomposition Procedure based on Loess) is a filtering procedure for decomposing a time series into a trend, seasonal, and set of remainder components. STL is widely used and this in part because of its availability in R today.

• In short - this method uses iterative Loess (locally weighted regression) smoothing to obtain an estimate of the trend and then Loess smoothing again to extract a changing additive seasonal component. The remainder is then the original data minus the seasonal and trend components.

• If you want to know more about it in detail you can look at the paper on STL by Cleveland et al. (1990)

Total_daily <- DF_ZA_sorted %>%
  group_by(date) %>%
  summarize(downloaded = n())

Total_daily_for_ts <- Total_daily

library(xts)
library(forecast)
time_series_2015_2018 <- ts(Total_daily_for_ts$downloaded, frequency = 365, start = c(2015, 1))
# time_series_2015_2018 %>% Acf(plot = FALSE) %>% autoplot()
library(stats)
Decomposition_2015_2018 <- time_series_2015_2018 %>%
  stl(s.window = "periodic") %>%
  autoplot()
Decomposition_2015_2018

So in terms of the actual results:

Trend:
2018 was definitely a great year. There is definite growth and we can see that from late 2016 the uptake has been persistent and it has only become stronger.

Seasonal & Remainder:
• The seasonal pattern we saw with just 2018 seems to be partially evident here, specifically the drop in January and December.

• There seems to be overall more happening in the first 6 months of the year than the later 6 months. Looking at the data we know this is a bit of a change for 2018 and we also see this clearly in the remainder that was extracted for 2018. This is likely just concerted additional use.

• I tried to look online at certain events or activities that could potentially have pushed this spike but there was nothing I could easily link on temporal level.

Likely what is driving this strong ongoing trend is:
• More and more academic institutes are using and training students in R.
• Businesses have also started seeing the value of R and are likely pushing for it.
• The general drive to more data science.
• The strong presence that R maintains within data science communities online.

Package downloads

Alright, so now with the total number of downloads out of the way we can get into the packages!

knitr::include_graphics("Packages_are_coming.png")

Package_overall <- DF_ZA_sorted_2018_mutated %>%
  group_by(package) %>%
  summarize(downloaded = n())

# number of unique packages
nrow(Package_overall)
## [1] 14156
# average number of times a pacakge is downloaded in R in 2018
mean(Package_overall$downloaded)
## [1] 203.4696
# median value is far lower than the median - we have a very long tail
median(Package_overall$downloaded)
## [1] 23
# sd is further evidence of that
sd(Package_overall$downloaded)
## [1] 1429.088
# sorting dataframe to include number downloaded by package and a rank
Package_overall_sorted <- DF_ZA_sorted_2018_mutated %>%
  group_by(package) %>%
  summarize(downloaded = n()) %>%
  arrange(desc(downloaded))
Package_overall_sorted <- Package_overall_sorted %>% mutate(rank = 1:n())

• For 2018 in South Africa we had 14156 unique packages downloaded.
• On average, a respective package was downloaded about 203 times.
• However, the median was \(\color{red}{\text{23}}\).

• Sorting the packages based on number of downloads (descending) and plotting it reveals a graph with a sharp peak and a tail that goes on forever.

• Essentially, we have a small subset of packages that are downloaded a lot while we have an enormous set that is downloaded very infrequently.

Top 100 Package downloads

Package_overall_sorted_100 <- Package_overall_sorted[1:100, ]

Package_top100 <- Package_overall_sorted_100 %>%
  ggplot(aes(reorder(x = package, rank), y = downloaded, group = 1)) +
  geom_area(color = "#ba6072", fill = "#ba6072") +
  geom_point() +
  geom_line(aes(package)) +
  theme(legend.position = "none") +
  labs(x = "Packages", y = "Total number of downloads") +
  theme(
    axis.text.y = element_text(angle = 90, hjust = 1),
    axis.text.x = element_text(angle = 90, hjust = 1),
    legend.position = "none"
  ) +
  ggtitle("Top 100 packages for 2018")

Package_top100

library(knitr)
knitr::kable(Package_overall_sorted_100)
package downloaded rank
Rcpp 36889 1
rlang 30500 2
stringi 30293 3
ggplot2 29210 4
tibble 27107 5
pillar 26885 6
cli 25126 7
glue 24934 8
digest 24931 9
stringr 24876 10
crayon 24712 11
utf8 24052 12
dplyr 23192 13
assertthat 23116 14
R6 23008 15
magrittr 22334 16
scales 21860 17
BH 21455 18
plyr 20654 19
reshape2 20550 20
readxl 20112 21
pkgconfig 19122 22
lazyeval 18877 23
viridisLite 18745 24
RColorBrewer 18737 25
munsell 18599 26
jsonlite 18522 27
yaml 18319 28
colorspace 18063 29
curl 17960 30
readr 17833 31
gtable 17557 32
knitr 17346 33
labeling 17192 34
hms 17065 35
withr 16902 36
mime 16688 37
purrr 16424 38
bindrcpp 16283 39
cellranger 16159 40
plogr 16033 41
tidyselect 15773 42
tidyr 15689 43
rematch 15422 44
htmltools 15364 45
evaluate 15074 46
bindr 15029 47
backports 15023 48
markdown 14429 49
highr 14306 50
rmarkdown 13950 51
base64enc 13937 52
data.table 13921 53
haven 13804 54
openssl 13552 55
fansi 13384 56
lubridate 12769 57
httr 12680 58
forcats 12508 59
rstudioapi 12481 60
rprojroot 12386 61
zoo 11823 62
tidyverse 11499 63
broom 11334 64
httpuv 10717 65
sp 10541 66
DBI 10512 67
shiny 10282 68
xml2 10134 69
lme4 9892 70
dichromat 9852 71
bitops 9595 72
car 9523 73
whisker 9495 74
tinytex 9342 75
gridExtra 9238 76
RcppEigen 9215 77
htmlwidgets 9142 78
callr 9001 79
caTools 8763 80
xtable 8740 81
quantreg 8467 82
xfun 8274 83
sourcetools 8162 84
processx 8144 85
devtools 8019 86
nloptr 7988 87
later 7912 88
maptools 7907 89
mvtnorm 7840 90
dbplyr 7836 91
rvest 7692 92
SparseM 7466 93
selectr 7403 94
abind 7367 95
reprex 7287 96
rJava 7204 97
xts 7145 98
MatrixModels 7101 99
modelr 7091 100

Network analysis - Doing something more with this data?

• As with the previous data I wanted to do something more meaningful with this that goes beyond descriptive.
• That got me thinking of chains or more specifically edges and networks. I wasn’t sure if this would lead to anything meaningful but I attempted it nonetheless.

A basic example:

knitr::include_graphics("Chain.png")

We can take the above diagram and represent it within a network of connected nodes. Essentially, we have 5 nodes and we have an edge list (showing how the nodes are connected) with 4 entries. We can visualise it below:

library(visNetwork)
nodes <- data.frame(
  id = 1:5,
  color.background = c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#CCEEFF"),
  label = c("ggplot2", "car", "stats", "leaflet", "caret")
)
edges <- data.frame(from = c(1, 2, 3, 4), to = c(2, 3, 4, 5))
visNetwork(nodes, edges, width = "100%")

Depending on the edges we can have some nodes that are more connected (i.e. they have a higher degree):

library(visNetwork)
nodes <- data.frame(
  id = 1:5,
  color.background = c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#CCEEFF"),
  label = c("ggplot2", "car", "stats", "leaflet", "caret")
)
edges <- data.frame(from = c(1, 3, 4, 3), to = c(2, 4, 5, 5))
visNetwork(nodes, edges, width = "100%")

There is a lot more to network analysis and graph theory but essentially we will be using these basic concepts to build a far more complex network.

Deep into the network:

library(igraph)
library(qgraph)

DF_ZA_adapt <- DF_ZA_sorted_2018_mutated %>% mutate(month = month(date))
DF_ZA_adapt <- DF_ZA_adapt %>% mutate(package_download = as.factor(package))

DF_ZA_adapt_wID <- DF_ZA_adapt %>% mutate(package_index = as.numeric(DF_ZA_adapt$package_download))

# Creating a unique ID across all the days of the year.
DF_ZA_adapt_wID <- DF_ZA_adapt_wID %>% mutate(
  UniqueD =
    paste(yday(DF_ZA_adapt_wID$date), "_", DF_ZA_adapt_wID$ip_id, sep = "")
)

DF_ZA_adapt_wID_clean <- DF_ZA_adapt_wID %>% drop_na(package)

• Because ip_id was only unique to the day we had to make it unique to the whole year. Essentially, we just took each day of the year and added in front of the ip_id. With this simple ID creation we could move on.

list_temp <- unique(DF_ZA_adapt_wID_clean$UniqueD)
length(list_temp)

Accumulate_instance <- NULL

# Loop to create the the correct stacking for my aedge list
for (i in 1:length(list_temp)) {
  df_temp <- DF_ZA_adapt_wID_clean %>% filter(UniqueD == list_temp[i])

  df_temp_trail <- df_temp$package_index

  temp_instance <- cbind(
    df_temp_trail[-(length(df_temp_trail))], # Need to cut the last number from the first copy
    df_temp_trail[-1], # And the first number fromn the last set
    df_temp$UniqueD[-1]
  ) # Just adding the ID so I have some way to link other things back

  Accumulate_instance <- rbind(Accumulate_instance, temp_instance)
}

• We filter the dataframe on each new created ID, copy the list of packages and create the edge list.

Accumulate_instance_check <- as.data.frame(Accumulate_instance) %>% drop_na()
head(Accumulate_instance_check)

Accumulate_instance_df <- as.data.frame(Accumulate_instance)
head(Accumulate_instance_df)

#save for later use
#write.csv(here(Accumulate_instance_df,"Accumulate_instance_backup.csv""))
nodelist <- as.data.frame(DF_ZA_adapt_wID_clean$package_download)
nodelist <- cbind(nodelist, DF_ZA_adapt_wID_clean$package_index)

nodelist <- arrange(nodelist, nodelist$`DF_ZA_adapt_wID_clean$package_index`)

#Final node list creation
nodelist_unique <- unique(nodelist)
#write.csv(here(nodelist_unique,"nodelist_unqiue.csv"))
# group same to and froms so know how often that specific chain occured
edge_weights <- Accumulate_instance %>% group_by(TO, FROM) %>% summarize(n())

# read in the nodelist or our unique set of packages to reference our numbers back
nodelist_dictionary <- nodelist_unique
# nodelist_dictionary <- nodelist_dictionary[, -1]
nodelist_dictionary <- nodelist_dictionary[, 2:1]
names(nodelist_dictionary) <- c("ID", "Name")
# Tables on from and to
library(tidyverse)
library(data.table)

set.seed(1337)

# read created file
DF_Accumulate_instance <- read.csv(here("Accumulate_instance_backup.csv"))

# give columns more meaningful names
names(DF_Accumulate_instance) <- c("L_ID", "FROM", "TO", "M_ID")

## Edge-

# group same to and froms so know how often that specific chain occured
edge_weights <- DF_Accumulate_instance %>% group_by(TO, FROM) %>% summarize(n())

# read in the nodelist or our unique set of packages to reference our numbers back
nodelist_dictionary <- read.csv(here("nodelist_unqiue.csv"))
nodelist_dictionary <- nodelist_dictionary[, -1]
nodelist_dictionary <- nodelist_dictionary[, 2:1]
names(nodelist_dictionary) <- c("ID", "Name")


library(expss)
# lookup the unqiue number to the corresponding entry for the TO set
TO_lookup <- expss::vlookup(
  lookup_value = edge_weights$TO,
  dict = nodelist_dictionary,
  lookup_column = 1,
  result_column = 2
)

# lookup the unqiue number to the corresponding entry for the FROM set
FROM_lookup <- expss::vlookup(
  lookup_value = edge_weights$FROM,
  dict = nodelist_dictionary,
  lookup_column = 1,
  result_column = 2
)

# combine the new lookups with the actual edge_weights
edge_weights_matched <- cbind.data.frame(edge_weights, TO_lookup, FROM_lookup)

edge_weights_prefinal <- edge_weights_matched[, -1:-2]
names(edge_weights_prefinal) <- c("weight", "to", "from")
# reorder frame in the right format
edge_weights_prefinal <- cbind(
  edge_weights_prefinal[, 2:3],
  edge_weights_prefinal[, 1]
)
names(edge_weights_prefinal) <- c("to", "from", "weight")

Final_edge_weights <- edge_weights_prefinal

# Nodes
# It is unecessary to create this if you already have the full nodelist_dictionary
# This is only required if you will trim or cut down somehow where some nodes will not be linked

pre_local_index <- rbind(edge_weights_prefinal$to, edge_weights_prefinal$from)

pre_local_index <- edge_weights_prefinal %>%
  select(to) %>%
  bind_rows(
    edge_weights_prefinal %>%
      transmute(to = from)
  ) %>%
  unique()

new_id_for_index <- 1:nrow(pre_local_index)
mid_local_index <- cbind(pre_local_index, new_id_for_index)

Final_node_list <- mid_local_index
# Graphing the first network with visNetwork
library(igraph)
library(visNetwork)

names(Final_edge_weights) <- c("Target", "Source", "Weight")
names(Final_node_list) <- c("name", "id")

Original_network <- graph_from_data_frame(
  d = Final_edge_weights,
  vertices = Final_node_list
)

# test.visn <- toVisNetworkData(Original_network)

# visNetwork(test.visn$nodes, test.visn$edges, width = "100%", height = "100vh") %>%
#  visIgraphLayout(layout = "layout_with_fr") %>%
#  visNodes(color = list(
#    background = "lightblue",
#    border = "lancet",
#    highlight = "yellow"
#  )) %>%
#  visOptions(
#    highlightNearest = TRUE,
#    nodesIdSelection = TRUE,
#    collapse = TRUE
#  ) %>%
# visNodes(size = 10)

• I just saved a visual of the network as a .png as it is an extremely large object.

knitr::include_graphics("original_network.png")

• This network is not really interpretable. It does however show you the data in its rawest form and actually the magnitude of what we are sitting with.

# Since we technically have an undirected graph setup we should create a final weight
# based on the sum of the “to-from” edge that corresponds to a “from-to” for the same two packages.
Special_edge_weights_filtered <- Final_edge_weights %>%
  rowwise() %>%
  mutate(temp = paste(sort(x = c(Source, Target)),
    collapse = " "
  )) %>%
  ungroup() %>%
  group_by(temp) %>%
  mutate(result = sum(Weight)) %>%
  ungroup() %>%
  select(-temp)

Special_edge_weights_filtered <- Special_edge_weights_filtered[, -3]
names(Special_edge_weights_filtered) <- c("Target", "Source", "Weight")


# Reducing our edge list with weights based on size of the weight and cutting redundant self reference nodes
Final_edge_weights_filtered <- Special_edge_weights_filtered %>%
  filter(Weight > 100) %>%
  filter(Target != Source)

# We have to create a new node list given that some small nodes will drop out and
# we don't want empty node points 

pre_local_index2 <- Final_edge_weights_filtered %>%
  select(Source) %>%
  bind_rows(
    Final_edge_weights_filtered %>%
      transmute(Source = Target)
  ) %>%
  unique()

pre_local_index2 <- pre_local_index2 %>% arrange(Source)

new_id_for_index <- 1:nrow(pre_local_index2)
mid_local_index <- cbind(pre_local_index2, new_id_for_index)

Final_node_list_2 <- mid_local_index
names(Final_node_list_2) <- c("name", "id")


Adjusted_network <- graph_from_data_frame(
  d = Final_edge_weights_filtered,
  vertices = Final_node_list_2
)

test.visn <- toVisNetworkData(Adjusted_network)


visNetwork(test.visn$nodes, test.visn$edges, width = "100%", height = "100vh") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE,
    collapse = TRUE
  ) %>%
  visNodes(size = 10)
#To get larger nodes based on degree value
graph <- graph.data.frame(Final_edge_weights_filtered, directed = T)
degree_value <- degree(graph, mode = "all")
Final_node_list_2$value <- degree_value[match(Final_node_list_2$name, names(degree_value))]


Final_network <- graph_from_data_frame(
  d = Final_edge_weights_filtered,
  vertices = Final_node_list_2
)

test.visn <- toVisNetworkData(Final_network)

# In order to change line thickness we associate the weights with the value      
test.visn$edges$value <- Final_edge_weights_filtered$Weight

# Creating my colour palette based on three colours from which to interpolate the colours between these. 
rbPal <- colorRampPalette(c("#12cef0", "#f395c3", "#6f1383"))

# to add colour to our test.visn we need to add it for the color.background in nodes
test.visn$nodes$color.background <- rbPal(nrow(test.visn$nodes))[as.numeric(cut(test.visn$nodes$value * 100, breaks = nrow(test.visn$nodes)))]

#Visualising final network: 
visNetwork(test.visn$nodes, test.visn$edges, width = "100%", height = "100vh") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE,
    collapse = TRUE
  )

If you would like to read more about the layout algorithm you can read: Fruchterman & Reingold (1991)
This is a reduced version of the original. We pruned the network by excluding edge weights smaller than 100.

If an edge was used less than a 100 times then it was scrapped from the visualisation. I produced two versions above. One without any alterations beyond the weight filter, while the other had a couple of changes:
• Node size dependent on degrees.
• Colouring dependent on node size.
• Line thickness dependent on edge list weight.

Even though we sit with a far smaller set of packages it is still rather complex.

Things you would want to evaluate is:
• How well connected is a respective package?
• Which packages are directly adjacent to the selected node and which are secondary?
• Where are the strongest ties with the respective package? What functions do they bring to the main selected package?

What does this type of visualisation offer?
• You can see packages that are often downloaded with your favourite packages and explore those relationships.
• It gives developers an idea of how their package is used in conjunction with others.

Conclusion - 5 take away points:

1. R is definitely growing in South Africa and the momentum is there.

It is likely that this momentum will continue and we as a community can do a lot to foster that growth through a variety of activities. Be it conferences like today, meetups, tutorials, etc. Decide if you want to be more involved and think about the how!

2. We finally have a more concrete idea of how R looks and how it is currently functioning.

To some extent, we did not have such a concrete representation before this point. There is likely far more to uncover still from this data which is available to all of us.

3. If the network analysis was at all overwhelming in any way for you, then in some ways it performed its very function.

This might sound like a counterintuitive point but this visualisation has shown you a glimpse of how complex the landscape is and how difficult the task is to navigate that landscape.

This network ties in nicely with a point made by Silge, Nash, & Graves (2018) in their article Navigating the R Package Universe:

“Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other.”

This is not meant to deter people from downloading packages but to give some thought of how best to navigate this space - what will your strategy be?

For many people it would seem package selection comes down to the the social media we consume, who we follow, and who we interact with regularly:

knitr::include_graphics("Discovering_packages.png")

• This was part of an online survey (n = 1039) conducted by Silge, Nash, and Graves (2018) for their paper.

4. It is likely that the package proliferation will continue and it will become even more difficult to select packages.

• On the one hand, I am sure we are all mostly in agreement that the contributions from various communities and individuals is a phenomenal side of both the open source and R community. We are spared many, many hours from not having to develop a host of things from scratch.

• On the other hand, there are in some instances so many good alternatives that the choice becomes extremely difficult. How many packages’ documentation and examples will you go through before selecting a package? There is something of a decision fatigue to be noted here. I suppose all I want to say is: when being spoiled for choice it spoils choice.

Why not simple unification?
• “A real downside of unification is the burden of work for the developers involved, including dealing with complex dependencies and reverse dependencies” (Silge, Nash, & Graves, 2018)

5. We now know something about the relative merits of the packages based on downloads but that is not sufficient.

This is a too narrow metric for a variety of reasons. What is the way forward? There doesn’t seem to be any consensus on this matter.

Some have opted to create metrics or platforms by which to search and evaluate packages. Just to list a few:
• R Site Search by Baron (or via RSiteSearch() function in utils (R Core Team, 2018))
• CRAN Task Views (Zeileis, 2005)
• packagemetrics (Firke et al., 2018)
• CRANsearcher RStudio addin (Krouse and Calatroni, 2017)
• rseek site of Goodman (2007)
• R Package Documentation site of Howson (2018)
• RDocumentation.org
• METACRAN (Csárdi and Ooms)
• crantastic (Wickham and Mæland)

Others emphasise the value of communities and investing in them:
• “Our survey does show that personal recommendations have been important for many individuals in evaluating R packages. This is yet another area where local user groups can continue to have important impact…Our survey and discussions show how impactful these community networks are; investing in community building is not something we need do only because of idealism, but because it is effective.” (Silge, Nash, & Graves, 2018)

Going forward

• I’ll likely deploy a more interactive version of this network.

• Something else I have been thinking about is developing a shiny app mini game. So one take on classifying functions in R has been through a shiny app developed by Lucy McGowan:

https://lucy.shinyapps.io/classify/

Something similar can be done for R packages

knitr::include_graphics("End.png")