About me

So before I get started on my talk - a little about me:
• I am originally from a research psychology background.
• I completed my master’s degree in research psychology not too long ago here in Johannesburg.
• For the last couple of years, I have been working at Kantar and I am currently senior analytics researcher there.

knitr::include_graphics("Kantar_analytics_practice.png")

knitr::include_graphics("Kantar_Global.png")

knitr::include_graphics("Kantar_practice_offer.png")

• We, as Kantar, do a lot of market research and typically assist companies with their business problems and data science needs.
• Globally, we are one of the largest marketing research companies with offices around the world.
• Depending on the client and problem we offer a host of analytical solutions.

Introduction - Why this topic? What actually led me down this path?

knitr::include_graphics("2017_2018downloadsR.png")

Original source: https://acastillogill.com/2018/12/linear-regression-r-downloads-around-the-world/

• So the inspiration for this topic started early in January when I saw a post about R downloads for the world on Twitter.
• I could verify that the final number of downloads was: \(\color{green}{\text{6752}}\).
• One can also not say much else about this figure except that we are close to the ceiling of this level.

This got me thinking - are there other ways in which to measure growth and usage of R?
• It turns out there are. While exploring the data they used I saw that we also have access to CRAN’s logs for package downloads globally.

Package downloads by day can tell us far more about how often R is being used and what R is being used for. If you think about it: R downloads just tell us how many people have downloaded it but not how they are using it and even if they are using it. Someone can easily download R and never use it. Whereas, with package downloads that does indicate something far more of a use/usage.

It is unlikely that you will stay with only a small subset of packages throughout your data analysis journey in R. As you encounter new problems and challenges, it is likely that a different package can provide the support and the specific techniques you require to address the problem at hand. Thus, as more people use R and as some people use R more, we should be seeing more package downloads happening.

The available data and research questions

So the files can be downloaded directly from: http://cran-logs.rstudio.com/

Each file contains the following variables:
1) date
2) time (in UTC)
3) size (in bytes)
4) r_version, version of R used to download package
5) r_arch (i386 = 32 bit, x86_64 = 64 bit)
6) r_os (darwin9.8.0 = mac, mingw32 = windows)
7) package
8) country, two letter ISO country code. Geocoded from IP using MaxMind’s free database
9) ip_id, a daily unique id assigned to each IP address

• It should be noted that the ip_id variable is our only way to track people but only \(\color{red}{\text{on a daily level and not over weeks or months.}}\)
• They have opted for this approach in order to allow for some tracking while prioritising user privacy and being cognisant of the issues surrounding it.
• Since \(\color{red}{\text{ip_id is based off IP address}}\) we also have no way to know if 1 or 4 people are behind a set of downloads for a day.
• Both these points act as methodological limitations, but there are still ways to work and evaluate the data.

With this in mind some of the questions I had were:
1) Are there any specific trends that we can uncover across the year?
2) Is the use of R growing within SA? If so, what does that growth look like?
3) Which are the most popular downloaded packages in SA?
4) Which packages are often downloaded together? Can we somehow visualise this in a meaningful way?

Data and preparation - Getting to the data!

Step 1: Downloading all the log files

library(tidyverse)
# Downloading all the logs from the start of 2015 to the start of 2019
this.year <- as.numeric(format(Sys.time(), "%Y"))
start <- as.Date("2015-01-01")
today <- as.Date("2019-01-01")

all_days <- seq(start, today, by = "day")

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0("http://cran-logs.rstudio.com/", year, "/", all_days, ".csv.gz")

# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))

dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0("CRANlogs/", missing_days[i], ".csv.gz"))
}

• I downloaded all the files from 2015-01-01 to 2019-01-01.
• This was 1462 files and was 17.6 GB compressed.

You might be asking if this talk is about 2018 why was the additional data downloaded?
• I had to download the additional data for the time series decomposition.
• Since there will be some seasonality effects at play you cannot plot a trend line across a single year of data and take it as is but more on this later!

Step 2: Reading the files in and extracting the entries we require

library(data.table)
library(here)

# handy little command that will show you all files in this folder
file_list <- list.files("Rlogs", full.names = TRUE)

# We split the full list into 49 partitions - 30 is an arbitrary number I used
# this next step is rather memory intensive hence the splitting here
file_list_split <- split(file_list, ceiling(seq_along(file_list) / 30))

# a loop repeating for the total length of our file split list
for (i in 1:length(file_list_split)) {
  # getting the part of list into a list that can be used in the right way
  files_to_read <- unlist(file_list_split[i])
  # creating an empty list to save the new files
  logs_temp <- list()
  # inner loop reading the respective days in the set
  for (file in files_to_read) {
    # printing in the loop so we can keep track of where we are
    print(paste("Reading", file, "..."))
    # the actual read.table() for the day
    logs_temp[[file]] <- read.table(file,
      header = TRUE, sep = ",",
      quote = "\"", dec = ".", fill = TRUE,
      comment.char = "", as.is = TRUE
    )
  }
  # binding all the days together in the set
  pulled_data <- rbindlist(logs_temp)

  # filtering the full frame to get rid of other countries
  data_to_write <- pulled_data %>% filter(country == "ZA")
  # saving files in a different directory so we can easily work with the extract
  write.csv(data_to_write,
    file = paste(here("write_location"), "/test", i, ".csv", sep = "")
  )


  # clearing the memory of these massive files
  rm("data_to_write", "logs_temp", "pulled_data")
}

• I took the list of 1462 files and created a smaller set of lists consisting of 30 days each (30 x 49 = 1470 files).
• Each small set was read in, the data combined into one file, filtered on South Africa and saved.
• After having all the entries extracted it was just a simple task of combining the 49 files into one final data file.

Step 3: Merging it all together

setwd(here("write_location"))

to_merge_file_list <- list.files(full.names = TRUE)

# creating an empty list to save the new files
logs_reduced <- list()
# same as before but we are reading all partitions we have kept
for (file in to_merge_file_list) {
  print(paste("Reading", file, "..."))
  logs_reduced[[file]] <- read.table(file,
    header = TRUE, sep = ",",
    quote = "\"", dec = ".", fill = TRUE,
    comment.char = "", as.is = TRUE
  )
}

# binding all parts together for the final frame
DF_ZA <- rbindlist(logs_reduced)
# write.csv(DF_ZA,"2018_CRAN_DOWNLOADS_forZA.csv")
DF_ZA_sorted <- DF_ZA %>% arrange(date, time)
setwd(here())
# write.csv(DF_ZA_sorted,"2015_2018_CRAN_DOWNLOADS_forZA_sorted.csv")

• My final dataset consisted of 5 749 500 observations. 2018 constituted 2 880 316 of these downloads. May that act as some \(\color{red}{\text{terrible foreshadowing}}\).

library(tidyverse)
library(data.table)
library(here)

DF_ZA_sorted <- read.table(here("2015_2018_CRAN_DOWNLOADS_forZA_sorted.csv.gz"),
  header = TRUE, sep = ",",
  quote = "\"", dec = ".", fill = TRUE,
  comment.char = "", as.is = TRUE
)

# Cutting off the single day of 2019 from the file
DF_ZA_sorted <- DF_ZA_sorted %>% filter(format(as.Date(date), "%Y") < 2019)

# Creating a subset that only includes 2018
DF_ZA_sorted_2018 <- DF_ZA_sorted %>% filter(format(as.Date(date), "%Y") == 2018)

# Adding variables to the frame to use for later
DF_ZA_sorted_2018_mutated <- DF_ZA_sorted_2018 %>%
  mutate(day = lubridate::wday(date, label = TRUE)) %>%
  mutate(month = lubridate::month(date, label = TRUE)) %>%
  mutate(package_download = as.factor(package))

• All of the above was just to create some year frames along with some basic mutations to get the graphs going.

head(DF_ZA_sorted_2018)

##       X.1     X       date     time   size r_version r_arch      r_os
## 1 2869185 28450 2018-01-01 00:09:22 445258     3.4.2 x86_64   mingw32
## 2 2869186 27342 2018-01-01 04:44:46  44565     3.4.2 x86_64 linux-gnu
## 3 2869187 27343 2018-01-01 04:44:46 447441     3.4.2 x86_64 linux-gnu
## 4 2869188 27344 2018-01-01 04:44:47 305818     3.4.2 x86_64 linux-gnu
## 5 2869189 27345 2018-01-01 04:44:47 193855     3.4.2 x86_64 linux-gnu
## 6 2869190 27346 2018-01-01 04:44:48  38579     3.4.2 x86_64 linux-gnu
##      package version country ip_id
## 1  animation     2.5      ZA 15265
## 2 assertthat   0.2.0      ZA   798
## 3 colorspace   1.3-2      ZA   798
## 4        cli   1.0.0      ZA   798
## 5       utf8   1.1.2      ZA   798
## 6       mime     0.5      ZA   798

• So this is the final dataframe for 2018.

Monthly downloads (2018)

library(tidyverse)
# turn off scientific notation like 1e+06
options(scipen = 999)

Total_monthly <- DF_ZA_sorted_2018_mutated %>%
  group_by(month) %>%
  summarize(downloaded = n())

Total_monthly %>%
  ggplot(aes(x = month, y = downloaded, group = 1)) +
  geom_area(color = "#ba6072", fill = "#ba6072") +
  geom_point() +
  geom_line(aes(month)) +
  theme(legend.position = "none") +
  labs(x = "Month of the year", y = "Total number of packages downloaded") +
  theme(axis.text.y = element_text(angle = 90, hjust = 1), legend.position = "none") +
  ggtitle("Total package downloaded per month for 2018")

Looking at the monthly downloads for 2018 we have the following image:
• There is quite a bit of variance.
• It is evident that the second half of the year had far more downloads.
• It is quite interesting to see the dip taking place in the mid-year and then this immense take up again.
• When we get to December there is an extreme drop that is close to the value we see in January.

If we zoom in and look at the individual days instead of the months we end up with the following picture.

Day of the year plotting (2018)

Total_daily <- DF_ZA_sorted_2018_mutated %>%
  group_by(date) %>%
  summarize(downloaded = n())

Total_daily %>%
  ggplot(aes(x = date, y = downloaded, fill = downloaded, group = 1)) +
  geom_area(color = "#135c89", fill = "#135c89") +
  geom_point() +
  geom_line(group = 1, color = "black") +
  theme(
    legend.position = "none",
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(angle = 90, hjust = 1)
  ) +
  labs(x = "Day of the year", y = "Total number of packages downloaded") +
  ggtitle("Total package downloaded per day for 2018")

• It is a rather jagged graph but this is how the whole year’s data looks like.
• We see more use within the second half of the year but not much else from this view.
• In the next graph it will become obvious why we have this jagged-ness.

Day of the week plotting (2018)

Total_day_count <- DF_ZA_sorted_2018_mutated %>%
  group_by(day) %>%
  summarize(downloaded = n())

Total_day_count %>%
  ggplot(aes(x = day, y = downloaded, group = 1)) +
  geom_area(color = "#00ffbf", fill = "#00ffbf") +
  geom_point() +
  geom_line(aes(day)) +
  labs(x = "Day of the week", y = "Total number of packages downloaded") +
  theme(axis.text.y = element_text(angle = 90, hjust = 1), legend.position = "none") +
  ggtitle("Package downloads grouped by day of the week for 2018")

Moving from day of the year to day of the week we can see the following trend and this also explains to a large extent the jaggedness we see throughout:
• Monday is the most popular day (not completely surprised).
• Wednesday has a bit of a mid-week slump and then after a slight increase on Thursday there is a decline to the weekend.
• People are clearly not downloading as many packages on the weekend.

Decomposition of the year’s data (2015-2018)

We have seen some basic patterns in the data showing considerable variability depending on day of the week and month of the year. With all of this in mind – has 2018 been a good year? How does the actual trend look?

Total_daily <- DF_ZA_sorted %>%
  group_by(date) %>%
  summarize(downloaded = n())

Total_daily %>%
  ggplot(aes(x = date, y = downloaded, fill = downloaded, group = 1)) +
  geom_area(color = "#135c89", fill = "#135c89") +
  geom_point() +
  geom_line(group = 1, color = "black") +
  theme(
    legend.position = "none",
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(angle = 90, hjust = 1)
  ) +
  labs(x = "", y = "Total number of packages downloaded") +
  ggtitle("Total package downloaded per day from 2015 to 2018")

In order to uncover this answer, we need to do some basic decomposition of a time series. In time series, your data can be said to consist of a certain trend + cyclical + season + remainder. In some instances, trend and cyclical components are grouped together into one and we can represent it as follow: \[Y_{v} = T_{v} + S_{v} + R_{v}\]

To be able to decompose your time series and to estimate a seasonal effect you require a couple of years’ data as you cannot derive a seasonal effect from a single year’s data.

So to do the decomposition:
• I made use of the Stats package. It has a convenient function called stl() which is based on STL developed by Cleveland et al. (1990). STL (Seasonal-Trend Decomposition Procedure based on Loess) is a filtering procedure for decomposing a time series into a trend, seasonal, and set of remainder components. STL is widely used and this in part because of its availability in R today.

• In short - this method uses iterative Loess (locally weighted regression) smoothing to obtain an estimate of the trend and then Loess smoothing again to extract a changing additive seasonal component. The remainder is then the original data minus the seasonal and trend components.

• If you want to know more about it in detail you can look at the paper on STL by Cleveland et al. (1990)

Total_daily <- DF_ZA_sorted %>%
  group_by(date) %>%
  summarize(downloaded = n())

Total_daily_for_ts <- Total_daily

library(xts)
library(forecast)
time_series_2015_2018 <- ts(Total_daily_for_ts$downloaded, frequency = 365, start = c(2015, 1))
# time_series_2015_2018 %>% Acf(plot = FALSE) %>% autoplot()

library(stats)
Decomposition_2015_2018 <- time_series_2015_2018 %>%
  stl(s.window = "periodic") %>%
  autoplot()
Decomposition_2015_2018

So in terms of the actual results:

Trend:
• 2018 was definitely a great year. There is definite growth and we can see that from late 2016 the uptake has been persistent and it has only become stronger.

Seasonal & Remainder:
• The seasonal pattern we saw with just 2018 seems to be partially evident here, specifically the drop in January and December.

• There seems to be overall more happening in the first 6 months of the year than the later 6 months. Looking at the data we know this is a bit of a change for 2018 and we also see this clearly in the remainder that was extracted for 2018. This is likely just concerted additional use.

• I tried to look online at certain events or activities that could potentially have pushed this spike but there was nothing I could easily link on temporal level.

Likely what is driving this strong ongoing trend is:
• More and more academic institutes are using and training students in R.
• Businesses have also started seeing the value of R and are likely pushing for it.
• The general drive to more data science.
• The strong presence that R maintains within data science communities online.

Package downloads

Alright, so now with the total number of downloads out of the way we can get into the packages!

knitr::include_graphics("Packages_are_coming.png")

Package_overall <- DF_ZA_sorted_2018_mutated %>%
  group_by(package) %>%
  summarize(downloaded = n())

# number of unique packages
nrow(Package_overall)

## [1] 14156

# average number of times a pacakge is downloaded in R in 2018
mean(Package_overall$downloaded)

## [1] 203.4696

# median value is far lower than the median - we have a very long tail
median(Package_overall$downloaded)

## [1] 23

# sd is further evidence of that
sd(Package_overall$downloaded)

## [1] 1429.088

# sorting dataframe to include number downloaded by package and a rank
Package_overall_sorted <- DF_ZA_sorted_2018_mutated %>%
  group_by(package) %>%
  summarize(downloaded = n()) %>%
  arrange(desc(downloaded))
Package_overall_sorted <- Package_overall_sorted %>% mutate(rank = 1:n())

• For 2018 in South Africa we had 14156 unique packages downloaded.
• On average, a respective package was downloaded about 203 times.
• However, the median was \(\color{red}{\text{23}}\).

• Sorting the packages based on number of downloads (descending) and plotting it reveals a graph with a sharp peak and a tail that goes on forever.

• Essentially, we have a small subset of packages that are downloaded a lot while we have an enormous set that is downloaded very infrequently.

Top 100 Package downloads

Package_overall_sorted_100 <- Package_overall_sorted[1:100, ]

Package_top100 <- Package_overall_sorted_100 %>%
  ggplot(aes(reorder(x = package, rank), y = downloaded, group = 1)) +
  geom_area(color = "#ba6072", fill = "#ba6072") +
  geom_point() +
  geom_line(aes(package)) +
  theme(legend.position = "none") +
  labs(x = "Packages", y = "Total number of downloads") +
  theme(
    axis.text.y = element_text(angle = 90, hjust = 1),
    axis.text.x = element_text(angle = 90, hjust = 1),
    legend.position = "none"
  ) +
  ggtitle("Top 100 packages for 2018")

Package_top100

library(knitr)
knitr::kable(Package_overall_sorted_100)

package	downloaded	rank
Rcpp	36889	1
rlang	30500	2
stringi	30293	3
ggplot2	29210	4
tibble	27107	5
pillar	26885	6
cli	25126	7
glue	24934	8
digest	24931	9
stringr	24876	10
crayon	24712	11
utf8	24052	12
dplyr	23192	13
assertthat	23116	14
R6	23008	15
magrittr	22334	16
scales	21860	17
BH	21455	18
plyr	20654	19
reshape2	20550	20
readxl	20112	21
pkgconfig	19122	22
lazyeval	18877	23
viridisLite	18745	24
RColorBrewer	18737	25
munsell	18599	26
jsonlite	18522	27
yaml	18319	28
colorspace	18063	29
curl	17960	30
readr	17833	31
gtable	17557	32
knitr	17346	33
labeling	17192	34
hms	17065	35
withr	16902	36
mime	16688	37
purrr	16424	38
bindrcpp	16283	39
cellranger	16159	40
plogr	16033	41
tidyselect	15773	42
tidyr	15689	43
rematch	15422	44
htmltools	15364	45
evaluate	15074	46
bindr	15029	47
backports	15023	48
markdown	14429	49
highr	14306	50
rmarkdown	13950	51
base64enc	13937	52
data.table	13921	53
haven	13804	54
openssl	13552	55
fansi	13384	56
lubridate	12769	57
httr	12680	58
forcats	12508	59
rstudioapi	12481	60
rprojroot	12386	61
zoo	11823	62
tidyverse	11499	63
broom	11334	64
httpuv	10717	65
sp	10541	66
DBI	10512	67
shiny	10282	68
xml2	10134	69
lme4	9892	70
dichromat	9852	71
bitops	9595	72
car	9523	73
whisker	9495	74
tinytex	9342	75
gridExtra	9238	76
RcppEigen	9215	77
htmlwidgets	9142	78
callr	9001	79
caTools	8763	80
xtable	8740	81
quantreg	8467	82
xfun	8274	83
sourcetools	8162	84
processx	8144	85
devtools	8019	86
nloptr	7988	87
later	7912	88
maptools	7907	89
mvtnorm	7840	90
dbplyr	7836	91
rvest	7692	92
SparseM	7466	93
selectr	7403	94
abind	7367	95
reprex	7287	96
rJava	7204	97
xts	7145	98
MatrixModels	7101	99
modelr	7091	100

Network analysis - Doing something more with this data?

• As with the previous data I wanted to do something more meaningful with this that goes beyond descriptive.
• That got me thinking of chains or more specifically edges and networks. I wasn’t sure if this would lead to anything meaningful but I attempted it nonetheless.

A basic example:

knitr::include_graphics("Chain.png")

We can take the above diagram and represent it within a network of connected nodes. Essentially, we have 5 nodes and we have an edge list (showing how the nodes are connected) with 4 entries. We can visualise it below:

library(visNetwork)
nodes <- data.frame(
  id = 1:5,
  color.background = c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#CCEEFF"),
  label = c("ggplot2", "car", "stats", "leaflet", "caret")
)
edges <- data.frame(from = c(1, 2, 3, 4), to = c(2, 3, 4, 5))
visNetwork(nodes, edges, width = "100%")

Depending on the edges we can have some nodes that are more connected (i.e. they have a higher degree):

library(visNetwork)
nodes <- data.frame(
  id = 1:5,
  color.background = c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#CCEEFF"),
  label = c("ggplot2", "car", "stats", "leaflet", "caret")
)
edges <- data.frame(from = c(1, 3, 4, 3), to = c(2, 4, 5, 5))
visNetwork(nodes, edges, width = "100%")

There is a lot more to network analysis and graph theory but essentially we will be using these basic concepts to build a far more complex network.

Deep into the network:

library(igraph)
library(qgraph)

DF_ZA_adapt <- DF_ZA_sorted_2018_mutated %>% mutate(month = month(date))
DF_ZA_adapt <- DF_ZA_adapt %>% mutate(package_download = as.factor(package))

DF_ZA_adapt_wID <- DF_ZA_adapt %>% mutate(package_index = as.numeric(DF_ZA_adapt$package_download))

# Creating a unique ID across all the days of the year.
DF_ZA_adapt_wID <- DF_ZA_adapt_wID %>% mutate(
  UniqueD =
    paste(yday(DF_ZA_adapt_wID$date), "_", DF_ZA_adapt_wID$ip_id, sep = "")
)

DF_ZA_adapt_wID_clean <- DF_ZA_adapt_wID %>% drop_na(package)

• Because ip_id was only unique to the day we had to make it unique to the whole year. Essentially, we just took each day of the year and added in front of the ip_id. With this simple ID creation we could move on.

list_temp <- unique(DF_ZA_adapt_wID_clean$UniqueD)
length(list_temp)

Accumulate_instance <- NULL

# Loop to create the the correct stacking for my aedge list
for (i in 1:length(list_temp)) {
  df_temp <- DF_ZA_adapt_wID_clean %>% filter(UniqueD == list_temp[i])

  df_temp_trail <- df_temp$package_index

  temp_instance <- cbind(
    df_temp_trail[-(length(df_temp_trail))], # Need to cut the last number from the first copy
    df_temp_trail[-1], # And the first number fromn the last set
    df_temp$UniqueD[-1]
  ) # Just adding the ID so I have some way to link other things back

  Accumulate_instance <- rbind(Accumulate_instance, temp_instance)
}

• We filter the dataframe on each new created ID, copy the list of packages and create the edge list.

Accumulate_instance_check <- as.data.frame(Accumulate_instance) %>% drop_na()
head(Accumulate_instance_check)

Accumulate_instance_df <- as.data.frame(Accumulate_instance)
head(Accumulate_instance_df)

#save for later use
#write.csv(here(Accumulate_instance_df,"Accumulate_instance_backup.csv""))

nodelist <- as.data.frame(DF_ZA_adapt_wID_clean$package_download)
nodelist <- cbind(nodelist, DF_ZA_adapt_wID_clean$package_index)

nodelist <- arrange(nodelist, nodelist$`DF_ZA_adapt_wID_clean$package_index`)

#Final node list creation
nodelist_unique <- unique(nodelist)
#write.csv(here(nodelist_unique,"nodelist_unqiue.csv"))

# group same to and froms so know how often that specific chain occured
edge_weights <- Accumulate_instance %>% group_by(TO, FROM) %>% summarize(n())

# read in the nodelist or our unique set of packages to reference our numbers back
nodelist_dictionary <- nodelist_unique
# nodelist_dictionary <- nodelist_dictionary[, -1]
nodelist_dictionary <- nodelist_dictionary[, 2:1]
names(nodelist_dictionary) <- c("ID", "Name")

# Tables on from and to
library(tidyverse)
library(data.table)

set.seed(1337)

# read created file
DF_Accumulate_instance <- read.csv(here("Accumulate_instance_backup.csv"))

# give columns more meaningful names
names(DF_Accumulate_instance) <- c("L_ID", "FROM", "TO", "M_ID")

## Edge-

# group same to and froms so know how often that specific chain occured
edge_weights <- DF_Accumulate_instance %>% group_by(TO, FROM) %>% summarize(n())

# read in the nodelist or our unique set of packages to reference our numbers back
nodelist_dictionary <- read.csv(here("nodelist_unqiue.csv"))
nodelist_dictionary <- nodelist_dictionary[, -1]
nodelist_dictionary <- nodelist_dictionary[, 2:1]
names(nodelist_dictionary) <- c("ID", "Name")


library(expss)
# lookup the unqiue number to the corresponding entry for the TO set
TO_lookup <- expss::vlookup(
  lookup_value = edge_weights$TO,
  dict = nodelist_dictionary,
  lookup_column = 1,
  result_column = 2
)

# lookup the unqiue number to the corresponding entry for the FROM set
FROM_lookup <- expss::vlookup(
  lookup_value = edge_weights$FROM,
  dict = nodelist_dictionary,
  lookup_column = 1,
  result_column = 2
)

# combine the new lookups with the actual edge_weights
edge_weights_matched <- cbind.data.frame(edge_weights, TO_lookup, FROM_lookup)

edge_weights_prefinal <- edge_weights_matched[, -1:-2]
names(edge_weights_prefinal) <- c("weight", "to", "from")
# reorder frame in the right format
edge_weights_prefinal <- cbind(
  edge_weights_prefinal[, 2:3],
  edge_weights_prefinal[, 1]
)
names(edge_weights_prefinal) <- c("to", "from", "weight")

Final_edge_weights <- edge_weights_prefinal

# Nodes
# It is unecessary to create this if you already have the full nodelist_dictionary
# This is only required if you will trim or cut down somehow where some nodes will not be linked

pre_local_index <- rbind(edge_weights_prefinal$to, edge_weights_prefinal$from)

pre_local_index <- edge_weights_prefinal %>%
  select(to) %>%
  bind_rows(
    edge_weights_prefinal %>%
      transmute(to = from)
  ) %>%
  unique()

new_id_for_index <- 1:nrow(pre_local_index)
mid_local_index <- cbind(pre_local_index, new_id_for_index)

Final_node_list <- mid_local_index

# Graphing the first network with visNetwork
library(igraph)
library(visNetwork)

names(Final_edge_weights) <- c("Target", "Source", "Weight")
names(Final_node_list) <- c("name", "id")

Original_network <- graph_from_data_frame(
  d = Final_edge_weights,
  vertices = Final_node_list
)

# test.visn <- toVisNetworkData(Original_network)

# visNetwork(test.visn$nodes, test.visn$edges, width = "100%", height = "100vh") %>%
#  visIgraphLayout(layout = "layout_with_fr") %>%
#  visNodes(color = list(
#    background = "lightblue",
#    border = "lancet",
#    highlight = "yellow"
#  )) %>%
#  visOptions(
#    highlightNearest = TRUE,
#    nodesIdSelection = TRUE,
#    collapse = TRUE
#  ) %>%
# visNodes(size = 10)

• I just saved a visual of the network as a .png as it is an extremely large object.

knitr::include_graphics("original_network.png")

• This network is not really interpretable. It does however show you the data in its rawest form and actually the magnitude of what we are sitting with.

# Since we technically have an undirected graph setup we should create a final weight
# based on the sum of the “to-from” edge that corresponds to a “from-to” for the same two packages.
Special_edge_weights_filtered <- Final_edge_weights %>%
  rowwise() %>%
  mutate(temp = paste(sort(x = c(Source, Target)),
    collapse = " "
  )) %>%
  ungroup() %>%
  group_by(temp) %>%
  mutate(result = sum(Weight)) %>%
  ungroup() %>%
  select(-temp)

Special_edge_weights_filtered <- Special_edge_weights_filtered[, -3]
names(Special_edge_weights_filtered) <- c("Target", "Source", "Weight")


# Reducing our edge list with weights based on size of the weight and cutting redundant self reference nodes
Final_edge_weights_filtered <- Special_edge_weights_filtered %>%
  filter(Weight > 100) %>%
  filter(Target != Source)

# We have to create a new node list given that some small nodes will drop out and
# we don't want empty node points 

pre_local_index2 <- Final_edge_weights_filtered %>%
  select(Source) %>%
  bind_rows(
    Final_edge_weights_filtered %>%
      transmute(Source = Target)
  ) %>%
  unique()

pre_local_index2 <- pre_local_index2 %>% arrange(Source)

new_id_for_index <- 1:nrow(pre_local_index2)
mid_local_index <- cbind(pre_local_index2, new_id_for_index)

Final_node_list_2 <- mid_local_index
names(Final_node_list_2) <- c("name", "id")


Adjusted_network <- graph_from_data_frame(
  d = Final_edge_weights_filtered,
  vertices = Final_node_list_2
)

test.visn <- toVisNetworkData(Adjusted_network)


visNetwork(test.visn$nodes, test.visn$edges, width = "100%", height = "100vh") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE,
    collapse = TRUE
  ) %>%
  visNodes(size = 10)

#To get larger nodes based on degree value
graph <- graph.data.frame(Final_edge_weights_filtered, directed = T)
degree_value <- degree(graph, mode = "all")
Final_node_list_2$value <- degree_value[match(Final_node_list_2$name, names(degree_value))]


Final_network <- graph_from_data_frame(
  d = Final_edge_weights_filtered,
  vertices = Final_node_list_2
)

test.visn <- toVisNetworkData(Final_network)

# In order to change line thickness we associate the weights with the value      
test.visn$edges$value <- Final_edge_weights_filtered$Weight

# Creating my colour palette based on three colours from which to interpolate the colours between these. 
rbPal <- colorRampPalette(c("#12cef0", "#f395c3", "#6f1383"))

# to add colour to our test.visn we need to add it for the color.background in nodes
test.visn$nodes$color.background <- rbPal(nrow(test.visn$nodes))[as.numeric(cut(test.visn$nodes$value * 100, breaks = nrow(test.visn$nodes)))]

#Visualising final network: 
visNetwork(test.visn$nodes, test.visn$edges, width = "100%", height = "100vh") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE,
    collapse = TRUE
  )

If you would like to read more about the layout algorithm you can read: Fruchterman & Reingold (1991)
This is a reduced version of the original. We pruned the network by excluding edge weights smaller than 100.

If an edge was used less than a 100 times then it was scrapped from the visualisation. I produced two versions above. One without any alterations beyond the weight filter, while the other had a couple of changes:
• Node size dependent on degrees.
• Colouring dependent on node size.
• Line thickness dependent on edge list weight.

Even though we sit with a far smaller set of packages it is still rather complex.

Things you would want to evaluate is:
• How well connected is a respective package?
• Which packages are directly adjacent to the selected node and which are secondary?
• Where are the strongest ties with the respective package? What functions do they bring to the main selected package?

What does this type of visualisation offer?
• You can see packages that are often downloaded with your favourite packages and explore those relationships.
• It gives developers an idea of how their package is used in conjunction with others.

Conclusion - 5 take away points:

1. R is definitely growing in South Africa and the momentum is there.

It is likely that this momentum will continue and we as a community can do a lot to foster that growth through a variety of activities. Be it conferences like today, meetups, tutorials, etc. Decide if you want to be more involved and think about the how!

2. We finally have a more concrete idea of how R looks and how it is currently functioning.

To some extent, we did not have such a concrete representation before this point. There is likely far more to uncover still from this data which is available to all of us.

3. If the network analysis was at all overwhelming in any way for you, then in some ways it performed its very function.

This might sound like a counterintuitive point but this visualisation has shown you a glimpse of how complex the landscape is and how difficult the task is to navigate that landscape.

This network ties in nicely with a point made by Silge, Nash, & Graves (2018) in their article Navigating the R Package Universe:

“Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other.”

This is not meant to deter people from downloading packages but to give some thought of how best to navigate this space - what will your strategy be?

For many people it would seem package selection comes down to the the social media we consume, who we follow, and who we interact with regularly:

knitr::include_graphics("Discovering_packages.png")

• This was part of an online survey (n = 1039) conducted by Silge, Nash, and Graves (2018) for their paper.

4. It is likely that the package proliferation will continue and it will become even more difficult to select packages.

• On the one hand, I am sure we are all mostly in agreement that the contributions from various communities and individuals is a phenomenal side of both the open source and R community. We are spared many, many hours from not having to develop a host of things from scratch.

• On the other hand, there are in some instances so many good alternatives that the choice becomes extremely difficult. How many packages’ documentation and examples will you go through before selecting a package? There is something of a decision fatigue to be noted here. I suppose all I want to say is: when being spoiled for choice it spoils choice.

Why not simple unification?
• “A real downside of unification is the burden of work for the developers involved, including dealing with complex dependencies and reverse dependencies” (Silge, Nash, & Graves, 2018)

5. We now know something about the relative merits of the packages based on downloads but that is not sufficient.

This is a too narrow metric for a variety of reasons. What is the way forward? There doesn’t seem to be any consensus on this matter.

Some have opted to create metrics or platforms by which to search and evaluate packages. Just to list a few:
• R Site Search by Baron (or via RSiteSearch() function in utils (R Core Team, 2018))
• CRAN Task Views (Zeileis, 2005)
• packagemetrics (Firke et al., 2018)
• CRANsearcher RStudio addin (Krouse and Calatroni, 2017)
• rseek site of Goodman (2007)
• R Package Documentation site of Howson (2018)
• RDocumentation.org
• METACRAN (Csárdi and Ooms)
• crantastic (Wickham and Mæland)

Others emphasise the value of communities and investing in them:
• “Our survey does show that personal recommendations have been important for many individuals in evaluating R packages. This is yet another area where local user groups can continue to have important impact…Our survey and discussions show how impactful these community networks are; investing in community building is not something we need do only because of idealism, but because it is effective.” (Silge, Nash, & Graves, 2018)

Going forward

• I’ll likely deploy a more interactive version of this network.

• Something else I have been thinking about is developing a shiny app mini game. So one take on classifying functions in R has been through a shiny app developed by Lucy McGowan:

https://lucy.shinyapps.io/classify/

Something similar can be done for R packages

knitr::include_graphics("End.png")

satRday - Understanding the state of R in South Africa: 2018 in review

Francois van Heerden

06/04/2019