Last updated: 2022-11-24
Checks: 6 1
Knit directory:
workflowr-policy-landscape/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220505)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version c95aa82. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Unstaged changes:
Modified: analysis/2_Topic_modeling.Rmd
Modified: analysis/3_Text_similarities_Figure_2B.Rmd
Modified: output/created_datasets/dataset_words_stm_5topics.csv
Modified: output/created_datasets/df_doc_level_stm_gamma.csv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/2_Topic_modeling.Rmd
) and
HTML (docs/2_Topic_modeling.html
) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote
),
click on the hyperlinks in the table below to view the files as they
were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 0a21152 | zuzannazagrodzka | 2022-09-21 | Build site. |
html | 796aa8e | zuzannazagrodzka | 2022-09-21 | Build site. |
Rmd | efb1202 | zuzannazagrodzka | 2022-09-21 | Publish other files |
To investigate the relationship to our research questions regarding the topics of our interest and the language choice in the stakeholders’ statements, the study employs a computational text analysis called the Structural Topic Modeling (STM) using library stm. We chose STM because it addresses the issue with the independence (like Correlated Topic Models) but also enables the discovery of topics and their prevalence based on document metadata, such as stakeholder group. The topic modeling was conducted on a sentence level and the metadata variable contained information about the stakeholders group and the document each work comes from.
After generating the topics, we used the ground theory method of analysing qualitative data (Corbin and Strauss, 1990) to identify the main categories that are present in the topics.
Open Research and the values that it brings seems to be necessary approach to maximise impact of science and help to solve global challenges. We expect that stakeholders that care about the Open Research agenda will frequently use words such as:
open, research, science, datum, access, accessibility, share, transparency
Research results and findings should be used to understand and solve challenges but also to educate people. Some stakeholders might emphasise the importance of it but not necessary seen the importance of Open Research in the process. The words that we could expect to be associated with the topic are:
impact, solve, development, sustainable, education, policy, climate, change, wildlife. We also expect that words such as open, access will be absent.
In our work, we hypothesise that some organisations are run as a for profit business. Their approach could be more monetary/profit driven and therefore, they would use a business language or financial terms in their aim and mission statements. We could expect in these topics to find words such as:
profit, fund, service, management, pay, financially
The stm model generated 73 topics which later were characterised and categorised following the above description. In the end we were able to identify four main topics that we called:
Only the Open Research topic was identified as predicted beforehand.
Topic 1, Topic 13, Topic 58, Topic 69, Topic 43, Topic 25, Topic 32, Topic 54, Topic 52, Topic 60
Topic 5, Topic 7, Topic 10, Topic 11, Topic 21, Topic 23, Topic 26, Topic 39, Topic 41, Topic 42, Topic 63, Topic 65
3.Innovation & Solution topic contains:
Topic 17, Topic 24, Topic 30, Topic 14, Topic 2, Topic 34, Topic 38, Topic 4, Topic 44, Topic 48, Topic 50, Topic 51, Topic 55, Topic 61, Topic 66, Topic 71, Topic 20, Topic 57, Topic 62
Topic 3, Topic 12, Topic 16, Topic 22, Topic 35, Topic 49, Topic 47, Topic 53
The rest of the topics that were not able to be categorised to any of our topics were not included in our analysis and interpretation.
Beta value is a calculated for each word and for each topic and gives information on a probability that a certain word belongs to a certain topic. Score value gives us measure how exclusive the each word is to a certain topic. For example, if a word has a low value, it means that it’s equally used in all topics. The score was calculated by a word beta value divided by a sum of beta values for this word across all topics.
After merging topics into new topics that were categorised by us, we calculated mean value of the merged topics beta and score values. Later we used these values in our text similarities analysis to create Fig. 2B (3_Text_similarities, Figure_2B).
Next, we calculated the proportion of the topics appearing in all of the documents by using a mean gamma value for each sentence and new topics. The gamma value informs what is a probability that certain document (here: sentence) belongs to a certain topic.
We chose one topic with the highest mean gamma value as a dominant topic for each of the sentence and later calculated the proportion of the sentences belonging to the new topics.
This data set was used to create Fig. 2A (Figure_2A).
rm(list=ls())
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Package version: 3.2.3
Unicode version: 14.0
ICU version: 70.1
Parallel computing: 12 of 12 threads used.
See https://quanteda.io for tutorials and examples.
Loading required package: NLP
Attaching package: 'NLP'
The following objects are masked from 'package:quanteda':
meta, meta<-
The following object is masked from 'package:ggplot2':
annotate
Attaching package: 'tm'
The following object is masked from 'package:quanteda':
stopwords
Loading required package: RColorBrewer
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Attaching package: 'igraph'
The following object is masked from 'package:quanteda.textplots':
as.igraph
The following objects are masked from 'package:dplyr':
as_data_frame, groups, union
The following objects are masked from 'package:purrr':
compose, simplify
The following object is masked from 'package:tidyr':
crossing
The following object is masked from 'package:tibble':
as_data_frame
The following objects are masked from 'package:stats':
decompose, spectrum
The following object is masked from 'package:base':
union
stm v1.3.6 successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
data_words <- read.csv(file = "./output/created_datasets/cleaned_data.csv")
Code follows: https://juliasilge.com/blog/sherlock-holmes-stm/
# Creating metadata and connecting it with my data to perform topic modeling on the documents. Metadata includes: document name and stakeholder name
data_dfm <- data_words %>%
count(sentence_doc, word, sort = TRUE) %>%
cast_dfm(sentence_doc, word, n)
data_sparse <- data_words %>%
count(sentence_doc, word, sort = TRUE) %>%
cast_sparse(sentence_doc, word, n)
# Creating metadata: document name, stakeholder name
data_metadata <- data_words %>%
select(sentence_doc, name, stakeholder) %>%
distinct(sentence_doc, .keep_all = TRUE)
# Connecting my metadata and data_dfm for stm()
covs = data.frame(sentence_doc = data_dfm@docvars$docname, row = c(1:length(data_dfm@docvars$docname)))
covs = left_join(covs, data_metadata)
Joining, by = "sentence_doc"
data_beta <- data_words
topic_model <- stm(data_dfm, K = 0, verbose = FALSE, init.type = "Spectral", prevalence = ~ name + stakeholder, data = covs, seed = 1) # running stm() function to fit a model and generate topics
tpc = topicCorr(topic_model)
plot(tpc) # plotting topic connections, there are no clear clustering among the topics
# Getting beta values from the topic modeling and adding beta value to the data_words
td_beta <- tidy(topic_model) # getting beta values
td_beta %>%
group_by(term) %>%
arrange(term, -beta)
# A tibble: 220,896 × 3
# Groups: term [2,832]
topic term beta
<int> <chr> <dbl>
1 51 abide 3.55e- 3
2 33 abide 2.95e- 3
3 69 abide 2.10e-126
4 11 abide 3.55e-230
5 13 abide 3.68e-254
6 78 abide 6.10e-263
7 53 abide 1.38e-284
8 71 abide 1.34e-285
9 70 abide 4.02e-289
10 55 abide 2.99e-289
# … with 220,886 more rows
# Topics generated by the model were coded and categorised into five defined by us topics. Below, new "topic" column was created and topics were assigned
td_beta$stm_topic <- td_beta$topic
td_beta$topic <- "NA"
# 1. Open Research topic contains: Topic 1, Topic 13, Topic 58, Topic 69, Topic 43, Topic 25, Topic 32, Topic 54, Topic 52, Topic 60
td_beta$topic[td_beta$stm_topic%in% c(1, 13, 58, 69, 43, 25, 32, 54, 52, 60)] <- 1
# 2. Community & Support topic contains: Topic 5, Topic 7, Topic 10, Topic 11, Topic 21, Topic 23, Topic 26, Topic 39, Topic 41, Topic 42, Topic 63, Topic 65
td_beta$topic[td_beta$stm_topic%in% c(5, 7, 10, 11, 21, 23, 26, 39, 41, 42, 63, 65)] <- 2
# 3.Innovation & Solution topic contains: Topic 17, Topic 24, Topic 30, Topic 14, Topic 2, Topic 34, Topic 38, Topic 4, Topic 44, Topic 48, Topic 50, Topic 51, Topic 55, Topic 61, Topic 66, Topic 71, Topic 20, Topic 57, Topic 62
td_beta$topic[td_beta$stm_topic%in% c(17, 24, 30, 14, 2, 34, 38, 4, 44, 48, 50, 51, 55, 61, 66, 71, 20, 57, 62)] <- 3
# 4. Publication process topic contains: Topic 3, Topic 12, Topic 16, Topic 22, Topic 35, Topic 49, Topic 47, Topic 53
td_beta$topic[td_beta$stm_topic%in% c(3, 12, 16, 22, 35, 49, 47, 53)] <- 4
# Rest of the topics that were not able to be categorised to any of our topics were not included in our analysis and interpretation.
td_beta$topic[td_beta$topic %in% "NA"] <- 5
td_beta$topic <- as.integer(td_beta$topic)
td_beta$term <- as.factor(td_beta$term)
# Sum of beta values for all topics for each category for each word
td_beta_sum <- td_beta %>%
select(-stm_topic) %>%
group_by(topic, term) %>%
summarise(beta = sum(beta))
`summarise()` has grouped output by 'topic'. You can override using the
`.groups` argument.
# Mean value of beta values for all topics for each category for each word
td_beta_mean <- td_beta %>%
select(-stm_topic) %>%
group_by(topic, term) %>%
summarise(beta = mean(beta))
`summarise()` has grouped output by 'topic'. You can override using the
`.groups` argument.
td_beta_groups_sum <- td_beta_sum %>%
spread(topic, beta)
# Calculating score - beta value of the word in topic / total beta value for the word in all topics
td_beta_total <- td_beta %>%
group_by(term) %>%
summarise(beta_word_total = sum(beta))
td_beta_score <- td_beta %>%
left_join(td_beta_total, by = c("term" = "term"))
td_beta_score$score = td_beta_score$beta/td_beta_score$beta_word_total
head(td_beta_score)
# A tibble: 6 × 6
topic term beta stm_topic beta_word_total score
<int> <fct> <dbl> <int> <dbl> <dbl>
1 1 abide 0 1 0.00650 0
2 3 abide 0 2 0.00650 0
3 4 abide 0 3 0.00650 0
4 3 abide 0 4 0.00650 0
5 2 abide 0 5 0.00650 0
6 5 abide 0 6 0.00650 0
td_beta_score <- td_beta_score %>%
select(topic, term, beta, score, stm_topic)
# Calculating mean beta and score value for new topics
# Grouping by word and then grouping by the category to calculate mean values
td_beta_mean <- td_beta_score %>%
group_by(term, topic) %>%
summarise(mean_beta = mean(beta)) %>%
mutate(merge_col = paste(term, topic, sep = "_"))
`summarise()` has grouped output by 'term'. You can override using the
`.groups` argument.
td_score_mean <- td_beta_score %>%
group_by(term, topic) %>%
summarise(mean_score = mean(score)) %>%
mutate(merge_col = paste(term, topic, sep = "_")) %>%
ungroup() %>%
select(-term, -topic)
`summarise()` has grouped output by 'term'. You can override using the
`.groups` argument.
# Creating a data frame with score and beta mean values
td_beta_score_mean <- td_beta_mean %>%
left_join(td_score_mean, by = c("merge_col" = "merge_col"))
# Adding a beta sum column
td_beta_sum_w <- td_beta_sum %>%
mutate(merge_col = paste(term, topic, sep = "_")) %>%
ungroup() %>%
select(- term, - topic)
td_beta_score_mean_max <- td_beta_score_mean %>%
left_join(td_beta_sum_w, by = c("merge_col" = "merge_col")) %>%
select(-merge_col) %>%
rename(sum_beta = beta)
# Getting highest beta value for each of the word with the information about the Topic number
td_beta_select <- td_beta_score_mean_max
td_beta_mean_wide <- td_beta_score_mean_max %>%
select(term, topic, mean_beta) %>%
spread(topic, mean_beta) %>%
rename(mean_beta_t1 = `1`, mean_beta_t2 = `2`, mean_beta_t3 = `3`, mean_beta_t4 = `4`, mean_beta_t5 = `5`)
td_score_mean_wide <- td_beta_score_mean_max %>%
select(term, topic, mean_score) %>%
spread(topic, mean_score) %>%
rename(mean_score_t1 = `1`, mean_score_t2 = `2`, mean_score_t3 = `3`, mean_score_t4 = `4`, mean_score_t5 = `5`)
td_beta_sum_wide <- td_beta_score_mean_max %>%
select(term, topic, sum_beta) %>%
spread(topic, sum_beta) %>%
rename(sum_beta_t1 = `1`, sum_beta_t2 = `2`, sum_beta_t3 = `3`, sum_beta_t4 = `4`, sum_beta_t5 = `5`)
# Highest score value
td_score_topic <- td_beta_select %>%
select(-sum_beta, -mean_beta) %>%
group_by(term) %>%
top_n(1, mean_score) %>%
rename(highest_mean_score = mean_score)
td_score_topic %>% group_by(topic) %>% count()
# A tibble: 5 × 2
# Groups: topic [5]
topic n
<int> <int>
1 1 491
2 2 571
3 3 662
4 4 410
5 5 698
# Highest mean beta value
td_beta_topic <- td_beta_select %>%
select(-sum_beta, -mean_score) %>%
group_by(term) %>%
top_n(1, mean_beta) %>%
rename(highest_mean_beta = mean_beta)
# Merging data_words with: td_beta_mean_wide, td_score_mean_wide, td_beta_sum_wide, td_score_topic
data_words$word <- as.factor(data_words$word)
to_merge <- td_beta_mean_wide %>%
left_join(td_score_mean_wide, by= c("term" = "term")) %>%
left_join(td_beta_sum_wide, by= c("term" = "term")) %>%
left_join(td_score_topic, by= c("term" = "term")) %>%
left_join(td_beta_topic, by= c("term" = "term"))
data_words_stm <- data_words %>%
left_join(to_merge, by = c("word" = "term")) %>%
select(-topic.y) %>%
rename(topic = topic.x)
# Saving the csv file
write_csv(data_words_stm, file = "./output/created_datasets/dataset_words_stm_5topics.csv")
# Getting gamma values from topic_modeling
td_gamma <- tidy(topic_model, matrix = "gamma",
document_names = rownames(data_dfm))
td_gamma_prog <- td_gamma
td_gamma_prog$stm_topic <- td_gamma_prog$topic
td_gamma_prog$topic <- "NA"
td_gamma_prog$topic[td_gamma_prog$stm_topic%in% c(1, 13, 58, 69, 43, 25, 32, 54, 52, 60)] <- 1
td_gamma_prog$topic[td_gamma_prog$stm_topic%in% c(5, 7, 10, 11, 21, 23, 26, 39, 41, 42, 63, 65)] <- 2
td_gamma_prog$topic[td_gamma_prog$stm_topic%in% c(17, 24, 30, 14, 2, 34, 38, 4, 44, 48, 50, 51, 55, 61, 66, 71, 20, 57, 62)] <- 3
td_gamma_prog$topic[td_gamma_prog$stm_topic%in% c(3, 12, 16, 22, 35, 49, 47, 53)] <- 4
td_gamma_prog$topic[td_gamma_prog$topic %in% "NA"] <- 5
td_gamma_prog$topic <- as.integer(td_gamma_prog$topic)
td_gamma_prog$document <- as.factor(td_gamma_prog$document)
# Removing topic 5 as we are not interested in it
td_gamma_prog <- td_gamma_prog %>%
select(-stm_topic) %>%
rename(sentence_doc = document) %>%
filter(topic != 5)
# Choosing the highest gamma value for each sentence
# Sort by the sentence and take top_n(1, gamma) to choose the topic with the biggest gamma value
td_gamma_prog_info <- td_gamma_prog %>%
rename(topic_sentence = topic) %>%
group_by(sentence_doc) %>%
top_n(1, gamma) %>%
ungroup()
# If there are any sentences with the same gamma values, I will exclude them from my analysis as not belonging to only one topic.
td_gamma_prog_info_keep <- td_gamma_prog_info %>%
group_by(sentence_doc) %>%
count() %>%
filter(n == 1) %>%
select(-n) %>%
ungroup()
td_gamma_prog_info <- td_gamma_prog_info_keep %>%
left_join(td_gamma_prog_info, by= c("sentence_doc" = "sentence_doc"))
# Calculating a proportion of sentences in each of the document,
# for that I need to add columns: document and stakeholder
# Adding info:
info_sentence_doc <- data_words %>%
select(sentence_doc, name, stakeholder) %>%
distinct(sentence_doc, .keep_all = TRUE)
td_gamma_prog_info <- td_gamma_prog_info %>%
left_join(info_sentence_doc, by= c("sentence_doc"="sentence_doc"))
# Calculating proportion
# I will do so by counting name (document) to get a number of sentences
# in each document, then I will count a number of each topic for a document
# and then I will create a column with the proportion
sentence_count_gamma <- data_words %>%
distinct(sentence_doc, .keep_all = TRUE) %>%
group_by(name) %>%
count() %>%
rename(total_sent = n)
topic_count_gamma <- td_gamma_prog_info %>%
group_by(name, topic_sentence) %>%
count() %>%
rename(total_topic = n)
doc_level_stm_gamma <- topic_count_gamma %>%
left_join(sentence_count_gamma, by= c("name" = "name")) %>%
mutate(merge_col = paste(name, topic_sentence, sep = "_"))
# There are some missing values, replacing it with 0 as not present
df_base <- info_sentence_doc %>%
select(-sentence_doc) %>%
distinct(name, .keep_all = TRUE) %>%
slice(rep(1:n(), each = 4)) %>%
group_by(name) %>%
mutate(topic_sentence = 1:n()) %>%
mutate(merge_col = paste(name, topic_sentence, sep = "_")) %>%
ungroup() %>%
select(-name, -topic_sentence)
df_doc_level_stm_gamma <- df_base %>%
left_join(doc_level_stm_gamma, by= c("merge_col" = "merge_col")) %>%
select(-name, -topic_sentence) %>%
separate(merge_col, c("name","topic"), sep = "_")
df_doc_level_stm_gamma$prop <- df_doc_level_stm_gamma$total_topic/df_doc_level_stm_gamma$total_sent
df_doc_level_stm_gamma <- df_doc_level_stm_gamma %>%
mutate_at(vars(prop), ~replace_na(., 0)) # replacing NA with 0, when a topic not present
write_excel_csv(df_doc_level_stm_gamma, "./output/created_datasets/df_doc_level_stm_gamma.csv")
Topic 1: Open Research Topic 2: Community & Support Topic 3: Innovation & Solution Topic 4: Publication process
# Number of words belonging to new topics
no_words_topics <- data_words_stm %>%
select(word, topic) %>%
distinct(word, .keep_all = TRUE) %>%
group_by(topic) %>%
count()
no_words_topics %>%
kbl(caption = "No of words belonging to new topics") %>%
kable_classic("hover", full_width = F)
topic | n |
---|---|
1 | 491 |
2 | 571 |
3 | 662 |
4 | 410 |
5 | 698 |
# The most relevant words (15) for each topic: highest mean beta
words_high_beta_topic <- data_words_stm %>%
select(word, topic, highest_mean_beta) %>%
distinct(word, .keep_all = TRUE) %>%
group_by(topic) %>%
top_n(4) %>%
ungroup() %>%
arrange(topic, -highest_mean_beta)
Selecting by highest_mean_beta
words_high_beta_topic %>%
kbl(caption = "Most relevant words in topics") %>%
kable_classic("hover", full_width = F)
word | topic | highest_mean_beta |
---|---|---|
datum | 1 | 0.0493624 |
share | 1 | 0.0262492 |
researcher | 1 | 0.0164228 |
tool | 1 | 0.0105743 |
scientific | 2 | 0.0256745 |
technology | 2 | 0.0165909 |
high | 2 | 0.0114022 |
programme | 2 | 0.0099910 |
community | 3 | 0.0194908 |
contribute | 3 | 0.0138369 |
support | 3 | 0.0137941 |
fund | 3 | 0.0098685 |
research | 4 | 0.0619931 |
member | 4 | 0.0262439 |
biological | 4 | 0.0145711 |
make | 4 | 0.0139964 |
science | 5 | 0.0198494 |
work | 5 | 0.0158738 |
publish | 5 | 0.0139126 |
open | 5 | 0.0134846 |
# Advocates
advocates_info <- df_doc_level_stm_gamma %>%
select(-total_topic, - total_sent) %>%
filter(stakeholder == "advocates") %>%
group_by(topic) %>%
slice_max(order_by = prop, n = 3) %>%
select(-stakeholder)
advocates_info
# A tibble: 12 × 3
# Groups: topic [4]
name topic prop
<chr> <chr> <dbl>
1 CoData 1 1
2 ROpenSci 1 1
3 Free our knowledge 1 0.769
4 COPDESS 2 1
5 Amelica 2 0.846
6 FAIRsharing 2 0.76
7 DOAJ 3 1
8 Research4life 3 1
9 Bioline International 3 0.818
10 coalitionS 4 1
11 Jisc 4 1
12 DataCite 4 0.833
advocates_info %>%
kbl(caption = "Advocates associated with topics") %>%
kable_classic("hover", full_width = F)
name | topic | prop |
---|---|---|
CoData | 1 | 1.0000000 |
ROpenSci | 1 | 1.0000000 |
Free our knowledge | 1 | 0.7692308 |
COPDESS | 2 | 1.0000000 |
Amelica | 2 | 0.8461538 |
FAIRsharing | 2 | 0.7600000 |
DOAJ | 3 | 1.0000000 |
Research4life | 3 | 1.0000000 |
Bioline International | 3 | 0.8181818 |
coalitionS | 4 | 1.0000000 |
Jisc | 4 | 1.0000000 |
DataCite | 4 | 0.8333333 |
# Funders
funders_info <- df_doc_level_stm_gamma %>%
select(-total_topic, - total_sent) %>%
filter(stakeholder == "funders") %>%
group_by(topic) %>%
slice_max(order_by = prop, n = 3) %>%
select(-stakeholder)
funders_info
# A tibble: 12 × 3
# Groups: topic [4]
name topic prop
<chr> <chr> <dbl>
1 Russian Academy of Science 1 0.727
2 Alexander von Humboldt Foundation 1 0.684
3 CONICYT 1 0.667
4 JST 2 0.9
5 Wellcome 2 0.794
6 National Science Foundation 2 0.692
7 Consortium of African Funds for the Environment 3 1
8 Sea World Research and Rescue Foundation 3 1
9 Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior 3 0.833
10 National Research Council Italy 4 1
11 NRC Egypt 4 0.571
12 Helmholtz-Gemeinschaft 4 0.56
funders_info %>%
kbl(caption = "Funders associated with topics") %>%
kable_classic("hover", full_width = F)
name | topic | prop |
---|---|---|
Russian Academy of Science | 1 | 0.7272727 |
Alexander von Humboldt Foundation | 1 | 0.6842105 |
CONICYT | 1 | 0.6666667 |
JST | 2 | 0.9000000 |
Wellcome | 2 | 0.7941176 |
National Science Foundation | 2 | 0.6923077 |
Consortium of African Funds for the Environment | 3 | 1.0000000 |
Sea World Research and Rescue Foundation | 3 | 1.0000000 |
Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior | 3 | 0.8333333 |
National Research Council Italy | 4 | 1.0000000 |
NRC Egypt | 4 | 0.5714286 |
Helmholtz-Gemeinschaft | 4 | 0.5600000 |
# Journals
journals_info <- df_doc_level_stm_gamma %>%
select(-total_topic, - total_sent) %>%
filter(stakeholder == "journals") %>%
group_by(topic) %>%
slice_max(order_by = prop, n = 3) %>%
select(-stakeholder)
journals_info
# A tibble: 13 × 3
# Groups: topic [4]
name topic prop
<chr> <chr> <dbl>
1 eLifeJournal 1 1
2 Ecology and Evolution 1 0.755
3 Neobiota 1 0.679
4 PeerJJournal 2 1
5 BioSciences 2 0.6
6 Evolutionary Applications 2 0.538
7 Ecological Applications 3 1
8 Journal of Applied Ecology 3 1
9 Remote Sensing in Ecology and Conservation 3 1
10 American Naturalist 4 1
11 Conservation Biology 4 0.6
12 Proceedings of the Royal Society B Biological Sciences 4 0.6
13 Conservation Letters 4 0.6
journals_info %>%
kbl(caption = "Journals associated with topics") %>%
kable_classic("hover", full_width = F)
name | topic | prop |
---|---|---|
eLifeJournal | 1 | 1.0000000 |
Ecology and Evolution | 1 | 0.7547170 |
Neobiota | 1 | 0.6785714 |
PeerJJournal | 2 | 1.0000000 |
BioSciences | 2 | 0.6000000 |
Evolutionary Applications | 2 | 0.5384615 |
Ecological Applications | 3 | 1.0000000 |
Journal of Applied Ecology | 3 | 1.0000000 |
Remote Sensing in Ecology and Conservation | 3 | 1.0000000 |
American Naturalist | 4 | 1.0000000 |
Conservation Biology | 4 | 0.6000000 |
Proceedings of the Royal Society B Biological Sciences | 4 | 0.6000000 |
Conservation Letters | 4 | 0.6000000 |
# Publishers
publishers_info <- df_doc_level_stm_gamma %>%
select(-total_topic, - total_sent) %>%
filter(stakeholder == "publishers") %>%
group_by(topic) %>%
slice_max(order_by = prop, n = 3) %>%
select(-stakeholder)
publishers_info
# A tibble: 12 × 3
# Groups: topic [4]
name topic prop
<chr> <chr> <dbl>
1 PLOS 1 0.636
2 eLife 1 0.4
3 Wiley 1 0.286
4 The Royal Society Publishing 2 1
5 Cell Press 2 0.652
6 eLife 2 0.6
7 BioOne 3 1
8 PeerJ 3 0.667
9 Springer Nature 3 0.571
10 AIBS 4 0.8
11 Elsevier 4 0.517
12 The University of Chicago Press 4 0.273
publishers_info %>%
kbl(caption = "Publishers associated with topics") %>%
kable_classic("hover", full_width = F)
name | topic | prop |
---|---|---|
PLOS | 1 | 0.6363636 |
eLife | 1 | 0.4000000 |
Wiley | 1 | 0.2857143 |
The Royal Society Publishing | 2 | 1.0000000 |
Cell Press | 2 | 0.6521739 |
eLife | 2 | 0.6000000 |
BioOne | 3 | 1.0000000 |
PeerJ | 3 | 0.6666667 |
Springer Nature | 3 | 0.5714286 |
AIBS | 4 | 0.8000000 |
Elsevier | 4 | 0.5172414 |
The University of Chicago Press | 4 | 0.2727273 |
# Repositories
repositories_info <- df_doc_level_stm_gamma %>%
select(-total_topic, - total_sent) %>%
filter(stakeholder == "repositories") %>%
group_by(topic) %>%
slice_max(order_by = prop, n = 3) %>%
select(-stakeholder)
repositories_info
# A tibble: 12 × 3
# Groups: topic [4]
name topic prop
<chr> <chr> <dbl>
1 Harvard Dataverse 1 1
2 TERN 1 1
3 Australian Antarctic Data Centre 1 0.875
4 Marine Data Archive 2 1
5 Dryad 2 0.727
6 NCBI 2 0.7
7 DNA Databank of Japan 3 0.923
8 BCO-DMO 3 0.75
9 European Bioinformatics Institute 3 0.476
10 OSF 4 0.3
11 NCBI 4 0.1
12 European Bioinformatics Institute 4 0.0952
repositories_info %>%
kbl(caption = "Repositories associated with topics") %>%
kable_classic("hover", full_width = F)
name | topic | prop |
---|---|---|
Harvard Dataverse | 1 | 1.0000000 |
TERN | 1 | 1.0000000 |
Australian Antarctic Data Centre | 1 | 0.8750000 |
Marine Data Archive | 2 | 1.0000000 |
Dryad | 2 | 0.7272727 |
NCBI | 2 | 0.7000000 |
DNA Databank of Japan | 3 | 0.9230769 |
BCO-DMO | 3 | 0.7500000 |
European Bioinformatics Institute | 3 | 0.4761905 |
OSF | 4 | 0.3000000 |
NCBI | 4 | 0.1000000 |
European Bioinformatics Institute | 4 | 0.0952381 |
# Societies
societies_info <- df_doc_level_stm_gamma %>%
select(-total_topic, - total_sent) %>%
filter(stakeholder == "societies") %>%
group_by(topic) %>%
slice_max(order_by = prop, n = 3) %>%
select(-stakeholder)
societies_info
# A tibble: 13 × 3
# Groups: topic [4]
name topic prop
<chr> <chr> <dbl>
1 National Academy of Sciences 1 0.667
2 SORTEE 1 0.333
3 Royal Society Te Aparangi 1 0.174
4 Australasian Evolution Society 2 1
5 Society for the Study of Evolution 2 1
6 The Zoological Society of London 2 0.643
7 Ecological Society of America 3 1
8 European Society for Evolutionary Biology 3 1
9 Ecological Society of Australia 3 0.929
10 The Society for Conservation Biology 4 0.545
11 The Royal Society 4 0.333
12 British Ecological Society 4 0.222
13 National Academy of Sciences 4 0.222
societies_info %>%
kbl(caption = "Societies associated with topics") %>%
kable_classic("hover", full_width = F)
name | topic | prop |
---|---|---|
National Academy of Sciences | 1 | 0.6666667 |
SORTEE | 1 | 0.3333333 |
Royal Society Te Aparangi | 1 | 0.1739130 |
Australasian Evolution Society | 2 | 1.0000000 |
Society for the Study of Evolution | 2 | 1.0000000 |
The Zoological Society of London | 2 | 0.6428571 |
Ecological Society of America | 3 | 1.0000000 |
European Society for Evolutionary Biology | 3 | 1.0000000 |
Ecological Society of Australia | 3 | 0.9285714 |
The Society for Conservation Biology | 4 | 0.5454545 |
The Royal Society | 4 | 0.3333333 |
British Ecological Society | 4 | 0.2222222 |
National Academy of Sciences | 4 | 0.2222222 |
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kableExtra_1.3.4 stm_1.3.6
[3] ggraph_2.1.0 igraph_1.3.5
[5] reshape2_1.4.4 wordcloud_2.6
[7] RColorBrewer_1.1-3 topicmodels_0.2-12
[9] tm_0.7-9 NLP_0.2-1
[11] quanteda.dictionaries_0.31 quanteda.textplots_0.94.2
[13] quanteda_3.2.3 tidytext_0.3.4
[15] forcats_0.5.2 stringr_1.4.1
[17] dplyr_1.0.10 purrr_0.3.5
[19] readr_2.1.3 tidyr_1.2.1
[21] tibble_3.1.8 ggplot2_3.3.6
[23] tidyverse_1.3.2
loaded via a namespace (and not attached):
[1] Rtsne_0.16 googledrive_2.0.0 colorspace_2.0-3
[4] ellipsis_0.3.2 modeltools_0.2-23 rprojroot_2.0.3
[7] fs_1.5.2 rstudioapi_0.14 farver_2.1.1
[10] graphlayouts_0.8.3 SnowballC_0.7.0 ggrepel_0.9.2
[13] fansi_1.0.3 lubridate_1.8.0 xml2_1.3.3
[16] cachem_1.0.6 knitr_1.40 polyclip_1.10-4
[19] jsonlite_1.8.3 workflowr_1.7.0 broom_1.0.1
[22] dbplyr_2.2.1 ggforce_0.4.1 compiler_4.2.1
[25] httr_1.4.4 backports_1.4.1 assertthat_0.2.1
[28] Matrix_1.4-1 fastmap_1.1.0 gargle_1.2.1
[31] cli_3.4.1 later_1.3.0 tweenr_2.0.2
[34] htmltools_0.5.3 tools_4.2.1 rsvd_1.0.5
[37] gtable_0.3.1 glue_1.6.2 fastmatch_1.1-3
[40] Rcpp_1.0.9 slam_0.1-50 cellranger_1.1.0
[43] jquerylib_0.1.4 vctrs_0.5.0 svglite_2.1.0
[46] xfun_0.33 stopwords_2.3 rvest_1.0.3
[49] lifecycle_1.0.3 googlesheets4_1.0.1 MASS_7.3-57
[52] scales_1.2.1 tidygraph_1.2.2 hms_1.1.2
[55] promises_1.2.0.1 parallel_4.2.1 yaml_2.3.6
[58] gridExtra_2.3 sass_0.4.2 stringi_1.7.8
[61] highr_0.9 tokenizers_0.2.3 geometry_0.4.6.1
[64] systemfonts_1.0.4 rlang_1.0.6 pkgconfig_2.0.3
[67] evaluate_0.16 lattice_0.20-45 tidyselect_1.2.0
[70] plyr_1.8.7 magrittr_2.0.3 R6_2.5.1
[73] generics_0.1.3 DBI_1.1.3 pillar_1.8.1
[76] haven_2.5.1 whisker_0.4 withr_2.5.0
[79] abind_1.4-5 janeaustenr_1.0.0 modelr_0.1.9
[82] crayon_1.5.2 utf8_1.2.2 tzdb_0.3.0
[85] rmarkdown_2.16 viridis_0.6.2 grid_4.2.1
[88] readxl_1.4.1 data.table_1.14.6 git2r_0.30.1
[91] webshot_0.5.4 reprex_2.0.2 digest_0.6.29
[94] httpuv_1.6.6 RcppParallel_5.1.5 stats4_4.2.1
[97] munsell_0.5.0 viridisLite_0.4.1 bslib_0.4.0
[100] magic_1.6-1
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kableExtra_1.3.4 stm_1.3.6
[3] ggraph_2.1.0 igraph_1.3.5
[5] reshape2_1.4.4 wordcloud_2.6
[7] RColorBrewer_1.1-3 topicmodels_0.2-12
[9] tm_0.7-9 NLP_0.2-1
[11] quanteda.dictionaries_0.31 quanteda.textplots_0.94.2
[13] quanteda_3.2.3 tidytext_0.3.4
[15] forcats_0.5.2 stringr_1.4.1
[17] dplyr_1.0.10 purrr_0.3.5
[19] readr_2.1.3 tidyr_1.2.1
[21] tibble_3.1.8 ggplot2_3.3.6
[23] tidyverse_1.3.2
loaded via a namespace (and not attached):
[1] Rtsne_0.16 googledrive_2.0.0 colorspace_2.0-3
[4] ellipsis_0.3.2 modeltools_0.2-23 rprojroot_2.0.3
[7] fs_1.5.2 rstudioapi_0.14 farver_2.1.1
[10] graphlayouts_0.8.3 SnowballC_0.7.0 ggrepel_0.9.2
[13] fansi_1.0.3 lubridate_1.8.0 xml2_1.3.3
[16] cachem_1.0.6 knitr_1.40 polyclip_1.10-4
[19] jsonlite_1.8.3 workflowr_1.7.0 broom_1.0.1
[22] dbplyr_2.2.1 ggforce_0.4.1 compiler_4.2.1
[25] httr_1.4.4 backports_1.4.1 assertthat_0.2.1
[28] Matrix_1.4-1 fastmap_1.1.0 gargle_1.2.1
[31] cli_3.4.1 later_1.3.0 tweenr_2.0.2
[34] htmltools_0.5.3 tools_4.2.1 rsvd_1.0.5
[37] gtable_0.3.1 glue_1.6.2 fastmatch_1.1-3
[40] Rcpp_1.0.9 slam_0.1-50 cellranger_1.1.0
[43] jquerylib_0.1.4 vctrs_0.5.0 svglite_2.1.0
[46] xfun_0.33 stopwords_2.3 rvest_1.0.3
[49] lifecycle_1.0.3 googlesheets4_1.0.1 MASS_7.3-57
[52] scales_1.2.1 tidygraph_1.2.2 hms_1.1.2
[55] promises_1.2.0.1 parallel_4.2.1 yaml_2.3.6
[58] gridExtra_2.3 sass_0.4.2 stringi_1.7.8
[61] highr_0.9 tokenizers_0.2.3 geometry_0.4.6.1
[64] systemfonts_1.0.4 rlang_1.0.6 pkgconfig_2.0.3
[67] evaluate_0.16 lattice_0.20-45 tidyselect_1.2.0
[70] plyr_1.8.7 magrittr_2.0.3 R6_2.5.1
[73] generics_0.1.3 DBI_1.1.3 pillar_1.8.1
[76] haven_2.5.1 whisker_0.4 withr_2.5.0
[79] abind_1.4-5 janeaustenr_1.0.0 modelr_0.1.9
[82] crayon_1.5.2 utf8_1.2.2 tzdb_0.3.0
[85] rmarkdown_2.16 viridis_0.6.2 grid_4.2.1
[88] readxl_1.4.1 data.table_1.14.6 git2r_0.30.1
[91] webshot_0.5.4 reprex_2.0.2 digest_0.6.29
[94] httpuv_1.6.6 RcppParallel_5.1.5 stats4_4.2.1
[97] munsell_0.5.0 viridisLite_0.4.1 bslib_0.4.0
[100] magic_1.6-1