Last updated: 2026-02-23

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 1fc02e6

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 1fc02e6. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    Asthma_Bothsex_inv_var_meta_GBMI_052021_nbbkgt1.txt.gz
    Ignored:    Aus_School_Profile.xlsx
    Ignored:    BC2GM/
    Ignored:    SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    bc2GMtest_1.0.tar.gz
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/pubmedbert_entity_predictions.csv
    Ignored:    output/pubmedbert_entity_predictions.jsonl
    Ignored:    output/pubmedbert_predictions.csv
    Ignored:    output/pubmedbert_predictions.jsonl
    Ignored:    output/text_mining_predictions/
    Ignored:    output/trait_ontology/
    Ignored:    population_description_terms.txt
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacyr_venv/

Untracked files:
    Untracked:  schools.R
    Untracked:  testing.R

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_dbgap_ids.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/specific_aims_stats.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/text_for_cohort_labels.Rmd) and HTML (docs/text_for_cohort_labels.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	1fc02e6	IJbeasley	2026-02-23	Update text-extraction
html	db3e0bd	IJbeasley	2025-12-28	Build site.
Rmd	79191ff	IJbeasley	2025-12-28	Update text cohort extracting to print out information on numbers of abstracts
html	cb5cf9e	IJbeasley	2025-12-28	Build site.
Rmd	697dbb1	IJbeasley	2025-12-28	Update text cohort extracting
html	2593d6a	IJbeasley	2025-12-28	Build site.
Rmd	410a36a	IJbeasley	2025-12-28	Include GWAS Catalog cohorts in grep search for cohort sentences
html	238486e	IJbeasley	2025-10-24	Build site.
Rmd	0d8b872	IJbeasley	2025-10-24	Cleaning up abstract collecting code again
html	2afc108	IJbeasley	2025-10-24	Build site.
Rmd	748dac2	IJbeasley	2025-10-24	Cleaning up abstract collecting

Set up

Required packages

library(stringr)
library(readxl)
library(dplyr)
library(stringi)
library(httr)
library(rentrez)
library(xml2)
library(jsonlite)
library(tokenizers)

Getting required information from GWAS catalog

Step 1: Get only disease studies, and select relevant columns

## Step 1: 
# get only relevant disease studies
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
    dplyr::filter(!cause %in% c("HIV/AIDS",
                             "Tuberculosis",
                             "Malaria",
                             "Lower respiratory infections",
                             "Diarrhoeal diseases",
                             "Neonatal disorders",
                             "Tetanus",
                             "Diphtheria",
                             "Pertussis" ,
                             "Measles",
                             "Maternal disorders"))

# group conditions into broader categories to ensure we have enough papers in each category for sampling
lancet_cause_mapping <- readxl::read_xlsx(here::here("data/icd/lancet_conditions_icd10.xlsx"),
                                     sheet = 1) |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

set.seed(500)

training_sample = gwas_study_info |>
  left_join(lancet_cause_mapping |> select(cause = mapped_gbd_term,
                                           lancet_condition), 
            by = "cause",
            relationship = "many-to-many") |>
  select(PUBMED_ID,
         lancet_condition) |>
  distinct() |>
  dplyr::group_by(lancet_condition) |>
  dplyr::slice_sample(prop = 0.25) |>
  pull(PUBMED_ID) |>
  unique()

pmids <- training_sample

gwas_study_info =
  gwas_study_info |>
  filter(PUBMED_ID %in% training_sample) 

print("Number of unique pubmed ids for disease studies:")

[1] "Number of unique pubmed ids for disease studies:"

gwas_study_info$PUBMED_ID |> unique() |> length()

[1] 241

How much overlap with abstracts and full-texts retrieved?

converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))

full_text_files <-
  list.files(here::here("output/fulltexts"), 
             recursive = T, 
             pattern = "\\.html$|\\.xml$") 

full_text_files <- basename(full_text_files) |> 
  stringr::str_remove_all("\\.html$|\\.xml$") |>
  unique()

# convert pmcids to pmids 
converted_fulltext_pmcids <-
  converted_ids |>
  filter(pmcids %in% full_text_files) |>
  pull(PMID) |>
  unique() 

full_text_files  <- c(full_text_files, 
                      converted_fulltext_pmcids)

full_text_pmids <- grep("PMC", 
                        full_text_files, 
                        invert = T, 
                        value = T)


print("Number of training sample papers with full text retrieved:")

[1] "Number of training sample papers with full text retrieved:"

sum(training_sample %in% full_text_pmids)

[1] 231

print("Number of unique pubmed ids for disease studies with full text retrieved:")

[1] "Number of unique pubmed ids for disease studies with full text retrieved:"

gwas_study_info |>
  filter(PUBMED_ID %in% full_text_pmids) |>
  select(PUBMED_ID, cause) |>
  distinct() |>
  group_by(cause) |>
  summarise(n = n())

# A tibble: 21 × 2
   cause                                                      n
   <chr>                                                  <int>
 1 Cervical cancer                                           22
 2 Chronic kidney disease due to diabetes mellitus type 1     3
 3 Chronic kidney disease due to diabetes mellitus type 2     5
 4 Chronic obstructive pulmonary disease                     17
 5 Cirrhosis and other chronic liver diseases                80
 6 Diabetes mellitus                                         87
 7 Intracerebral hemorrhage                                   7
 8 Ischemic heart disease                                    73
 9 Ischemic stroke                                           31
10 Larynx cancer                                             18
# ℹ 11 more rows

Step 2: Get list of terms to use for cohort context matching

Source of terms: GWAS Catalog Ancestry Metadata

GWAS ancestry data.frame contains: country of recruitment, ancestry label, population descriptors

gwas_ancest_info <-  data.table::fread(here::here("data/gwas_catalog/gwas-catalog-v1.0.3.1-ancestries-r2025-07-21.tsv"),
                           sep = "\t",
                           quote = "")

gwas_ancest_info = gwas_ancest_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

Countries

countries <- 
gwas_ancest_info$COUNTRY_OF_RECRUITMENT |>
  stringr::str_split("\\,") |>
  unlist() |>
  str_trim() |>
  unique() 
  
countries <- countries[countries != "NR"]
countries <- c(countries, 
               "United States", 
               "United Kingdom",
               "Korea",
               "America",
               "USA",
               "United States of America",
               "Latin America")

Population descriptors

population_descriptors <- 
  c(gwas_ancest_info$BROAD_ANCESTRAL_CATEGORY, gwas_ancest_info$ADDITIONAL_ANCESTRY_DESCRIPTION) |>
  str_replace_all("Recruitment in North, Central, and South America, Asia, and Europe", 
                  "North America, Central America, Asia, Europe") |>
  #stringr::str_split("\\,|;|heritage:\\|") |>
  stringr::str_split("\\,|;|heritage:|") |>
  unlist() |>
  str_trim() 

# processing population descriptors to remove extra information and get more general terms to match cohort descriptions in abstracts (e.g. "European" instead of "European (non-Finnish)")
population_descriptors <- population_descriptors |>
  str_remove_all("\\.$")  |> # Escape the dot to make it literal
  str_remove_all("^and ") |> 
  str_remove_all(" \\(founder/genetic isolate\\)$|\\(founder/genetic isolate$") |>
  #str_remove_all("^[0-9+]% ") |>
  str_remove_all("^\\d+% ") |>  # \d is shorthand for [0-9]
  str_remove_all("^[‡†*]+") |> # Remove dagger symbols at start 
  str_remove_all(" - Korea Association Resource \\(KARE\\)$") |>
  str_remove_all("or unknown ancestry$|\\(middle eastern$") |>
  str_remove_all("\\(see Springelkamp 2017\\)") |>
  str_remove_all("parents & grandparents born in ") |>
  str_remove_all(" cell lines$| cell line$") |>
  str_remove_all(" cohort$") |>
  str_remove_all(" population$") |> 
  str_remove_all("^Erasmus Rucphen in |^Erasmus Rucphen") |>
  str_remove_all("^including ") |>
  str_remove_all("\\(Middle Eastern$") |>
  unique() 


# removing population descriptors that are not valid to match cohort descriptions in abstracts (e.g. "cases", "controls", "Bipolar disorder")
not_valid_pop_descriptors <- c("NR", "N.R", "Other", "other", "", "unspecified",
                               "East", "Euopre",
                               "UKB", "UKBB", "DECODE", "Controls from UKBiobank", "UKBB and deCODE", 
                               "NHAPC", "GeneID-I", " Family", 
                               "non-Hispaniic white", "Zimbabweian", "Portugese", "Europen American",
                               "(see Moffatt et al 2010)", "See Wu J. H", 
                               "et al. 2013. PMID: 23362303", "See Wu et al 23362303", "23362303",
                               "See Locke (PubMed 25673413) for BMI and Shungin (PubMed 25673412) for WHR")

disease_descriptors <- grep("cases|Bipolar|controls|Schizophrenia|disorder", 
                            population_descriptors, 
                            value = T)

population_descriptors <- population_descriptors[!(population_descriptors %in% 
                                                     c(not_valid_pop_descriptors,
                                                       disease_descriptors
                                                       )
                                                     )]

Source of terms: hancestro ontology of human ancestry terms

# Get all terms in one go (if total < 500)
response <- GET("https://www.ebi.ac.uk/ols4/api/ontologies/hancestro/terms?size=1000")
data <- fromJSON(content(response, "text"), flatten = TRUE)
hancestro_terms <- data$`_embedded`$terms$label

# Remove terms with specific biobanks / datasets in brackets, as these are unlikely to be used in cohort descriptions in abstracts / texts
hancestro_terms <- str_remove_all(hancestro_terms, 
                                  " \\(SGDP\\)$| \\(1KGP\\)$| \\(HGDP\\)$| \\(GGVP\\)$")

# Remove obsolete terms
hancestro_terms <- grep("obsolete|obsolescence", 
                        hancestro_terms, 
                        value = TRUE, 
                        invert = TRUE)

# remove terms with specific regions in brackets, as these are unlikely to be used in cohort descriptions in abstracts / texts
hancestro_terms <- str_remove_all(hancestro_terms, pattern = "\\(Carmel\\)$|\\(pre1989\\)$|\\(Bergamo\\)$|\\(Negev\\)$|\\(Central\\)|\\(Caucasus\\)")

# replace underscores with spaces
hancestro_terms <- str_replace_all(hancestro_terms, 
                                   pattern = "_", 
                                   replacement = " ")

hancestro_terms <- unique(hancestro_terms)

# specific hancestro terms to exclude: 
terms_to_exclude <- c("ancestry category",
                      "ancestry status",
                      "BFO 0000006",
                      "Country",
                      "continent",
                      "continuant",
                      "curation status specification",
                      "denotator type",
                      "entity",
                      "ethnicity category",
                      "ethnicity descriptor",
                      "geographic location",
                      "geographic descriptor",
                      "geography-based population category",
                      "immaterial entity",
                      "independent continuant",
                      "!Kung",
                      "material entity",
                      "quality",
                      "reference population",
                      "region",
                      "organization",
                      "population",
                      "specifically dependent continuant",
                      "Thing",
                      "uncategorised population",
                      "undefined ancestry population")

hancestro_terms <- hancestro_terms[!c(hancestro_terms %in% terms_to_exclude)]

? Other possible sources

Country R packages Population descriptors in: https://pmc.ncbi.nlm.nih.gov/articles/PMC8715140/#sec2

Combine sources

# combine terms from different sources, and add some extra terms based on looking at cohort descriptions in abstracts
cohort_context_terms <-
c(countries, hancestro_terms, population_descriptors) |> 
  unique() |>
  sort()

# terms to add: 
cohort_context_terms <- c(cohort_context_terms, 
                          "Scandinavians",
                          "Native Hawaiians")

# terms to remove ... "Qatar", "UK", "Taiwan", "Korean"
cohort_context_terms <- cohort_context_terms[!cohort_context_terms %in% c("Qatar", 
                                                                          "UK", 
                                                                          "Taiwan",
                                                                          "Korean",
                                                                          "population")]

# replace brackets with \\( & and \\)
cohort_context_terms <- cohort_context_terms |> 
  str_replace_all(pattern = "\\(", replacement = "\\\\(") |>
  str_replace_all(pattern = "\\)", replacement = "\\\\)")



cohort_context_terms <- c(cohort_context_terms, 
                          "UK\\b(?! Biobank)",
                          "UK\\b\\(?! Biobank\\)",
                          "Qatar\\b\\(?! Biobank\\)",
                          "Taiwan\\b\\(?! Biobank\\)",
                          "Korean\\b\\(?! Biobank\\)"
                          )

# sort by length of term (longest first) to match longest names first
cohort_context_terms <- cohort_context_terms[order(-nchar(cohort_context_terms))]


writeLines(cohort_context_terms, 
           here::here("output/gwas_cat/ancestry_population_terms.txt"))

cohort_context_terms <- readLines(here::here("output/gwas_cat/ancestry_population_terms.txt"))

print("Number of cohort context terms:")

[1] "Number of cohort context terms:"

length(cohort_context_terms)

[1] 924

Step 3: Get cohort information

gwas_study_info_cohort =
  data.table::fread(here::here("output/gwas_cohorts/gwas_cohort_name_corrected.csv"))

gwas_study_info_cohort =
  gwas_study_info_cohort |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

gwas_study_info_cohort =
  gwas_study_info_cohort |>
  select(PUBMED_ID,
         COHORT) |>
  distinct()

print("Check adding cohort information has only added columns, not rows:")

[1] "Check adding cohort information has only added columns, not rows:"

print("Before adding:")

[1] "Before adding:"

dim(gwas_study_info)

[1] 8241   13

gwas_study_info =
  left_join(gwas_study_info,
            gwas_study_info_cohort,
            by = "PUBMED_ID"
  )

print("After adding:")

[1] "After adding:"

dim(gwas_study_info)

[1] 8241   14

print("Check adding cohort information has not increaed number of unique pubmed ids:")

[1] "Check adding cohort information has not increaed number of unique pubmed ids:"

gwas_study_info |>
  pull(PUBMED_ID) |>
  unique() |>
  length()

[1] 241

print("Number of unique pubmed ids for disease studies with cohort info:")

[1] "Number of unique pubmed ids for disease studies with cohort info:"

gwas_study_info |>
  filter(COHORT != "") |>
  pull(PUBMED_ID) |>
  unique() |>
  length()

[1] 52

Step 4: Add ancestry/population info

gwas_ancest_info =
  gwas_ancest_info |>
  select(PUBMED_ID,
         BROAD_ANCESTRAL_CATEGORY,
         COUNTRY_OF_RECRUITMENT) |>
  distinct() |>
  group_by(PUBMED_ID) |>
  summarise(
    BROAD_ANCESTRAL_CATEGORY = paste(
      unique(
        unlist(strsplit(BROAD_ANCESTRAL_CATEGORY, split = "\\|"))
      ),
      collapse = "|"
    ),
    COUNTRY_OF_RECRUITMENT = paste(
      unique(
        unlist(strsplit(COUNTRY_OF_RECRUITMENT, split = "\\|"))
      ),
      collapse = "|"
    )
  )

print("Check adding ancestry information has only added columns, not rows:")

[1] "Check adding ancestry information has only added columns, not rows:"

print("Before adding:")

[1] "Before adding:"

dim(gwas_study_info)

[1] 8241   14

gwas_study_info =
  left_join(gwas_study_info,
            gwas_ancest_info,
            by = "PUBMED_ID"
  )

print("After adding:")

[1] "After adding:"

dim(gwas_study_info)

[1] 8241   16

Step 5: Extract information required to get paper text & abstracts

gwas_study_info <-
  gwas_study_info |>
  #filter(COHORT != "") |>
  select(PUBMED_ID,
         COHORT,
         YEAR,
         BROAD_ANCESTRAL_CATEGORY,
         COUNTRY_OF_RECRUITMENT) |>
  distinct() |>
  group_by(PUBMED_ID,
           YEAR,
           BROAD_ANCESTRAL_CATEGORY,
           COUNTRY_OF_RECRUITMENT) |>
  summarise(
    COHORT = paste(
      unique(
        unlist(strsplit(COHORT, split = "\\|"))
      ),
      collapse = "|"
    )
  )

`summarise()` has grouped output by 'PUBMED_ID', 'YEAR',
'BROAD_ANCESTRAL_CATEGORY'. You can override using the `.groups` argument.

gwas_study_info = 
  gwas_study_info |>
  ungroup() |>
  arrange(PUBMED_ID)

pmids = gwas_study_info$PUBMED_ID

cohort = gwas_study_info$COHORT
names(cohort) = pmids

date = gwas_study_info$YEAR
names(date) = pmids

country = gwas_study_info$COUNTRY_OF_RECRUITMENT
names(country) = pmids

ancestry = gwas_study_info$BROAD_ANCESTRAL_CATEGORY
names(ancestry) = pmids

print("Number of papers without cohort information:")

[1] "Number of papers without cohort information:"

gwas_study_info |> 
  filter(COHORT == "") |>
  nrow()

[1] 189

print("Number of papers with cohort information:")

[1] "Number of papers with cohort information:"

gwas_study_info |>
  filter(COHORT != "") |>
  nrow()

[1] 52

Step 6: Get data dictionary of cohort names

# this xlsx was built from looking at acrynyms / cohort names in the gwas catalog
# and finding the corresponding full names / details of cohorts

cohort_names <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
                                 sheet = 1) |>
  mutate(across(everything(), 
                ~stringr::str_replace_all(.x,
                                          pattern = "\u00A0",
                                          replacement = " "))
         )

New names:
• `` -> `...15`

cohort_full_names = cohort_names$full_name[!is.na(cohort_names$full_name)]
cohort_full_names <- str_trim(cohort_full_names)
cohort_full_names <- iconv(cohort_full_names, to = "UTF-8")
cohort_full_names <- gsub("[\u00A0\r\n]", " ", cohort_full_names)  # replace non-breaking spaces, CR, LF with space
cohort_full_names <- str_squish(cohort_full_names)  # trims and removes extra spaces
# sort by length of name (longest first) to match longest names first
cohort_full_names <- cohort_full_names[order(-nchar(cohort_full_names))]
cohort_full_names <- cohort_full_names[cohort_full_names != "Not Reported"]
cohort_full_names <- unique(cohort_full_names)

print("Number of unique cohort full names:")

[1] "Number of unique cohort full names:"

length(cohort_full_names)

[1] 756

cohort_abbr_names = cohort_names$cohort[!is.na(cohort_names$cohort)]
cohort_abbr_names <- str_trim(cohort_abbr_names)
cohort_abbr_names <- iconv(cohort_abbr_names, to = "UTF-8")
cohort_abbr_names <- gsub("[\u00A0\r\n]", " ", cohort_abbr_names)  # replace non-breaking spaces, CR, LF with space
cohort_abbr_names <- str_squish(cohort_abbr_names)  # trims and removes extra spaces

# remove abbreviations that are too short
# cohort_abbr_names <- cohort_abbr_names[nchar(cohort_abbr_names) >= 4]
# small_abbr_to_keep <- c("C4D", 
#                         "BBJ", 
#                         "UKB", 
#                         "MVP", 
#                         "TWB", 
#                         "QBB",
#                         "MEC",
#                         "WHI"
#                         )

# cohort_abbr_names <- unique(c(cohort_abbr_names, 
#                               small_abbr_to_keep
#                               ))

cohort_abbr_names <- cohort_abbr_names[!str_detect(pattern = "\\?", 
                                                   cohort_abbr_names)]
# sort by length of name (longest first) to match longest names first
cohort_abbr_names <- cohort_abbr_names[order(-nchar(cohort_abbr_names))]


# add cohort names from GWAS catalog not yet added to data-dictionary
gwas_cat_cohorts = gwas_study_info_cohort$COHORT
gwas_cat_cohorts = unlist(strsplit(gwas_cat_cohorts, "\\|"))
gwas_cat_cohorts = gwas_cat_cohorts[!(gwas_cat_cohorts %in% c("", "other", "multiple"))]

cohort_abbr_names = unique(c(cohort_abbr_names,
                           gwas_cat_cohorts))

# remove small abbreviations that are likely to be false positives
cohort_abbr_names <- cohort_abbr_names[cohort_abbr_names != "DN"]
cohort_abbr_names <- cohort_abbr_names[cohort_abbr_names != "CHB"]
cohort_abbr_names <- cohort_abbr_names[cohort_abbr_names != "FG"]

print("Number of unique cohort abbreviation names:")

[1] "Number of unique cohort abbreviation names:"

length(cohort_abbr_names)

[1] 1019

Get cohort-relevant abstract sentences

Load relevant abstracts

Convert abstract sections to sentences


source spacyr_venv/bin/activate

python3 code/extract_text/spacy_obtain_sentences.py \
--input_dir output/abstracts \
--output_dir output/abstracts

abstract_files <- list.files(here::here("output/abstracts/"), 
                             pattern = "*.json", 
                             full.names = FALSE
                             ) |>
                   sort()

abstract_pmids = str_remove_all(abstract_files, "_sentences.json$")
abstract_pmids = abstract_pmids[abstract_pmids %in% pmids]


abstracts <- sapply(abstract_pmids,
                    function(file) {
                      json_data <- fromJSON(here::here(paste0("output/abstracts/", file, "_sentences.json")))
                      
                      abstract_lines <- json_data[json_data != ""]
                      
                      # readLines(here::here(paste0("output/abstracts/",file, ".txt")), 
                      #           warn = FALSE) |> 
                      #   paste(collapse = " ")
                    }
                    )

#pmids = pmids_with_abstracts 
abstract_cohort = cohort[which(pmids %in% abstract_pmids)]
abstract_country = country[which(pmids %in% abstract_pmids)]
abstract_date = date[which(pmids %in% abstract_pmids)]
abstract_ancestry = ancestry[which(pmids %in% abstract_pmids)]


# check lengths of these vectors are the same
print("Check lengths of vectors are the same:")

[1] "Check lengths of vectors are the same:"

print(paste("Length of pmids:", length(pmids)))

[1] "Length of pmids: 241"

print(paste("Length of abstract pmids:", length(abstract_pmids)))

[1] "Length of abstract pmids: 239"

print(paste("Length of abstracts:", length(abstracts)))

[1] "Length of abstracts: 239"

print(paste("Length of cohort:", length(cohort)))

[1] "Length of cohort: 241"

print(paste("Length of date:", length(date)))

[1] "Length of date: 241"

Check: What abstracts are missing?

missing_abstracts = pmids[!(pmids %in% abstract_pmids)] 

print("Number of missing abstracts:")

[1] "Number of missing abstracts:"

length(missing_abstracts)

[1] 2

#pmids <- pmids[pmids %in% abstract_pmids]

Cohort name matching in abstract sentences

Function to extract sentences with cohort names

extract_cohort_sentences <- function(abstract_list, 
                                     cohort_names, 
                                     column_name = "COHORT",
                                     tokenize = FALSE,
                                     ignore_case) {
  
  cohort_names_grep <- paste0("\\b", cohort_names, "\\b") # add word boundaries to match whole words only
  
  results <- lapply(seq_along(abstract_list), function(i) {
    
    #abstract <- text_vector[i]
    # Split abstract into sentences
    if(tokenize) {
      sentences <- tokenizers::tokenize_sentences(abstract_list[[i]])[[1]]
    } else {
      sentences <-  abstract_list[[i]]
    }
    
    # For each sentence, find all matching cohort names
    lapply(seq_along(sentences), function(s) {
      
      sentence <- sentences[s]
      
      # Identify cohort names present in this sentence
      matched_cohorts <- cohort_names[str_detect(sentence, 
                                                 regex(cohort_names_grep, 
                                                       ignore_case = ignore_case))]
      cohort_df <-
      data.frame(
        article_id = i,
        sentence_id = s,
        sentence = sentence,
        has_cohort = length(matched_cohorts) > 0,
        #!!column_name = if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else "",
        #COHORT = if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else "",
        stringsAsFactors = FALSE
      )
      
      # Add dynamic column
      cohort_df[[column_name]] <- if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else ""
      
      return(cohort_df)
      
    }) |> bind_rows()
  })

 results <- bind_rows(results)
 
 results$pubmed_id <- names(abstract_list)[results$article_id]
 
 return(results) #returns a df of sentences, with labelled columns for whether they contain cohort names, and which cohort names they contain (if any)
 # id of sentence, and abstract / article
}

Get cohort relevant sentences

Get cohort name sentences

# extract sentences with cohort full names (case-insensitive matching, as full names are less likely to be ambiguous)
cohort_sentences_df_p1 <- extract_cohort_sentences(abstracts,
                                                cohort_full_names,
                                                ignore_case = TRUE
                                                )

# extract sentences with cohort abbreviation names (case-sensitive matching, as abbreviations are more likely to be ambiguous)
cohort_sentences_df_p2 = extract_cohort_sentences(abstracts,
                                                  cohort_abbr_names,
                                                  ignore_case = FALSE
                                                  )

cohort_sentences_df =
  bind_rows(cohort_sentences_df_p1,
            cohort_sentences_df_p2
            )


cohort_sentences_df = 
  cohort_sentences_df |>
  distinct()

separate_cohorts <- function(COHORT) {
  
  if (any(grepl("\\|", COHORT))) {
    return(unlist(strsplit(COHORT, "\\|")))
  } else {
    return(COHORT)
  }
  
}

cohort_sentences_df =
  cohort_sentences_df |>
  group_by(article_id, sentence_id, sentence) |>
  summarise(
    COHORT = str_flatten(unique(separate_cohorts(COHORT)), 
                         collapse = "|", 
                         na.rm = T),
    has_cohort = ifelse(any(COHORT != ""), TRUE, FALSE)
  ) |>
  ungroup()

`summarise()` has grouped output by 'article_id', 'sentence_id'. You can
override using the `.groups` argument.

cohort_sentences_df =
  cohort_sentences_df |>
  mutate(COHORT = str_remove_all(COHORT, 
                               pattern = "\\|$|^\\|"
                               )
         )

Distinguishing words between sentences with cohort names, and sentences without cohort names

# remove stop words and common words that are not informative for cohort context
removal_words <- c("the", "and", "of", "in", "to", "with", "a",
                   "for", "on", "by", "is", "are", "was", "were", "each", "all", "had", "have", "it",
                   "as", "from", "that", "this", "which", "be", "at", "or", "an", "then", "than",  "into",
                   "if", "not", "only", "both", "same", "after", "across", "between", "out", "up", "any", 
                   "we", "our", "us", "these", "within", "per",
                   "used", "using", "use",
                  "0.01", "0.05", "0.5", "1", "one", "2", "two", "3", "three", "iii",
                  "4", "5", "6", "p", "8",  "10", "value", "significant", "30",
                   "i", "r", "wide"
)

cohort_sentences_words <- 
cohort_sentences_df |>
  filter(has_cohort) |>
  pull(sentence) |>
  tokenizers::tokenize_words() |>
  unlist()

cohort_sentence_words_df =
data.frame(word = cohort_sentences_words) |>
  filter(!(tolower(word) %in% removal_words)) |>
  group_by(word) |>
  summarise(n_in_cohort = n()) |>
  filter(n_in_cohort > 5)

non_cohort_sentences_words <-
cohort_sentences_df |>
  filter(!has_cohort) |>
  pull(sentence) |>
  tokenizers::tokenize_words() |>
  unlist()

# sample down non-cohort sentences words
set.seed(500)
non_cohort_sentences_words_sample <- sample(non_cohort_sentences_words, 
                                              size = length(cohort_sentences_words), 
                                              replace = FALSE)

non_cohort_sentence_words_df =
data.frame(word = non_cohort_sentences_words_sample) |>
  filter(!(tolower(word) %in% removal_words)) |>
  group_by(word) |>
  summarise(n_not_cohort = n()) 


n_sentences <- cohort_sentences_df |>
  filter(COHORT != "") |>
  select(article_id, sentence_id) |>
  distinct() |>
  nrow()


print("Words more common in sentences with cohort names than sentences without cohort names:")

[1] "Words more common in sentences with cohort names than sentences without cohort names:"

left_join(cohort_sentence_words_df,
          non_cohort_sentence_words_df,
          by = "word"
          ) |>
  mutate(n_not_cohort = ifelse(is.na(n_not_cohort), 0, n_not_cohort)) |>
  mutate(delta_n = n_in_cohort - n_not_cohort) |>
  mutate(recall = n_in_cohort / n_sentences) |>
  arrange(desc(delta_n)) |>
  head(20)

# A tibble: 20 × 5
   word        n_in_cohort n_not_cohort delta_n recall
   <chr>             <int>        <dbl>   <dbl>  <dbl>
 1 biobank              53            0      53 0.310 
 2 study                62           19      43 0.363 
 3 uk                   41            1      40 0.240 
 4 data                 38            9      29 0.222 
 5 n                    29            4      25 0.170 
 6 cases                35           16      19 0.205 
 7 controls             32           15      17 0.187 
 8 consortium           15            0      15 0.0877
 9 cohorts              15            2      13 0.0877
10 project              13            0      13 0.0760
11 african              12            0      12 0.0702
12 based                18            7      11 0.105 
13 genomes              11            0      11 0.0643
14 individuals          25           14      11 0.146 
15 performed            25           14      11 0.146 
16 1000                 10            0      10 0.0585
17 19                   10            0      10 0.0585
18 ancestry             14            4      10 0.0819
19 analysis             40           31       9 0.234 
20 cohort               15            6       9 0.0877

print("Number of abstracts containing probable cohort reference")

[1] "Number of abstracts containing probable cohort reference"

abstracts_cohorts <- cohort_sentences_df |>
  filter(COHORT != "") |>
  pull(article_id) |>
  unique() 

n_abstracts_cohorts =  abstracts_cohorts |>
  length()

print(n_abstracts_cohorts)

[1] 99

print("Percentage of sampled abstracts containing probable cohort reference:")

[1] "Percentage of sampled abstracts containing probable cohort reference:"

100 * n_abstracts_cohorts / length(unique(pmids))

[1] 41.07884

print("Number of sentences containing probable cohort reference")

[1] "Number of sentences containing probable cohort reference"

  cohort_sentences_df |>
  filter(COHORT != "") |>
  select(article_id, sentence_id) |>
  distinct() |>
  nrow()

[1] 171

print("Number of entities")

[1] "Number of entities"

cohort_sentences_df |>
  filter(COHORT != "") |>
  pull(COHORT) |>
  str_split("\\|") |>
  unlist() |>
  length()

[1] 279

print("Number of unique cohorts referenced (not unique cohort names, but unique strings in the COHORT column):")

[1] "Number of unique cohorts referenced (not unique cohort names, but unique strings in the COHORT column):"

detected_cohorts <- cohort_sentences_df |>
  filter(COHORT != "") |>
  pull(COHORT) |>
  str_split("\\|") |>
  unlist() 

n_cohorts <- detected_cohorts |>
  unique()  |>
  length()

print(n_cohorts)

[1] 115

print("Most common cohort names detected (by number of sentences they are mentioned in):")

[1] "Most common cohort names detected (by number of sentences they are mentioned in):"

data.frame(cohort = detected_cohorts) |>
  group_by(cohort) |>
  summarise(n = n()) |>
  arrange(desc(n)) |>
  head(10)

# A tibble: 10 × 2
   cohort            n
   <chr>         <int>
 1 UK Biobank       38
 2 1000 Genomes     10
 3 Biobank Japan     7
 4 HTN               7
 5 WTCCC             7
 6 Taiwan            6
 7 AA-DHS            5
 8 COPDGene          5
 9 DCCT              5
10 DHS               5

print("Most common cohort names detected (by number of abstracts they are mentioned in):")

[1] "Most common cohort names detected (by number of abstracts they are mentioned in):"

cohort_sentences_df |>
  filter(COHORT != "") |>
  select(COHORT, article_id) |>
  tidyr::separate_rows(COHORT, sep = "\\|") |>
  distinct() |>
  group_by(COHORT) |>
  summarise(n_abstracts = n()) |>
  arrange(desc(n_abstracts)) |>
  head(5)

# A tibble: 5 × 2
  COHORT        n_abstracts
  <chr>               <int>
1 UK Biobank             29
2 1000 Genomes            9
3 Biobank Japan           6
4 FinnGen                 4
5 Taiwan                  4

Add cohort-related context sentences

# extract sentences with cohort context terms (case-sensitive matching, as some context terms are common words that could be ambiguous)
cohort_context_df = extract_cohort_sentences(abstracts,
                                            cohort_context_terms,
                                            column_name = "CONTEXT",
                                            ignore_case = FALSE
                                            )

cohort_sentences_df <-
left_join(cohort_sentences_df,
          cohort_context_df |> select(article_id, 
                                      sentence_id, 
                                      CONTEXT),
          by = c("article_id", "sentence_id")
          ) 

print("Number of sentences with cohort names and/or cohort-related context:")

[1] "Number of sentences with cohort names and/or cohort-related context:"

cohort_sentences_df |> 
  mutate(has_context = ifelse(CONTEXT == "", F, T)) |>
  group_by(has_cohort, has_context) |> 
  summarise(n = n())

`summarise()` has grouped output by 'has_cohort'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   has_cohort [2]
  has_cohort has_context     n
  <lgl>      <lgl>       <int>
1 FALSE      FALSE        1876
2 FALSE      TRUE          214
3 TRUE       FALSE         124
4 TRUE       TRUE           47

filtered_cohort_sentences_df =
  cohort_sentences_df |>
  filter(COHORT != "" | CONTEXT != "")

print("Number of sentences with cohort names or cohort-related context (population, ancestry, country terms):")

[1] "Number of sentences with cohort names or cohort-related context (population, ancestry, country terms):"

nrow(filtered_cohort_sentences_df)

[1] 385

Get other sentences with words that indicate sample or cohort is being discussed

Using this to correct for the over-representation of sentences without cohort names, by identifying well-matched sentences

# words that indicate sample or cohort is being discussed:
sample_terms <- c(
  # Your original list
  "ancestry", "biobank", "cases", "cohort", "controls", 
  "consortium", "consortia", "descent", "founder", 
  "enrolled", "enrollment", "ethnicity", "heritage",
  "isolate", "isolated", "individuals", "participants", 
  "patients", "population", "recruitment", "registry", 
  "sample", "study", "subjects", "volunteer",
  
  # Additional core terms
  "recruited", "enroll", "ascertained", "volunteers",
  "probands", "affected", "unaffected", "families", "twins",
  "cohorts", "populations", "samples", "subgroup", "subset",
  "subcohort", "demographic", "ethnic", "racial",
  "admixed", "admixture", "ancestral", "origin",
  "case-control", "prospective", "genotyped", "sequenced",
  "meta-analysis", "pooled", "residing", "nationwide",
  
  # Useful variants
  "enrollees", "sampling", "lineage", "self-reported",
  "self-identified", "replication", "discovery", "validation"
)

control_cohort_sentences_df <-
cohort_sentences_df |>
  filter(COHORT == "" & CONTEXT == "") |>
  filter(grepl(paste(sample_terms, collapse = "|"), 
               sentence, 
               ignore.case = TRUE)
         ) #|>

print("Number of sentences without (detected) cohort names or cohort-context but with sample-related context:")

[1] "Number of sentences without (detected) cohort names or cohort-context but with sample-related context:"

nrow(control_cohort_sentences_df)

[1] 706

filtered_cohort_sentences_df =
  bind_rows(filtered_cohort_sentences_df,
            control_cohort_sentences_df
            ) |>
  distinct()

print("Percentage of sentences with cohort names vs other cohort-related context")

[1] "Percentage of sentences with cohort names vs other cohort-related context"

total_n_sentences <- nrow(filtered_cohort_sentences_df)

filtered_cohort_sentences_df |>
  group_by(has_cohort) |> 
  summarise(n = n(),
            percentage = round(100 * n/total_n_sentences, digits = 2)
  )

# A tibble: 2 × 3
  has_cohort     n percentage
  <lgl>      <int>      <dbl>
1 FALSE        920       84.3
2 TRUE         171       15.7

abstract_pmids = filtered_cohort_sentences_df$pubmed_id |> unique()

Warning: Unknown or uninitialised column: `pubmed_id`.

abstract_ids <- filtered_cohort_sentences_df$article_id |> unique()
abstract_date = abstract_date[abstract_ids]
abstract_cohort = abstract_cohort[abstract_ids]
abstract_country = abstract_country[abstract_ids]

Prepare data in Doccano JSON format

# file path for intermediate json file output
json_file = here::here("output/doccano/abstracts_with_cohort_info.json")

# file path for final jsonl file output
jsonl_file = here::here("output/doccano/abstracts_with_cohort_info.jsonl")

convert_to_doccano_json_sentence_level <- function(pmids,
                                                   date,
                                                   cohort,
                                                   country,
                                                   cohort_sentences_df) {
  
  # set up json list
  doccano_list <- list()
  example_id <- 1

  for(current_sentence in cohort_sentences_df$sentence) {

      # Filter cohort sentences that match this sentence (safe matching)
      df <- cohort_sentences_df |>
             dplyr::filter(sentence == current_sentence) 
      
      # article_id <- df$article_id
      # matched_cohort <- df$COHORT
      
      for (i in seq_len(nrow(df))) {
        
        matched_cohort <- df$COHORT[i]
        article_id <- df$article_id[i]
        
        #browser()
      
        if(matched_cohort == ""){
        
          doccano_list[[example_id]] <- list(
          text = current_sentence,
          pubmed_id = pmids[article_id],
          date = date[article_id],
          country = country[article_id],
          gwas_cat_cohort_label = cohort[article_id],
          label = list()
          )
        
        } else {
        
        # Find the location of all matches of the cohort name in the sentence
        if(grepl("\\|", matched_cohort)) {
           
           # If multiple cohort names, separate
           matched_cohort <- unlist(
                                    strsplit(matched_cohort, 
                                             split = "\\|")
                                    )
        
           match_locations <- list()
        
           for(current_matched_cohort in matched_cohort){
          
               matches <- str_locate_all(current_sentence,
                                         fixed(current_matched_cohort, 
                                               ignore_case = TRUE)
                                         )[[1]]
          
               match_locations <- append(match_locations, 
                                         list(matches)
                                         )
          
           }
        
           # Combine all match locations into a single matrix
           matches <- do.call(rbind, match_locations)
        
        }  else {
        
           matches <- str_locate_all(current_sentence,
                                     fixed(matched_cohort, 
                                           ignore_case = TRUE)
                                     )[[1]]
      
      }
      
      # Convert matches to 0-based indexing (for doccano)
      matches[, "start"] <- matches[, "start"] - 1  
      
      # Turn match locations into entity list
      entities <- list()
      
      for(k in seq_len(nrow(matches))) {
              entities <- append(entities, list(list(
                start_offset = matches[k, "start"],
                end_offset = matches[k, "end"],
                label = "COHORT"
              )))
      }
      
      # Create Doccano JSON entry
      doccano_list[[example_id]] <- list(
        # id = example_id,
        text = current_sentence,
        pubmed_id = pmids[article_id],
        date = date[article_id],
        country = country[article_id],
        gwas_cat_cohort_label = cohort[article_id],
        label = entities
      )
      
      }
      }

      example_id <- example_id + 1
  }
  
  
# Suppose each element of json_list has 'labels' as a list of named lists
  doccano_list <- lapply(doccano_list, function(x) {
  x$label <- lapply(x$label, function(l) {
    # convert named list to vector/list format [start, end, label]
    c(l$start_offset, l$end_offset, l$label)
  })
  x
})

  return(doccano_list)
}


# Create JSON as a list
json_list <- convert_to_doccano_json_sentence_level(pmids = abstract_pmids, #pmids,
                                                    date = abstract_date,
                                                    cohort = abstract_cohort,
                                                    country = abstract_country,
                                                    filtered_cohort_sentences_df)



# Write to JSON file
writeLines(toJSON(json_list,
                  auto_unbox = TRUE,
                  pretty = TRUE),
           json_file)


json_data <- fromJSON(json_file,
                      simplifyVector = FALSE)

# Open connection to JSONL file
con <- file(jsonl_file, "w")

# Loop over each element (object) and write as one line
for (i in seq_along(json_data)) {
  writeLines(toJSON(json_data[[i]], auto_unbox = TRUE), con)
}

# Close connection
close(con)

cat("JSONL saved to:", jsonl_file, "\n")

JSONL saved to: /Users/ibeasley/code/genomics_ancest_disease_dispar/output/doccano/abstracts_with_cohort_info.jsonl

Get cohort-revelevant methods section sentence extraction

Convert methods sections to sentences


source spacyr_venv/bin/activate

python3 code/extract_text/spacy_obtain_sentences.py \
--input_dir output/fulltexts/methods_sections \
--output_dir  output/fulltexts/methods_sentences

Get relevant methods text:

methods_sections <- list.files(here::here("output/fulltexts/methods_sentences/"), 
                             pattern = "*.json", 
                             full.names = TRUE
                             )

pmcids_with_methods <- methods_sections |>
  gsub(pattern = ".*/", replacement = "") |>
  gsub(pattern = "_methods_sentences\\.json$", replacement = "")

pubmeds_with_methods <- converted_ids |>
  filter(pmcids %in% pmcids_with_methods) |>
  pull(PMID)

pubmeds_with_methods <- c(pubmeds_with_methods, 
                          grep("PMC", pmcids_with_methods, value = TRUE, invert = TRUE)
                          )

print("Number of papers with methods sections extracted:")

[1] "Number of papers with methods sections extracted:"

sum(pubmeds_with_methods %in% pmids)

[1] 212

pmids_methods_to_get <-  pubmeds_with_methods[pubmeds_with_methods %in% pmids]

pmcids_methods_to_get <- converted_ids |>
  filter(PMID %in% pmids_methods_to_get) |>
  pull(pmcids)

ids_methods_to_get <- c(pmids_methods_to_get, pmcids_methods_to_get)

ids_methods_to_get <- ids_methods_to_get[ids_methods_to_get != ""]

files_to_get <-
paste0(ids_methods_to_get, 
       "_methods_sentences.json") |>
  sort()

methods_sections_to_get <- methods_sections[basename(methods_sections) %in% files_to_get]
methods_sections_to_get <- basename(methods_sections_to_get) |> 
                           gsub(pattern = "_methods_sentences\\.json$", replacement = "")

# read in methods sections for available papers
methods_texts <- sapply(methods_sections_to_get,
                    function(id) {
                      
                      file <- here::here(paste0("output/fulltexts/methods_sentences/",
                                                id, 
                                                "_methods_sentences.json"))
                      
                      json_data <- fromJSON(file)
                      
                      abstract_lines <- json_data[json_data != ""]
                      # readLines(file, 
                      #           warn = FALSE, 
                      #           encoding = "UTF-8") |>
                      # paste(collapse = " ")
                    }
                    )

# convert pmcids to pubmeds
names(methods_texts) <- sapply(names(methods_texts), function(id) {
  if(grepl("^PMC", id)) {
    pmcid <- id
    pubmed_id <- converted_ids$PMID[converted_ids$pmcids == pmcid]
    if(length(pubmed_id) > 0) {
      return(as.character(pubmed_id))
    } else {
      return(NA)
    }
  } else {
    return(id)
  }
})

# order methods texts by pmids
methods_texts <- methods_texts[order(names(methods_texts))]

methods_pmids <- names(methods_texts)
methods_cohort <- cohort[which(pmids %in% methods_pmids)]
methods_country <- country[which(pmids %in% methods_pmids)]
methods_date <- date[which(pmids %in% methods_pmids)]

Extract sentences with cohort names or cohort-related context (population, ancestry, country terms) from methods sections

# extract sentences with cohort full names (case-insensitive matching, as full names are less likely to be ambiguous)
cohort_sentences_df_p1 <- extract_cohort_sentences(methods_texts,
                                                cohort_full_names,
                                                ignore_case = TRUE
                                                )

# extract sentences with cohort abbreviation names (case-sensitive matching, as abbreviations are more likely to be ambiguous)
cohort_sentences_df_p2 = extract_cohort_sentences(methods_texts,
                                                  cohort_abbr_names,
                                                  ignore_case = FALSE
                                                  )

methods_cohort_sentences_df = 
  bind_rows(cohort_sentences_df_p1,
            cohort_sentences_df_p2
            ) |>
  distinct()


methods_cohort_sentences_df =
  methods_cohort_sentences_df |>
  group_by(article_id, sentence_id, sentence) |>
  summarise(
    COHORT = str_flatten(unique(separate_cohorts(COHORT)), 
                         collapse = "|", 
                         na.rm = T),
    has_cohort = ifelse(any(COHORT != ""), TRUE, FALSE)
  ) |>
  ungroup()

`summarise()` has grouped output by 'article_id', 'sentence_id'. You can
override using the `.groups` argument.

methods_cohort_sentences_df =
  methods_cohort_sentences_df |>
  mutate(COHORT = str_remove_all(COHORT, 
                               pattern = "\\|$|^\\|"
                               )
         ) 

print("Number of methods sections containing probable cohort reference:")

[1] "Number of methods sections containing probable cohort reference:"

methods_cohorts <- methods_cohort_sentences_df |>
  filter(COHORT != "") |>
  pull(article_id) |>
  unique()

methods_cohorts  |> length()

[1] 197

print("Percentage of methods sections containing probable cohort reference:")

[1] "Percentage of methods sections containing probable cohort reference:"

100 * length(methods_cohorts) / length(methods_texts)

[1] 92.92453

print("Number of sentences containing probable cohort reference in methods sections")

[1] "Number of sentences containing probable cohort reference in methods sections"

  methods_cohort_sentences_df |>
  filter(COHORT != "") |>
  select(article_id, sentence_id) |>
  distinct() |>
  nrow()

[1] 1873

print("Number of entities in methods sections")

[1] "Number of entities in methods sections"

methods_cohort_sentences_df |>
  filter(COHORT != "") |>
  pull(COHORT) |>
  str_split("\\|") |>
  unlist() |>
  length()

[1] 2792

print("Most common cohort names detected in methods sections:")

[1] "Most common cohort names detected in methods sections:"

methods_cohort_sentences_df |>
  filter(COHORT != "") |>
  select(COHORT, article_id) |>
  tidyr::separate_rows(COHORT, sep = "\\|") |>
  distinct() |>
  group_by(COHORT) |>
  summarise(n_abstracts = n()) |>
  arrange(desc(n_abstracts)) |>
  head(5)

# A tibble: 5 × 2
  COHORT        n_abstracts
  <chr>               <int>
1 1000 Genomes           82
2 UK Biobank             47
3 GTEx                   39
4 UKB                    18
5 Biobank Japan          16

print("Number of cohort names only mentioned in one paper")

[1] "Number of cohort names only mentioned in one paper"

methods_cohort_sentences_df |>
  filter(COHORT != "") |>
  select(COHORT, article_id) |>
  tidyr::separate_rows(COHORT, sep = "\\|") |>
  distinct() |>
  group_by(COHORT) |>
  summarise(n_abstracts = n()) |>
  filter(n_abstracts == 1) |>
  nrow()

[1] 229

print("Number of unique cohorts referenced in methods sections (not unique cohort names, but unique strings in the COHORT column):")

[1] "Number of unique cohorts referenced in methods sections (not unique cohort names, but unique strings in the COHORT column):"

detected_cohorts_methods <- methods_cohort_sentences_df |>
  filter(COHORT != "") |>
  pull(COHORT) |>
  str_split("\\|") |>
  unlist()

n_cohorts_methods <- detected_cohorts_methods |>
  unique()  |>
  length()

print(n_cohorts_methods)

[1] 389

print("Most common cohort names detected in methods sections by number of sentences they are mentioned in:")

[1] "Most common cohort names detected in methods sections by number of sentences they are mentioned in:"

data.frame(cohort = detected_cohorts_methods) |>
  group_by(cohort) |>
  summarise(n = n()) |>
  arrange(desc(n)) |>
  head(10)

# A tibble: 10 × 2
   cohort           n
   <chr>        <int>
 1 UK Biobank     210
 2 1000 Genomes   165
 3 UKB            138
 4 UKBB            74
 5 GTEx            71
 6 ARIC            45
 7 MVP             45
 8 FHS             43
 9 GERA            37
10 WTCCC           37

# 
# methods_cohort_sentences_df |>
#   filter(!c(article_id %in% methods_abstracts_cohorts)) |>
#   View()

Distinguishing words between sentences with cohort names, and sentences without cohort names in methods sections

cohort_sentences_words <- 
cohort_sentences_df |>
  filter(has_cohort) |>
  pull(sentence) |>
  tokenizers::tokenize_words() |>
  unlist()

cohort_sentence_words_df =
data.frame(word = cohort_sentences_words) |>
  filter(!(tolower(word) %in% removal_words)) |>
  group_by(word) |>
  summarise(n_in_cohort = n()) |>
  filter(n_in_cohort > 5)

non_cohort_sentences_words <-
cohort_sentences_df |>
  filter(!has_cohort) |>
  pull(sentence) |>
  tokenizers::tokenize_words() |>
  unlist()

# sample down non-cohort sentences words
set.seed(500)
non_cohort_sentences_words_sample <- sample(non_cohort_sentences_words, 
                                              size = length(cohort_sentences_words), 
                                              replace = FALSE)

non_cohort_sentence_words_df =
data.frame(word = non_cohort_sentences_words_sample) |>
  filter(!(tolower(word) %in% removal_words)) |>
  group_by(word) |>
  summarise(n_not_cohort = n()) 


print("Words more common in sentences with cohort names than sentences without cohort names:")

[1] "Words more common in sentences with cohort names than sentences without cohort names:"

left_join(cohort_sentence_words_df,
          non_cohort_sentence_words_df,
          by = "word"
          ) |>
  mutate(n_not_cohort = ifelse(is.na(n_not_cohort), 0, n_not_cohort)) |>
  mutate(delta_n = n_in_cohort - n_not_cohort) |>
  arrange(desc(delta_n)) |>
  head(20)

# A tibble: 20 × 4
   word        n_in_cohort n_not_cohort delta_n
   <chr>             <int>        <dbl>   <dbl>
 1 biobank              53            0      53
 2 study                62           19      43
 3 uk                   41            1      40
 4 data                 38            9      29
 5 n                    29            4      25
 6 cases                35           16      19
 7 controls             32           15      17
 8 consortium           15            0      15
 9 cohorts              15            2      13
10 project              13            0      13
11 african              12            0      12
12 based                18            7      11
13 genomes              11            0      11
14 individuals          25           14      11
15 performed            25           14      11
16 1000                 10            0      10
17 19                   10            0      10
18 ancestry             14            4      10
19 analysis             40           31       9
20 cohort               15            6       9

Add cohort-related context sentences to methods sections

# extract sentences with cohort context terms (case-sensitive matching, as some context terms are common words that could be ambiguous)
cohort_context_df = extract_cohort_sentences(methods_texts,
                                            cohort_context_terms,
                                            column_name = "CONTEXT",
                                            ignore_case = FALSE
                                            )

methods_cohort_sentences_df <-
left_join(methods_cohort_sentences_df,
          cohort_context_df |> select(article_id, sentence_id, CONTEXT),
          by = c("article_id", "sentence_id")
          ) 

print("Number of sentences with cohort names and/or cohort-related context:")

[1] "Number of sentences with cohort names and/or cohort-related context:"

methods_cohort_sentences_df |> 
  mutate(has_context = ifelse(CONTEXT == "", F, T)) |>
  group_by(has_cohort, has_context) |> 
  summarise(n = n())

`summarise()` has grouped output by 'has_cohort'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   has_cohort [2]
  has_cohort has_context     n
  <lgl>      <lgl>       <int>
1 FALSE      FALSE       12169
2 FALSE      TRUE         1154
3 TRUE       FALSE        1516
4 TRUE       TRUE          357

filtered_cohort_sentences_df =
  methods_cohort_sentences_df |>
  filter(COHORT != "" | CONTEXT != "")

print("Number of sentences with cohort names or cohort-related context (population, ancestry, country terms):")

[1] "Number of sentences with cohort names or cohort-related context (population, ancestry, country terms):"

nrow(filtered_cohort_sentences_df)

[1] 3027

Get other sentences with words that indicate sample or cohort is being discussed

control_cohort_sentences_df <-
cohort_sentences_df |>
  filter(COHORT == "" & CONTEXT == "") |>
  filter(grepl(paste(sample_terms, collapse = "|"), 
               sentence, 
               ignore.case = TRUE)
         ) #|>

print("Number of sentences without (detected) cohort names or cohort-context but with sample-related context:")

[1] "Number of sentences without (detected) cohort names or cohort-context but with sample-related context:"

nrow(control_cohort_sentences_df)

[1] 706

filtered_cohort_sentences_df =
  bind_rows(filtered_cohort_sentences_df,
            control_cohort_sentences_df
            ) |>
  distinct()

print("Percentage of sentences with cohort names vs other cohort-related context")

[1] "Percentage of sentences with cohort names vs other cohort-related context"

total_n_sentences <- nrow(filtered_cohort_sentences_df)

filtered_cohort_sentences_df |>
  group_by(has_cohort) |> 
  summarise(n = n(),
            percentage = round(100 * n/total_n_sentences, digits = 2)
  )

# A tibble: 2 × 3
  has_cohort     n percentage
  <lgl>      <int>      <dbl>
1 FALSE       1860       49.8
2 TRUE        1873       50.2

Prepare data in Doccano JSON format

# file path for intermediate json file output
json_file = here::here("output/doccano/methods_with_cohort_info.json")

# file path for final jsonl file output
jsonl_file = here::here("output/doccano/methods_with_cohort_info.jsonl")


methods_pmids <- filtered_cohort_sentences_df$pubmed_id |> unique()

Warning: Unknown or uninitialised column: `pubmed_id`.

article_ids <- filtered_cohort_sentences_df$article_id |> unique()

methods_cohort = methods_cohort[article_ids]
country = country[article_ids]
date = date[article_ids]
ancestry = ancestry[article_ids]

# Create JSON as a list
json_list <- convert_to_doccano_json_sentence_level(pmids = pmids,
                                                    date = date,
                                                    cohort = cohort,
                                                    country = country,
                                                    filtered_cohort_sentences_df)

# Write to JSON file
writeLines(toJSON(json_list,
                  auto_unbox = TRUE,
                  pretty = TRUE),
           json_file)


json_data <- fromJSON(json_file,
                      simplifyVector = FALSE)

# Open connection to JSONL file
con <- file(jsonl_file, "w")

# Loop over each element (object) and write as one line
for (i in seq_along(json_data)) {
  writeLines(toJSON(json_data[[i]], auto_unbox = TRUE), con)
}

# Close connection
close(con)

cat("JSONL saved to:", jsonl_file, "\n")

JSONL saved to: /Users/ibeasley/code/genomics_ancest_disease_dispar/output/doccano/methods_with_cohort_info.jsonl

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] tokenizers_0.3.0 jsonlite_2.0.0   xml2_1.4.0       rentrez_1.2.4   
 [5] httr_1.4.7       stringi_1.8.7    dplyr_1.1.4      readxl_1.4.5    
 [9] stringr_1.6.0    workflowr_1.7.1 

loaded via a namespace (and not attached):
 [1] sass_0.4.10         utf8_1.2.6          generics_0.1.4     
 [4] tidyr_1.3.1         renv_1.0.3          digest_0.6.37      
 [7] magrittr_2.0.4      evaluate_1.0.5      fastmap_1.2.0      
[10] cellranger_1.1.0    rprojroot_2.1.0     processx_3.8.6     
[13] whisker_0.4.1       ps_1.9.1            promises_1.3.3     
[16] BiocManager_1.30.26 purrr_1.1.0         XML_3.99-0.19      
[19] jquerylib_0.1.4     cli_3.6.5           rlang_1.1.6        
[22] withr_3.0.2         cachem_1.1.0        yaml_2.3.10        
[25] tools_4.3.1         httpuv_1.6.16       here_1.0.1         
[28] vctrs_0.6.5         R6_2.6.1            lifecycle_1.0.4    
[31] git2r_0.36.2        fs_1.6.6            pkgconfig_2.0.3    
[34] callr_3.7.6         pillar_1.11.1       bslib_0.9.0        
[37] later_1.4.4         glue_1.8.0          data.table_1.17.8  
[40] Rcpp_1.1.0          xfun_0.55           tibble_3.3.0       
[43] tidyselect_1.2.1    rstudioapi_0.17.1   knitr_1.50         
[46] htmltools_0.5.8.1   SnowballC_0.7.1     rmarkdown_2.30     
[49] compiler_4.3.1      getPass_0.2-4

Get abstract text for GWAS Catalog studies

Isobel Beasley

2025-10-17

Set up

Required packages

Getting required information from GWAS catalog

Step 1: Get only disease studies, and select relevant columns

How much overlap with abstracts and full-texts retrieved?

Step 2: Get list of terms to use for cohort context matching

Source of terms: GWAS Catalog Ancestry Metadata

Countries

Population descriptors

Source of terms: hancestro ontology of human ancestry terms

? Other possible sources

Combine sources

Step 3: Get cohort information

Step 4: Add ancestry/population info

Step 5: Extract information required to get paper text & abstracts

Step 6: Get data dictionary of cohort names

Get cohort-relevant abstract sentences

Load relevant abstracts

Convert abstract sections to sentences

Check: What abstracts are missing?

Cohort name matching in abstract sentences

Function to extract sentences with cohort names

Get cohort relevant sentences

Get cohort name sentences

Distinguishing words between sentences with cohort names, and sentences without cohort names

Get other sentences with words that indicate sample or cohort is being discussed

Prepare data in Doccano JSON format

Get cohort-revelevant methods section sentence extraction

Convert methods sections to sentences

Get relevant methods text:

Distinguishing words between sentences with cohort names, and sentences without cohort names in methods sections

Get other sentences with words that indicate sample or cohort is being discussed

Prepare data in Doccano JSON format