Last updated: 2025-10-27

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 3df4096

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3df4096. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/cohort/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/who/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/

Untracked files:
    Untracked:  analysis/specific_aims_stats.Rmd

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/correcting_cohort_names.Rmd
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/gbd_data_plots.Rmd
    Modified:   analysis/group_cancer_diseases.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/level_1_disease_group_non_cancer.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/trait_ontology_categorization.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_full_text.Rmd) and HTML (docs/get_full_text.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	3df4096	IJbeasley	2025-10-27	Update + improve full text downloading - test run
html	1439951	IJbeasley	2025-10-24	Build site.
Rmd	481aebe	IJbeasley	2025-10-24	Update code for getting full texts

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)

Get PMCIDs

Get Pubmed ids from GWAS catalog

gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

gwas_study_info <- gwas_study_info |>
  filter(DISEASE_STUDY == TRUE)

pmids <- unique(gwas_study_info$PUBMED_ID)

pmids <- sample(pmids, size = 500)

length(pmids)

[1] 500

Convert Pubmed IDs to PMCIDs

# convert PMID to PMCID
convert_pmid_to_pmcid <- function(pmid_vec,
                                  tool = "myTool",
                                  email = "you@example.com",
                                  format = "json",
                                  batch_size = 100,
                                  sleep_time = 1) {

  base_url <- "https://pmc.ncbi.nlm.nih.gov/tools/idconv/api/v1/articles/"

  batches <- split(pmid_vec,
                   ceiling(seq_along(pmid_vec) / batch_size))
  #browser()

  pmcid_list = purrr::map(batches,
             function(pmid_vec) {

               ids_param <- paste(pmid_vec,
                                  collapse = ",")

               query <- list(ids = ids_param,
                             idtype = "pmid",
                             tool = tool,
                             email = email,
                             format = format
                             )

               resp <- httr::GET(base_url,
                                 query = query)

               httr::stop_for_status(resp)

               content_text <- httr::content(resp,
                                             as = "text",
                                             encoding = "UTF-8")

               # Handle cases where records might be empty or missing pmcid
               parsed <- jsonlite::fromJSON(content_text,
                                            flatten = TRUE)

               parsed$records[is.na(parsed$records)] = ""

               pmcid <- parsed$records |>
                        pull(pmcid)

               Sys.sleep(sleep_time)

               return(pmcid)
             }
  )

 pmcid_list = unlist(pmcid_list)
 names(pmcid_list) <- pmid_vec
 return(pmcid_list)
}

pmcids <- convert_pmid_to_pmcid(pmids)

get_pmcid_europepmc <- function(pmid_vec) {
  base_url <- "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
  
  purrr::map_dfr(pmid_vec, function(pmid) {
    query <- list(
      query = paste0("ext_id:", pmid),
      format = "json"
    )
    resp <- httr::GET(base_url, query = query)
    if (httr::status_code(resp) != 200) {
      return(tibble(pmid = pmid, 
                    pmcid = NA_character_))
    }
    
    dat <- jsonlite::fromJSON(httr::content(resp, 
                                            as = "text", 
                                            encoding = "UTF-8"))
    
    if (length(dat$resultList$result) == 0) {
      return(tibble(pmid = pmid, pmcid = NA_character_))
    }
    
    pmcid <- dat$resultList$result$pmcid
    tibble(pmid = pmid, pmcid = pmcid)
  })
}

pmids_missing = names(pmcids[pmcids == ""])

get_pmcid_europepmc(pmids_missing) -> pmcid_europepmc_df

converted_ids <-
data.frame(pmids = names(pmcids),
           pmcids = pmcids
           ) 

data.table::fwrite(converted_ids,
                   here::here("output/gwas_cat/gwas_pubmed_to_pmcid_mapping.csv")
                   )

# How many missing? 
sum(pmcids == "")

[1] 115

Download + clean full texts from European PMC

converted_ids <-
data.table::fread(here::here("output/gwas_cat/gwas_pubmed_to_pmcid_mapping.csv")
)


download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europepmc")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )

  resp <- GET(url_xml)

  if (status_code(resp) != 200) stop("Failed to fetch XML for ", pmcid)

  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )

  # Get text body xml content
  #browser()
  xml_body = xml_child(xml_content, "body")
  
  # Build text file
  # By converting xml structure into sections and subsections
  text = c()
  for(section in 1:xml_length(xml_body)){

    section_node = xml_child(xml_body, section)

    if(sum(
            xml_length(xml_find_all(section_node,
                                    ".//*[.//title and .//p]"))
            ) == 0
       )
      {

      # Get section name:
      section_name = xml_text(xml_find_all(section_node, ".//title"))
      section_name = str_squish(section_name)
      if(!rlang::is_empty(section_name)) {
         text = c(text, paste0("\n\n", section_name, "\n"))
      }

      # Get paragraphs
      para_nodes = xml_find_all(section_node, ".//p")
      para_texts = xml_text(para_nodes)
      para_texts = str_squish(para_texts)
      
      
      if(!rlang::is_empty(para_texts)) {
        text = c(text, paste0("\n", para_texts, "\n"))
      }
      
      if(rlang::is_empty(section_name) && 
         rlang::is_empty(para_texts)) {
        
        all_node_text <- xml_text(section_node)
        
        if(!rlang::is_empty(all_node_text)){
        
        text = c(text, paste0("\n", all_node_text, "\n"))
        
        }
      }
      
      label <- xml_text(xml_find_all(section_node, 
                                     ".//label"))
      href <- xml_attr(xml_find_all(section_node, 
                                    ".//media"), 
                       "href")
      
      if(!rlang::is_empty(label) && !rlang::is_empty(href)) {
        
        text = c(text, paste0("\n", label, ". ", href, "\n"))
        
      }
      
      

    } else {

    for(subsection in 1:xml_length(section_node)){

    subsection_node = xml_child(section_node,
                                subsection
                                )

    # Get section name:
    subsection_name = xml_text(xml_find_all(subsection_node, ".//title"))
    subsection_name = str_squish(subsection_name)
    if(!rlang::is_empty(subsection_name)) {
    # Add spaces around section titles
    text = c(text, paste0("\n\n", subsection_name, "\n"))
    }

    # Get paragraphs
    para_nodes = xml_find_all(subsection_node, ".//p")
    para_texts = xml_text(para_nodes)
    para_texts = str_squish(para_texts)
    
    if(!rlang::is_empty(para_texts)) {
    # Add spaces around paragraphs
    text = c(text, paste0("\n", para_texts, "\n"))
    }
    
    if(rlang::is_empty(subsection_name) &&
         rlang::is_empty(para_texts)) {    
    
        all_node_text <- xml_text(section_node)
        
        if(!rlang::is_empty(all_node_text)){
        
        text = c(text, paste0("\n", all_node_text, "\n"))
        
        }
      }
            
    }
}
  }
  
  # if Nat Genet article
  if(grepl("Nature genetics", 
        xml_text(xml_find_all(xml_content, 
                              ".//journal-title")),
        ignore.case = TRUE
        )
  ){
    
    # Find all figure nodes
    xml_figures <- xml_find_all(xml_content,
                                     ".//fig")
    
    if (length(xml_figures) != 0){
     
      text = c(text, "\n\nFigures:\n")
      
    }
    
    for(nodes in 1:length(xml_figures)){
      
      figure_node = xml_figures[nodes]
      
      # Extract label:
      label <- xml_text(xml_find_all(figure_node, 
                                     ".//label"))
      
      # Extract title
      title = xml_text(xml_find_all(figure_node, ".//title"))
      
      if(!rlang::is_empty(label) | !rlang::is_empty(title)){
        
        text = c(text,
                 paste0("\n", label, ". ", title, "\n")
                 )
      }
      
      
      # Extract caption
      caption = xml_text(xml_find_all(figure_node, 
                                      ".//caption//p"))
      
      if(!rlang::is_empty(caption)){
        
        text = c(text,
                 paste0("\n", caption, "\n")
                 )
      }
      
    }
    
    # Find all tables
    xml_tables <- xml_find_all(xml_content,
                               ".//table-wrap")
    
    if(length(xml_tables) != 0){
      
      text = c(text, "\n\nTables:\n")
      
    }
    
    for(nodes in 1:length(xml_tables)){
      
      table_node = xml_tables[nodes]
      
      # Extract label:
      label <- xml_text(xml_find_all(table_node, 
                                     ".//label"))
      
       if(!rlang::is_empty(label)){
        
        text = c(text,
                 paste0("\n",label, "\n")
                 )
      }
      
      # Extract caption
      caption = xml_text(xml_find_all(table_node, 
                                      ".//caption//p"))
      
      if(!rlang::is_empty(caption)){
        
        text = c(text,
                 paste0("\n", 
                        caption, "\n")
        )
      }
      
    }
    
    
    
  }

  # --- Save ---
  text_full <- paste(text, collapse = " ")

  txt_file <- file.path(out_dir,
                        paste0(pmcid, ".txt")
                        )
  writeLines(text_full,
             txt_file,
             useBytes = TRUE)

  #message("✅ Cleaned text saved for ", pmcid)
  invisible(text_full)
}


safe_download_pmc_text <- purrr::safely(download_pmc_text)


# Download full texts for all PMCIDs
for(pmcid in pmcids){
  if(pmcid != ""){
    
    result <- safe_download_pmc_text(pmcid)

    # if(!is.null(result$error)){
    #   message("❌ Failed to download text for ", pmcid,
    #           ": ", result$error)
    # }
  }
}

# How many texts saved? 
length(list.files(here::here("output/fulltexts/europepmc"),
                  pattern = "\\.txt$")
       )

[1] 234

For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service


# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text - non-commerical use
# oa_noncomm 
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request

Identify full texts downloaded

europeanpmc_full_texts <- 
list.files(here::here("output/fulltexts/europepmc"),
                  pattern = "\\.txt$"
           )

# get pmcids of these files
europeanpmc_full_texts <-
  gsub("\\.txt$", 
       "", 
       europeanpmc_full_texts
       )

Get paths of full texts could download:

left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]

length(left_over_pmcids)

[1] 266

author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))
oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))

author_manu_to_get <-
author_manu |>
  filter(PMID %in% names(left_over_pmcids))

nrow(author_manu_to_get)

[1] 94

oa_noncomm_to_get = 
oa_noncomm |>
  filter(PMID %in% names(left_over_pmcids)) 

nrow(oa_noncomm_to_get)

[1] 1

not_available = names(left_over_pmcids)[!names(left_over_pmcids) %in% 
                                          c(oa_noncomm_to_get$PMID, 
                                            author_manu_to_get$PMID)]

length(not_available)

[1] 171

file_paths = 
c(oa_noncomm_to_get$Key,
  author_manu_to_get$Key)

# percentage not avaliable, from all:
100 * length(not_available) / length(pmcids)

[1] 34.2

Download remaining full texts from NCBI Cloud Service

# download all the available ones:
writeLines(file_paths, here::here("output/fulltexts/aws_locations/selected_paths.txt"))
system("xargs -a output/fulltexts/aws_locations/selected_paths.txt -I {} aws s3 cp s3://pmc-oa-opendata/{} output/fulltexts/ncbi_cloud/ --no-sign-request")

# for (i in seq_along(file_paths)) {
#   
#   system(
#   paste(
#   "aws s3 cp",
#   paste0("s3://pmc-oa-opendata/", file_paths[i]),
#   here::here("output/fulltexts/ncbi_cloud/"),
#   "--no-sign-request"
#   )
#   )
# }

Get dbGaP ids / EGA sentences

library(tokenizers)

all_full_texts = 
c(
  list.files(here::here("output/fulltexts/europepmc"),
             pattern = "\\.txt$",
             full.names = TRUE
           ),
  list.files(here::here("output/fulltexts/ncbi_cloud"),
             pattern = "\\.txt$",
             full.names = TRUE)
)

# get pmcids of these files
all_pmcids <-
  all_full_texts |>
  gsub(pattern = ".*/", replacement = "") |>
  gsub(pattern = "\\.txt$", replacement = "") 

get_grep_sentences <- function(file_path,
                               grep_pattern) {

  txt_in <- readLines(file_path,
                      warn = FALSE,
                      encoding = "UTF-8")

  sentences <- unlist(tokenize_sentences(txt_in))

  dbgap_sentences = sentences[grepl(grep_pattern, sentences)]

  return(dbgap_sentences)
}

# get dbgap ids
# get EGA ids
# get JGAS
# get PRJEB / PRJNA ids
grep_pattern <- "phs\\d+|EGAC\\d+|EGAD\\d+|EGAF\\d+|JGAS\\d+|JGAD\\d+|PRJEB\\d+|PRJNA\\d+" 

pmcid_dbgap_sentences <- 
purrr::map(all_full_texts,
           ~get_grep_sentences(.x, 
                               grep_pattern = grep_pattern
                               )
           ) 

names(pmcid_dbgap_sentences) <- all_pmcids

keep_pmcid_dbgap_sentences <-
  purrr::discard(
pmcid_dbgap_sentences,
~rlang::is_empty(.x)
) 

# Convert to data frame
sentences_df <- 
  purrr::imap(keep_pmcid_dbgap_sentences,
              ~data.frame(pmcid = .y,
                          sentence = unique(.x),
                        stringsAsFactors = FALSE)
              )|>
  dplyr::bind_rows()

# Extract all dbGaP IDs from the sentences
sentences_df <- sentences_df |>
  mutate(
    dbgap_id = str_extract_all(sentence, "phs\\d+(\\.v\\d+)?(\\.p\\d+)?"),
    ega_id = str_extract_all(sentence, "EGA[CDFS]\\d+"),
    jgas_id = str_extract_all(sentence, "JGAS\\d+|JGAD"),
    prj_id = str_extract_all(sentence, "PRJEB\\d+|PRJNA\\d+")
  )
             
sentences_df = 
  sentences_df |>
  distinct()

# number of pubmed ids with dbgap / ega ids found
sentences_df$pmcid |> unique() |> length()

[1] 26

data.table::fwrite(sentences_df,
                   here::here("output/gwas_cat/gwas_study_dbgap_ega_sentences.csv")
                   )

Are there any dbgap ids not in our cohort description mapping?

sentences_df <- data.table::fread(
                   here::here("output/gwas_cat/gwas_study_dbgap_ega_sentences.csv")
)

cohort_dbgap_mapping <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
                                 sheet = 1)

New names:
• `` -> `...14`

current_dbgap_ids =
  cohort_dbgap_mapping |>
  pull(dbGaP) |>
  strsplit(",") |>
  unlist() |>
  str_trim() |>
  unique()

old_dbgap_ids = 
  cohort_dbgap_mapping |>
  pull(old_dbGaP) |>
  strsplit(",") |>
  unlist() |>
  str_trim() |>
  unique()

mapped_dbgap_ids = c(current_dbgap_ids,
                      old_dbgap_ids
                      )

found_dbgap_ids =
sentences_df$dbgap_id |>
  strsplit("\\|") |>
  unlist() |>
  unique()

not_found_dbgap_ids <- 
sort(found_dbgap_ids[!found_dbgap_ids %in% mapped_dbgap_ids])

not_found_dbgap_ids

[1] "phs001478.v1.p1"

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] tokenizers_0.3.0 dplyr_1.1.4      here_1.0.1       stringr_1.5.2   
[5] xml2_1.4.0       httr_1.4.7       workflowr_1.7.1 

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0    compiler_4.3.1    renv_1.0.3        promises_1.3.3   
 [5] tidyselect_1.2.1  Rcpp_1.1.0        git2r_0.36.2      callr_3.7.6      
 [9] later_1.4.4       jquerylib_0.1.4   readxl_1.4.5      yaml_2.3.10      
[13] fastmap_1.2.0     R6_2.6.1          SnowballC_0.7.1   generics_0.1.4   
[17] curl_7.0.0        knitr_1.50        tibble_3.3.0      rprojroot_2.1.0  
[21] bslib_0.9.0       pillar_1.11.1     rlang_1.1.6       cachem_1.1.0     
[25] stringi_1.8.7     httpuv_1.6.16     xfun_0.53         getPass_0.2-4    
[29] fs_1.6.6          sass_0.4.10       cli_3.6.5         withr_3.0.2      
[33] magrittr_2.0.4    ps_1.9.1          digest_0.6.37     processx_3.8.6   
[37] rstudioapi_0.17.1 lifecycle_1.0.4   vctrs_0.6.5       data.table_1.17.8
[41] evaluate_1.0.5    glue_1.8.0        cellranger_1.1.0  whisker_0.4.1    
[45] purrr_1.1.0       rmarkdown_2.30    tools_4.3.1       pkgconfig_2.0.3  
[49] htmltools_0.5.8.1

Get full article text for GWAS Catalog studies

Isobel Beasley

2025-10-22