Last updated: 2026-01-12

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: b43e9a9

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b43e9a9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd
    Modified:   analysis/trait_ontology_categorization.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_full_text.Rmd) and HTML (docs/get_full_text.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	b43e9a9	IJbeasley	2026-01-12	Update getting full text
html	ac0d1a7	IJbeasley	2025-10-27	Build site.
html	8642872	IJbeasley	2025-10-27	Build site.
Rmd	da4d730	IJbeasley	2025-10-27	Now run on all texts
html	fb5cfd9	IJbeasley	2025-10-27	Build site.
Rmd	8ed4c37	IJbeasley	2025-10-27	Now run on all texts
html	8610283	IJbeasley	2025-10-27	Build site.
Rmd	7d504e3	IJbeasley	2025-10-27	More fixing of download full text
html	16f4c19	IJbeasley	2025-10-27	Build site.
Rmd	3df4096	IJbeasley	2025-10-27	Update + improve full text downloading - test run
html	1439951	IJbeasley	2025-10-24	Build site.
Rmd	481aebe	IJbeasley	2025-10-24	Update code for getting full texts

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)

Get PMCIDs

Get Pubmed ids from GWAS catalog

# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

## Step 1: 
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# gwas_study_info <- gwas_study_info |>
#   dplyr::filter(DISEASE_STUDY == TRUE)

pmids <- unique(gwas_study_info$PUBMED_ID)

#pmids <- sample(pmids, size = 500)

length(pmids)

[1] 993

Convert Pubmed IDs to PMCIDs

# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

Warning in fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv")): Found and
resolved improper quoting out-of-sample. First healed line 8637909:
<<17234576,PMC2742408,"https://doi.org/10.1102/1470-7330.2007.0001>>. If the
fields are not quoted (e.g. field separator does not appear within any field),
try quote="" to avoid this warning.

convert_pmid_df <- convert_pmid_df |>
  dplyr::rename(pmcids = PMCID
                ) |>
  dplyr::mutate(pmcids = ifelse(is.na(pmcids),
                                "",
                                pmcids
                                )
                )

convert_pmid_df =
  convert_pmid_df |>
  select(-DOI)

convert_pmid_df <-
  convert_pmid_df |>
  dplyr::filter(!is.na(PMID))

converted_ids = 
  convert_pmid_df |>
  filter(PMID %in% pmids)

dim(converted_ids)

[1] 993   2

length(pmids)

[1] 993

converted_ids |>
  filter(pmcids == "") |>
  dim()

[1] 202   2

# convert PMID to PMCID
# convert_pmid_to_pmcid <- function(pmid_vec,
#                                   tool = "myTool",
#                                   email = "you@example.com",
#                                   format = "json",
#                                   batch_size = 50,
#                                   sleep_time = 1) {
# 
#   base_url <- "https://pmc.ncbi.nlm.nih.gov/tools/idconv/api/v1/articles/"
# 
#   batches <- split(pmid_vec,
#                    ceiling(seq_along(pmid_vec) / batch_size))
#   #browser()
# 
#   pmcid_list = purrr::map(batches,
#              function(pmid_vec) {
# 
# 
#                ids_param <- paste(pmid_vec,
#                                   collapse = ",")
# 
#                query <- list(ids = ids_param,
#                              idtype = "pmid",
#                              tool = tool,
#                              email = email,
#                              format = format
#                              )
# 
#                resp <- httr::GET(base_url,
#                                  query = query)
# 
#                httr::stop_for_status(resp)
# 
#                content_text <- httr::content(resp,
#                                              as = "text",
#                                              encoding = "UTF-8")
# 
#                # Handle cases where records might be empty or missing pmcid
#                parsed <- jsonlite::fromJSON(content_text,
#                                             flatten = TRUE)
# 
#                parsed$records[is.na(parsed$records)] = ""
# 
#                pmcid <- parsed$records |>
#                         pull(pmcid)
# 
#                Sys.sleep(sleep_time)
# 
#                return(pmcid)
#              }
#   )
# 
#  pmcid_list = unlist(pmcid_list)
#  names(pmcid_list) <- pmid_vec
#  return(pmcid_list)
# }
# 
# pmcids <- convert_pmid_to_pmcid(pmids)
# 
# get_pmcid_europepmc <- function(pmid_vec) {
#   base_url <- "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
#   
#   purrr::map_dfr(pmid_vec, function(pmid) {
#     query <- list(
#       query = paste0("ext_id:", pmid),
#       format = "json"
#     )
#     resp <- httr::GET(base_url, query = query)
#     if (httr::status_code(resp) != 200) {
#       return(tibble(pmid = pmid, 
#                     pmcid = NA_character_))
#     }
#     
#     dat <- jsonlite::fromJSON(httr::content(resp, 
#                                             as = "text", 
#                                             encoding = "UTF-8"))
#     
#     if (length(dat$resultList$result) == 0) {
#       return(tibble(pmid = pmid, pmcid = NA_character_))
#     }
#     
#     pmcid <- dat$resultList$result$pmcid
#     tibble(pmid = pmid, pmcid = pmcid)
#   })
# }
# 
# pmids_missing = names(pmcids[pmcids == ""])
# 
# get_pmcid_europepmc(pmids_missing) -> pmcid_europepmc_df
# 
# converted_ids <-
# data.frame(pmids = names(pmcids),
#            pmcids = pmcids
#            ) 
# 
# # checked using pmids to pmcs conversion from europe pmc webservice
# # can map 38367033 -> PMC12560237
# converted_ids <- converted_ids |>
#   mutate(pmcids = ifelse(pmids == "38367033", 
#                          "PMC12560237", 
#                          pmcids
#                          )
#          )
# 
# data.table::fwrite(converted_ids,
#                    here::here("output/gwas_cat/gwas_pubmed_to_pmcid_mapping.csv")
#                    )
# 
# # How many missing? 
# sum(pmcids == "")
# 
# pmcids <- pmcids[pmcids != ""]

Download full texts from European PMC

pmcids <-
converted_ids$pmcids |>
  unique()

pmcids <- pmcids[pmcids != ""]

length(pmcids)

[1] 791

print("Percentage of pmids with pmcid:")

[1] "Percentage of pmids with pmcid:"

round(100 * length(pmcids) / length(pmids), digits = 2)

[1] 79.66

download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europe_pmc/")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/",
                    "europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )
  
  resp <- GET(url_xml)
  
  # MED/20708005
  
  # ---- Fallback URL ----
  if(status_code(resp) != 200){
    
    #print(paste0("Trying alternative URL for ", pmcid))
    
    url_xml <- paste0("https://europepmc.org/",
                       "oai.cgi?verb=GetRecord",
                       "&metadataPrefix=pmc",
                       "&identifier=oai:europepmc.org:",
                       pmcid)
    
    resp <- GET(url_xml)
  
  }
  
  # ---- Fail if still bad ----
  if(status_code(resp) != 200){
    
  #message("Failed to fetch XML for ", pmcid)
  
  return(NULL)
    
  }
  
  # ---- Parse XML ----
  xml_content <- read_xml(
    content(resp, 
            as = "text", 
            encoding = "UTF-8")
  )
  
  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )
  
  article_node = xml_find_first(xml_content, 
                               "//*[local-name() = 'article']"
                               )
  
   if (is.na(article_node)) {
    message("No <article> node found for ", pmcid)
     
    return(NULL)
   }
  
    # --- Save ---
  write_xml(article_node, 
            paste0(out_dir, pmcid, ".xml")
            )
  
} 


for(article in pmcids[pmcids != ""]){

download_pmc_text(article)

}

print("Number of downloaded full text files")

[1] "Number of downloaded full text files"

print("From European PMC:")

[1] "From European PMC:"

n_euro_pmc <- length(list.files(here::here("output/fulltexts/europe_pmc/"),
                  pattern = "\\.xml$")
       )

print("Percentage of pmids with full text from European PMC:")

[1] "Percentage of pmids with full text from European PMC:"

round(100 * n_euro_pmc / length(pmids), digits = 2)

[1] 52.77

converted_ids <-
data.table::fread(here::here("output/gwas_cat/gwas_pubmed_to_pmcid_mapping.csv")
)

convert_xml_text <- function(xml_content,
                             text #output text file
                             ){
  
    for(section in 1:xml_length(xml_content)){

    section_node = xml_child(xml_content, section)

    if(length(xml_path(xml_find_all(section_node,
                                    ".//*[.//title and .//p]"
                                    )
                       )
            ) == 0
       )
      {

      # Get section name:
      section_name = xml_text(xml_find_all(section_node, 
                                           ".//title"))
      
      section_name = str_squish(section_name)
      
      if(!rlang::is_empty(section_name)) {
         text = c(text, paste0("\n\n", section_name, "\n"))
      }

      # Get paragraphs
      para_nodes = xml_find_all(section_node, ".//p")
      para_texts = xml_text(para_nodes)
      para_texts = str_squish(para_texts)
      
      
      if(!rlang::is_empty(para_texts)) {
        text = c(text, paste0("\n", 
                              para_texts, 
                              "\n")
                 )
      }
      
      if(rlang::is_empty(section_name) && 
         rlang::is_empty(para_texts)) {
        
        all_node_text <- xml_text(section_node)
        
        if(!rlang::is_empty(all_node_text)){
        
        text = c(text, paste0("\n", 
                              all_node_text, 
                              "\n")
                 )
        
        }
      }
      
      label <- xml_text(xml_find_all(section_node, 
                                     ".//label"))
      
      href <- xml_attr(xml_find_all(section_node, 
                                    ".//media"), 
                       "href")
      
      if(!rlang::is_empty(label) && !rlang::is_empty(href)) {
        
        text = c(text, paste0("\n", label, ". ", href, "\n"))
        
      }
      
      

    } else {

    for(subsection in 1:xml_length(section_node)){

    subsection_node = xml_child(section_node,
                                subsection
                                )
    
    if(length(xml_children(subsection_node)) == 0){
      
      if(xml_name(subsection_node) == "title"){
        
        text = c(text, 
               paste0("\n\n", 
                      xml_text(subsection_node), 
                      "\n")
               )
        
      } else {
      
      text = c(text, 
               paste0("\n", 
                      xml_text(subsection_node), 
                      "\n")
               )
      
      }
      
      next
      
    }
                

    # Get section name:
    subsection_name = xml_text(xml_find_all(subsection_node, 
                                            ".//title")
                               )
    
    subsection_name = str_squish(subsection_name)
    
    if(!rlang::is_empty(subsection_name)) {
    # Add spaces around section titles
    text = c(text, paste0("\n\n", 
                          subsection_name, 
                          "\n")
             )
    }

    # Get paragraphs
    para_nodes = xml_find_all(subsection_node, 
                              ".//p")
    
    para_texts = xml_text(para_nodes)
    para_texts = str_squish(para_texts)
    
    if(!rlang::is_empty(para_texts)) {
    # Add spaces around paragraphs
    text = c(text, paste0("\n", 
                          para_texts, 
                          "\n")
             )
    }
    
    if(rlang::is_empty(subsection_name) &&
         rlang::is_empty(para_texts)) {    
    
        all_node_text <- xml_text(subsection_node)
        
        if(!rlang::is_empty(all_node_text)){
        
        text = c(text, paste0("\n", 
                              all_node_text, 
                              "\n")
                 )
        
        }
      }
            
    }
}
  }
  
  return(text)
  
}

extract_app_text <- function(xml_back, 
                             text){
  
  # Add separator for appendices section
  text <- c(text, "\n\n=== APPENDICES ===\n")
  
  #browser()
  
  for(node_id in 1:xml_length(xml_back)){
    
    node = xml_child(xml_back, 
                     node_id)
    
    #print(node)
    
    if(xml_name(node) == "app-group" & 
       length(xml_find_all(node, ".//sec")) > 0) {
      
      app_node = xml_find_all(node, ".//sec")
      
      text = convert_xml_text(app_node, text)
      
    } else if(xml_name(node) == "ref-list") {
      
      next 
      
    } else {
      
      text = convert_xml_text(node, 
                              text)
    }
  }


download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europepmc")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )

  resp <- GET(url_xml)

  if (status_code(resp) != 200) stop("Failed to fetch XML for ", pmcid)

  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )

  # Get text body xml content
  #browser()
  xml_body = xml_child(xml_content, "body")
  xml_back = xml_child(xml_content, "back")
  #browser()
  
  # Build text file
  # By converting xml structure into sections and subsections
  text = c()
  text = convert_xml_text(xml_body,
                          text
                          )
  
  text = c(text, "\n\n")
  text = extract_app_text(xml_back,
                          text
                          )
  
    # if Nat Genet article
  if(grepl("Nature genetics", 
        xml_text(xml_find_all(xml_content, 
                              ".//journal-title")),
        ignore.case = TRUE
        )
  ){
    
    # Find all figure nodes
    xml_figures <- xml_find_all(xml_content,
                                     ".//fig")
    
    if (length(xml_figures) != 0){
     
      text = c(text, "\n\nFigures:\n")
      
    }
    
    for(nodes in 1:length(xml_figures)){
      
      figure_node = xml_figures[nodes]
      
      # Extract label:
      label <- xml_text(xml_find_all(figure_node, 
                                     ".//label"))
      
      # Extract title
      title = xml_text(xml_find_all(figure_node, ".//title"))
      
      if(!rlang::is_empty(label) | !rlang::is_empty(title)){
        
        text = c(text,
                 paste0("\n", label, ". ", title, "\n")
                 )
      }
      
      # Extract caption
      caption = xml_text(xml_find_all(figure_node, 
                                      ".//caption//p"))
      
      if(!rlang::is_empty(caption)){
        
        text = c(text,
                 paste0("\n", caption, "\n")
                 )
      }
      
    }
    
    }
  
  
  # --- Save ---
  text_full <- paste(text, collapse = " ")

  txt_file <- file.path(out_dir,
                        paste0(pmcid, ".txt")
                        )
  writeLines(text_full,
             txt_file,
             useBytes = TRUE)

  #message("✅ Cleaned text saved for ", pmcid)
  invisible(text_full)
}


safe_download_pmc_text <- purrr::safely(download_pmc_text)


# Download full texts for all PMCIDs
for(pmcid in pmcids){
  if(pmcid != ""){
    
    result <- safe_download_pmc_text(pmcid)

    # if(!is.null(result$error)){
    #   message("❌ Failed to download text for ", pmcid,
    #           ": ", result$error)
    # }
  }
}

# How many texts saved? 
length(list.files(here::here("output/fulltexts/europepmc"),
                  pattern = "\\.txt$")
       )

For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service


# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text - non-commercial use
# oa_noncomm 
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/  --no-sign-request

Identify full texts downloaded

europeanpmc_full_texts <- 
list.files(here::here("output/fulltexts/europe_pmc"),
                  pattern = "\\.xml"
           )

# get pmcids of these files
europeanpmc_full_texts <-
  gsub("\\.xml$", 
       "", 
       europeanpmc_full_texts
       )

Get paths of full texts could download:

left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]

print("Number of remaining pmcids without full text:")

[1] "Number of remaining pmcids without full text:"

length(left_over_pmcids)

[1] 267

author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))

oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))

oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))

author_manu_to_get <-
author_manu |>
  dplyr::filter(AccessionID %in% left_over_pmcids)

print("Number of papers to download in Author Manuscripts section:")

[1] "Number of papers to download in Author Manuscripts section:"

nrow(author_manu_to_get)

[1] 160

oa_noncomm_to_get = 
oa_noncomm |>
#  dplyr::filter(PMID %in% names(left_over_pmcids)) 
  dplyr::filter(AccessionID %in% left_over_pmcids)

oa_comm_to_get = 
oa_comm |>
#  dplyr::filter(PMID %in% names(left_over_pmcids)) 
  dplyr::filter(AccessionID %in% left_over_pmcids)

print("Number of papers to download in Open Access PMC section:")

[1] "Number of papers to download in Open Access PMC section:"

nrow(oa_noncomm_to_get) + nrow(oa_comm_to_get)

[1] 1

oa_noncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(AccessionID %in% author_manu_to_get$AccessionID))

oa_ncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(AccessionID %in% author_manu_to_get$AccessionID))

not_available = left_over_pmcids[!c(left_over_pmcids %in% 
                                          c(oa_noncomm_to_get$AccessionID, 
                                            oa_comm_to_get$AccessionID,
                                            author_manu_to_get$AccessionID)
                                          )]

print("Number of papers without full text available in NCBI Cloud Service:")

[1] "Number of papers without full text available in NCBI Cloud Service:"

length(not_available)

[1] 106

file_paths = 
c(oa_noncomm_to_get$Key,
  oa_comm_to_get$Key,
  author_manu_to_get$Key)

file_paths <- str_replace_all(file_paths,
                              pattern = "txt",
                              replacement = "xml")

# percentage not available, from all papers
100 * length(not_available) / length(pmids)

[1] 10.67472

Download remaining full texts from NCBI Cloud Service

writeLines(
  file_paths,
  here::here("output/fulltexts/aws_locations/selected_paths.txt")
)

system(
  paste(
    "xargs -I {} aws s3 cp",
    "s3://pmc-oa-opendata/{}",
    shQuote(here::here("output/fulltexts/ncbi_cloud/")),
    "--no-sign-request",
    "<",
    shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
  )
)

Download PDFs using Open Access information from Open Alex

PMIDs that couldn’t be converted to PMCIDs

# old getting dois:
entrez_info <-
entrez_summary(db="pubmed", 
               id=not_convertable_pmids)

dois <-
entrez_info |>
  purrr::map(function(x) {
    
    x$articleids |> 
      filter(idtype == "doi") |> 
      pull(value)
  }
)

library(openalexR)

openalexR v2.0.0 introduces breaking changes.
See NEWS.md for details.

To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.

convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

Warning in fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv")): Found and
resolved improper quoting out-of-sample. First healed line 8637909:
<<17234576,PMC2742408,"https://doi.org/10.1102/1470-7330.2007.0001>>. If the
fields are not quoted (e.g. field separator does not appear within any field),
try quote="" to avoid this warning.

not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

doi_information <-
convert_pmid_df |>
  filter(PMID %in% not_convertable_pmids)

doi_information |>
  filter(DOI == "")

Empty data.table (0 rows and 3 cols): PMID,PMCID,DOI

doi_information$PMID |> unique() |> length()

[1] 202

length(not_convertable_pmids)

[1] 202

# get open alex works for pmids
open_alex_works <- oa_fetch(
  doi = unique(doi_information$DOI),
  entity = "works",
  options = list(select = c("doi", 
                            "open_access"))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

[1] 110

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

[1] 65

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))


cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download.txt

PMCIDs not found in Author Manuscripts or Open Access sections

doi_information <-
convert_pmid_df |>
  filter(PMCID %in% not_available)

doi_information |>
  filter(DOI == "")

doi_information <-
  doi_information |>
  filter(DOI != "")

# get open alex works for pmcids
open_alex_works <- oa_fetch(
  doi = doi_information$DOI,
  entity = "works",
  options = list(select = c(#"title",
                            "doi", 
                            "open_access"
                            ))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download_pt2.txt"))


cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download_pt2.txt

Test: not used - Europe PMC Author Manuscripts


curl -s https://europepmc.org/ftp/manuscripts/ \
  | grep -o 'author_manuscript_txt[^"]*\.filelist\.txt' \
  | sort -u \
  | while read -r file; do
      curl -O "https://europepmc.org/ftp/manuscripts/$file"
    done

all_file_lists <- list.files(here::here("data/epmc"))

author_manu_epmc <- all_file_lists |>
                    purrr::map(function(file_name) {
                      
                      file_path = here::here("data/epmc",
                                             file_name
                                             )
                      
                      df <- fread(file_path)
                      
                      return(df)
                    }
                    ) |>
                    bind_rows()

author_manu_epmc |>
  filter(AccessionID %in% not_available)

author_manu_epmc |>
  filter(PMID %in% not_avaliable_pmids)

author_manu_epmc |>
  filter(PMID %in% pmids)

author_manu_epmc |>
  filter(PMID %in% not_convertable_pmids)

Test: not used - Download with ftp service, where avaliable

# not_available
not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

all_tgz_links = c()

for(article_id in "PMC2613843"){

url <- paste0("https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=",
              article_id)
  
resp <- GET(url)

xml_data <- xml_child(content(resp), "records")

tgz_link <- xml_find_first(xml_data, 
                           ".//link[@format='tgz']/@href")
tgz_link <- xml_text(tgz_link)

if (is.na(tgz_link)) {
  
  print("No tar.gz link found.")
  
} else {
  
  all_tgz_links <- append(all_tgz_links, 
                          tgz_link)
  
}
}

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] openalexR_2.0.1   data.table_1.17.8 dplyr_1.1.4       here_1.0.1       
[5] stringr_1.6.0     xml2_1.4.0        httr_1.4.7        workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0      compiler_4.3.1      BiocManager_1.30.26
 [4] renv_1.0.3          promises_1.3.3      tidyselect_1.2.1   
 [7] Rcpp_1.1.0          git2r_0.36.2        callr_3.7.6        
[10] later_1.4.4         jquerylib_0.1.4     yaml_2.3.10        
[13] fastmap_1.2.0       R6_2.6.1            generics_0.1.4     
[16] curl_7.0.0          knitr_1.50          tibble_3.3.0       
[19] rprojroot_2.1.0     bslib_0.9.0         pillar_1.11.1      
[22] rlang_1.1.6         cachem_1.1.0        stringi_1.8.7      
[25] httpuv_1.6.16       xfun_0.55           getPass_0.2-4      
[28] fs_1.6.6            sass_0.4.10         cli_3.6.5          
[31] withr_3.0.2         magrittr_2.0.4      ps_1.9.1           
[34] digest_0.6.37       processx_3.8.6      rstudioapi_0.17.1  
[37] lifecycle_1.0.4     vctrs_0.6.5         evaluate_1.0.5     
[40] glue_1.8.0          whisker_0.4.1       rmarkdown_2.30     
[43] tools_4.3.1         pkgconfig_2.0.3     htmltools_0.5.8.1

Get full article text for GWAS Catalog studies

Isobel Beasley

2025-10-22