Last updated: 2026-01-12

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b43e9a9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd
    Modified:   analysis/trait_ontology_categorization.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_full_text.Rmd) and HTML (docs/get_full_text.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd b43e9a9 IJbeasley 2026-01-12 Update getting full text
html ac0d1a7 IJbeasley 2025-10-27 Build site.
html 8642872 IJbeasley 2025-10-27 Build site.
Rmd da4d730 IJbeasley 2025-10-27 Now run on all texts
html fb5cfd9 IJbeasley 2025-10-27 Build site.
Rmd 8ed4c37 IJbeasley 2025-10-27 Now run on all texts
html 8610283 IJbeasley 2025-10-27 Build site.
Rmd 7d504e3 IJbeasley 2025-10-27 More fixing of download full text
html 16f4c19 IJbeasley 2025-10-27 Build site.
Rmd 3df4096 IJbeasley 2025-10-27 Update + improve full text downloading - test run
html 1439951 IJbeasley 2025-10-24 Build site.
Rmd 481aebe IJbeasley 2025-10-24 Update code for getting full texts

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)

Get PMCIDs

Get Pubmed ids from GWAS catalog

# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

## Step 1: 
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# gwas_study_info <- gwas_study_info |>
#   dplyr::filter(DISEASE_STUDY == TRUE)

pmids <- unique(gwas_study_info$PUBMED_ID)

#pmids <- sample(pmids, size = 500)

length(pmids)
[1] 993

Convert Pubmed IDs to PMCIDs

# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))
Warning in fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv")): Found and
resolved improper quoting out-of-sample. First healed line 8637909:
<<17234576,PMC2742408,"https://doi.org/10.1102/1470-7330.2007.0001>>. If the
fields are not quoted (e.g. field separator does not appear within any field),
try quote="" to avoid this warning.
convert_pmid_df <- convert_pmid_df |>
  dplyr::rename(pmcids = PMCID
                ) |>
  dplyr::mutate(pmcids = ifelse(is.na(pmcids),
                                "",
                                pmcids
                                )
                )

convert_pmid_df =
  convert_pmid_df |>
  select(-DOI)

convert_pmid_df <-
  convert_pmid_df |>
  dplyr::filter(!is.na(PMID))

converted_ids = 
  convert_pmid_df |>
  filter(PMID %in% pmids)

dim(converted_ids)
[1] 993   2
length(pmids)
[1] 993
converted_ids |>
  filter(pmcids == "") |>
  dim()
[1] 202   2
# convert PMID to PMCID
# convert_pmid_to_pmcid <- function(pmid_vec,
#                                   tool = "myTool",
#                                   email = "you@example.com",
#                                   format = "json",
#                                   batch_size = 50,
#                                   sleep_time = 1) {
# 
#   base_url <- "https://pmc.ncbi.nlm.nih.gov/tools/idconv/api/v1/articles/"
# 
#   batches <- split(pmid_vec,
#                    ceiling(seq_along(pmid_vec) / batch_size))
#   #browser()
# 
#   pmcid_list = purrr::map(batches,
#              function(pmid_vec) {
# 
# 
#                ids_param <- paste(pmid_vec,
#                                   collapse = ",")
# 
#                query <- list(ids = ids_param,
#                              idtype = "pmid",
#                              tool = tool,
#                              email = email,
#                              format = format
#                              )
# 
#                resp <- httr::GET(base_url,
#                                  query = query)
# 
#                httr::stop_for_status(resp)
# 
#                content_text <- httr::content(resp,
#                                              as = "text",
#                                              encoding = "UTF-8")
# 
#                # Handle cases where records might be empty or missing pmcid
#                parsed <- jsonlite::fromJSON(content_text,
#                                             flatten = TRUE)
# 
#                parsed$records[is.na(parsed$records)] = ""
# 
#                pmcid <- parsed$records |>
#                         pull(pmcid)
# 
#                Sys.sleep(sleep_time)
# 
#                return(pmcid)
#              }
#   )
# 
#  pmcid_list = unlist(pmcid_list)
#  names(pmcid_list) <- pmid_vec
#  return(pmcid_list)
# }
# 
# pmcids <- convert_pmid_to_pmcid(pmids)
# 
# get_pmcid_europepmc <- function(pmid_vec) {
#   base_url <- "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
#   
#   purrr::map_dfr(pmid_vec, function(pmid) {
#     query <- list(
#       query = paste0("ext_id:", pmid),
#       format = "json"
#     )
#     resp <- httr::GET(base_url, query = query)
#     if (httr::status_code(resp) != 200) {
#       return(tibble(pmid = pmid, 
#                     pmcid = NA_character_))
#     }
#     
#     dat <- jsonlite::fromJSON(httr::content(resp, 
#                                             as = "text", 
#                                             encoding = "UTF-8"))
#     
#     if (length(dat$resultList$result) == 0) {
#       return(tibble(pmid = pmid, pmcid = NA_character_))
#     }
#     
#     pmcid <- dat$resultList$result$pmcid
#     tibble(pmid = pmid, pmcid = pmcid)
#   })
# }
# 
# pmids_missing = names(pmcids[pmcids == ""])
# 
# get_pmcid_europepmc(pmids_missing) -> pmcid_europepmc_df
# 
# converted_ids <-
# data.frame(pmids = names(pmcids),
#            pmcids = pmcids
#            ) 
# 
# # checked using pmids to pmcs conversion from europe pmc webservice
# # can map 38367033 -> PMC12560237
# converted_ids <- converted_ids |>
#   mutate(pmcids = ifelse(pmids == "38367033", 
#                          "PMC12560237", 
#                          pmcids
#                          )
#          )
# 
# data.table::fwrite(converted_ids,
#                    here::here("output/gwas_cat/gwas_pubmed_to_pmcid_mapping.csv")
#                    )
# 
# # How many missing? 
# sum(pmcids == "")
# 
# pmcids <- pmcids[pmcids != ""]

Download full texts from European PMC

pmcids <-
converted_ids$pmcids |>
  unique()

pmcids <- pmcids[pmcids != ""]

length(pmcids)
[1] 791
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(pmids), digits = 2)
[1] 79.66
download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europe_pmc/")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/",
                    "europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )
  
  resp <- GET(url_xml)
  
  # MED/20708005
  
  # ---- Fallback URL ----
  if(status_code(resp) != 200){
    
    #print(paste0("Trying alternative URL for ", pmcid))
    
    url_xml <- paste0("https://europepmc.org/",
                       "oai.cgi?verb=GetRecord",
                       "&metadataPrefix=pmc",
                       "&identifier=oai:europepmc.org:",
                       pmcid)
    
    resp <- GET(url_xml)
  
  }
  
  # ---- Fail if still bad ----
  if(status_code(resp) != 200){
    
  #message("Failed to fetch XML for ", pmcid)
  
  return(NULL)
    
  }
  
  # ---- Parse XML ----
  xml_content <- read_xml(
    content(resp, 
            as = "text", 
            encoding = "UTF-8")
  )
  
  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )
  
  article_node = xml_find_first(xml_content, 
                               "//*[local-name() = 'article']"
                               )
  
   if (is.na(article_node)) {
    message("No <article> node found for ", pmcid)
     
    return(NULL)
   }
  
    # --- Save ---
  write_xml(article_node, 
            paste0(out_dir, pmcid, ".xml")
            )
  
} 


for(article in pmcids[pmcids != ""]){

download_pmc_text(article)

}
print("Number of downloaded full text files")
[1] "Number of downloaded full text files"
print("From European PMC:")
[1] "From European PMC:"
n_euro_pmc <- length(list.files(here::here("output/fulltexts/europe_pmc/"),
                  pattern = "\\.xml$")
       )

print("Percentage of pmids with full text from European PMC:")
[1] "Percentage of pmids with full text from European PMC:"
round(100 * n_euro_pmc / length(pmids), digits = 2)
[1] 52.77
converted_ids <-
data.table::fread(here::here("output/gwas_cat/gwas_pubmed_to_pmcid_mapping.csv")
)

convert_xml_text <- function(xml_content,
                             text #output text file
                             ){
  
    for(section in 1:xml_length(xml_content)){

    section_node = xml_child(xml_content, section)

    if(length(xml_path(xml_find_all(section_node,
                                    ".//*[.//title and .//p]"
                                    )
                       )
            ) == 0
       )
      {

      # Get section name:
      section_name = xml_text(xml_find_all(section_node, 
                                           ".//title"))
      
      section_name = str_squish(section_name)
      
      if(!rlang::is_empty(section_name)) {
         text = c(text, paste0("\n\n", section_name, "\n"))
      }

      # Get paragraphs
      para_nodes = xml_find_all(section_node, ".//p")
      para_texts = xml_text(para_nodes)
      para_texts = str_squish(para_texts)
      
      
      if(!rlang::is_empty(para_texts)) {
        text = c(text, paste0("\n", 
                              para_texts, 
                              "\n")
                 )
      }
      
      if(rlang::is_empty(section_name) && 
         rlang::is_empty(para_texts)) {
        
        all_node_text <- xml_text(section_node)
        
        if(!rlang::is_empty(all_node_text)){
        
        text = c(text, paste0("\n", 
                              all_node_text, 
                              "\n")
                 )
        
        }
      }
      
      label <- xml_text(xml_find_all(section_node, 
                                     ".//label"))
      
      href <- xml_attr(xml_find_all(section_node, 
                                    ".//media"), 
                       "href")
      
      if(!rlang::is_empty(label) && !rlang::is_empty(href)) {
        
        text = c(text, paste0("\n", label, ". ", href, "\n"))
        
      }
      
      

    } else {

    for(subsection in 1:xml_length(section_node)){

    subsection_node = xml_child(section_node,
                                subsection
                                )
    
    if(length(xml_children(subsection_node)) == 0){
      
      if(xml_name(subsection_node) == "title"){
        
        text = c(text, 
               paste0("\n\n", 
                      xml_text(subsection_node), 
                      "\n")
               )
        
      } else {
      
      text = c(text, 
               paste0("\n", 
                      xml_text(subsection_node), 
                      "\n")
               )
      
      }
      
      next
      
    }
                

    # Get section name:
    subsection_name = xml_text(xml_find_all(subsection_node, 
                                            ".//title")
                               )
    
    subsection_name = str_squish(subsection_name)
    
    if(!rlang::is_empty(subsection_name)) {
    # Add spaces around section titles
    text = c(text, paste0("\n\n", 
                          subsection_name, 
                          "\n")
             )
    }

    # Get paragraphs
    para_nodes = xml_find_all(subsection_node, 
                              ".//p")
    
    para_texts = xml_text(para_nodes)
    para_texts = str_squish(para_texts)
    
    if(!rlang::is_empty(para_texts)) {
    # Add spaces around paragraphs
    text = c(text, paste0("\n", 
                          para_texts, 
                          "\n")
             )
    }
    
    if(rlang::is_empty(subsection_name) &&
         rlang::is_empty(para_texts)) {    
    
        all_node_text <- xml_text(subsection_node)
        
        if(!rlang::is_empty(all_node_text)){
        
        text = c(text, paste0("\n", 
                              all_node_text, 
                              "\n")
                 )
        
        }
      }
            
    }
}
  }
  
  return(text)
  
}

extract_app_text <- function(xml_back, 
                             text){
  
  # Add separator for appendices section
  text <- c(text, "\n\n=== APPENDICES ===\n")
  
  #browser()
  
  for(node_id in 1:xml_length(xml_back)){
    
    node = xml_child(xml_back, 
                     node_id)
    
    #print(node)
    
    if(xml_name(node) == "app-group" & 
       length(xml_find_all(node, ".//sec")) > 0) {
      
      app_node = xml_find_all(node, ".//sec")
      
      text = convert_xml_text(app_node, text)
      
    } else if(xml_name(node) == "ref-list") {
      
      next 
      
    } else {
      
      text = convert_xml_text(node, 
                              text)
    }
  }


download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europepmc")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )

  resp <- GET(url_xml)

  if (status_code(resp) != 200) stop("Failed to fetch XML for ", pmcid)

  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )

  # Get text body xml content
  #browser()
  xml_body = xml_child(xml_content, "body")
  xml_back = xml_child(xml_content, "back")
  #browser()
  
  # Build text file
  # By converting xml structure into sections and subsections
  text = c()
  text = convert_xml_text(xml_body,
                          text
                          )
  
  text = c(text, "\n\n")
  text = extract_app_text(xml_back,
                          text
                          )
  
    # if Nat Genet article
  if(grepl("Nature genetics", 
        xml_text(xml_find_all(xml_content, 
                              ".//journal-title")),
        ignore.case = TRUE
        )
  ){
    
    # Find all figure nodes
    xml_figures <- xml_find_all(xml_content,
                                     ".//fig")
    
    if (length(xml_figures) != 0){
     
      text = c(text, "\n\nFigures:\n")
      
    }
    
    for(nodes in 1:length(xml_figures)){
      
      figure_node = xml_figures[nodes]
      
      # Extract label:
      label <- xml_text(xml_find_all(figure_node, 
                                     ".//label"))
      
      # Extract title
      title = xml_text(xml_find_all(figure_node, ".//title"))
      
      if(!rlang::is_empty(label) | !rlang::is_empty(title)){
        
        text = c(text,
                 paste0("\n", label, ". ", title, "\n")
                 )
      }
      
      # Extract caption
      caption = xml_text(xml_find_all(figure_node, 
                                      ".//caption//p"))
      
      if(!rlang::is_empty(caption)){
        
        text = c(text,
                 paste0("\n", caption, "\n")
                 )
      }
      
    }
    
    }
  
  
  # --- Save ---
  text_full <- paste(text, collapse = " ")

  txt_file <- file.path(out_dir,
                        paste0(pmcid, ".txt")
                        )
  writeLines(text_full,
             txt_file,
             useBytes = TRUE)

  #message("✅ Cleaned text saved for ", pmcid)
  invisible(text_full)
}


safe_download_pmc_text <- purrr::safely(download_pmc_text)


# Download full texts for all PMCIDs
for(pmcid in pmcids){
  if(pmcid != ""){
    
    result <- safe_download_pmc_text(pmcid)

    # if(!is.null(result$error)){
    #   message("❌ Failed to download text for ", pmcid,
    #           ": ", result$error)
    # }
  }
}

# How many texts saved? 
length(list.files(here::here("output/fulltexts/europepmc"),
                  pattern = "\\.txt$")
       )

For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service


# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text - non-commercial use
# oa_noncomm 
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

Identify full texts downloaded

europeanpmc_full_texts <- 
list.files(here::here("output/fulltexts/europe_pmc"),
                  pattern = "\\.xml"
           )

# get pmcids of these files
europeanpmc_full_texts <-
  gsub("\\.xml$", 
       "", 
       europeanpmc_full_texts
       ) 

Get paths of full texts could download:

left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]

print("Number of remaining pmcids without full text:")
[1] "Number of remaining pmcids without full text:"
length(left_over_pmcids)
[1] 267
author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))

oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))

oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))

author_manu_to_get <-
author_manu |>
  dplyr::filter(AccessionID %in% left_over_pmcids)

print("Number of papers to download in Author Manuscripts section:")
[1] "Number of papers to download in Author Manuscripts section:"
nrow(author_manu_to_get)
[1] 160
oa_noncomm_to_get = 
oa_noncomm |>
#  dplyr::filter(PMID %in% names(left_over_pmcids)) 
  dplyr::filter(AccessionID %in% left_over_pmcids)

oa_comm_to_get = 
oa_comm |>
#  dplyr::filter(PMID %in% names(left_over_pmcids)) 
  dplyr::filter(AccessionID %in% left_over_pmcids)

print("Number of papers to download in Open Access PMC section:")
[1] "Number of papers to download in Open Access PMC section:"
nrow(oa_noncomm_to_get) + nrow(oa_comm_to_get)
[1] 1
oa_noncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(AccessionID %in% author_manu_to_get$AccessionID))

oa_ncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(AccessionID %in% author_manu_to_get$AccessionID))

not_available = left_over_pmcids[!c(left_over_pmcids %in% 
                                          c(oa_noncomm_to_get$AccessionID, 
                                            oa_comm_to_get$AccessionID,
                                            author_manu_to_get$AccessionID)
                                          )]

print("Number of papers without full text available in NCBI Cloud Service:")
[1] "Number of papers without full text available in NCBI Cloud Service:"
length(not_available)
[1] 106
file_paths = 
c(oa_noncomm_to_get$Key,
  oa_comm_to_get$Key,
  author_manu_to_get$Key)

file_paths <- str_replace_all(file_paths,
                              pattern = "txt",
                              replacement = "xml")

# percentage not available, from all papers
100 * length(not_available) / length(pmids)
[1] 10.67472

Download remaining full texts from NCBI Cloud Service

writeLines(
  file_paths,
  here::here("output/fulltexts/aws_locations/selected_paths.txt")
)

system(
  paste(
    "xargs -I {} aws s3 cp",
    "s3://pmc-oa-opendata/{}",
    shQuote(here::here("output/fulltexts/ncbi_cloud/")),
    "--no-sign-request",
    "<",
    shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
  )
)

Download PDFs using Open Access information from Open Alex

PMIDs that couldn’t be converted to PMCIDs

# old getting dois:
entrez_info <-
entrez_summary(db="pubmed", 
               id=not_convertable_pmids)

dois <-
entrez_info |>
  purrr::map(function(x) {
    
    x$articleids |> 
      filter(idtype == "doi") |> 
      pull(value)
  }
)
library(openalexR)
openalexR v2.0.0 introduces breaking changes.
See NEWS.md for details.

To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))
Warning in fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv")): Found and
resolved improper quoting out-of-sample. First healed line 8637909:
<<17234576,PMC2742408,"https://doi.org/10.1102/1470-7330.2007.0001>>. If the
fields are not quoted (e.g. field separator does not appear within any field),
try quote="" to avoid this warning.
not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

doi_information <-
convert_pmid_df |>
  filter(PMID %in% not_convertable_pmids)

doi_information |>
  filter(DOI == "")
Empty data.table (0 rows and 3 cols): PMID,PMCID,DOI
doi_information$PMID |> unique() |> length()
[1] 202
length(not_convertable_pmids)
[1] 202
# get open alex works for pmids
open_alex_works <- oa_fetch(
  doi = unique(doi_information$DOI),
  entity = "works",
  options = list(select = c("doi", 
                            "open_access"))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()
[1] 110
# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()
[1] 65
to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))

cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download.txt

PMCIDs not found in Author Manuscripts or Open Access sections

doi_information <-
convert_pmid_df |>
  filter(PMCID %in% not_available)

doi_information |>
  filter(DOI == "")

doi_information <-
  doi_information |>
  filter(DOI != "")

# get open alex works for pmcids
open_alex_works <- oa_fetch(
  doi = doi_information$DOI,
  entity = "works",
  options = list(select = c(#"title",
                            "doi", 
                            "open_access"
                            ))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download_pt2.txt"))

cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download_pt2.txt

Test: not used - Europe PMC Author Manuscripts


curl -s https://europepmc.org/ftp/manuscripts/ \
  | grep -o 'author_manuscript_txt[^"]*\.filelist\.txt' \
  | sort -u \
  | while read -r file; do
      curl -O "https://europepmc.org/ftp/manuscripts/$file"
    done
all_file_lists <- list.files(here::here("data/epmc"))

author_manu_epmc <- all_file_lists |>
                    purrr::map(function(file_name) {
                      
                      file_path = here::here("data/epmc",
                                             file_name
                                             )
                      
                      df <- fread(file_path)
                      
                      return(df)
                    }
                    ) |>
                    bind_rows()

author_manu_epmc |>
  filter(AccessionID %in% not_available)

author_manu_epmc |>
  filter(PMID %in% not_avaliable_pmids)

author_manu_epmc |>
  filter(PMID %in% pmids)

author_manu_epmc |>
  filter(PMID %in% not_convertable_pmids)

Test: not used - Download with ftp service, where avaliable

# not_available
not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

all_tgz_links = c()

for(article_id in "PMC2613843"){

url <- paste0("https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=",
              article_id)
  
resp <- GET(url)

xml_data <- xml_child(content(resp), "records")

tgz_link <- xml_find_first(xml_data, 
                           ".//link[@format='tgz']/@href")
tgz_link <- xml_text(tgz_link)

if (is.na(tgz_link)) {
  
  print("No tar.gz link found.")
  
} else {
  
  all_tgz_links <- append(all_tgz_links, 
                          tgz_link)
  
}
}

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] openalexR_2.0.1   data.table_1.17.8 dplyr_1.1.4       here_1.0.1       
[5] stringr_1.6.0     xml2_1.4.0        httr_1.4.7        workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0      compiler_4.3.1      BiocManager_1.30.26
 [4] renv_1.0.3          promises_1.3.3      tidyselect_1.2.1   
 [7] Rcpp_1.1.0          git2r_0.36.2        callr_3.7.6        
[10] later_1.4.4         jquerylib_0.1.4     yaml_2.3.10        
[13] fastmap_1.2.0       R6_2.6.1            generics_0.1.4     
[16] curl_7.0.0          knitr_1.50          tibble_3.3.0       
[19] rprojroot_2.1.0     bslib_0.9.0         pillar_1.11.1      
[22] rlang_1.1.6         cachem_1.1.0        stringi_1.8.7      
[25] httpuv_1.6.16       xfun_0.55           getPass_0.2-4      
[28] fs_1.6.6            sass_0.4.10         cli_3.6.5          
[31] withr_3.0.2         magrittr_2.0.4      ps_1.9.1           
[34] digest_0.6.37       processx_3.8.6      rstudioapi_0.17.1  
[37] lifecycle_1.0.4     vctrs_0.6.5         evaluate_1.0.5     
[40] glue_1.8.0          whisker_0.4.1       rmarkdown_2.30     
[43] tools_4.3.1         pkgconfig_2.0.3     htmltools_0.5.8.1