Last updated: 2026-02-04

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: c0dc676

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c0dc676. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    Aus_School_Profile.xlsx
    Ignored:    SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacyr_venv/
    Ignored:    test_37689528.xml

Untracked files:
    Untracked:  code/full_text_conversion/elsevier_to_jats_v2.R
    Untracked:  code/full_text_conversion/elsevier_to_jats_v3.R
    Untracked:  code/full_text_conversion/elsevier_to_jats_v4.R
    Untracked:  code/full_text_conversion/elsevier_to_jats_v5.R
    Untracked:  code/full_text_conversion/fix_elsevier_xml.py
    Untracked:  code/full_text_conversion/testing_fix_elsevier.R
    Untracked:  debug_elsevier.R
    Untracked:  schools.R
    Untracked:  testing.R

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_dbgap_ids.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/specific_aims_stats.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd
    Modified:   code/full_text_conversion/elsevier_to_jats.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_full_text.Rmd) and HTML (docs/get_full_text.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	c0dc676	IJbeasley	2026-02-04	Getting full text from publisher APIs
html	1898c02	IJbeasley	2026-02-04	Build site.
Rmd	d214580	IJbeasley	2026-02-04	Getting full text from publisher APIs
html	6ba1e1f	IJbeasley	2026-01-12	Build site.
Rmd	b43e9a9	IJbeasley	2026-01-12	Update getting full text
html	ac0d1a7	IJbeasley	2025-10-27	Build site.
html	8642872	IJbeasley	2025-10-27	Build site.
Rmd	da4d730	IJbeasley	2025-10-27	Now run on all texts
html	fb5cfd9	IJbeasley	2025-10-27	Build site.
Rmd	8ed4c37	IJbeasley	2025-10-27	Now run on all texts
html	8610283	IJbeasley	2025-10-27	Build site.
Rmd	7d504e3	IJbeasley	2025-10-27	More fixing of download full text
html	16f4c19	IJbeasley	2025-10-27	Build site.
Rmd	3df4096	IJbeasley	2025-10-27	Update + improve full text downloading - test run
html	1439951	IJbeasley	2025-10-24	Build site.
Rmd	481aebe	IJbeasley	2025-10-24	Update code for getting full texts

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)

Get PMCIDs

Get Pubmed ids from GWAS catalog

# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

## Step 1: 
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
    dplyr::filter(!cause %in% c("HIV/AIDS",
                             "Tuberculosis",
                             "Malaria",
                             "Lower respiratory infections",
                             "Diarrhoeal diseases",
                             "Neonatal disorders",
                             "Tetanus",
                             "Diphtheria",
                             "Pertussis" ,
                             "Measles",
                             "Maternal disorders"))

# gwas_study_info <- gwas_study_info |>
#   dplyr::filter(DISEASE_STUDY == TRUE)

print("Number of disease studies to get full texts for:")

[1] "Number of disease studies to get full texts for:"

pmids <- unique(gwas_study_info$PUBMED_ID)
length(pmids)

[1] 821

Convert Pubmed IDs to PMCIDs

# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

convert_pmid_df <- convert_pmid_df |>
  dplyr::rename(pmcids = PMCID
                ) |>
  dplyr::mutate(pmcids = ifelse(is.na(pmcids),
                                "",
                                pmcids
                                )
                )

convert_pmid_df <-
  convert_pmid_df |>
  dplyr::filter(!is.na(PMID))

converted_ids = 
  convert_pmid_df |>
  filter(PMID %in% pmids)

data.table::fwrite(converted_ids,
                   here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")
                   )

converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))

print("Head of pmid to pmcid mapping data.frame:")

[1] "Head of pmid to pmcid mapping data.frame:"

head(converted_ids)

       PMID     pmcids                                           DOI
      <int>     <char>                                        <char>
1: 17223258             https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876                      https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697                           https://doi.org/10.1038/ng2043
5: 17463246                  https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617       https://doi.org/10.1126/science.1142382

print("Dimensions of pmid to pmcid mapping data.frame:")

[1] "Dimensions of pmid to pmcid mapping data.frame:"

dim(converted_ids)

[1] 821   3

length(pmids)

[1] 821

print("All pmids are in this data.frame, but some don't have pmcid mapping")

[1] "All pmids are in this data.frame, but some don't have pmcid mapping"

not_converted_pmids <-
converted_ids |>
  filter(pmcids == "")  |>
  pull(PMID)

print("Number of pmids without pmcid mapping:")

[1] "Number of pmids without pmcid mapping:"

length(not_converted_pmids)

[1] 173

pmcids <-
converted_ids$pmcids |>
  unique()

pmcids <- pmcids[pmcids != ""]

print("Number of pmids with pmcid mapping:")

[1] "Number of pmids with pmcid mapping:"

length(pmcids)

[1] 648

print("Percentage of pmids with pmcid:")

[1] "Percentage of pmids with pmcid:"

round(100 * length(pmcids) / length(pmids), digits = 2)

[1] 78.93

Download full texts from European PMC

Requires PMCIDS to download full text xmls from Europe PMC Restful API. Thus, can only be applied to papers with PMCIDs.

# Function to download full text xml from Europe PMC Restful API
download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europe_pmc/")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/",
                    "europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )
  
  resp <- GET(url_xml)
  
  # ---- Fallback URL ----
  if(status_code(resp) != 200){
    
    url_xml <- paste0("https://europepmc.org/",
                       "oai.cgi?verb=GetRecord",
                       "&metadataPrefix=pmc",
                       "&identifier=oai:europepmc.org:",
                       pmcid)
    
    resp <- GET(url_xml)
  
  }
  
  # ---- Fail if still bad ----
  if(status_code(resp) != 200){
    
  return(NULL)
    
  }
  
  # ---- Parse XML ----
  xml_content <- read_xml(
    content(resp, 
            as = "text", 
            encoding = "UTF-8")
  )
  
  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )
  
  article_node = xml_find_first(xml_content, 
                               "//*[local-name() = 'article']"
                               )
  
   if (is.na(article_node)) {
    message("No <article> node found for ", pmcid)
     
    return(NULL)
   }
  
  # --- Save ---
  write_xml(article_node, 
            paste0(out_dir, pmcid, ".xml")
            )
  
} 


for(article in pmcids[pmcids != ""]){

download_pmc_text(article)

}

print("Number of downloaded full text files from European PMC:")

[1] "Number of downloaded full text files from European PMC:"

n_euro_pmc <- length(list.files(here::here("output/fulltexts/europe_pmc/"),
                  pattern = "\\.xml$")
       )
print(n_euro_pmc)

[1] 524

print("Percentage of pmids with full text from European PMC:")

[1] "Percentage of pmids with full text from European PMC:"

round(100 * n_euro_pmc / length(pmids), digits = 2)

[1] 63.82

Download full texts from NCBI Cloud Service

For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service.


# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text - non-commercial use
# oa_noncomm 
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/  --no-sign-request

Identify full texts already downloaded through European PMC

europeanpmc_full_texts <- 
list.files(here::here("output/fulltexts/europe_pmc"),
                  pattern = "\\.xml"
           )

# get pmcids of these files
europeanpmc_full_texts <-
  gsub("\\.xml$", 
       "", 
       europeanpmc_full_texts
       )

Get NCBI download paths for remaining full texts (where avaliable)

left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]

print("Number of remaining pmcids without full text:")

[1] "Number of remaining pmcids without full text:"

length(left_over_pmcids)

[1] 220

print("+ Number of pmids without pmcid mapping:")

[1] "+ Number of pmids without pmcid mapping:"

length(not_converted_pmids)

[1] 173

author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))

oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))

oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))

author_manu_to_get <-
author_manu |>
  dplyr::filter(AccessionID %in% left_over_pmcids | 
                PMID %in% not_converted_pmids)

print("Number of papers to download in Author Manuscripts section:")

[1] "Number of papers to download in Author Manuscripts section:"

nrow(author_manu_to_get)

[1] 135

oa_noncomm_to_get = 
oa_noncomm |>
  dplyr::filter(AccessionID %in% left_over_pmcids | 
                PMID %in% not_converted_pmids)

# remove any overlaps between sections
oa_noncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(PMID %in% author_manu_to_get$PMID))

print("Number of additional papers to download in the Non-commericial Open Access PMC section:")

[1] "Number of additional papers to download in the Non-commericial Open Access PMC section:"

nrow(oa_noncomm_to_get)

[1] 0

oa_comm_to_get = 
oa_comm |>
  dplyr::filter(AccessionID %in% left_over_pmcids |
                PMID %in% not_converted_pmids)

oa_comm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(PMID %in% author_manu_to_get$PMID)) |>
  dplyr::filter(!c(PMID %in% oa_noncomm_to_get$PMID))

# remove any overlaps between sections
print("Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:")

[1] "Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:"

nrow(oa_comm_to_get)

[1] 0

file_paths = 
c(oa_noncomm_to_get$Key,
  oa_comm_to_get$Key,
  author_manu_to_get$Key)

file_paths <- str_replace_all(file_paths,
                              pattern = "txt",
                              replacement = "xml")

Download remaining full texts from NCBI Cloud Service

writeLines(
  file_paths,
  here::here("output/fulltexts/aws_locations/selected_paths.txt")
)

system(
  paste(
    "xargs -I {} aws s3 cp",
    "s3://pmc-oa-opendata/{}",
    shQuote(here::here("output/fulltexts/ncbi_cloud/")),
    "--no-sign-request",
    "<",
    shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
  )
)

# not_available = left_over_pmcids[!c(left_over_pmcids %in% 
#                                           c(oa_noncomm_to_get$AccessionID, 
#                                             oa_comm_to_get$AccessionID,
#                                             author_manu_to_get$AccessionID)
#                                           )]

# get list of pmcids already retrieved
pmcids_retrieved <- 
list.files(c(here::here("output/fulltexts/europe_pmc"),
             here::here("output/fulltexts/ncbi_cloud/")
             ),
             pattern = "\\.xml$"
)

pmcids_retrieved <-
  gsub("\\.xml$", 
       "", 
       pmcids_retrieved
       )

pmids_retrieved <-
converted_ids  |>
  filter(pmcids %in% pmcids_retrieved) |>
  pull(PMID)

all_pmids <- unique(gwas_study_info$PUBMED_ID)

not_available <- all_pmids[!c(all_pmids %in% pmids_retrieved)] 

print("Percentage of pmids with full text from NCBI Cloud Service:")

[1] "Percentage of pmids with full text from NCBI Cloud Service:"

100 * (n_euro_pmc - length(not_available)) / length(pmids)

[1] 32.52132

print("Percentage of pmids without full text from either European PMC or NCBI Cloud Service:")

[1] "Percentage of pmids without full text from either European PMC or NCBI Cloud Service:"

100 * length(not_available) / length(pmids)

[1] 31.30329

Download from publisher (uses dois)

Get dois for remaining articles to get full full text

doi_information <-
converted_ids |>
  filter(PMID %in% not_available)

Function to get CrossRef meta-data / link information:

library(rcrossref)
library(httr)

# Get download links from Crossref
get_crossref_links <- function(doi) {
  
  # Query Crossref for the article
  works <- cr_works(dois = doi)
  
  # keep links for xml or text-mining 
  links <- works$data$link[[1]]
  
  if(is.null(links)){
    link_data <- data.frame(doi = doi,
                            URL = NA,
                            content.type = NA,
                            content.version = NA,
                            intended.application = NA)
    return(link_data)
  }
  
  links <- 
  links |>
    filter(intended.application == "text-mining" | content.type == "application/xml"
             ) 
  
  
  
  if(nrow(links) == 0){
    link_data <- data.frame(doi = doi,
                            URL = NA,
                            content.type = NA,
                            content.version = NA,
                            intended.application = NA)
  } else{
  
  link_data <- 
  data.frame(doi = doi,
             links)
  
  }

  return(link_data)
}

Download xml full texts from publisher links

Download full text from Elsevier API links

Can download full text (xmls etc.) for open access articles using Elsevier API by getting token from: https://dev.elsevier.com/
Can download full text using this token for subscribed content (for non-commercial, academic purposes) after contacting Elsevier using the following information: https://dev.elsevier.com/api_key_settings.html
XMLS are in Elsevier’s proprietary XML format (not JATS)

# elsevier dois:
elsevier_doi_patterns <-    "10.1016|10.1053|10.1086|10.1194|10.1593|10.1097/jto."

elsevier_dois <- grep(elsevier_doi_patterns,
                          doi_information$DOI,
                           value = TRUE
                           )

print("Number of papers potentially can get from Elsevier:")

[1] "Number of papers potentially can get from Elsevier:"

length(elsevier_dois)

[1] 41

elsevier_api_key <- Sys.getenv("ELSEVIER_API_KEY")

elsevier_doi_info <- str_remove_all(pattern = "https://doi.org/", 
                                     string = elsevier_dois)

# get pmids for elsevier dois
pmids_elsevier <- doi_information |>
  filter(DOI %in% elsevier_dois) |>
  mutate(DOI = str_remove_all(DOI,
                                 pattern = "https://doi.org/"
                                 )
         ) |>
rename_with(~tolower(.x)) 

# get elsevier full text links from crossref
elsevier_link_df <- purrr::map(elsevier_doi_info,
                              ~get_crossref_links(.x)
                              ) |> 
  bind_rows()

print("Number of Elsevier links retrieved from Crossref:")
nrow(elsevier_link_df)

print("Number of xml links retrieved from Elsevier links:")
elsevier_link_df |>
  filter(content.type == "text/xml") |>
  nrow()

elsevier_links <- elsevier_link_df |>
                  filter(!is.na(URL)) 

elsevier_links <- elsevier_links |>
  left_join(pmids_elsevier,
            by = c("doi")
            )

# get only xml links
elsevier_links <-
elsevier_links |>
  filter(content.type == "text/xml") 


download_elsevier_text <- function(url, 
                                   api_key,
                                   pmid,
                                   out_dir = here::here("output/fulltexts/elsevier/elsevier_xml/")) {
  
  # if(file.exists(paste0(out_dir, pmid, ".xml"))|file.exists(paste0(out_dir, pmid, ".txt"))
  #    ){
  #   return(TRUE)
  # }
  
  response <- GET(url,
                  add_headers("X-ELS-APIKey" = api_key)
                  )
  
  # if (status_code(response) != 200) {
  #   message("Failed to fetch text for ", pmid)
  #   return(FALSE)
  #   
  # }
  
  ct <- headers(response)[["content-type"]]
  
  #print(ct)
  
  if(grepl("text/plain", ct)){
    message("Received plain text for ", pmid, 
            " - skipping for now."
            )
    
    return(TRUE)
    
    # text_content <- content(response, type = "text/plain")
    # 
    # writeLines(text_content,
    #            paste0(out_dir, pmid, ".txt"),
    #            useBytes = TRUE)
    
    
  } else {
  
  xml_content <- content(response, 
                         encoding = "UTF-8",
                         type = "text/xml")
  
  
  article_node <- xml2::xml_find_first(
    xml_content,
    ".//*[local-name()='originalText']"
)
  
      xml2::write_xml(article_node, 
                        file = paste0(out_dir, pmid, ".xml")
    )
      
  }
    
  # writeLines(text_content,
  #            paste0(out_dir, pmid, ".txt"),
  #            useBytes = TRUE)
  
}

purrr::walk2(elsevier_links$URL,
              elsevier_links$pmid,
                ~download_elsevier_text(url = .x,
                                        api_key = elsevier_api_key,
                                        pmid = .y)
                )

Convert Elsevier xmls to JATS xml files


mkdir -p output/fulltexts/elsevier/xml

for file in output/fulltexts/elsevier/elsevier_xml/*.xml; do 
    filename=$(basename "$file")
    Rscript code/full_text_conversion/elsevier_to_jats_v4.R "$file" "output/fulltexts/elsevier/xml/${filename%.xml}.xml"
done

print("Number of downloaded full text files (xml) from Elsevier:")

[1] "Number of downloaded full text files (xml) from Elsevier:"

list.files(here::here("output/fulltexts/elsevier/elsevier_xml/"),
             pattern = "\\.xml$"
             ) |>
  length()

[1] 41

Download full text from Sage

Policies:

Permitted for non-commercial text mining with institutional access
https://journals.sagepub.com/page/policies/text-and-data-mining

sage_doi_patterns <- "10.1177|10.1089"

sage_links <-
  grep(sage_doi_patterns,
       doi_information$DOI,
       value = TRUE)

sage_links <- str_remove_all(pattern = "https://doi.org/", 
                              string = sage_links)

sage_link_df <- purrr::map(sage_links, 
                      ~get_crossref_links(.x)) |> 
  bind_rows()


# then had to download manually using provided xml links
# to use institutional login details 
# http://www.liebertpub.com/doi/full-xml/10.1089/omi.2017.0019
# https://journals.sagepub.com/doi/full-xml/10.1177/00220345211051967
# https://journals.sagepub.com/doi/full-xml/10.1177/0271678X211066299
# these are in JATS .xml format

# saved to output/fulltexts/sage
length(sage_links)

print("Number of downloaded full text files (xml) from Sage:")

[1] "Number of downloaded full text files (xml) from Sage:"

length(list.files(here::here("output/fulltexts/sage"),
             pattern = "\\.xml$"
)
)

[1] 3

Download full text from Springer Nature Open Access API

JATS xml format

springer_nature_links <-
  grep("nature|10.1038/ng|10.1007/s0|10.1007|10.1038/ejhg|10.1038/tpj|10.1038/jhg|10\\.1038/|10\\.1007/",
       doi_information$DOI,
       value = TRUE)

springer_nature_links <- str_remove_all(pattern = "https://doi.org/", 
                                        string = springer_nature_links)




pmids <- doi_information %>%
  filter(DOI %in% paste0("https://doi.org/", 
                         springer_nature_links)) %>%
  pull(PMID)

check_springer_oa <- function(doi, 
                              api_key,
                              pmids,
                              out_dir = here::here("output/fulltexts/springer_nature/")) {
  
  if(file.exists(paste0(out_dir, pmids, ".xml"))){
    return(data.frame(doi = doi, 
                      openaccess = TRUE)
    )
  }
  
  url<- paste0("https://api.springernature.com/openaccess/jats?",
               "api_key=", oa_api_key,
               "&q=", doi
  )
  
  response <- GET(url)
  
  # if the request fails, return data.frame with doi and oa = F
  if (status_code(response) != 200) {
    return(data.frame(doi = doi, 
                      openaccess = FALSE)
           )
  } else {
    
    xml_content <- content(response)
    
    article_node <- xml2::xml_find_all(xml_content, ".//records")
    
  if (xml2::xml_text(article_node) == "") {
    
    return(data.frame(doi = doi, 
                      openaccess = FALSE)
    )
    
  }
    
  }
    
    xml2::write_xml(article_node, 
              paste0(out_dir, pmids, ".xml")
    )
    
    return(data.frame(doi = doi, 
                      openaccess = TRUE)
    )
    
}
  
  oa_status <-
purrr::map2(springer_nature_links,
           pmids,
           ~check_springer_oa(doi = .x,
                              api_key = oa_api_key,
                              pmids = .y)
)
  
  
  oa_status_df <- oa_status |> bind_rows()
  
  oa_status_df |> group_by(openaccess) |>
    summarise(n = n())
  
  oa_status_df |>
    filter(openaccess == FALSE)

print("Number of downloaded full text files (xml) from Springer Nature:")

[1] "Number of downloaded full text files (xml) from Springer Nature:"

length(list.files(here::here("output/fulltexts/springer_nature"),
             pattern = "\\.xml$"
)
)

[1] 67

print("Number of downloaded html files from Springer Nature:")

[1] "Number of downloaded html files from Springer Nature:"

length(list.files(here::here("output/fulltexts/springer_nature"),
                  recursive = TRUE,
             pattern = "\\.html$"
)
)

[1] 6

Download full text from Wiley

Wiley Text & Data-mining Policy: https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining

Can download PDFs using their API with token
Download xmls using https://onlinelibrary.wiley.com/doi/full-xml/[DOI]

These xmls are in Wiley’s proprietary XML format, not JATS.

wiley_dois <- grep("10\\.1002/|10\\.1111/", 
                   doi_information$DOI, 
                   value = TRUE)

wiley_dois <- str_remove_all(wiley_dois, "https://doi.org/")

print("Number of papers potentially can get from Wiley:")

[1] "Number of papers potentially can get from Wiley:"

length(wiley_dois)

[1] 24

pmids_wiley_dois <- doi_information %>%
  filter(DOI %in% paste0("https://doi.org/",
                         wiley_dois)
                         ) %>%
  pull(PMID)

download_wiley_pdf<- function(doi,
                   api_key,
                   pmids,
                   output_dir = here::here("output/fulltexts/wiley/pdf/")){
  
  # check files doesn't already exist
  if(file.exists(paste0(output_dir, pmids, ".pdf"))){
    return(NULL)
  }
  
  curl_command <- paste0('curl -L -H "Wiley-TDM-Client-Token:',
                         wiley_api,
                         '" https://api.wiley.com/onlinelibrary/tdm/v1/articles/',
                         doi, 
                         ' -o ', output_dir, pmids, '.pdf'
  )
  
  print(curl_command)

system(curl_command)

}

purrr::walk2(wiley_dois,
             pmids_wiley_dois,
             ~ download_wiley_pdf(.x, wiley_api, .y)
)

# remove zero byte files - ? I think these are failed downloads as not open access
system("find output/fulltexts/wiley/pdf -type f -size 0 -delete")

# xmls downloaded manually using https://onlinelibrary.wiley.com/doi/full-xml/[DOI]
# downloaded to fulltexts/wiley/wiley_xml

As xml files are in Wiley format, convert JATS XML (1.1) format to be consistent with PubMed etc.


mkdir -p output/fulltexts/wiley/xml

for file in output/fulltexts/wiley/wiley_xml/*.xml; do 
    filename=$(basename "$file")
    Rscript code/full_text_conversion/wiley_to_jats.R "$file" "output/fulltexts/wiley/xml/${filename%.xml}.xml"
done

# how many wiley full text xml downloaded
print("Number of downloaded full text files (xml) from Wiley:")

[1] "Number of downloaded full text files (xml) from Wiley:"

length(list.files(here::here("output/fulltexts/wiley/xml/"), 
                  recursive = TRUE,
                  pattern = "\\.xml$"))

[1] 24

# how many wiley pdfs downloaded
print("Number of downloaded full text files (pdf) from Wiley:")

[1] "Number of downloaded full text files (pdf) from Wiley:"

length(list.files(here::here("output/fulltexts/wiley/"), 
                  recursive = TRUE,
                  pattern = "\\.pdf$"))

[1] 8

Download html full texts from publisher links

Download full text from BMJ Journals

TDM policy: https://bmjgroup.com/text-and-data-mining-tdm-policy/

bmj_doi_patterns <- "10.1136/gutjnl|10.1136/jmedgenet"

bmj_links <-
  grep(bmj_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from BMJ:")
length(bmj_links)

# download html content from webpage
# save to output/fulltexts/bmj

print("Number of downloaded full text files (html) from BMJ:")

[1] "Number of downloaded full text files (html) from BMJ:"

length(list.files(here::here("output/fulltexts/bmj"),
                  recursive = TRUE,
             pattern = "\\.html$"
)
)

[1] 4

Download full text from Cambridge

Policies: https://www.cambridge.org/core/services/open-research/text-and-data-mining

Can carry out TDM on any Cambridge Core content you have lawful access to
Contact openresearch@cambridge.org about getting xml content

cambridge_doi_patterns <- "10.1017"

cambridge_links <-
  grep(cambridge_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from Cambridge:")
length(cambridge_links)

# obtain html content from webpage

print("Number of downloaded full text files (html) from Cambridge:")

[1] "Number of downloaded full text files (html) from Cambridge:"

length(list.files(here::here("output/fulltexts/cambridge"),
                  recursive = TRUE,
             pattern = "\\.html$"
)
)

[1] 1

print("Number of downloaded full text files (xml) from Cambridge:")

[1] "Number of downloaded full text files (xml) from Cambridge:"

length(list.files(here::here("output/fulltexts/cambridge"),
                  recursive = TRUE,
             pattern = "\\.xml$"
)
)

[1] 0

Download full text from Oxford Academic

Oxford Academic TDM policy: https://academic.oup.com/pages/purchasing/rights-and-permissions/text-and-data-mining

*should reach out to confirm UCSF rights / possibly get xml formats

# go doi pages, and download html manually 
oxford_dois <- grep("10.1093|10.113/amiajnl|10.1210|10.1513", 
                   doi_information$DOI, 
                   value = TRUE)

print("Number of papers potentially can get from Oxford Academic:")

[1] "Number of papers potentially can get from Oxford Academic:"

length(oxford_dois)

[1] 38

# then had to download manually using institutional login
# then saved to output/fulltexts/oxford_academic/html
oxford_htmls <- list.files(here::here("output/fulltexts/oxford_academic/html/"),
                             pattern = "\\.html$"
                             )

print("Number of downloaded full text files (html) from Oxford Academic:")

[1] "Number of downloaded full text files (html) from Oxford Academic:"

length(oxford_htmls)

[1] 39

Convert html to txt files

# convert html to txt

for(html_file in oxford_htmls){
  
  html_path <- here::here("output/fulltexts/oxford_academic/html/",
                          html_file
                          )
  
  html_content <- rvest::read_html(html_path)
  
  text_content <- rvest::html_text2(html_content)
  
  writeLines(text_content,
             here::here("output/fulltexts/oxford_academic/txt/",
                        gsub("\\.html$", ".txt", html_file)
                        ),
             useBytes = TRUE)
  
}

print("Number of downloaded full text files (html) from Oxford Academic:")

[1] "Number of downloaded full text files (html) from Oxford Academic:"

length(list.files(here::here("output/fulltexts/oxford_academic/html"),
                  pattern = "\\.html$"
)
)

[1] 39

print("Number of downloaded full text files (txt) from Oxford Academic:")

[1] "Number of downloaded full text files (txt) from Oxford Academic:"

length(list.files(here::here("output/fulltexts/oxford_academic/txt"),
                  pattern = "\\.txt$"
)
)

[1] 35

Download full text from Taylor & Francis

TDM policy / information: https://taylorandfrancis.com/our-policies/textanddatamining/

” If you or your institution subscribes to content from Taylor & Francis you can carry out TDM activities on this content, as well as open access content, without any additional charge, provided this is on a non-commercial basis. ”

taylor_francis_dois <- grep("10.1080|10.2217", 
                   doi_information$DOI,
                   value = TRUE)

print("Number of papers potentially can get from Taylor & Francis:")
length(taylor_francis_dois)

# then had to download manually using institutional login
# as html
# saved to output/fulltexts/taylor_and_francis/html

print("Number of downloaded full text files (html) from Taylor & Francis:")

[1] "Number of downloaded full text files (html) from Taylor & Francis:"

length(list.files(here::here("output/fulltexts/taylor_and_francis/html"),
                  pattern = "\\.html$"
)
)

[1] 3

print("Number of downloaded full text files (xml) from Taylor & Francis:")

[1] "Number of downloaded full text files (xml) from Taylor & Francis:"

length(list.files(here::here("output/fulltexts/taylor_and_francis"),
                  recursive = TRUE,
                  pattern = "\\.xml$")
)

[1] 0

To get text-mining license info for:

American Physiological Society, doi: 10.1152

# check, how many papers:
aps_doi_patterns <- "10.1152"

aps_links <-
  grep(aps_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from APS:")

[1] "Number of papers potentially can get from APS:"

length(aps_links)

[1] 2

American Association for Cancer Research, doi: 10.1158

aacr_doi_patterns <- "10.1158"

aacr_links <-
  grep(aacr_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AACR:")

[1] "Number of papers potentially can get from AACR:"

length(aacr_links)

[1] 4

AHA, doi: 10.1161

aha_doi_patterns <- "10.1161"

aha_links <-
  grep(aha_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AHA:")

[1] "Number of papers potentially can get from AHA:"

length(aha_links)

[1] 4

? ATS: doi: 10.1164, 10.1165 (moving to Oxford Academic in March 2026)

ats_doi_patterns <- "10.1164|10.1165"

ats_links <-
  grep(ats_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ATS:")

[1] "Number of papers potentially can get from ATS:"

length(ats_links)

[1] 12

ASH, doi: 10.1182

ash_doi_patterns <- "10.1182"

ash_links <-
  grep(ash_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ASH:")

[1] "Number of papers potentially can get from ASH:"

length(ash_links)

[1] 1

ERS, doi: 10.1183

ers_doi_patterns <- "10.1183"

ers_links <-
  grep(ers_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ERS:")

[1] "Number of papers potentially can get from ERS:"

length(ers_links)

[1] 1

ASCO, doi: 10.1200

asco_doi_patterns <- "10.1200"

asco_links <-
  grep(asco_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ASCO:")

[1] "Number of papers potentially can get from ASCO:"

length(asco_links)

[1] 1

AAN, doi: 10.1212

aan_doi_patterns <- "10.1212"

aan_links <-
  grep(aan_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AAN:")

[1] "Number of papers potentially can get from AAN:"

length(aan_links)

[1] 3

J-STAGE: doi: 10.1248

jstage_doi_patterns <- "10.1248"

jstage_links <-
  grep(jstage_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from J-STAGE:")

[1] "Number of papers potentially can get from J-STAGE:"

length(jstage_links)

[1] 1

JASN: doi: 10.1681

jasn_doi_patterns <- "10.1681"

jasn_links <-
  grep(jasn_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from JASN:")

[1] "Number of papers potentially can get from JASN:"

length(jasn_links)

[1] 4

(ADA) Diabetes, doi: 10.2337

https://diabetesjournals.org/journals/pages/ada-journal-policies
? Seems likely text-mining may be allowed, https://diabetesjournals.org/journals/pages/license

diabetes_doi_patterns <- "10.2337"

diabetes_links <-
  grep(diabetes_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from Diabetes:")

[1] "Number of papers potentially can get from Diabetes:"

length(diabetes_links)

[1] 12

Testing:

Download PDFs using Open Access information from Open Alex

PMIDs that couldn’t be converted to PMCIDs

# old getting dois:
entrez_info <-
entrez_summary(db="pubmed", 
               id=not_convertable_pmids)

dois <-
entrez_info |>
  purrr::map(function(x) {
    
    x$articleids |> 
      filter(idtype == "doi") |> 
      pull(value)
  }
)

library(openalexR)

convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

doi_information <-
convert_pmid_df |>
  filter(PMID %in% not_convertable_pmids)

doi_information |>
  filter(DOI == "")

doi_information$PMID |> unique() |> length()

length(not_convertable_pmids)

# get open alex works for pmids
open_alex_works <- oa_fetch(
  doi = unique(doi_information$DOI),
  entity = "works",
  options = list(select = c("doi", 
                            "open_access"))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))


cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download.txt

PMCIDs not found in Author Manuscripts or Open Access sections

doi_information <-
convert_pmid_df |>
  filter(PMCID %in% not_available)

doi_information |>
  filter(DOI == "")

doi_information <-
  doi_information |>
  filter(DOI != "")

# get open alex works for pmcids
open_alex_works <- oa_fetch(
  doi = doi_information$DOI,
  entity = "works",
  options = list(select = c(#"title",
                            "doi", 
                            "open_access"
                            ))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download_pt2.txt"))


cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download_pt2.txt

Test: not used - Europe PMC Author Manuscripts


curl -s https://europepmc.org/ftp/manuscripts/ \
  | grep -o 'author_manuscript_txt[^"]*\.filelist\.txt' \
  | sort -u \
  | while read -r file; do
      curl -O "https://europepmc.org/ftp/manuscripts/$file"
    done

all_file_lists <- list.files(here::here("data/epmc"))

author_manu_epmc <- all_file_lists |>
                    purrr::map(function(file_name) {
                      
                      file_path = here::here("data/epmc",
                                             file_name
                                             )
                      
                      df <- fread(file_path)
                      
                      return(df)
                    }
                    ) |>
                    bind_rows()

author_manu_epmc |>
  filter(AccessionID %in% not_available)

author_manu_epmc |>
  filter(PMID %in% not_avaliable_pmids)

author_manu_epmc |>
  filter(PMID %in% pmids)

author_manu_epmc |>
  filter(PMID %in% not_convertable_pmids)

Test: not used - Download with ftp service, where avaliable

# not_available
not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

all_tgz_links = c()

for(article_id in "PMC2613843"){

url <- paste0("https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=",
              article_id)
  
resp <- GET(url)

xml_data <- xml_child(content(resp), "records")

tgz_link <- xml_find_first(xml_data, 
                           ".//link[@format='tgz']/@href")
tgz_link <- xml_text(tgz_link)

if (is.na(tgz_link)) {
  
  print("No tar.gz link found.")
  
} else {
  
  all_tgz_links <- append(all_tgz_links, 
                          tgz_link)
  
}
}

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] rcrossref_1.2.1   data.table_1.17.8 dplyr_1.1.4       here_1.0.1       
[5] stringr_1.6.0     xml2_1.4.0        httr_1.4.7        workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] sass_0.4.10         generics_0.1.4      renv_1.0.3         
 [4] stringi_1.8.7       httpcode_0.3.0      digest_0.6.37      
 [7] magrittr_2.0.4      evaluate_1.0.5      fastmap_1.2.0      
[10] plyr_1.8.9          rprojroot_2.1.0     jsonlite_2.0.0     
[13] processx_3.8.6      whisker_0.4.1       crul_1.6.0         
[16] ps_1.9.1            promises_1.3.3      BiocManager_1.30.26
[19] jquerylib_0.1.4     cli_3.6.5           shiny_1.11.1       
[22] rlang_1.1.6         withr_3.0.2         cachem_1.1.0       
[25] yaml_2.3.10         tools_4.3.1         httpuv_1.6.16      
[28] DT_0.34.0           curl_7.0.0          vctrs_0.6.5        
[31] R6_2.6.1            mime_0.13           lifecycle_1.0.4    
[34] git2r_0.36.2        fs_1.6.6            htmlwidgets_1.6.4  
[37] miniUI_0.1.2        pkgconfig_2.0.3     callr_3.7.6        
[40] pillar_1.11.1       bslib_0.9.0         later_1.4.4        
[43] glue_1.8.0          Rcpp_1.1.0          xfun_0.55          
[46] tibble_3.3.0        tidyselect_1.2.1    rstudioapi_0.17.1  
[49] knitr_1.50          xtable_1.8-4        htmltools_0.5.8.1  
[52] rmarkdown_2.30      compiler_4.3.1      getPass_0.2-4

Get full article text for GWAS Catalog studies

Isobel Beasley

2025-10-22