Last updated: 2026-03-24

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ce8519d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    Aus_School_Profile.xlsx
    Ignored:    BC2GM/
    Ignored:    BioC.dtd
    Ignored:    FormatConverter.jar
    Ignored:    FormatConverter.zip
    Ignored:    SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    doccano_venv/
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/pubmedbert_entity_predictions.csv
    Ignored:    output/pubmedbert_entity_predictions.jsonl
    Ignored:    output/pubmedbert_predictions.csv
    Ignored:    output/pubmedbert_predictions.jsonl
    Ignored:    output/supplement/
    Ignored:    output/text_mining_predictions/
    Ignored:    output/trait_ontology/
    Ignored:    population_description_terms.txt
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacy_venv_requirements.txt
    Ignored:    spacyr_venv/

Untracked files:
    Untracked:  code/full_text_conversion/html_to_xml.R
    Untracked:  code/test_cohort_desc_file.R
    Untracked:  code/text_mining_models/tokenise_data.py
    Untracked:  output/fulltexts/
    Untracked:  schools.R

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_dbgap_ids.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/specific_aims_stats.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_supplement.Rmd) and HTML (docs/get_supplement.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 4afd39d IJbeasley 2026-03-24 Add rmarkdown page for getting supplement

Ideas/help for downloading supplemental files: - https://pmc.ncbi.nlm.nih.gov/articles/PMC12371329/

What file types are likely to be relevant? - pdf - Excel spreadsheet: (xls, xlsx) - Data files (csv, txt) - Word documents (docx, doc)

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)

Get PMCIDs and PUBMIDs for GWAS Catalog studies of diseases

Pubmed ids from GWAS catalog

## Step 1: 
# get only relevant disease studies
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
    dplyr::filter(!cause %in% c("HIV/AIDS",
                             "Tuberculosis",
                             "Malaria",
                             "Lower respiratory infections",
                             "Diarrhoeal diseases",
                             "Neonatal disorders",
                             "Tetanus",
                             "Diphtheria",
                             "Pertussis" ,
                             "Measles",
                             "Maternal disorders"))

print("Number of disease studies to get full texts for:")
[1] "Number of disease studies to get full texts for:"
all_pmids <- unique(gwas_study_info$PUBMED_ID)
length(all_pmids)
[1] 821

PMCIDs for these PMIDS from Europe PMC mapping file

converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))

print("Head of pmid to pmcid mapping data.frame:")
[1] "Head of pmid to pmcid mapping data.frame:"
head(converted_ids)
       PMID     pmcids                                           DOI
      <int>     <char>                                        <char>
1: 17223258             https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876                      https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697                           https://doi.org/10.1038/ng2043
5: 17463246                  https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617       https://doi.org/10.1126/science.1142382
print("Dimensions of pmid to pmcid mapping data.frame:")
[1] "Dimensions of pmid to pmcid mapping data.frame:"
dim(converted_ids)
[1] 821   3
length(all_pmids)
[1] 821
print("All pmids are in this data.frame, but some don't have pmcid mapping")
[1] "All pmids are in this data.frame, but some don't have pmcid mapping"
not_converted_pmids <-
converted_ids |>
  filter(pmcids == "")  |>
  pull(PMID)

print("Number of pmids without pmcid mapping:")
[1] "Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 173
pmcids <-
converted_ids$pmcids |>
  unique()

pmcids <- pmcids[pmcids != ""]

print("Number of pmids with pmcid mapping:")
[1] "Number of pmids with pmcid mapping:"
length(pmcids)
[1] 648
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(all_pmids), digits = 2)
[1] 78.93

Download from European PMC

library(httr)
library(here)
library(tools)

download_pmc_supplements <- function(pmcid,
                                     out_dir = here::here("output/supplement")) {
  
  # Create PMCID-specific subdirectory
  pmcid_dir <- file.path(out_dir, pmcid)
  
  # Check if directory already exists and has files
  if (dir.exists(pmcid_dir) && length(list.files(pmcid_dir)) > 0) {
    message("Supplementary files for ", pmcid, " already exist. Skipping.")
    return(invisible(list(pmcid = pmcid, status = "skipped", files = list.files(pmcid_dir))))
  }
  
  # Create directory if it doesn't exist
  dir.create(pmcid_dir, recursive = TRUE, showWarnings = FALSE)
  
  # Build URL
  sup_url <- paste0(
    "https://www.ebi.ac.uk/europepmc/webservices/rest/",
    pmcid,
    "/supplementaryFiles"
  )
  
  message("Fetching supplementary files for ", pmcid, "...")
  
  # Save zip to a temp file
  zip_path <- file.path(pmcid_dir, 
                        paste0(pmcid, "_supplements.zip"))
  resp <- GET(sup_url, 
              write_disk(zip_path, overwrite = TRUE))
  
  if (status_code(resp) != 200) {
    warning("Failed to retrieve supplements for ", pmcid,
            ". HTTP status: ", status_code(resp))
    unlink(pmcid_dir, recursive = TRUE)
    return(invisible(list(pmcid = pmcid, status = "failed", files = NULL)))
  }
  
  # Unzip into the PMCID folder
  unzip_result <- tryCatch({
    unzip(zip_path, exdir = pmcid_dir)
  }, error = function(e) {
    warning("Failed to unzip supplements for ", pmcid, ": ", e$message)
    NULL
  })
  
  # Remove the zip file after extraction
  file.remove(zip_path)
  
  if (is.null(unzip_result)) {
    return(invisible(list(pmcid = pmcid, status = "unzip_failed", files = NULL)))
  }
  
  extracted_files <- list.files(pmcid_dir, recursive = TRUE, full.names = TRUE)
  message("Extracted ", length(extracted_files), " file(s) to ", pmcid_dir)
  
  return(invisible(list(
    pmcid  = pmcid,
    status = "success",
    dir    = pmcid_dir,
    files  = extracted_files
  )))
}


# Batch across multiple PMCIDs
results <- lapply(pmcids, function(id) {
  tryCatch(
    download_pmc_supplements(id),
    error = function(e) {
      message("Error processing ", id, ": ", e$message)
      list(pmcid = id, status = "error", files = NULL)
    }
  )
})

find output/supplement -name "*.zip" | while read f; do
  unzip -o "$f" -d "$(dirname "$f")"
done

find output/supplement -type d -empty -delete

Download from NCBI Cloud AWS

# check how many pmcids I have downloaded supplementary materials for
sup_dir <- here::here("output/supplement")

folders <- list.dirs(sup_dir, 
                     full.names = TRUE, 
                     recursive = FALSE)

folders_with_files <- folders[
  sapply(folders, function(d) length(list.files(d, recursive = TRUE)) > 0)
]

not_retrieved_pmcids <- setdiff(pmcids, basename(folders_with_files))

print("PMCIDs for which I could not retrieve supplements from European PMC:")
[1] "PMCIDs for which I could not retrieve supplements from European PMC:"
print(length(not_retrieved_pmcids))
[1] 226
writeLines(
  not_retrieved_pmcids,
  here::here("output/supplement/selected_pmcids.txt")
)

bash code/extract_text/download_pmc_supplements_aws.sh --file output/supplement/selected_pmcids.txt

find output/supplement -type d -empty -delete

Check how many supplements I have downloaded, and what file types they are

# check how many pmcids I have downloaded supplementary materials for
sup_dir <- here::here("output/supplement")

folders <- list.dirs(sup_dir, 
                     full.names = TRUE, 
                     recursive = FALSE)

folders_with_files <- folders[
  sapply(folders, function(d) length(list.files(d, recursive = TRUE)) > 0)
]

message(length(folders_with_files), 
        " / ", 
        length(folders), 
        " folders contain at least one file")
423 / 423 folders contain at least one file
# check extensions of files downloaded
all_files <- unlist(lapply(folders_with_files, 
                           list.files, 
                           recursive = TRUE, 
                           full.names = TRUE))

file_extensions <- tools::file_ext(all_files) |>
  table() |>
  as.data.frame() |>
  dplyr::arrange(desc(Freq))

print("File extensions of downloaded supplementary files:")
[1] "File extensions of downloaded supplementary files:"
print(file_extensions)
    Var1 Freq
1    gif 2075
2    jpg 1950
3    xml  588
4    pdf  586
5   xlsx  584
6   docx  263
7    tif  167
8    doc  131
9    zip   38
10  tiff   22
11  pptx   20
12   xls   20
13  XLSX   18
14   png   15
15    ai   12
16  html   11
17   txt    9
18    py    8
19   PNG    7
20    sh    6
21  DOCX    5
22   csv    4
23   TIF    4
24   eps    2
25  jpeg    1
26   mp4    1
27    pl    1
28   ppt    1
29     R    1
30 tifff    1
folders_with_relevant_files <- folders[
  sapply(folders, function(d) length(list.files(d, 
                                                recursive = TRUE,
                                                pattern = "*.pdf|*.docx|*.doc|*.xls|*.XLSX")) > 0)
]

message(length(folders_with_relevant_files), 
        " / ", length(folders), 
        " folders contain at least one relevant file type (pdf, docx, doc, xls, xlsx)")
371 / 423 folders contain at least one relevant file type (pdf, docx, doc, xls, xlsx)

Convert sup materials

folders <- list.dirs(sup_dir, 
                     full.names = TRUE, 
                     recursive = FALSE)

folders_with_relevant_files <- folders[
  sapply(folders, function(d) length(list.files(d, 
                                                recursive = TRUE,
                                                pattern = "*.pdf|*.docx|*.doc|*.DOCX|*.txt|*.xls|*.XLSX")) > 0)
]

print("Number of folders with at least one relevant file type (pdf, docx, doc, xls, xlsx):")
[1] "Number of folders with at least one relevant file type (pdf, docx, doc, xls, xlsx):"
print(length(folders_with_relevant_files))
[1] 375
no_relevant_files <- folders[!c(folders %in% folders_with_relevant_files)]

print("File extensions of folders without relevant file types:")
[1] "File extensions of folders without relevant file types:"
list.files(no_relevant_files, 
           recursive = TRUE, 
           full.names = TRUE) |>
  tools::file_ext()  |>
  unique() 
[1] "gif" "jpg" "tif"
# what sup files to convert?

sup_files <- list.files(sup_dir,
                        recursive = TRUE,
                        pattern = "*.pdf|*.docx|*.doc|*.xls|*.XLSX"
                        )


sup_files <- paste0("output/supplement/", sup_files)

writeLines(sup_files, 
           here::here("output/supplement/supplemental_files_to_convert.txt"))

./code/extract_text/convert_supplemental_materials.sh 

Archived / not run

Download supplements from Elsevier

elseiver_xmls <- list.files(here::here("output/fulltexts/elsevier/elsevier_xml/"), full.names = TRUE)


xml <- xml2::read_xml(here::here("output/fulltexts/elsevier/elsevier_xml/17223258.xml")) 

use_local <- function(x) {
  paste0(".//*[local-name()='", x, "']")
}

get_elsevier_supplement_links <- function(xml_file_path,
                                          api_key    = Sys.getenv("ELSEVIER_API_KEY"),
                                          out_dir = here::here("output/supplement/elsevier")) {
  
  
  # Build auth headers
  headers <- c("X-ELS-APIKey" = api_key)
  
  xml <- xml2::read_xml(xml_file_path)
  doc <- xml_find_first(xml, use_local("xocs:doc"))
  meta <- xml_find_first(doc, use_local("xocs:meta"))
  
  attachments <- xml_find_all(meta, 
                                use_local("xocs:attachments")
                                )
  supplement_eids <- xml_find_all(attachments, 
                                   use_local("xocs:attachment-eid")
                                   ) |> 
    xml_text()
  
  supplement_eids <- supplement_eids[!grepl("main", 
                                              supplement_eids, 
                                              ignore.case = TRUE)]
  
# sup_names <- xml_text(xml_find_all(attachments, use_local("attachment-eid")))
  # track results for all files
  results <- list()

if(length(supplement_eids) > 0) {
  
  #@browser()
  
   out_dir = paste0(out_dir, 
                    "/",
                    tools::file_path_sans_ext(basename(xml_file_path))
                    )
   
   # make directory for this article
   dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)
  
  for(eid in supplement_eids) {
    
    message("Supplement link: ", eid)
    
    out_path  <- paste0(out_dir, "/", basename(eid))
    
    api_url  = paste0("https://api.elsevier.com/content/object/eid/", eid)
    
    file_resp <- GET(api_url,
                     add_headers(.headers = headers),  # <-- auth headers
                write_disk(out_path,
                           overwrite = TRUE)
                )
    
    http_status  <- status_code(file_resp)
    file_size_kb <- file.size(out_path) / 1024
      
    is_html_error <- tryCatch({
        first_bytes <- readLines(out_path, n = 1, warn = FALSE)
        grepl("^<!DOCTYPE|^<html", first_bytes, ignore.case = TRUE)
      }, error = function(e) FALSE)
    
    status <- dplyr::case_when(
        http_status == 403               ~ "permission_denied",
        http_status == 401               ~ "unauthorized",
        http_status != 200               ~ paste0("http_error_", http_status),
        is_html_error                    ~ "permission_denied_html_response",
        file_size_kb < 5                 ~ "suspiciously_small",
        TRUE                             ~ "success"
      )
    
    if (status != "success") {
        warning(sprintf("  [%s] %s (%.1f KB, HTTP %d)",
                        status, basename(eid), file_size_kb, http_status))
        # Remove the bad file so it doesn't look like a successful download
        file.remove(out_path)
      } else {
        message(sprintf("  [OK] %s (%.1f KB)", basename(eid), file_size_kb))
      }
    
    # results[[basename(sup)]] <- list(
    #     url         = sup,
    #     out_path    = out_path,
    #     http_status = http_status,
    #     file_size_kb = round(file_size_kb, 2),
    #     status      = status
    #   )
    # 
    
  }
  

} else {
  message("No supplement links found in ", xml_file_path)
}

}

purrr::map(elseiver_xmls, get_elsevier_supplement_links)

More


sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.17.8 dplyr_1.1.4       here_1.0.1        stringr_1.6.0    
[5] xml2_1.4.0        httr_1.4.7        workflowr_1.7.2  

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0      compiler_4.3.1      BiocManager_1.30.26
 [4] renv_1.1.8          promises_1.3.3      tidyselect_1.2.1   
 [7] Rcpp_1.1.0          git2r_0.36.2        callr_3.7.6        
[10] later_1.4.4         jquerylib_0.1.4     yaml_2.3.10        
[13] fastmap_1.2.0       R6_2.6.1            generics_0.1.4     
[16] knitr_1.50          tibble_3.3.0        rprojroot_2.1.0    
[19] bslib_0.9.0         pillar_1.11.1       rlang_1.1.6        
[22] cachem_1.1.0        stringi_1.8.7       httpuv_1.6.16      
[25] xfun_0.55           getPass_0.2-4       fs_1.6.6           
[28] sass_0.4.10         cli_3.6.5           withr_3.0.2        
[31] magrittr_2.0.4      ps_1.9.1            digest_0.6.37      
[34] processx_3.8.6      rstudioapi_0.17.1   lifecycle_1.0.4    
[37] vctrs_0.6.5         evaluate_1.0.5      glue_1.8.0         
[40] whisker_0.4.1       rmarkdown_2.30      tools_4.3.1        
[43] pkgconfig_2.0.3     htmltools_0.5.8.1