Last updated: 2026-03-24
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ce8519d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: .venv/
Ignored: Aus_School_Profile.xlsx
Ignored: BC2GM/
Ignored: BioC.dtd
Ignored: FormatConverter.jar
Ignored: FormatConverter.zip
Ignored: SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: code/.DS_Store
Ignored: code/full_text_conversion/.DS_Store
Ignored: data/.DS_Store
Ignored: data/RCDCFundingSummary_01042026.xlsx
Ignored: data/cdc/
Ignored: data/cohort/
Ignored: data/epmc/
Ignored: data/europe_pmc/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
Ignored: data/gbd/gbd_2019_california_percent_deaths.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/2025AA/
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/hp_umls_mapping.csv
Ignored: data/icd/lancet_conditions_icd10.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/mondo_umls_mapping.csv
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/phecode_to_icd10_manual_mapping.xlsx
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
Ignored: data/icd/umls-2025AA-mrconso.zip
Ignored: doccano_venv/
Ignored: figures/
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/pubmedbert_entity_predictions.csv
Ignored: output/pubmedbert_entity_predictions.jsonl
Ignored: output/pubmedbert_predictions.csv
Ignored: output/pubmedbert_predictions.jsonl
Ignored: output/supplement/
Ignored: output/text_mining_predictions/
Ignored: output/trait_ontology/
Ignored: population_description_terms.txt
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: renv/
Ignored: spacy_venv_requirements.txt
Ignored: spacyr_venv/
Untracked files:
Untracked: code/full_text_conversion/html_to_xml.R
Untracked: code/test_cohort_desc_file.R
Untracked: code/text_mining_models/tokenise_data.py
Untracked: output/fulltexts/
Untracked: schools.R
Unstaged changes:
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/get_dbgap_ids.Rmd
Modified: analysis/get_full_text.Rmd
Modified: analysis/gwas_to_gbd.Rmd
Modified: analysis/map_trait_to_icd10.Rmd
Modified: analysis/replication_ancestry_bias.Rmd
Modified: analysis/specific_aims_stats.Rmd
Modified: analysis/text_for_cohort_labels.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/get_supplement.Rmd) and
HTML (docs/get_supplement.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 4afd39d | IJbeasley | 2026-03-24 | Add rmarkdown page for getting supplement |
Ideas/help for downloading supplemental files: - https://pmc.ncbi.nlm.nih.gov/articles/PMC12371329/
What file types are likely to be relevant? - pdf - Excel spreadsheet: (xls, xlsx) - Data files (csv, txt) - Word documents (docx, doc)
library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)
## Step 1:
# get only relevant disease studies
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))
gwas_study_info = gwas_study_info |>
dplyr::rename_with(~ gsub(" ", "_", .x))
# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
dplyr::filter(!cause %in% c("HIV/AIDS",
"Tuberculosis",
"Malaria",
"Lower respiratory infections",
"Diarrhoeal diseases",
"Neonatal disorders",
"Tetanus",
"Diphtheria",
"Pertussis" ,
"Measles",
"Maternal disorders"))
print("Number of disease studies to get full texts for:")
[1] "Number of disease studies to get full texts for:"
all_pmids <- unique(gwas_study_info$PUBMED_ID)
length(all_pmids)
[1] 821
converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))
print("Head of pmid to pmcid mapping data.frame:")
[1] "Head of pmid to pmcid mapping data.frame:"
head(converted_ids)
PMID pmcids DOI
<int> <char> <char>
1: 17223258 https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876 https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697 https://doi.org/10.1038/ng2043
5: 17463246 https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617 https://doi.org/10.1126/science.1142382
print("Dimensions of pmid to pmcid mapping data.frame:")
[1] "Dimensions of pmid to pmcid mapping data.frame:"
dim(converted_ids)
[1] 821 3
length(all_pmids)
[1] 821
print("All pmids are in this data.frame, but some don't have pmcid mapping")
[1] "All pmids are in this data.frame, but some don't have pmcid mapping"
not_converted_pmids <-
converted_ids |>
filter(pmcids == "") |>
pull(PMID)
print("Number of pmids without pmcid mapping:")
[1] "Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 173
pmcids <-
converted_ids$pmcids |>
unique()
pmcids <- pmcids[pmcids != ""]
print("Number of pmids with pmcid mapping:")
[1] "Number of pmids with pmcid mapping:"
length(pmcids)
[1] 648
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(all_pmids), digits = 2)
[1] 78.93
library(httr)
library(here)
library(tools)
download_pmc_supplements <- function(pmcid,
out_dir = here::here("output/supplement")) {
# Create PMCID-specific subdirectory
pmcid_dir <- file.path(out_dir, pmcid)
# Check if directory already exists and has files
if (dir.exists(pmcid_dir) && length(list.files(pmcid_dir)) > 0) {
message("Supplementary files for ", pmcid, " already exist. Skipping.")
return(invisible(list(pmcid = pmcid, status = "skipped", files = list.files(pmcid_dir))))
}
# Create directory if it doesn't exist
dir.create(pmcid_dir, recursive = TRUE, showWarnings = FALSE)
# Build URL
sup_url <- paste0(
"https://www.ebi.ac.uk/europepmc/webservices/rest/",
pmcid,
"/supplementaryFiles"
)
message("Fetching supplementary files for ", pmcid, "...")
# Save zip to a temp file
zip_path <- file.path(pmcid_dir,
paste0(pmcid, "_supplements.zip"))
resp <- GET(sup_url,
write_disk(zip_path, overwrite = TRUE))
if (status_code(resp) != 200) {
warning("Failed to retrieve supplements for ", pmcid,
". HTTP status: ", status_code(resp))
unlink(pmcid_dir, recursive = TRUE)
return(invisible(list(pmcid = pmcid, status = "failed", files = NULL)))
}
# Unzip into the PMCID folder
unzip_result <- tryCatch({
unzip(zip_path, exdir = pmcid_dir)
}, error = function(e) {
warning("Failed to unzip supplements for ", pmcid, ": ", e$message)
NULL
})
# Remove the zip file after extraction
file.remove(zip_path)
if (is.null(unzip_result)) {
return(invisible(list(pmcid = pmcid, status = "unzip_failed", files = NULL)))
}
extracted_files <- list.files(pmcid_dir, recursive = TRUE, full.names = TRUE)
message("Extracted ", length(extracted_files), " file(s) to ", pmcid_dir)
return(invisible(list(
pmcid = pmcid,
status = "success",
dir = pmcid_dir,
files = extracted_files
)))
}
# Batch across multiple PMCIDs
results <- lapply(pmcids, function(id) {
tryCatch(
download_pmc_supplements(id),
error = function(e) {
message("Error processing ", id, ": ", e$message)
list(pmcid = id, status = "error", files = NULL)
}
)
})
find output/supplement -name "*.zip" | while read f; do
unzip -o "$f" -d "$(dirname "$f")"
done
find output/supplement -type d -empty -delete
# check how many pmcids I have downloaded supplementary materials for
sup_dir <- here::here("output/supplement")
folders <- list.dirs(sup_dir,
full.names = TRUE,
recursive = FALSE)
folders_with_files <- folders[
sapply(folders, function(d) length(list.files(d, recursive = TRUE)) > 0)
]
not_retrieved_pmcids <- setdiff(pmcids, basename(folders_with_files))
print("PMCIDs for which I could not retrieve supplements from European PMC:")
[1] "PMCIDs for which I could not retrieve supplements from European PMC:"
print(length(not_retrieved_pmcids))
[1] 226
writeLines(
not_retrieved_pmcids,
here::here("output/supplement/selected_pmcids.txt")
)
bash code/extract_text/download_pmc_supplements_aws.sh --file output/supplement/selected_pmcids.txt
find output/supplement -type d -empty -delete
# check how many pmcids I have downloaded supplementary materials for
sup_dir <- here::here("output/supplement")
folders <- list.dirs(sup_dir,
full.names = TRUE,
recursive = FALSE)
folders_with_files <- folders[
sapply(folders, function(d) length(list.files(d, recursive = TRUE)) > 0)
]
message(length(folders_with_files),
" / ",
length(folders),
" folders contain at least one file")
423 / 423 folders contain at least one file
# check extensions of files downloaded
all_files <- unlist(lapply(folders_with_files,
list.files,
recursive = TRUE,
full.names = TRUE))
file_extensions <- tools::file_ext(all_files) |>
table() |>
as.data.frame() |>
dplyr::arrange(desc(Freq))
print("File extensions of downloaded supplementary files:")
[1] "File extensions of downloaded supplementary files:"
print(file_extensions)
Var1 Freq
1 gif 2075
2 jpg 1950
3 xml 588
4 pdf 586
5 xlsx 584
6 docx 263
7 tif 167
8 doc 131
9 zip 38
10 tiff 22
11 pptx 20
12 xls 20
13 XLSX 18
14 png 15
15 ai 12
16 html 11
17 txt 9
18 py 8
19 PNG 7
20 sh 6
21 DOCX 5
22 csv 4
23 TIF 4
24 eps 2
25 jpeg 1
26 mp4 1
27 pl 1
28 ppt 1
29 R 1
30 tifff 1
folders_with_relevant_files <- folders[
sapply(folders, function(d) length(list.files(d,
recursive = TRUE,
pattern = "*.pdf|*.docx|*.doc|*.xls|*.XLSX")) > 0)
]
message(length(folders_with_relevant_files),
" / ", length(folders),
" folders contain at least one relevant file type (pdf, docx, doc, xls, xlsx)")
371 / 423 folders contain at least one relevant file type (pdf, docx, doc, xls, xlsx)
folders <- list.dirs(sup_dir,
full.names = TRUE,
recursive = FALSE)
folders_with_relevant_files <- folders[
sapply(folders, function(d) length(list.files(d,
recursive = TRUE,
pattern = "*.pdf|*.docx|*.doc|*.DOCX|*.txt|*.xls|*.XLSX")) > 0)
]
print("Number of folders with at least one relevant file type (pdf, docx, doc, xls, xlsx):")
[1] "Number of folders with at least one relevant file type (pdf, docx, doc, xls, xlsx):"
print(length(folders_with_relevant_files))
[1] 375
no_relevant_files <- folders[!c(folders %in% folders_with_relevant_files)]
print("File extensions of folders without relevant file types:")
[1] "File extensions of folders without relevant file types:"
list.files(no_relevant_files,
recursive = TRUE,
full.names = TRUE) |>
tools::file_ext() |>
unique()
[1] "gif" "jpg" "tif"
# what sup files to convert?
sup_files <- list.files(sup_dir,
recursive = TRUE,
pattern = "*.pdf|*.docx|*.doc|*.xls|*.XLSX"
)
sup_files <- paste0("output/supplement/", sup_files)
writeLines(sup_files,
here::here("output/supplement/supplemental_files_to_convert.txt"))
./code/extract_text/convert_supplemental_materials.sh
get_fair_smart_links <- function(pmid) {
url <- paste0("https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/supplmat_query_request.cgi/bioc_xml/",
pmid,
"/CPC",)
response <- GET(url)
if (status_code(response) == 200) {
collected_response <- content(response, as = "text", encoding = "UTF-8")
sup_link = xml2::xml_text(xml2:::read_xml(collected_response))
if(grepl("No result can be found", sup_link)) {
warning("No relevant supplements found for PMID: ", pmid)
return(NULL)
}
supplement_links <- stringr::str_extract_all(sup_link, "https?://[^\\s]+")
supplement_links <- unlist(supplement_links)
# print("Supplement links found:")
# print(supplement_links)
return(supplement_links)
} else {
warning("Failed to retrieve supplements for PMID: ", pmid,
". HTTP status: ", status_code(response))
return(NULL)
}
}
sup_links <- purrr::map(all_pmids,
get_fair_smart_links)
names(sup_links) <- all_pmids
download_relevant_sup <- function(sup_link,
out_dir,
pmid) {
if(is.null(sup_link)) {
return(NULL)
}
out_path <- here::here(paste0(out_dir,"/",
pmid,
"_supplement_", basename(sup_link))
)
resp <- GET(sup_link,
write_disk(out_path, overwrite = TRUE))
if (status_code(resp) != 200) {
warning("Failed to download supplement from ", sup_link,
". HTTP status: ", status_code(resp))
return(NULL)
}
message("Downloaded supplement from ", sup_link, " to ", out_path)
return(out_path)
}
purrr::iwalk(sup_links,
~download_relevant_sup(.x,
out_dir = here::here("output/supplement/ncbi"),
pmid = .y)
)
elseiver_xmls <- list.files(here::here("output/fulltexts/elsevier/elsevier_xml/"), full.names = TRUE)
xml <- xml2::read_xml(here::here("output/fulltexts/elsevier/elsevier_xml/17223258.xml"))
use_local <- function(x) {
paste0(".//*[local-name()='", x, "']")
}
get_elsevier_supplement_links <- function(xml_file_path,
api_key = Sys.getenv("ELSEVIER_API_KEY"),
out_dir = here::here("output/supplement/elsevier")) {
# Build auth headers
headers <- c("X-ELS-APIKey" = api_key)
xml <- xml2::read_xml(xml_file_path)
doc <- xml_find_first(xml, use_local("xocs:doc"))
meta <- xml_find_first(doc, use_local("xocs:meta"))
attachments <- xml_find_all(meta,
use_local("xocs:attachments")
)
supplement_eids <- xml_find_all(attachments,
use_local("xocs:attachment-eid")
) |>
xml_text()
supplement_eids <- supplement_eids[!grepl("main",
supplement_eids,
ignore.case = TRUE)]
# sup_names <- xml_text(xml_find_all(attachments, use_local("attachment-eid")))
# track results for all files
results <- list()
if(length(supplement_eids) > 0) {
#@browser()
out_dir = paste0(out_dir,
"/",
tools::file_path_sans_ext(basename(xml_file_path))
)
# make directory for this article
dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)
for(eid in supplement_eids) {
message("Supplement link: ", eid)
out_path <- paste0(out_dir, "/", basename(eid))
api_url = paste0("https://api.elsevier.com/content/object/eid/", eid)
file_resp <- GET(api_url,
add_headers(.headers = headers), # <-- auth headers
write_disk(out_path,
overwrite = TRUE)
)
http_status <- status_code(file_resp)
file_size_kb <- file.size(out_path) / 1024
is_html_error <- tryCatch({
first_bytes <- readLines(out_path, n = 1, warn = FALSE)
grepl("^<!DOCTYPE|^<html", first_bytes, ignore.case = TRUE)
}, error = function(e) FALSE)
status <- dplyr::case_when(
http_status == 403 ~ "permission_denied",
http_status == 401 ~ "unauthorized",
http_status != 200 ~ paste0("http_error_", http_status),
is_html_error ~ "permission_denied_html_response",
file_size_kb < 5 ~ "suspiciously_small",
TRUE ~ "success"
)
if (status != "success") {
warning(sprintf(" [%s] %s (%.1f KB, HTTP %d)",
status, basename(eid), file_size_kb, http_status))
# Remove the bad file so it doesn't look like a successful download
file.remove(out_path)
} else {
message(sprintf(" [OK] %s (%.1f KB)", basename(eid), file_size_kb))
}
# results[[basename(sup)]] <- list(
# url = sup,
# out_path = out_path,
# http_status = http_status,
# file_size_kb = round(file_size_kb, 2),
# status = status
# )
#
}
} else {
message("No supplement links found in ", xml_file_path)
}
}
purrr::map(elseiver_xmls, get_elsevier_supplement_links)
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.17.8 dplyr_1.1.4 here_1.0.1 stringr_1.6.0
[5] xml2_1.4.0 httr_1.4.7 workflowr_1.7.2
loaded via a namespace (and not attached):
[1] jsonlite_2.0.0 compiler_4.3.1 BiocManager_1.30.26
[4] renv_1.1.8 promises_1.3.3 tidyselect_1.2.1
[7] Rcpp_1.1.0 git2r_0.36.2 callr_3.7.6
[10] later_1.4.4 jquerylib_0.1.4 yaml_2.3.10
[13] fastmap_1.2.0 R6_2.6.1 generics_0.1.4
[16] knitr_1.50 tibble_3.3.0 rprojroot_2.1.0
[19] bslib_0.9.0 pillar_1.11.1 rlang_1.1.6
[22] cachem_1.1.0 stringi_1.8.7 httpuv_1.6.16
[25] xfun_0.55 getPass_0.2-4 fs_1.6.6
[28] sass_0.4.10 cli_3.6.5 withr_3.0.2
[31] magrittr_2.0.4 ps_1.9.1 digest_0.6.37
[34] processx_3.8.6 rstudioapi_0.17.1 lifecycle_1.0.4
[37] vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0
[40] whisker_0.4.1 rmarkdown_2.30 tools_4.3.1
[43] pkgconfig_2.0.3 htmltools_0.5.8.1