Last updated: 2026-02-04
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version e7de25d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: .venv/
Ignored: Aus_School_Profile.xlsx
Ignored: SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: code/.DS_Store
Ignored: code/full_text_conversion/.DS_Store
Ignored: data/.DS_Store
Ignored: data/RCDCFundingSummary_01042026.xlsx
Ignored: data/cdc/
Ignored: data/cohort/
Ignored: data/epmc/
Ignored: data/europe_pmc/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
Ignored: data/gbd/gbd_2019_california_percent_deaths.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/2025AA/
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/hp_umls_mapping.csv
Ignored: data/icd/lancet_conditions_icd10.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/mondo_umls_mapping.csv
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/phecode_to_icd10_manual_mapping.xlsx
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
Ignored: data/icd/umls-2025AA-mrconso.zip
Ignored: figures/
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/fulltexts/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/trait_ontology/
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: renv/
Ignored: spacyr_venv/
Ignored: test_37689528.xml
Untracked files:
Untracked: code/full_text_conversion/elsevier_to_jats_v2.R
Untracked: code/full_text_conversion/elsevier_to_jats_v3.R
Untracked: code/full_text_conversion/elsevier_to_jats_v4.R
Untracked: code/full_text_conversion/elsevier_to_jats_v5.R
Untracked: code/full_text_conversion/fix_elsevier_xml.py
Untracked: code/full_text_conversion/testing_fix_elsevier.R
Untracked: debug_elsevier.R
Untracked: schools.R
Untracked: testing.R
Unstaged changes:
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/get_dbgap_ids.Rmd
Modified: analysis/index.Rmd
Modified: analysis/map_trait_to_icd10.Rmd
Modified: analysis/missing_cohort_info.Rmd
Modified: analysis/replication_ancestry_bias.Rmd
Modified: analysis/specific_aims_stats.Rmd
Modified: analysis/text_for_cohort_labels.Rmd
Modified: code/full_text_conversion/elsevier_to_jats.R
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/get_full_text.Rmd) and
HTML (docs/get_full_text.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | e7de25d | IJbeasley | 2026-02-04 | Adding totals + to manually review |
| html | 5c9d397 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | 456acb1 | IJbeasley | 2026-02-04 | Fixing percentage downloaded stats |
| html | 55f6763 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | c0dc676 | IJbeasley | 2026-02-04 | Getting full text from publisher APIs |
| html | 1898c02 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | d214580 | IJbeasley | 2026-02-04 | Getting full text from publisher APIs |
| html | 6ba1e1f | IJbeasley | 2026-01-12 | Build site. |
| Rmd | b43e9a9 | IJbeasley | 2026-01-12 | Update getting full text |
| html | ac0d1a7 | IJbeasley | 2025-10-27 | Build site. |
| html | 8642872 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | da4d730 | IJbeasley | 2025-10-27 | Now run on all texts |
| html | fb5cfd9 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 8ed4c37 | IJbeasley | 2025-10-27 | Now run on all texts |
| html | 8610283 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 7d504e3 | IJbeasley | 2025-10-27 | More fixing of download full text |
| html | 16f4c19 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 3df4096 | IJbeasley | 2025-10-27 | Update + improve full text downloading - test run |
| html | 1439951 | IJbeasley | 2025-10-24 | Build site. |
| Rmd | 481aebe | IJbeasley | 2025-10-24 | Update code for getting full texts |
library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
## Step 1:
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))
gwas_study_info = gwas_study_info |>
dplyr::rename_with(~ gsub(" ", "_", .x))
# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
dplyr::filter(!cause %in% c("HIV/AIDS",
"Tuberculosis",
"Malaria",
"Lower respiratory infections",
"Diarrhoeal diseases",
"Neonatal disorders",
"Tetanus",
"Diphtheria",
"Pertussis" ,
"Measles",
"Maternal disorders"))
# gwas_study_info <- gwas_study_info |>
# dplyr::filter(DISEASE_STUDY == TRUE)
print("Number of disease studies to get full texts for:")
[1] "Number of disease studies to get full texts for:"
all_pmids <- unique(gwas_study_info$PUBMED_ID)
length(all_pmids)
[1] 821
# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))
convert_pmid_df <- convert_pmid_df |>
dplyr::rename(pmcids = PMCID
) |>
dplyr::mutate(pmcids = ifelse(is.na(pmcids),
"",
pmcids
)
)
convert_pmid_df <-
convert_pmid_df |>
dplyr::filter(!is.na(PMID))
converted_ids =
convert_pmid_df |>
filter(PMID %in% all_pmids)
data.table::fwrite(converted_ids,
here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")
)
converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))
print("Head of pmid to pmcid mapping data.frame:")
[1] "Head of pmid to pmcid mapping data.frame:"
head(converted_ids)
PMID pmcids DOI
<int> <char> <char>
1: 17223258 https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876 https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697 https://doi.org/10.1038/ng2043
5: 17463246 https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617 https://doi.org/10.1126/science.1142382
print("Dimensions of pmid to pmcid mapping data.frame:")
[1] "Dimensions of pmid to pmcid mapping data.frame:"
dim(converted_ids)
[1] 821 3
length(all_pmids)
[1] 821
print("All pmids are in this data.frame, but some don't have pmcid mapping")
[1] "All pmids are in this data.frame, but some don't have pmcid mapping"
not_converted_pmids <-
converted_ids |>
filter(pmcids == "") |>
pull(PMID)
print("Number of pmids without pmcid mapping:")
[1] "Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 173
pmcids <-
converted_ids$pmcids |>
unique()
pmcids <- pmcids[pmcids != ""]
print("Number of pmids with pmcid mapping:")
[1] "Number of pmids with pmcid mapping:"
length(pmcids)
[1] 648
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(all_pmids), digits = 2)
[1] 78.93
Requires PMCIDS to download full text xmls from Europe PMC Restful API. Thus, can only be applied to papers with PMCIDs.
# Function to download full text xml from Europe PMC Restful API
download_pmc_text <- function(pmcid,
out_dir = here::here("output/fulltexts/europe_pmc/")
) {
# check if file already exists
if(file.exists(paste0(out_dir, pmcid, ".xml"))){
return(TRUE)
}
url_xml <- paste0("https://www.ebi.ac.uk/",
"europepmc/webservices/rest/",
pmcid,
"/fullTextXML"
)
resp <- GET(url_xml)
# ---- Fallback URL ----
if(status_code(resp) != 200){
url_xml <- paste0("https://europepmc.org/",
"oai.cgi?verb=GetRecord",
"&metadataPrefix=pmc",
"&identifier=oai:europepmc.org:",
pmcid)
resp <- GET(url_xml)
}
# ---- Fail if still bad ----
if(status_code(resp) != 200){
return(NULL)
}
# ---- Parse XML ----
xml_content <- read_xml(
content(resp,
as = "text",
encoding = "UTF-8")
)
article_node = xml_find_first(xml_content,
"//*[local-name() = 'article']"
)
if (is.na(article_node)) {
message("No <article> node found for ", pmcid)
return(NULL)
}
# --- Save ---
write_xml(article_node,
paste0(out_dir, pmcid, ".xml")
)
}
for(article in pmcids[pmcids != ""]){
download_pmc_text(article)
}
euro_pmcids <-list.files(here::here("output/fulltexts/europe_pmc/"),
pattern = "\\.xml$")
euro_pmcids <- gsub("\\.xml$",
"",
euro_pmcids
)
euro_pmcids <- pmcids[pmcids %in% euro_pmcids]
n_euro_pmc <- length(euro_pmcids)
print("Number of downloaded full text files from European PMC:")
[1] "Number of downloaded full text files from European PMC:"
print(n_euro_pmc)
[1] 428
print("Percentage of pmids with full text from European PMC:")
[1] "Percentage of pmids with full text from European PMC:"
round(100 * n_euro_pmc / length(all_pmids), digits = 2)
[1] 52.13
For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service.
# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
# get list of pmcids with full text - non-commercial use
# oa_noncomm
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
europeanpmc_full_texts <-
list.files(here::here("output/fulltexts/europe_pmc"),
pattern = "\\.xml"
)
# get pmcids of these files
europeanpmc_full_texts <-
gsub("\\.xml$",
"",
europeanpmc_full_texts
)
left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]
print("Number of remaining pmcids without full text:")
[1] "Number of remaining pmcids without full text:"
length(left_over_pmcids)
[1] 220
print("+ Number of pmids without pmcid mapping:")
[1] "+ Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 173
author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))
oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))
oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))
author_manu_to_get <-
author_manu |>
dplyr::filter(AccessionID %in% left_over_pmcids |
PMID %in% not_converted_pmids)
print("Number of papers to download in Author Manuscripts section:")
[1] "Number of papers to download in Author Manuscripts section:"
nrow(author_manu_to_get)
[1] 135
oa_noncomm_to_get =
oa_noncomm |>
dplyr::filter(AccessionID %in% left_over_pmcids |
PMID %in% not_converted_pmids)
# remove any overlaps between sections
oa_noncomm_to_get <-
oa_noncomm_to_get |>
dplyr::filter(!c(PMID %in% author_manu_to_get$PMID))
print("Number of additional papers to download in the Non-commericial Open Access PMC section:")
[1] "Number of additional papers to download in the Non-commericial Open Access PMC section:"
nrow(oa_noncomm_to_get)
[1] 0
oa_comm_to_get =
oa_comm |>
dplyr::filter(AccessionID %in% left_over_pmcids |
PMID %in% not_converted_pmids)
oa_comm_to_get <-
oa_comm_to_get |>
dplyr::filter(!c(PMID %in% author_manu_to_get$PMID)) |>
dplyr::filter(!c(PMID %in% oa_noncomm_to_get$PMID))
# remove any overlaps between sections
print("Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:")
[1] "Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:"
nrow(oa_comm_to_get)
[1] 4
file_paths =
c(oa_noncomm_to_get$Key,
oa_comm_to_get$Key,
author_manu_to_get$Key)
file_paths <- str_replace_all(file_paths,
pattern = "txt",
replacement = "xml")
writeLines(
file_paths,
here::here("output/fulltexts/aws_locations/selected_paths.txt")
)
system(
paste(
"xargs -I {} aws s3 cp",
"s3://pmc-oa-opendata/{}",
shQuote(here::here("output/fulltexts/ncbi_cloud/")),
"--no-sign-request",
"<",
shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
)
)
# not_available = left_over_pmcids[!c(left_over_pmcids %in%
# c(oa_noncomm_to_get$AccessionID,
# oa_comm_to_get$AccessionID,
# author_manu_to_get$AccessionID)
# )]
# get list of pmcids already retrieved
ncbi_pmcids_retrieved <-
list.files(c(#here::here("output/fulltexts/europe_pmc"),
here::here("output/fulltexts/ncbi_cloud/")
),
pattern = "\\.xml$"
)
ncbi_pmcids_retrieved <-
gsub("\\.xml$",
"",
ncbi_pmcids_retrieved
)
pmids_retrieved <-
converted_ids |>
filter(pmcids %in% c(ncbi_pmcids_retrieved, euro_pmcids)) |>
pull(PMID)
not_available <- all_pmids[!c(all_pmids %in% pmids_retrieved)]
print("Percentage of pmids with full text from NCBI Cloud Service:")
[1] "Percentage of pmids with full text from NCBI Cloud Service:"
100 * (length(all_pmids) - n_euro_pmc - length(not_available)) / length(all_pmids)
[1] 16.56516
print("Percentage of pmids without full text from either European PMC or NCBI Cloud Service:")
[1] "Percentage of pmids without full text from either European PMC or NCBI Cloud Service:"
100 * length(not_available) / length(all_pmids)
[1] 31.30329
doi_information <-
converted_ids |>
filter(PMID %in% not_available)
library(rcrossref)
library(httr)
# Get download links from Crossref
get_crossref_links <- function(doi) {
# Query Crossref for the article
works <- cr_works(dois = doi)
# keep links for xml or text-mining
links <- works$data$link[[1]]
if(is.null(links)){
link_data <- data.frame(doi = doi,
URL = NA,
content.type = NA,
content.version = NA,
intended.application = NA)
return(link_data)
}
links <-
links |>
filter(intended.application == "text-mining" | content.type == "application/xml"
)
if(nrow(links) == 0){
link_data <- data.frame(doi = doi,
URL = NA,
content.type = NA,
content.version = NA,
intended.application = NA)
} else{
link_data <-
data.frame(doi = doi,
links)
}
return(link_data)
}
# elsevier dois:
elsevier_doi_patterns <- "10.1016|10.1053|10.1086|10.1194|10.1593|10.1097/jto."
elsevier_dois <- grep(elsevier_doi_patterns,
doi_information$DOI,
value = TRUE
)
print("Number of papers potentially can get from Elsevier:")
[1] "Number of papers potentially can get from Elsevier:"
length(elsevier_dois)
[1] 41
elsevier_api_key <- Sys.getenv("ELSEVIER_API_KEY")
elsevier_doi_info <- str_remove_all(pattern = "https://doi.org/",
string = elsevier_dois)
# get pmids for elsevier dois
pmids_elsevier <- doi_information |>
filter(DOI %in% elsevier_dois) |>
mutate(DOI = str_remove_all(DOI,
pattern = "https://doi.org/"
)
) |>
rename_with(~tolower(.x))
# get elsevier full text links from crossref
elsevier_link_df <- purrr::map(elsevier_doi_info,
~get_crossref_links(.x)
) |>
bind_rows()
print("Number of Elsevier links retrieved from Crossref:")
nrow(elsevier_link_df)
print("Number of xml links retrieved from Elsevier links:")
elsevier_link_df |>
filter(content.type == "text/xml") |>
nrow()
elsevier_links <- elsevier_link_df |>
filter(!is.na(URL))
elsevier_links <- elsevier_links |>
left_join(pmids_elsevier,
by = c("doi")
)
# get only xml links
elsevier_links <-
elsevier_links |>
filter(content.type == "text/xml")
download_elsevier_text <- function(url,
api_key,
pmid,
out_dir = here::here("output/fulltexts/elsevier/elsevier_xml/")) {
# if(file.exists(paste0(out_dir, pmid, ".xml"))|file.exists(paste0(out_dir, pmid, ".txt"))
# ){
# return(TRUE)
# }
response <- GET(url,
add_headers("X-ELS-APIKey" = api_key)
)
# if (status_code(response) != 200) {
# message("Failed to fetch text for ", pmid)
# return(FALSE)
#
# }
ct <- headers(response)[["content-type"]]
#print(ct)
if(grepl("text/plain", ct)){
message("Received plain text for ", pmid,
" - skipping for now."
)
return(TRUE)
# text_content <- content(response, type = "text/plain")
#
# writeLines(text_content,
# paste0(out_dir, pmid, ".txt"),
# useBytes = TRUE)
} else {
xml_content <- content(response,
encoding = "UTF-8",
type = "text/xml")
article_node <- xml2::xml_find_first(
xml_content,
".//*[local-name()='originalText']"
)
xml2::write_xml(article_node,
file = paste0(out_dir, pmid, ".xml")
)
}
# writeLines(text_content,
# paste0(out_dir, pmid, ".txt"),
# useBytes = TRUE)
}
purrr::walk2(elsevier_links$URL,
elsevier_links$pmid,
~download_elsevier_text(url = .x,
api_key = elsevier_api_key,
pmid = .y)
)
Convert Elsevier xmls to JATS xml files
mkdir -p output/fulltexts/elsevier/xml
for file in output/fulltexts/elsevier/elsevier_xml/*.xml; do
filename=$(basename "$file")
Rscript code/full_text_conversion/elsevier_to_jats_v4.R "$file" "output/fulltexts/elsevier/xml/${filename%.xml}.xml"
done
print("Number of downloaded full text files (xml) from Elsevier:")
[1] "Number of downloaded full text files (xml) from Elsevier:"
list.files(here::here("output/fulltexts/elsevier/elsevier_xml/"),
pattern = "\\.xml$"
) |>
length()
[1] 41
Policies:
sage_doi_patterns <- "10.1177|10.1089"
sage_links <-
grep(sage_doi_patterns,
doi_information$DOI,
value = TRUE)
sage_links <- str_remove_all(pattern = "https://doi.org/",
string = sage_links)
sage_link_df <- purrr::map(sage_links,
~get_crossref_links(.x)) |>
bind_rows()
# then had to download manually using provided xml links
# to use institutional login details
# http://www.liebertpub.com/doi/full-xml/10.1089/omi.2017.0019
# https://journals.sagepub.com/doi/full-xml/10.1177/00220345211051967
# https://journals.sagepub.com/doi/full-xml/10.1177/0271678X211066299
# these are in JATS .xml format
# saved to output/fulltexts/sage
length(sage_links)
print("Number of downloaded full text files (xml) from Sage:")
[1] "Number of downloaded full text files (xml) from Sage:"
length(list.files(here::here("output/fulltexts/sage"),
pattern = "\\.xml$"
)
)
[1] 3
springer_nature_links <-
grep("nature|10.1038/ng|10.1007/s0|10.1007|10.1038/ejhg|10.1038/tpj|10.1038/jhg|10\\.1038/|10\\.1007/",
doi_information$DOI,
value = TRUE)
springer_nature_links <- str_remove_all(pattern = "https://doi.org/",
string = springer_nature_links)
pmids <- doi_information %>%
filter(DOI %in% paste0("https://doi.org/",
springer_nature_links)) %>%
pull(PMID)
check_springer_oa <- function(doi,
api_key,
pmids,
out_dir = here::here("output/fulltexts/springer_nature/")) {
if(file.exists(paste0(out_dir, pmids, ".xml"))){
return(data.frame(doi = doi,
openaccess = TRUE)
)
}
url<- paste0("https://api.springernature.com/openaccess/jats?",
"api_key=", oa_api_key,
"&q=", doi
)
response <- GET(url)
# if the request fails, return data.frame with doi and oa = F
if (status_code(response) != 200) {
return(data.frame(doi = doi,
openaccess = FALSE)
)
} else {
xml_content <- content(response)
article_node <- xml2::xml_find_all(xml_content, ".//records")
if (xml2::xml_text(article_node) == "") {
return(data.frame(doi = doi,
openaccess = FALSE)
)
}
}
xml2::write_xml(article_node,
paste0(out_dir, pmids, ".xml")
)
return(data.frame(doi = doi,
openaccess = TRUE)
)
}
oa_status <-
purrr::map2(springer_nature_links,
pmids,
~check_springer_oa(doi = .x,
api_key = oa_api_key,
pmids = .y)
)
oa_status_df <- oa_status |> bind_rows()
oa_status_df |> group_by(openaccess) |>
summarise(n = n())
oa_status_df |>
filter(openaccess == FALSE)
print("Number of downloaded full text files (xml) from Springer Nature:")
[1] "Number of downloaded full text files (xml) from Springer Nature:"
length(list.files(here::here("output/fulltexts/springer_nature"),
pattern = "\\.xml$"
)
)
[1] 67
print("Number of downloaded html files from Springer Nature:")
[1] "Number of downloaded html files from Springer Nature:"
length(list.files(here::here("output/fulltexts/springer_nature"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 6
Wiley Text & Data-mining Policy: https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining
These xmls are in Wiley’s proprietary XML format, not JATS.
wiley_dois <- grep("10\\.1002/|10\\.1111/",
doi_information$DOI,
value = TRUE)
wiley_dois <- str_remove_all(wiley_dois, "https://doi.org/")
print("Number of papers potentially can get from Wiley:")
[1] "Number of papers potentially can get from Wiley:"
length(wiley_dois)
[1] 24
pmids_wiley_dois <- doi_information %>%
filter(DOI %in% paste0("https://doi.org/",
wiley_dois)
) %>%
pull(PMID)
download_wiley_pdf<- function(doi,
api_key,
pmids,
output_dir = here::here("output/fulltexts/wiley/pdf/")){
# check files doesn't already exist
if(file.exists(paste0(output_dir, pmids, ".pdf"))){
return(NULL)
}
curl_command <- paste0('curl -L -H "Wiley-TDM-Client-Token:',
wiley_api,
'" https://api.wiley.com/onlinelibrary/tdm/v1/articles/',
doi,
' -o ', output_dir, pmids, '.pdf'
)
print(curl_command)
system(curl_command)
}
purrr::walk2(wiley_dois,
pmids_wiley_dois,
~ download_wiley_pdf(.x, wiley_api, .y)
)
# remove zero byte files - ? I think these are failed downloads as not open access
system("find output/fulltexts/wiley/pdf -type f -size 0 -delete")
# xmls downloaded manually using https://onlinelibrary.wiley.com/doi/full-xml/[DOI]
# downloaded to fulltexts/wiley/wiley_xml
As xml files are in Wiley format, convert JATS XML (1.1) format to be consistent with PubMed etc.
mkdir -p output/fulltexts/wiley/xml
for file in output/fulltexts/wiley/wiley_xml/*.xml; do
filename=$(basename "$file")
Rscript code/full_text_conversion/wiley_to_jats.R "$file" "output/fulltexts/wiley/xml/${filename%.xml}.xml"
done
# how many wiley full text xml downloaded
print("Number of downloaded full text files (xml) from Wiley:")
[1] "Number of downloaded full text files (xml) from Wiley:"
length(list.files(here::here("output/fulltexts/wiley/xml/"),
recursive = TRUE,
pattern = "\\.xml$"))
[1] 24
# how many wiley pdfs downloaded
print("Number of downloaded full text files (pdf) from Wiley:")
[1] "Number of downloaded full text files (pdf) from Wiley:"
length(list.files(here::here("output/fulltexts/wiley/"),
recursive = TRUE,
pattern = "\\.pdf$"))
[1] 8
TDM policy: https://bmjgroup.com/text-and-data-mining-tdm-policy/
bmj_doi_patterns <- "10.1136/gutjnl|10.1136/jmedgenet"
bmj_links <-
grep(bmj_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from BMJ:")
[1] "Number of papers potentially can get from BMJ:"
length(bmj_links)
[1] 4
# download html content from webpage
# save to output/fulltexts/bmj
print("Number of downloaded full text files (html) from BMJ:")
[1] "Number of downloaded full text files (html) from BMJ:"
length(list.files(here::here("output/fulltexts/bmj"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 4
Policies: https://www.cambridge.org/core/services/open-research/text-and-data-mining
cambridge_doi_patterns <- "10.1017"
cambridge_links <-
grep(cambridge_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Cambridge:")
[1] "Number of papers potentially can get from Cambridge:"
length(cambridge_links)
[1] 1
# obtain html content from webpage
print("Number of downloaded full text files (html) from Cambridge:")
[1] "Number of downloaded full text files (html) from Cambridge:"
length(list.files(here::here("output/fulltexts/cambridge"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 1
print("Number of downloaded full text files (xml) from Cambridge:")
[1] "Number of downloaded full text files (xml) from Cambridge:"
length(list.files(here::here("output/fulltexts/cambridge"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 0
Oxford Academic TDM policy: https://academic.oup.com/pages/purchasing/rights-and-permissions/text-and-data-mining
*should reach out to confirm UCSF rights / possibly get xml formats
# go doi pages, and download html manually
oxford_dois <- grep("10.1093|10.113/amiajnl|10.1210|10.1513",
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Oxford Academic:")
[1] "Number of papers potentially can get from Oxford Academic:"
length(oxford_dois)
[1] 38
# then had to download manually using institutional login
# then saved to output/fulltexts/oxford_academic/html
oxford_htmls <- list.files(here::here("output/fulltexts/oxford_academic/html/"),
pattern = "\\.html$"
)
print("Number of downloaded full text files (html) from Oxford Academic:")
[1] "Number of downloaded full text files (html) from Oxford Academic:"
length(oxford_htmls)
[1] 39
# convert html to txt
for(html_file in oxford_htmls){
html_path <- here::here("output/fulltexts/oxford_academic/html/",
html_file
)
html_content <- rvest::read_html(html_path)
text_content <- rvest::html_text2(html_content)
writeLines(text_content,
here::here("output/fulltexts/oxford_academic/txt/",
gsub("\\.html$", ".txt", html_file)
),
useBytes = TRUE)
}
print("Number of downloaded full text files (html) from Oxford Academic:")
[1] "Number of downloaded full text files (html) from Oxford Academic:"
length(list.files(here::here("output/fulltexts/oxford_academic/html"),
pattern = "\\.html$"
)
)
[1] 39
print("Number of downloaded full text files (txt) from Oxford Academic:")
[1] "Number of downloaded full text files (txt) from Oxford Academic:"
length(list.files(here::here("output/fulltexts/oxford_academic/txt"),
pattern = "\\.txt$"
)
)
[1] 35
TDM policy / information: https://taylorandfrancis.com/our-policies/textanddatamining/
taylor_francis_dois <- grep("10.1080|10.2217",
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Taylor & Francis:")
length(taylor_francis_dois)
# then had to download manually using institutional login
# as html
# saved to output/fulltexts/taylor_and_francis/html
print("Number of downloaded full text files (html) from Taylor & Francis:")
[1] "Number of downloaded full text files (html) from Taylor & Francis:"
length(list.files(here::here("output/fulltexts/taylor_and_francis/html"),
pattern = "\\.html$"
)
)
[1] 3
print("Number of downloaded full text files (xml) from Taylor & Francis:")
[1] "Number of downloaded full text files (xml) from Taylor & Francis:"
length(list.files(here::here("output/fulltexts/taylor_and_francis"),
recursive = TRUE,
pattern = "\\.xml$")
)
[1] 0
American Physiological Society, doi: 10.1152
# check, how many papers:
aps_doi_patterns <- "10.1152"
aps_links <-
grep(aps_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from APS:")
[1] "Number of papers potentially can get from APS:"
length(aps_links)
[1] 2
American Association for Cancer Research, doi: 10.1158
aacr_doi_patterns <- "10.1158"
aacr_links <-
grep(aacr_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from AACR:")
[1] "Number of papers potentially can get from AACR:"
length(aacr_links)
[1] 4
AHA, doi: 10.1161
aha_doi_patterns <- "10.1161"
aha_links <-
grep(aha_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from AHA:")
[1] "Number of papers potentially can get from AHA:"
length(aha_links)
[1] 4
? ATS: doi: 10.1164, 10.1165 (moving to Oxford Academic in March 2026)
ats_doi_patterns <- "10.1164|10.1165"
ats_links <-
grep(ats_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ATS:")
[1] "Number of papers potentially can get from ATS:"
length(ats_links)
[1] 12
ASH, doi: 10.1182
ash_doi_patterns <- "10.1182"
ash_links <-
grep(ash_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ASH:")
[1] "Number of papers potentially can get from ASH:"
length(ash_links)
[1] 1
ERS, doi: 10.1183
ers_doi_patterns <- "10.1183"
ers_links <-
grep(ers_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ERS:")
[1] "Number of papers potentially can get from ERS:"
length(ers_links)
[1] 1
ASCO, doi: 10.1200
asco_doi_patterns <- "10.1200"
asco_links <-
grep(asco_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ASCO:")
[1] "Number of papers potentially can get from ASCO:"
length(asco_links)
[1] 1
AAN, doi: 10.1212
aan_doi_patterns <- "10.1212"
aan_links <-
grep(aan_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from AAN:")
[1] "Number of papers potentially can get from AAN:"
length(aan_links)
[1] 3
J-STAGE: doi: 10.1248
jstage_doi_patterns <- "10.1248"
jstage_links <-
grep(jstage_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from J-STAGE:")
[1] "Number of papers potentially can get from J-STAGE:"
length(jstage_links)
[1] 1
JASN: doi: 10.1681
jasn_doi_patterns <- "10.1681"
jasn_links <-
grep(jasn_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from JASN:")
[1] "Number of papers potentially can get from JASN:"
length(jasn_links)
[1] 4
(ADA) Diabetes, doi: 10.2337
diabetes_doi_patterns <- "10.2337"
diabetes_links <-
grep(diabetes_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Diabetes:")
[1] "Number of papers potentially can get from Diabetes:"
length(diabetes_links)
[1] 12
full_text_files <-
list.files(here::here("output/fulltexts"),
recursive = T,
pattern = "\\.html$|\\.xml$")
full_text_files <- basename(full_text_files) |>
stringr::str_remove_all("\\.html$|\\.xml$") |>
unique()
# convert pmcids to pmids
converted_fulltext_pmcids <-
converted_ids |>
filter(pmcids %in% full_text_files) |>
pull(PMID) |>
unique()
full_text_files <- c(full_text_files,
converted_fulltext_pmcids)
full_text_pmids <- grep("PMC",
full_text_files,
invert = T,
value = T)
full_text_pmids = unique(full_text_pmids)
print("Number of PMIDs with full texts downloaded:")
[1] "Number of PMIDs with full texts downloaded:"
sum(all_pmids %in% full_text_files)
[1] 754
print("% of total PMIDs with full texts downloaded:")
[1] "% of total PMIDs with full texts downloaded:"
100 * sum(all_pmids %in% full_text_files) / length(all_pmids)
[1] 91.83922
The papers I can’t get full texts for automatically (either through Europe PMC, NCBI Cloud Service, or publisher TDM policies) will need to be manually reviewed to identify study cohorts.
print("Number of PMIDs without full texts downloaded (to manually review):")
[1] "Number of PMIDs without full texts downloaded (to manually review):"
n_manual_review = length(all_pmids) - sum(all_pmids %in% full_text_files)
n_manual_review
[1] 67
print("Assuming 10 minutes per paper to review, total time (hours):")
[1] "Assuming 10 minutes per paper to review, total time (hours):"
n_manual_review * 10 / 60
[1] 11.16667
open_alex_wiley_urls <-
open_alex_works |>
filter(doi %in% paste0("https://doi.org/", remaining_doi_info)) |>
filter(is_oa_anywhere == T) |>
filter(grepl("onlinelibrary.wiley.com", oa_url))
open_alex_wiley_dois <-
open_alex_wiley_urls |>
pull(doi)
doi_information |>
filter(DOI %in% open_alex_wiley_dois
) |>
pull(PMID) -> pmids_open_alex_wiley
purrr::walk2(open_alex_wiley_urls$oa_url,
pmids_open_alex_wiley,
~ download_wiley_pdf(doi = .x,
api_key = wiley_api,
pmids = .y)
)
# old getting dois:
entrez_info <-
entrez_summary(db="pubmed",
id=not_convertable_pmids)
dois <-
entrez_info |>
purrr::map(function(x) {
x$articleids |>
filter(idtype == "doi") |>
pull(value)
}
)
library(openalexR)
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))
not_convertable_pmids <- converted_ids |>
filter(pmcids == "") |>
pull(PMID)
doi_information <-
convert_pmid_df |>
filter(PMID %in% not_convertable_pmids)
doi_information |>
filter(DOI == "")
doi_information$PMID |> unique() |> length()
length(not_convertable_pmids)
# get open alex works for pmids
open_alex_works <- oa_fetch(
doi = unique(doi_information$DOI),
entity = "works",
options = list(select = c("doi",
"open_access"))
)
# no best open access location:
open_alex_works |>
filter(is.na(oa_url)) |>
nrow()
# pdf link available:
open_alex_works |>
filter(grepl("pdf", oa_url)) |>
nrow()
to_download_pdfs <-
open_alex_works |>
filter(grepl(".pdf", oa_url)) |>
pull(oa_url)
writeLines(
to_download_pdfs,
here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))
cd output/fulltexts/pdfs
while read -r url; do
curl -O "$url"
done < pdf_links_to_download.txt
# not_available
not_convertable_pmids <- converted_ids |>
filter(pmcids == "") |>
pull(PMID)
all_tgz_links = c()
for(article_id in "PMC2613843"){
url <- paste0("https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=",
article_id)
resp <- GET(url)
xml_data <- xml_child(content(resp), "records")
tgz_link <- xml_find_first(xml_data,
".//link[@format='tgz']/@href")
tgz_link <- xml_text(tgz_link)
if (is.na(tgz_link)) {
print("No tar.gz link found.")
} else {
all_tgz_links <- append(all_tgz_links,
tgz_link)
}
}
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] rcrossref_1.2.1 data.table_1.17.8 dplyr_1.1.4 here_1.0.1
[5] stringr_1.6.0 xml2_1.4.0 httr_1.4.7 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] sass_0.4.10 generics_0.1.4 renv_1.0.3
[4] stringi_1.8.7 httpcode_0.3.0 digest_0.6.37
[7] magrittr_2.0.4 evaluate_1.0.5 fastmap_1.2.0
[10] plyr_1.8.9 rprojroot_2.1.0 jsonlite_2.0.0
[13] processx_3.8.6 whisker_0.4.1 crul_1.6.0
[16] ps_1.9.1 promises_1.3.3 BiocManager_1.30.26
[19] jquerylib_0.1.4 cli_3.6.5 shiny_1.11.1
[22] rlang_1.1.6 withr_3.0.2 cachem_1.1.0
[25] yaml_2.3.10 tools_4.3.1 httpuv_1.6.16
[28] DT_0.34.0 curl_7.0.0 vctrs_0.6.5
[31] R6_2.6.1 mime_0.13 lifecycle_1.0.4
[34] git2r_0.36.2 fs_1.6.6 htmlwidgets_1.6.4
[37] miniUI_0.1.2 pkgconfig_2.0.3 callr_3.7.6
[40] pillar_1.11.1 bslib_0.9.0 later_1.4.4
[43] glue_1.8.0 Rcpp_1.1.0 xfun_0.55
[46] tibble_3.3.0 tidyselect_1.2.1 rstudioapi_0.17.1
[49] knitr_1.50 xtable_1.8-4 htmltools_0.5.8.1
[52] rmarkdown_2.30 compiler_4.3.1 getPass_0.2-4