Last updated: 2026-02-04

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c0dc676. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    Aus_School_Profile.xlsx
    Ignored:    SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacyr_venv/
    Ignored:    test_37689528.xml

Untracked files:
    Untracked:  code/full_text_conversion/elsevier_to_jats_v2.R
    Untracked:  code/full_text_conversion/elsevier_to_jats_v3.R
    Untracked:  code/full_text_conversion/elsevier_to_jats_v4.R
    Untracked:  code/full_text_conversion/elsevier_to_jats_v5.R
    Untracked:  code/full_text_conversion/fix_elsevier_xml.py
    Untracked:  code/full_text_conversion/testing_fix_elsevier.R
    Untracked:  debug_elsevier.R
    Untracked:  schools.R
    Untracked:  testing.R

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_dbgap_ids.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/specific_aims_stats.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd
    Modified:   code/full_text_conversion/elsevier_to_jats.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_full_text.Rmd) and HTML (docs/get_full_text.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd c0dc676 IJbeasley 2026-02-04 Getting full text from publisher APIs
html 1898c02 IJbeasley 2026-02-04 Build site.
Rmd d214580 IJbeasley 2026-02-04 Getting full text from publisher APIs
html 6ba1e1f IJbeasley 2026-01-12 Build site.
Rmd b43e9a9 IJbeasley 2026-01-12 Update getting full text
html ac0d1a7 IJbeasley 2025-10-27 Build site.
html 8642872 IJbeasley 2025-10-27 Build site.
Rmd da4d730 IJbeasley 2025-10-27 Now run on all texts
html fb5cfd9 IJbeasley 2025-10-27 Build site.
Rmd 8ed4c37 IJbeasley 2025-10-27 Now run on all texts
html 8610283 IJbeasley 2025-10-27 Build site.
Rmd 7d504e3 IJbeasley 2025-10-27 More fixing of download full text
html 16f4c19 IJbeasley 2025-10-27 Build site.
Rmd 3df4096 IJbeasley 2025-10-27 Update + improve full text downloading - test run
html 1439951 IJbeasley 2025-10-24 Build site.
Rmd 481aebe IJbeasley 2025-10-24 Update code for getting full texts

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)

Get PMCIDs

Get Pubmed ids from GWAS catalog

# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

## Step 1: 
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
    dplyr::filter(!cause %in% c("HIV/AIDS",
                             "Tuberculosis",
                             "Malaria",
                             "Lower respiratory infections",
                             "Diarrhoeal diseases",
                             "Neonatal disorders",
                             "Tetanus",
                             "Diphtheria",
                             "Pertussis" ,
                             "Measles",
                             "Maternal disorders"))

# gwas_study_info <- gwas_study_info |>
#   dplyr::filter(DISEASE_STUDY == TRUE)

print("Number of disease studies to get full texts for:")
[1] "Number of disease studies to get full texts for:"
pmids <- unique(gwas_study_info$PUBMED_ID)
length(pmids)
[1] 821

Convert Pubmed IDs to PMCIDs

# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

convert_pmid_df <- convert_pmid_df |>
  dplyr::rename(pmcids = PMCID
                ) |>
  dplyr::mutate(pmcids = ifelse(is.na(pmcids),
                                "",
                                pmcids
                                )
                )

convert_pmid_df <-
  convert_pmid_df |>
  dplyr::filter(!is.na(PMID))

converted_ids = 
  convert_pmid_df |>
  filter(PMID %in% pmids)

data.table::fwrite(converted_ids,
                   here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")
                   )
converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))

print("Head of pmid to pmcid mapping data.frame:")
[1] "Head of pmid to pmcid mapping data.frame:"
head(converted_ids)
       PMID     pmcids                                           DOI
      <int>     <char>                                        <char>
1: 17223258             https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876                      https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697                           https://doi.org/10.1038/ng2043
5: 17463246                  https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617       https://doi.org/10.1126/science.1142382
print("Dimensions of pmid to pmcid mapping data.frame:")
[1] "Dimensions of pmid to pmcid mapping data.frame:"
dim(converted_ids)
[1] 821   3
length(pmids)
[1] 821
print("All pmids are in this data.frame, but some don't have pmcid mapping")
[1] "All pmids are in this data.frame, but some don't have pmcid mapping"
not_converted_pmids <-
converted_ids |>
  filter(pmcids == "")  |>
  pull(PMID)

print("Number of pmids without pmcid mapping:")
[1] "Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 173
pmcids <-
converted_ids$pmcids |>
  unique()

pmcids <- pmcids[pmcids != ""]

print("Number of pmids with pmcid mapping:")
[1] "Number of pmids with pmcid mapping:"
length(pmcids)
[1] 648
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(pmids), digits = 2)
[1] 78.93

Download full texts from European PMC

Requires PMCIDS to download full text xmls from Europe PMC Restful API. Thus, can only be applied to papers with PMCIDs.

# Function to download full text xml from Europe PMC Restful API
download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europe_pmc/")
                              ) {


  url_xml <- paste0("https://www.ebi.ac.uk/",
                    "europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )
  
  resp <- GET(url_xml)
  
  # ---- Fallback URL ----
  if(status_code(resp) != 200){
    
    url_xml <- paste0("https://europepmc.org/",
                       "oai.cgi?verb=GetRecord",
                       "&metadataPrefix=pmc",
                       "&identifier=oai:europepmc.org:",
                       pmcid)
    
    resp <- GET(url_xml)
  
  }
  
  # ---- Fail if still bad ----
  if(status_code(resp) != 200){
    
  return(NULL)
    
  }
  
  # ---- Parse XML ----
  xml_content <- read_xml(
    content(resp, 
            as = "text", 
            encoding = "UTF-8")
  )
  
  xml_content <- read_xml(content(resp,
                                  as = "text",
                                  encoding = "UTF-8")
                          )
  
  article_node = xml_find_first(xml_content, 
                               "//*[local-name() = 'article']"
                               )
  
   if (is.na(article_node)) {
    message("No <article> node found for ", pmcid)
     
    return(NULL)
   }
  
  # --- Save ---
  write_xml(article_node, 
            paste0(out_dir, pmcid, ".xml")
            )
  
} 


for(article in pmcids[pmcids != ""]){

download_pmc_text(article)

}
print("Number of downloaded full text files from European PMC:")
[1] "Number of downloaded full text files from European PMC:"
n_euro_pmc <- length(list.files(here::here("output/fulltexts/europe_pmc/"),
                  pattern = "\\.xml$")
       )
print(n_euro_pmc)
[1] 524
print("Percentage of pmids with full text from European PMC:")
[1] "Percentage of pmids with full text from European PMC:"
round(100 * n_euro_pmc / length(pmids), digits = 2)
[1] 63.82

Download full texts from NCBI Cloud Service

For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service.


# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text - non-commercial use
# oa_noncomm 
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

Identify full texts already downloaded through European PMC

europeanpmc_full_texts <- 
list.files(here::here("output/fulltexts/europe_pmc"),
                  pattern = "\\.xml"
           )

# get pmcids of these files
europeanpmc_full_texts <-
  gsub("\\.xml$", 
       "", 
       europeanpmc_full_texts
       ) 

Get NCBI download paths for remaining full texts (where avaliable)

left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]

print("Number of remaining pmcids without full text:")
[1] "Number of remaining pmcids without full text:"
length(left_over_pmcids)
[1] 220
print("+ Number of pmids without pmcid mapping:")
[1] "+ Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 173
author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))

oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))

oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))

author_manu_to_get <-
author_manu |>
  dplyr::filter(AccessionID %in% left_over_pmcids | 
                PMID %in% not_converted_pmids)

print("Number of papers to download in Author Manuscripts section:")
[1] "Number of papers to download in Author Manuscripts section:"
nrow(author_manu_to_get)
[1] 135
oa_noncomm_to_get = 
oa_noncomm |>
  dplyr::filter(AccessionID %in% left_over_pmcids | 
                PMID %in% not_converted_pmids)

# remove any overlaps between sections
oa_noncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(PMID %in% author_manu_to_get$PMID))

print("Number of additional papers to download in the Non-commericial Open Access PMC section:")
[1] "Number of additional papers to download in the Non-commericial Open Access PMC section:"
nrow(oa_noncomm_to_get)
[1] 0
oa_comm_to_get = 
oa_comm |>
  dplyr::filter(AccessionID %in% left_over_pmcids |
                PMID %in% not_converted_pmids)

oa_comm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(PMID %in% author_manu_to_get$PMID)) |>
  dplyr::filter(!c(PMID %in% oa_noncomm_to_get$PMID))

# remove any overlaps between sections
print("Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:")
[1] "Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:"
nrow(oa_comm_to_get)
[1] 0
file_paths = 
c(oa_noncomm_to_get$Key,
  oa_comm_to_get$Key,
  author_manu_to_get$Key)

file_paths <- str_replace_all(file_paths,
                              pattern = "txt",
                              replacement = "xml")

Download remaining full texts from NCBI Cloud Service

writeLines(
  file_paths,
  here::here("output/fulltexts/aws_locations/selected_paths.txt")
)

system(
  paste(
    "xargs -I {} aws s3 cp",
    "s3://pmc-oa-opendata/{}",
    shQuote(here::here("output/fulltexts/ncbi_cloud/")),
    "--no-sign-request",
    "<",
    shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
  )
)
# not_available = left_over_pmcids[!c(left_over_pmcids %in% 
#                                           c(oa_noncomm_to_get$AccessionID, 
#                                             oa_comm_to_get$AccessionID,
#                                             author_manu_to_get$AccessionID)
#                                           )]

# get list of pmcids already retrieved
pmcids_retrieved <- 
list.files(c(here::here("output/fulltexts/europe_pmc"),
             here::here("output/fulltexts/ncbi_cloud/")
             ),
             pattern = "\\.xml$"
)

pmcids_retrieved <-
  gsub("\\.xml$", 
       "", 
       pmcids_retrieved
       )

pmids_retrieved <-
converted_ids  |>
  filter(pmcids %in% pmcids_retrieved) |>
  pull(PMID)

all_pmids <- unique(gwas_study_info$PUBMED_ID)

not_available <- all_pmids[!c(all_pmids %in% pmids_retrieved)] 

print("Percentage of pmids with full text from NCBI Cloud Service:")
[1] "Percentage of pmids with full text from NCBI Cloud Service:"
100 * (n_euro_pmc - length(not_available)) / length(pmids)
[1] 32.52132
print("Percentage of pmids without full text from either European PMC or NCBI Cloud Service:")
[1] "Percentage of pmids without full text from either European PMC or NCBI Cloud Service:"
100 * length(not_available) / length(pmids)
[1] 31.30329

Download from publisher (uses dois)

Get dois for remaining articles to get full full text

doi_information <-
converted_ids |>
  filter(PMID %in% not_available)

To get text-mining license info for:

American Physiological Society, doi: 10.1152

# check, how many papers:
aps_doi_patterns <- "10.1152"

aps_links <-
  grep(aps_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from APS:")
[1] "Number of papers potentially can get from APS:"
length(aps_links)
[1] 2

American Association for Cancer Research, doi: 10.1158

aacr_doi_patterns <- "10.1158"

aacr_links <-
  grep(aacr_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AACR:")
[1] "Number of papers potentially can get from AACR:"
length(aacr_links)
[1] 4

AHA, doi: 10.1161

aha_doi_patterns <- "10.1161"

aha_links <-
  grep(aha_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AHA:")
[1] "Number of papers potentially can get from AHA:"
length(aha_links)
[1] 4

? ATS: doi: 10.1164, 10.1165 (moving to Oxford Academic in March 2026)

ats_doi_patterns <- "10.1164|10.1165"

ats_links <-
  grep(ats_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ATS:")
[1] "Number of papers potentially can get from ATS:"
length(ats_links)
[1] 12

ASH, doi: 10.1182

ash_doi_patterns <- "10.1182"

ash_links <-
  grep(ash_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ASH:")
[1] "Number of papers potentially can get from ASH:"
length(ash_links)
[1] 1

ERS, doi: 10.1183

ers_doi_patterns <- "10.1183"

ers_links <-
  grep(ers_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ERS:")
[1] "Number of papers potentially can get from ERS:"
length(ers_links)
[1] 1

ASCO, doi: 10.1200

asco_doi_patterns <- "10.1200"

asco_links <-
  grep(asco_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ASCO:")
[1] "Number of papers potentially can get from ASCO:"
length(asco_links)
[1] 1

AAN, doi: 10.1212

aan_doi_patterns <- "10.1212"

aan_links <-
  grep(aan_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AAN:")
[1] "Number of papers potentially can get from AAN:"
length(aan_links)
[1] 3

J-STAGE: doi: 10.1248

jstage_doi_patterns <- "10.1248"

jstage_links <-
  grep(jstage_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from J-STAGE:")
[1] "Number of papers potentially can get from J-STAGE:"
length(jstage_links)
[1] 1

JASN: doi: 10.1681

jasn_doi_patterns <- "10.1681"

jasn_links <-
  grep(jasn_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from JASN:")
[1] "Number of papers potentially can get from JASN:"
length(jasn_links)
[1] 4

(ADA) Diabetes, doi: 10.2337

diabetes_doi_patterns <- "10.2337"

diabetes_links <-
  grep(diabetes_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from Diabetes:")
[1] "Number of papers potentially can get from Diabetes:"
length(diabetes_links)
[1] 12

Testing:

Download PDFs using Open Access information from Open Alex

PMIDs that couldn’t be converted to PMCIDs

# old getting dois:
entrez_info <-
entrez_summary(db="pubmed", 
               id=not_convertable_pmids)

dois <-
entrez_info |>
  purrr::map(function(x) {
    
    x$articleids |> 
      filter(idtype == "doi") |> 
      pull(value)
  }
)
library(openalexR)

convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

doi_information <-
convert_pmid_df |>
  filter(PMID %in% not_convertable_pmids)

doi_information |>
  filter(DOI == "")

doi_information$PMID |> unique() |> length()

length(not_convertable_pmids)

# get open alex works for pmids
open_alex_works <- oa_fetch(
  doi = unique(doi_information$DOI),
  entity = "works",
  options = list(select = c("doi", 
                            "open_access"))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))

cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download.txt

PMCIDs not found in Author Manuscripts or Open Access sections

doi_information <-
convert_pmid_df |>
  filter(PMCID %in% not_available)

doi_information |>
  filter(DOI == "")

doi_information <-
  doi_information |>
  filter(DOI != "")

# get open alex works for pmcids
open_alex_works <- oa_fetch(
  doi = doi_information$DOI,
  entity = "works",
  options = list(select = c(#"title",
                            "doi", 
                            "open_access"
                            ))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download_pt2.txt"))

cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download_pt2.txt

Test: not used - Europe PMC Author Manuscripts


curl -s https://europepmc.org/ftp/manuscripts/ \
  | grep -o 'author_manuscript_txt[^"]*\.filelist\.txt' \
  | sort -u \
  | while read -r file; do
      curl -O "https://europepmc.org/ftp/manuscripts/$file"
    done
all_file_lists <- list.files(here::here("data/epmc"))

author_manu_epmc <- all_file_lists |>
                    purrr::map(function(file_name) {
                      
                      file_path = here::here("data/epmc",
                                             file_name
                                             )
                      
                      df <- fread(file_path)
                      
                      return(df)
                    }
                    ) |>
                    bind_rows()

author_manu_epmc |>
  filter(AccessionID %in% not_available)

author_manu_epmc |>
  filter(PMID %in% not_avaliable_pmids)

author_manu_epmc |>
  filter(PMID %in% pmids)

author_manu_epmc |>
  filter(PMID %in% not_convertable_pmids)

Test: not used - Download with ftp service, where avaliable

# not_available
not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

all_tgz_links = c()

for(article_id in "PMC2613843"){

url <- paste0("https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=",
              article_id)
  
resp <- GET(url)

xml_data <- xml_child(content(resp), "records")

tgz_link <- xml_find_first(xml_data, 
                           ".//link[@format='tgz']/@href")
tgz_link <- xml_text(tgz_link)

if (is.na(tgz_link)) {
  
  print("No tar.gz link found.")
  
} else {
  
  all_tgz_links <- append(all_tgz_links, 
                          tgz_link)
  
}
}

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] rcrossref_1.2.1   data.table_1.17.8 dplyr_1.1.4       here_1.0.1       
[5] stringr_1.6.0     xml2_1.4.0        httr_1.4.7        workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] sass_0.4.10         generics_0.1.4      renv_1.0.3         
 [4] stringi_1.8.7       httpcode_0.3.0      digest_0.6.37      
 [7] magrittr_2.0.4      evaluate_1.0.5      fastmap_1.2.0      
[10] plyr_1.8.9          rprojroot_2.1.0     jsonlite_2.0.0     
[13] processx_3.8.6      whisker_0.4.1       crul_1.6.0         
[16] ps_1.9.1            promises_1.3.3      BiocManager_1.30.26
[19] jquerylib_0.1.4     cli_3.6.5           shiny_1.11.1       
[22] rlang_1.1.6         withr_3.0.2         cachem_1.1.0       
[25] yaml_2.3.10         tools_4.3.1         httpuv_1.6.16      
[28] DT_0.34.0           curl_7.0.0          vctrs_0.6.5        
[31] R6_2.6.1            mime_0.13           lifecycle_1.0.4    
[34] git2r_0.36.2        fs_1.6.6            htmlwidgets_1.6.4  
[37] miniUI_0.1.2        pkgconfig_2.0.3     callr_3.7.6        
[40] pillar_1.11.1       bslib_0.9.0         later_1.4.4        
[43] glue_1.8.0          Rcpp_1.1.0          xfun_0.55          
[46] tibble_3.3.0        tidyselect_1.2.1    rstudioapi_0.17.1  
[49] knitr_1.50          xtable_1.8-4        htmltools_0.5.8.1  
[52] rmarkdown_2.30      compiler_4.3.1      getPass_0.2-4