Last updated: 2025-10-20
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version bdb7fae. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: IHME_GBD_2019_Level_2.csv
Ignored: PMC000XXXXX_json_ascii.tar.gz
Ignored: PMC120XXXXX_json_ascii.tar.gz
Ignored: PMC1790863.txt
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: data/.DS_Store
Ignored: data/cohort/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/~$IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/~$IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/who/
Ignored: full_texts/
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/fulltexts/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/trait_ontology/
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: r-spacyr/
Ignored: renv/
Ignored: test_PMC1790863.xml
Ignored: venv/
Untracked files:
Untracked: analysis/text_for_cohort_labels.Rmd
Untracked: code/full_text_download.R
Untracked: code/get_dbgap_ids.py
Untracked: code/get_pmids_from_dbgap.py
Unstaged changes:
Modified: .gitignore
Modified: analysis/correcting_cohort_names.Rmd
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/gbd_data_plots.Rmd
Modified: analysis/group_cancer_diseases.Rmd
Modified: analysis/gwas_to_gbd.Rmd
Modified: analysis/level_1_disease_group_non_cancer.Rmd
Modified: analysis/level_2_disease_group.Rmd
Modified: analysis/map_trait_to_icd10.Rmd
Modified: analysis/trait_ontology_categorization.Rmd
Modified: code/pubmedbert_train_test.py
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/get_dbgap_ids.Rmd) and
HTML (docs/get_dbgap_ids.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | bdb7fae | IJbeasley | 2025-10-20 | Extracting all dbgap id info |
| html | 47e3b12 | IJbeasley | 2025-10-20 | Build site. |
| Rmd | 5909160 | IJbeasley | 2025-10-20 | Extracting dbgap ids |
library(dplyr)
library(data.table)
library(rentrez)
library(purrr)
library(stringr)
gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info = gwas_study_info |>
dplyr::rename_with(~ gsub(" ", "_", .x))
gwas_study_info =
gwas_study_info |>
dplyr::filter(DISEASE_STUDY == T)
all_pmids <-
unique(gwas_study_info$PUBMED_ID)
print(length(all_pmids))
[1] 4511
get_internal_dbgap_ids = function(pubmed_id) {
dbgap_links <- entrez_link(
dbfrom = "pubmed",
db = "gap",
id = pubmed_id
)
dbgap_ids <- unlist(dbgap_links$links$pubmed_gap)
return(data.frame(PUBMED_ID = pubmed_id,
DBGAP_ID = str_flatten(dbgap_ids,
collapse = ",",
na.rm = TRUE)
)
)
}
pubmed_dbgap_mapping <-
purrr::map(all_pmids,
get_internal_dbgap_ids
) |>
dplyr::bind_rows()
n_with_dbgap = pubmed_dbgap_mapping |>
dplyr::filter(DBGAP_ID != "") |>
nrow()
percent_with_dbgap = n_with_dbgap / nrow(pubmed_dbgap_mapping) * 100
percent_with_dbgap
[1] 8.778541
data.table::fwrite(pubmed_dbgap_mapping,
here::here("output/gwas_cohorts/gwas_study_dbgap_ids.csv")
)
dbgap_ids = pubmed_dbgap_mapping$DBGAP_ID
dbgap_ids = dbgap_ids |>
strsplit(",") |>
unlist() |>
unique()
get_dbgap_accession = function(internal_dbgap_id) {
summary = entrez_summary(db = "gap",
id = internal_dbgap_id)
accession = summary$d_study_results$d_study_id
return(data.frame(
INTERNAL_DBGAP_ID = internal_dbgap_id,
DBGAP_ACCESSION = str_flatten(accession,
collapse = ",",
na.rm = TRUE)
)
)
}
dbgap_accession_mapping <-
purrr::map(dbgap_ids,
get_dbgap_accession
) |>
dplyr::bind_rows()
Warning: ID 2256109 produced error 'cannot get document summary'
Warning: ID 2255562 produced error 'cannot get document summary'
Warning: ID 2132689 produced error 'cannot get document summary'
Warning: ID 1999847 produced error 'cannot get document summary'
Warning: ID 2244507 produced error 'cannot get document summary'
Warning: ID 2414967 produced error 'cannot get document summary'
Warning: ID 1813370 produced error 'cannot get document summary'
Warning: ID 1698191 produced error 'cannot get document summary'
Warning: ID 2293830 produced error 'cannot get document summary'
Warning: ID 2293410 produced error 'cannot get document summary'
Warning: ID 2132701 produced error 'cannot get document summary'
Warning: ID 2292915 produced error 'cannot get document summary'
Warning: ID 2410869 produced error 'cannot get document summary'
Warning: ID 1692051 produced error 'cannot get document summary'
Warning: ID 1913977 produced error 'cannot get document summary'
Warning: ID 2254636 produced error 'cannot get document summary'
Warning: ID 2099534 produced error 'cannot get document summary'
Warning: ID 2128305 produced error 'cannot get document summary'
Warning: ID 2250150 produced error 'cannot get document summary'
Warning: ID 2246763 produced error 'cannot get document summary'
Warning: ID 2006486 produced error 'cannot get document summary'
Warning: ID 2258817 produced error 'cannot get document summary'
Warning: ID 2252961 produced error 'cannot get document summary'
Warning: ID 2297187 produced error 'cannot get document summary'
Warning: ID 2410794 produced error 'cannot get document summary'
Warning: ID 2429214 produced error 'cannot get document summary'
Warning: ID 2224606 produced error 'cannot get document summary'
Warning: ID 2246536 produced error 'cannot get document summary'
Warning: ID 2128291 produced error 'cannot get document summary'
Warning: ID 1601604 produced error 'cannot get document summary'
Warning: ID 1914741 produced error 'cannot get document summary'
Warning: ID 2271001 produced error 'cannot get document summary'
Warning: ID 2307567 produced error 'cannot get document summary'
Warning: ID 2243505 produced error 'cannot get document summary'
Warning: ID 2294055 produced error 'cannot get document summary'
Warning: ID 2294054 produced error 'cannot get document summary'
Warning: ID 2423015 produced error 'cannot get document summary'
Warning: ID 1684016 produced error 'cannot get document summary'
Warning: ID 2246760 produced error 'cannot get document summary'
Warning: ID 2244583 produced error 'cannot get document summary'
Warning: ID 2134513 produced error 'cannot get document summary'
Warning: ID 1752357 produced error 'cannot get document summary'
Warning: ID 1716966 produced error 'cannot get document summary'
Warning: ID 2013166 produced error 'cannot get document summary'
Warning: ID 1909445 produced error 'cannot get document summary'
Warning: ID 1909443 produced error 'cannot get document summary'
Warning: ID 1978350 produced error 'cannot get document summary'
Warning: ID 2134525 produced error 'cannot get document summary'
Warning: ID 2408435 produced error 'cannot get document summary'
Warning: ID 2307609 produced error 'cannot get document summary'
Warning: ID 2297186 produced error 'cannot get document summary'
Warning: ID 2239380 produced error 'cannot get document summary'
Warning: ID 2239305 produced error 'cannot get document summary'
Warning: ID 2239580 produced error 'cannot get document summary'
Warning: ID 2246759 produced error 'cannot get document summary'
Warning: ID 2246755 produced error 'cannot get document summary'
Warning: ID 2134508 produced error 'cannot get document summary'
Warning: ID 2248299 produced error 'cannot get document summary'
Warning: ID 2136732 produced error 'cannot get document summary'
Warning: ID 2133385 produced error 'cannot get document summary'
Warning: ID 1958564 produced error 'cannot get document summary'
Warning: ID 1927385 produced error 'cannot get document summary'
Warning: ID 984835 produced error 'cannot get document summary'
Warning: ID 1428307 produced error 'cannot get document summary'
Warning: ID 2134504 produced error 'cannot get document summary'
Warning: ID 2251201 produced error 'cannot get document summary'
Warning: ID 2251836 produced error 'cannot get document summary'
Warning: ID 1913631 produced error 'cannot get document summary'
Warning: ID 2128293 produced error 'cannot get document summary'
Warning: ID 2247986 produced error 'cannot get document summary'
Warning: ID 2239578 produced error 'cannot get document summary'
Warning: ID 2136723 produced error 'cannot get document summary'
Warning: ID 2260344 produced error 'cannot get document summary'
Warning: ID 930640 produced error 'cannot get document summary'
Warning: ID 1585115 produced error 'cannot get document summary'
Warning: ID 1582545 produced error 'cannot get document summary'
Warning: ID 1582544 produced error 'cannot get document summary'
Warning: ID 2239577 produced error 'cannot get document summary'
Warning: ID 2254490 produced error 'cannot get document summary'
Warning: ID 2013313 produced error 'cannot get document summary'
Warning: ID 2013312 produced error 'cannot get document summary'
Warning: ID 1909444 produced error 'cannot get document summary'
Warning: ID 2134512 produced error 'cannot get document summary'
Warning: ID 2256101 produced error 'cannot get document summary'
Warning: ID 1601499 produced error 'cannot get document summary'
Warning: ID 2099548 produced error 'cannot get document summary'
Warning: ID 1954641 produced error 'cannot get document summary'
Warning: ID 1752360 produced error 'cannot get document summary'
Warning: ID 2133393 produced error 'cannot get document summary'
Warning: ID 1960329 produced error 'cannot get document summary'
Warning: ID 2410868 produced error 'cannot get document summary'
Warning: ID 1582340 produced error 'cannot get document summary'
Warning: ID 2100347 produced error 'cannot get document summary'
Warning: ID 2092375 produced error 'cannot get document summary'
Warning: ID 2108236 produced error 'cannot get document summary'
Warning: ID 2252700 produced error 'cannot get document summary'
Warning: ID 1926746 produced error 'cannot get document summary'
Warning: ID 1755085 produced error 'cannot get document summary'
Warning: ID 2247851 produced error 'cannot get document summary'
Warning: ID 2243583 produced error 'cannot get document summary'
Warning: ID 2136726 produced error 'cannot get document summary'
Warning: ID 2416381 produced error 'cannot get document summary'
Warning: ID 2423513 produced error 'cannot get document summary'
Warning: ID 2270763 produced error 'cannot get document summary'
Warning: ID 2128296 produced error 'cannot get document summary'
Warning: ID 2428913 produced error 'cannot get document summary'
Warning: ID 2222631 produced error 'cannot get document summary'
Warning: ID 2437938 produced error 'cannot get document summary'
Warning: ID 2423291 produced error 'cannot get document summary'
Warning: ID 2254632 produced error 'cannot get document summary'
Warning: ID 2254544 produced error 'cannot get document summary'
Warning: ID 2439610 produced error 'cannot get document summary'
Warning: ID 1926646 produced error 'cannot get document summary'
Warning: ID 2420445 produced error 'cannot get document summary'
Warning: ID 2408563 produced error 'cannot get document summary'
Warning: ID 2136752 produced error 'cannot get document summary'
Warning: ID 2085275 produced error 'cannot get document summary'
Warning: ID 2136724 produced error 'cannot get document summary'
Warning: ID 1775403 produced error 'cannot get document summary'
Warning: ID 2292432 produced error 'cannot get document summary'
Warning: ID 2254348 produced error 'cannot get document summary'
Warning: ID 2399899 produced error 'cannot get document summary'
Warning: ID 1925078 produced error 'cannot get document summary'
Warning: ID 2252675 produced error 'cannot get document summary'
Warning: ID 2259374 produced error 'cannot get document summary'
Warning: ID 2013263 produced error 'cannot get document summary'
Warning: ID 2414968 produced error 'cannot get document summary'
Warning: ID 2243787 produced error 'cannot get document summary'
Warning: ID 1975323 produced error 'cannot get document summary'
Warning: ID 2225448 produced error 'cannot get document summary'
Warning: ID 2294062 produced error 'cannot get document summary'
dbgap_accession_mapping =
left_join(pubmed_dbgap_mapping,
dbgap_accession_mapping,
by = c("DBGAP_ID" = "INTERNAL_DBGAP_ID")
)
data.table::fwrite(
dbgap_accession_mapping,
here::here("output/gwas_cohorts/gwas_study_dbgap_accessions.csv")
)
dbgap_to_pubmed_id <- function(dbgap_accession) {
res <- entrez_search(db = "gap",
term = paste0(dbgap_accession, "[STID]"))
links <- entrez_link(dbfrom = "gap",
db = "pubmed",
id = res$ids
)
pmids <- unlist(links$links$gap_pubmed)
return(data.frame(
DBGAP_ACCESSION = dbgap_accession,
DBGAP_ID = str_flatten(res$ids,
collapse = ",",
na.rm = TRUE),
PUBMED_ID = str_flatten(pmids,
collapse = ",",
na.rm = TRUE)
)
)
}
safe_dbgap_to_pubmed_id <- purrr::possibly(dbgap_to_pubmed_id,
otherwise = data.frame(
DBGAP_ACCESSION = NA,
DBGAP_ID = NA,
PUBMED_ID = NA
))
# get known dbgap accessions from gwas study info
cohort_info <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
sheet = 1)
New names:
• `` -> `...10`
dbgap_accessions <-
c(cohort_info$dbGaP[cohort_info$dbGaP != "" & !is.na(cohort_info$dbGaP)],
dbgap_accession_mapping$DBGAP_ACCESSION[dbgap_accession_mapping$DBGAP_ACCESSION != ""]
)
dbgap_accessions <- unlist(strsplit(dbgap_accessions, ","))
dbgap_accessions <- unique(dbgap_accessions)
dbgap_to_pubmed_mapping <-
purrr::map(dbgap_accessions,
safe_dbgap_to_pubmed_id
) |>
dplyr::bind_rows()
dbgap_to_pubmed_mapping =
dbgap_to_pubmed_mapping |>
filter(!is.na(PUBMED_ID)) |>
tidyr::separate_longer_delim(cols = "PUBMED_ID", delim = ",")
dbgap_accession_mapping =
dbgap_accession_mapping |>
mutate(dbgap_accession_mapping = as.character(DBGAP_ACCESSION))
dbgap_to_pubmed_mapping =
dbgap_to_pubmed_mapping |>
mutate(PUBMED_ID = as.numeric(PUBMED_ID))
combined_dbgap_pubmed_mapping =
bind_rows(
dbgap_accession_mapping,
dbgap_to_pubmed_mapping)
combined_dbgap_pubmed_mapping =
combined_dbgap_pubmed_mapping |>
distinct()
combined_dbgap_pubmed_mapping =
combined_dbgap_pubmed_mapping |>
filter(!(DBGAP_ID == "" & is.na(DBGAP_ACCESSION))
)
combined_dbgap_pubmed_mapping =
combined_dbgap_pubmed_mapping |>
filter(!(is.na(PUBMED_ID)))
# filter for pmids in gwas study info
combined_dbgap_pubmed_mapping =
combined_dbgap_pubmed_mapping |>
filter(PUBMED_ID %in% all_pmids)
combined_dbgap_pubmed_mapping =
combined_dbgap_pubmed_mapping |>
arrange(PUBMED_ID)
data.table::fwrite(
combined_dbgap_pubmed_mapping,
here::here("output/gwas_cohorts/gwas_study_combined_dbgap_pubmed_mapping.csv")
)
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] stringr_1.5.2 purrr_1.1.0 rentrez_1.2.4 data.table_1.17.8
[5] dplyr_1.1.4 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] jsonlite_2.0.0 compiler_4.3.1 renv_1.0.3 promises_1.3.3
[5] tidyselect_1.2.1 Rcpp_1.1.0 git2r_0.36.2 tidyr_1.3.1
[9] callr_3.7.6 later_1.4.4 jquerylib_0.1.4 readxl_1.4.5
[13] yaml_2.3.10 fastmap_1.2.0 here_1.0.1 R6_2.6.1
[17] generics_0.1.4 curl_7.0.0 knitr_1.50 XML_3.99-0.19
[21] tibble_3.3.0 rprojroot_2.1.0 bslib_0.9.0 pillar_1.11.1
[25] rlang_1.1.6 cachem_1.1.0 stringi_1.8.7 httpuv_1.6.16
[29] xfun_0.53 getPass_0.2-4 fs_1.6.6 sass_0.4.10
[33] cli_3.6.5 withr_3.0.2 magrittr_2.0.4 ps_1.9.1
[37] digest_0.6.37 processx_3.8.6 rstudioapi_0.17.1 lifecycle_1.0.4
[41] vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0 cellranger_1.1.0
[45] whisker_0.4.1 rmarkdown_2.30 httr_1.4.7 tools_4.3.1
[49] pkgconfig_2.0.3 htmltools_0.5.8.1