Last updated: 2025-09-10
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a7e2f7c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Ignored: data/gwas_catalog/
Ignored: output/gwas_cat/
Ignored: output/gwas_study_info_cohort_corrected.csv
Ignored: output/gwas_study_info_trait_corrected.csv
Ignored: output/gwas_study_info_trait_ontology_info.csv
Ignored: output/gwas_study_info_trait_ontology_info_l1.csv
Ignored: output/gwas_study_info_trait_ontology_info_l2.csv
Ignored: output/trait_ontology/
Ignored: renv/
Untracked files:
Untracked: analysis/disease_trait_terms_simplify.Rmd
Untracked: data/gbd/
Untracked: data/who/
Unstaged changes:
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/index.Rmd
Modified: analysis/level_1_disease_group.Rmd
Modified: analysis/level_2_disease_group.Rmd
Deleted: analysis/non_ontology_trait_collapse.Rmd
Deleted: analysis/trait_ontology_collapse.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown
(analysis/trait_ontology_categorization.Rmd
) and HTML
(docs/trait_ontology_categorization.html
) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote
),
click on the hyperlinks in the table below to view the files as they
were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | a7e2f7c | IJbeasley | 2025-09-10 | Fixing / re-formatting of initial trait categorization |
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE
)
library(data.table)
library(dplyr)
library(ggplot2)
library(stringr)
gwas_study_info <- fread(here::here("output/gwas_study_info_cohort_corrected.csv"))
# fixing weird terms- where comma is in the term
# gwas_study_info |>
# mutate(n_commas_trait = stringr::str_count(MAPPED_TRAIT, ", "),
# n_commas_uri = stringr::str_count(MAPPED_TRAIT_URI, ",")) |>
# filter(n_commas_trait != n_commas_uri)
gwas_study_info =
gwas_study_info |>
rowwise() |>
mutate(MAPPED_TRAIT = case_when(
# osteoarthritis, hip ... http://www.ebi.ac.uk/efo/EFO_1000786
grepl("EFO_1000786", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, hip", "hip osteoarthritis"),
# osteoarthritis, hand ... http://www.ebi.ac.uk/efo/EFO_1000789
grepl("EFO_1000789", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, hand", "hand osteoarthritis"),
# osteoarthritis, spine ... http://www.ebi.ac.uk/efo/EFO_1000787
grepl("EFO_1000787", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, spine", "spine osteoarthritis"),
# Hepatitis, Alcoholic, http://www.ebi.ac.uk/efo/EFO_1001345
grepl("EFO_1001345", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Hepatitis, Alcoholic", "Alcoholic Hepatitis"),
# psoriasis 14, pustular http://purl.obolibrary.org/obo/MONDO_0013626
grepl("MONDO_0013626", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "psoriasis 14, pustula", "pustular psoriasis 14"),
# hypertension, pregnancy-induced http://purl.obolibrary.org/obo/MONDO_0024664
grepl("MONDO_0024664", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "hypertension, pregnancy-induced", "pregnancy-induced hypertension"),
# renal agenesis, unilateral http://purl.obolibrary.org/obo/MONDO_0019636
grepl("MONDO_0019636", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "renal agenesis, unilateral", "unilateral renal agenesis"),
# Cholecystitis, Acute http://www.ebi.ac.uk/efo/EFO_1001289
grepl("EFO_1001289", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Cholecystitis, Acute", "Acute Cholecystitis"),
# Genital neoplasm, female http://www.ebi.ac.uk/efo/EFO_1001331
grepl("EFO_1001331", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Genital neoplasm, female", "female reproductive organ cancer"),
# Anemia, Hemolytic, Autoimmune http://www.ebi.ac.uk/efo/EFO_1001264
grepl("EFO_1001264", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Anemia, Hemolytic, Autoimmune", "autoimmune haemolytic anemia"),
TRUE ~ MAPPED_TRAIT
)
) |>
ungroup()
# osteoarthritis, knee ... http://www.ebi.ac.uk/efo/EFO_0004616
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(
grepl("EFO_0004616", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, knee", "knee osteoarthritis"),
MAPPED_TRAIT
)
)
# rheumatoid factor-negative juvenile idiopathic arthritis
# polyarticular juvenile idiopathic arthritis, rheumatoid factor negative
# http://www.ebi.ac.uk/efo/EFO_1002020
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("EFO_1002020", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT,
pattern = "polyarticular juvenile idiopathic arthritis, rheumatoid factor negative",
"rheumatoid factor-negative juvenile idiopathic arthritis"),
MAPPED_TRAIT)
)
# http://www.ebi.ac.uk/efo/EFO_0007294, hand, foot and mouth disease,
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("EFO_0007294", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT,
pattern = "hand, foot and mouth disease",
"hand foot and mouth disease"),
MAPPED_TRAIT)
)
# susceptibility to migriane without aura http://purl.obolibrary.org/obo/MONDO_0011847
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("MONDO_0011847", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT, pattern = "migraine without aura, susceptibility to, 4", "migraine without aura"),
MAPPED_TRAIT)
)
# neural tube defects, susceptibility to, http://purl.obolibrary.org/obo/MONDO_0020705
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("MONDO_0020705", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT,
pattern = "neural tube defects, susceptibility to",
"neural tube defects"),
MAPPED_TRAIT)
)
# infantile diarrhea
# Diarrhea, Infantile http://www.ebi.ac.uk/efo/EFO_1001306
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("EFO_1001306", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT,
pattern = "Diarrhea, Infantile",
"infantile diarrhea"),
MAPPED_TRAIT)
)
# self reported traits
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("EFO_0009803|EFO_0009822|EFO_0009803|EFO_0009817|EFO_0009822|EFO_0009819|EFO_0009823|EFO_0009824", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT,
pattern = ", self-reported$",
" self-reported"),
MAPPED_TRAIT)
)
# Hodgkins lymphoma, mixed cellularity http://www.ebi.ac.uk/efo/EFO_1002031
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("EFO_1002031", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT, pattern = "Hodgkins lymphoma, mixed cellularity", "Hodgkins lymphoma mixed cellularity"),
MAPPED_TRAIT)
)
# encephalopathy, acute, infection-induced, http://purl.obolibrary.org/obo/MONDO_0000166
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = ifelse(grepl("MONDO_0000166", MAPPED_TRAIT_URI),
stringr::str_replace_all(MAPPED_TRAIT,
pattern = "encephalopathy, acute, infection-induced",
"encephalopathy acute infection-induced"),
MAPPED_TRAIT)
)
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT = stringr::str_replace_all(MAPPED_TRAIT,
"^level of .+, mitochondrial in blood$", "blood protein amount")
)
all_gwas_terms = unique(gwas_study_info$MAPPED_TRAIT)
all_gwas_terms = stringr::str_trim(tolower(all_gwas_terms))
efo_descendants <- readLines(here::here("output/trait_ontology/efo_0000408_descendants.txt"))
mondo_descendants <- readLines(here::here("output/trait_ontology/mondo_0700096_descendants.txt"))
ncit_descendants <- readLines(here::here("output/trait_ontology/ncit_C2991_descendants.txt"))
orphanet_descendants <- readLines(here::here("output/trait_ontology/orphanet_557493_descendants.txt"))
age_of_onset_descendants <- readLines(here::here("output/trait_ontology/oba_2020000_descendants.txt"))
disease_measurement_terms <- readLines(here::here("output/trait_ontology/efo_0001444_disease_measurement_terms.txt"))
disease_typos = c("Alzheimer disease",
"late-onset Alzheimers disease",
"age of onset of Alzheimer disease",
"Chagas cardiomyopathy",
"Parkinson disease",
"Iron deficiency anemia",
"Churg-Strauss syndrome",
"Iridocyclitis",
"Phlebitis"
)
other <- c("Allergic disease",
"Lewy body dementia",
"Lewy body attribute",
"non-Hodgkins lymphoma",
"Ischemic Stroke",
"Lung disease",
"Respiratory System Disease",
"Iron deficiency anemia (disorder)",
"Alzheimer disease, APOE carrier status",
"Alzheimer's disease biomarker measurement",
"Genital neoplasm, female",
"HIV-associated neurocognitive disorder",
"encephalopathy acute infection-induced"
)
disease_terms = c(mondo_descendants,
efo_descendants,
ncit_descendants,
orphanet_descendants,
age_of_onset_descendants,
disease_measurement_terms,
disease_typos,
other) |>
unique()
disease_terms = stringr::str_trim(tolower(disease_terms))
print("Number of terms related to disease or disorder")
[1] "Number of terms related to disease or disorder"
length(disease_terms)
[1] 53397
# Find GWAS traits that fall within disease or disorder terms
simple_disease_terms = all_gwas_terms[all_gwas_terms %in% disease_terms]
# Also search for cases where there are multiple terms separated by commas
# and one of them is a disease term
not_simple_disease_terms = all_gwas_terms[!all_gwas_terms %in% disease_terms]
# sometimes there's multiple terms - check if any disease term is in these gwas terms
multiple_terms = grep(",", not_simple_disease_terms, value = T)
disease_chunks <- split(disease_terms, ceiling(seq_along(disease_terms) / 100))
disease_chunks <- lapply(disease_chunks, function(x) paste0(x, collapse = "|"))
mask <- Reduce(`|`, lapply(disease_chunks, function(x) grepl(x, multiple_terms, ignore.case = T)))
additional_disease_gwas <- multiple_terms[mask]
disease_gwas = c(all_gwas_terms[all_gwas_terms %in% disease_terms],
additional_disease_gwas)
not_disease_terms = not_simple_disease_terms[!not_simple_disease_terms %in% additional_disease_gwas]
print("Number of GWAS traits under disease or disorder terms")
[1] "Number of GWAS traits under disease or disorder terms"
length(all_gwas_terms) - length(not_disease_terms)
[1] 3501
print("Percentage of GWAS traits under disease or disorder terms")
[1] "Percentage of GWAS traits under disease or disorder terms"
round(100 * (length(all_gwas_terms) - length(not_disease_terms)) / length(all_gwas_terms),
digits = 1)
[1] 15.3
print("Percentage of GWAS traits not under disease or disorder terms")
[1] "Percentage of GWAS traits not under disease or disorder terms"
round(100 * length(not_disease_terms) / length(all_gwas_terms),
digits = 1)
[1] 84.7
not_accounted_for = not_disease_terms
pheno_abnorm <- readLines(here::here("output/trait_ontology/hp_0000118_descendants.txt"))
pheno_abnorm = stringr::str_trim(tolower(pheno_abnorm))
# Find terms where all comma-split pieces are in measurement
pheno_abnorm_gwas <- not_accounted_for[
sapply(strsplit(not_accounted_for, ", "), function(parts) {
parts <- trimws(parts) # remove extra spaces
all(parts %in% pheno_abnorm)
})
]
additional_pheno_abnorm <- not_accounted_for[not_accounted_for %in% pheno_abnorm]
pheno_abnorm_gwas = c(pheno_abnorm_gwas, additional_pheno_abnorm) |> unique()
print("Percentage of GWAS traits under phenotype abnormality terms")
[1] "Percentage of GWAS traits under phenotype abnormality terms"
round(100 * length(pheno_abnorm_gwas) / length(all_gwas_terms),
digits = 1)
[1] 1.5
not_accounted_for = not_accounted_for[!not_accounted_for %in% pheno_abnorm_gwas]
print("Percentage of GWAS traits not accounted for so far")
[1] "Percentage of GWAS traits not accounted for so far"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 83.2
print("Number of GWAS traits not accounted for by so far")
[1] "Number of GWAS traits not accounted for by so far"
length(not_accounted_for)
[1] 18990
find_disease_terms <- function(MAPPED_TRAIT) {
# find all disease terms that appear in the trait
split_mapped_traits <- stringr::str_split(MAPPED_TRAIT, ", ") |>
unlist()
mapped_disease_terms <- split_mapped_traits[split_mapped_traits %in% disease_terms]
mapped_pheno_abnorm_terms <- split_mapped_traits[split_mapped_traits %in% pheno_abnorm]
mapped_disease_terms = unique(c(mapped_disease_terms,
mapped_pheno_abnorm_terms
)
)
return(paste0(mapped_disease_terms, collapse = ", ")) # combine multiple matches
}
gwas_study_info <-
gwas_study_info |>
dplyr::rowwise() |>
dplyr::mutate(
disease_terms =
ifelse(stringr::str_trim(tolower(MAPPED_TRAIT)) %in% c(disease_gwas,pheno_abnorm_gwas),
find_disease_terms(stringr::str_trim(tolower(MAPPED_TRAIT))),
NA)
)
gwas_study_info <-
gwas_study_info |>
rowwise() |>
dplyr::mutate(
disease_terms =
ifelse(MAPPED_TRAIT == "",
NA,
disease_terms)
) |>
ungroup()
measurement <- readLines(here::here("output/trait_ontology/efo_0001444_descendants.txt"))
total_choles <- readLines(here::here("output/trait_ontology/efo_0004574_descendants.txt"))
measurement <- c(total_choles,
measurement)
measurement <- unique(measurement)
measurement <- c("cerebrospinal fluid composition attribute",
"blood protein amount",
measurement)
measurement = stringr::str_trim(tolower(measurement))
# Find terms where all comma-split pieces are in measurement
measurement_gwas <- not_accounted_for[
sapply(strsplit(not_accounted_for, ", "), function(parts) {
parts <- trimws(parts)
all(parts %in% measurement)
})
]
additional_measurement <- not_accounted_for[not_accounted_for %in% measurement]
measurement_gwas = c(measurement_gwas, additional_measurement) |> unique()
print("Number of GWAS traits under measurement terms")
[1] "Number of GWAS traits under measurement terms"
length(measurement_gwas)
[1] 18068
print("Percentage of GWAS traits under measurement terms")
[1] "Percentage of GWAS traits under measurement terms"
round(100 * length(measurement_gwas) / length(all_gwas_terms),
digits = 1)
[1] 79.1
not_accounted_for = not_accounted_for[!not_accounted_for %in% measurement_gwas]
print("Percentage of GWAS traits not accounted for by disease, disorder or measurement terms")
[1] "Percentage of GWAS traits not accounted for by disease, disorder or measurement terms"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 4
print("Number of GWAS traits not accounted for by disease, disorder or measurement terms")
[1] "Number of GWAS traits not accounted for by disease, disorder or measurement terms"
length(not_accounted_for)
[1] 921
go_response = readLines(here::here("output/trait_ontology/go_0050896_descendants.txt"))
efo_response <- readLines(here::here("output/trait_ontology/efo_go_0050896_descendants.txt"))
response <- c(go_response,
efo_response,
"response to stimulus")
response <- unique(response)
response = stringr::str_trim(tolower(response))
# Find terms where all comma-split pieces are in measurement
response_gwas <- not_accounted_for[
sapply(strsplit(not_accounted_for, ", "), function(parts) {
parts <- trimws(parts)
all(parts %in% response)
})
]
additional_response <- not_accounted_for[not_accounted_for %in% response]
measurement_gwas = c(measurement_gwas, additional_response) |> unique()
print("Percentage of GWAS traits under response terms")
[1] "Percentage of GWAS traits under response terms"
round(100 * length(response_gwas) / length(all_gwas_terms),
digits = 1)
[1] 0.7
not_accounted_for = not_accounted_for[!not_accounted_for %in% response_gwas]
print("Percentage of GWAS traits not accounted for by disease, measurement or response terms")
[1] "Percentage of GWAS traits not accounted for by disease, measurement or response terms"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 3.4
print("Number of GWAS traits not accounted for by disease, measurement or response terms")
[1] "Number of GWAS traits not accounted for by disease, measurement or response terms"
length(not_accounted_for)
[1] 768
mental <- readLines(here::here("output/trait_ontology/efo_0004323_descendants.txt"))
mental = stringr::str_trim(tolower(mental))
mental_gwas = not_accounted_for[not_accounted_for %in% mental]
print("Percentage of GWAS traits under mental process terms")
[1] "Percentage of GWAS traits under mental process terms"
round(100 * length(mental_gwas) / length(all_gwas_terms),
digits = 1)
[1] 0.1
not_accounted_for = not_accounted_for[!not_accounted_for %in% mental_gwas]
print("Percentage of GWAS traits not accounted for thus far")
[1] "Percentage of GWAS traits not accounted for thus far"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 3.3
print("Number of GWAS traits not accounted for thus far")
[1] "Number of GWAS traits not accounted for thus far"
length(not_accounted_for)
[1] 750
behavior <- readLines(here::here("output/trait_ontology/go_0007610_descendants.txt"))
behavior = stringr::str_trim(tolower(behavior))
behavior_gwas = not_accounted_for[not_accounted_for %in% behavior]
print("Percentage of GWAS traits under behavouir terms")
[1] "Percentage of GWAS traits under behavouir terms"
round(100 * length(behavior_gwas) / length(all_gwas_terms),
digits = 1)
[1] 0.1
not_accounted_for = not_accounted_for[!not_accounted_for %in% behavior_gwas]
print("Percentage of GWAS traits not accounted for so far")
[1] "Percentage of GWAS traits not accounted for so far"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 3.2
print("Number of GWAS traits not accounted for so far")
[1] "Number of GWAS traits not accounted for so far"
length(not_accounted_for)
[1] 727
injury <- readLines(here::here("output/trait_ontology/efo_0000546_descendants.txt"))
injury = stringr::str_trim(tolower(injury))
injury_gwas = not_accounted_for[not_accounted_for %in% injury]
print("Percentage of GWAS traits under injury terms")
[1] "Percentage of GWAS traits under injury terms"
round(100 * length(injury_gwas) / length(all_gwas_terms),
digits = 1)
[1] 0.1
not_accounted_for = not_accounted_for[!not_accounted_for %in% injury_gwas]
print("Percentage of GWAS traits not accounted for so far")
[1] "Percentage of GWAS traits not accounted for so far"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 3.1
print("Number of GWAS traits not accounted for so far")
[1] "Number of GWAS traits not accounted for so far"
length(not_accounted_for)
[1] 708
phenotype <- readLines(here::here("output/trait_ontology/efo_0000651_descendants.txt"))
phenotype = stringr::str_trim(tolower(phenotype))
phenotype_gwas = not_accounted_for[not_accounted_for %in% phenotype]
print("Percentage of GWAS traits under phenotype terms")
[1] "Percentage of GWAS traits under phenotype terms"
round(100 * length(phenotype_gwas) / length(all_gwas_terms),
digits = 1)
[1] 0.2
not_accounted_for = not_accounted_for[!not_accounted_for %in% phenotype_gwas]
print("Percentage of GWAS traits not accounted for so far")
[1] "Percentage of GWAS traits not accounted for so far"
round(100 * length(not_accounted_for) / length(all_gwas_terms),
digits = 1)
[1] 2.9
print("Number of GWAS traits not accounted for so far")
[1] "Number of GWAS traits not accounted for so far"
length(not_accounted_for)
[1] 661
gwas_study_info =
gwas_study_info |>
dplyr::mutate(MAPPED_TRAIT_CATEGORY = dplyr::case_when(is.na(MAPPED_TRAIT) ~ NA,
tolower(MAPPED_TRAIT) %in% disease_gwas ~ "Disease/Disorder",
tolower(MAPPED_TRAIT) %in% pheno_abnorm_gwas ~ "Phenotypic Abnormality",
tolower(MAPPED_TRAIT) %in% measurement_gwas ~ "Measurement",
tolower(MAPPED_TRAIT) %in% response_gwas ~ "Response",
tolower(MAPPED_TRAIT) %in% mental_gwas ~ "Mental Process",
tolower(MAPPED_TRAIT) %in% behavior_gwas ~ "Behavior",
tolower(MAPPED_TRAIT) %in% injury_gwas ~ "Injury",
tolower(MAPPED_TRAIT) %in% phenotype_gwas ~ "Phenotype",
TRUE ~ "Other"
)
)
gwas_study_info$MAPPED_BACKGROUND_TRAIT |> unique() -> gwas_background
gwas_background = stringr::str_trim(tolower(gwas_background))
length(gwas_background)
[1] 314
multiple_terms = grep(",", gwas_background, value = T)
mask <- Reduce(`|`, lapply(disease_terms, function(x) grepl(x, multiple_terms)))
additional_disease_gwas <- multiple_terms[mask]
disease_gwas = c(gwas_background[gwas_background %in% disease_terms],
additional_disease_gwas)
print("Number of background GWAS traits under disease or disorder terms")
[1] "Number of background GWAS traits under disease or disorder terms"
length(disease_gwas)
[1] 229
print("Percentage of background GWAS traits under disease or disorder terms")
[1] "Percentage of background GWAS traits under disease or disorder terms"
round(100 * length(disease_gwas) / length(gwas_background),
digits = 1)
[1] 72.9
not_accounted_for = gwas_background[!gwas_background %in% disease_gwas]
gwas_study_info <-
gwas_study_info |>
rowwise() |>
dplyr::mutate(
background_disease_terms =
ifelse(stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% c(disease_gwas, pheno_abnorm_gwas),
find_disease_terms(stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT))),
NA)
) |>
ungroup()
gwas_study_info <-
gwas_study_info |>
rowwise() |>
dplyr::mutate(
background_disease_terms =
ifelse(MAPPED_BACKGROUND_TRAIT == "",
NA,
background_disease_terms)
) |>
ungroup()
# Find terms where all comma-split pieces are in measurement
pheno_abnorm_gwas <- not_accounted_for[
sapply(strsplit(not_accounted_for, ", "), function(parts) {
parts <- trimws(parts) # remove extra spaces
all(parts %in% pheno_abnorm)
})
]
additional_pheno_abnorm <- not_accounted_for[not_accounted_for %in% pheno_abnorm]
pheno_abnorm_gwas = c(pheno_abnorm_gwas, additional_pheno_abnorm) |> unique()
print("Percentage of background GWAS traits under phenotype abnormality terms")
[1] "Percentage of background GWAS traits under phenotype abnormality terms"
round(100 * length(pheno_abnorm_gwas) / length(gwas_background),
digits = 1)
[1] 2.2
not_accounted_for = not_accounted_for[!not_accounted_for %in% pheno_abnorm_gwas]
print("Percentage of background GWAS traits not accounted for so far")
[1] "Percentage of background GWAS traits not accounted for so far"
round(100 * length(not_accounted_for) / length(gwas_background),
digits = 1)
[1] 25.2
print("Number of background GWAS traits not accounted for by so far")
[1] "Number of background GWAS traits not accounted for by so far"
length(not_accounted_for)
[1] 79
measurement_gwas <- not_accounted_for[
sapply(strsplit(not_accounted_for, ", "), function(parts) {
parts <- trimws(parts) # remove extra spaces
all(parts %in% measurement)
})
]
measurement_gwas = c(measurement_gwas, additional_measurement) |> unique()
additional_measurement <- not_accounted_for[not_accounted_for %in% measurement]
measurement_gwas = c(measurement_gwas, additional_measurement) |> unique()
not_accounted_for = not_accounted_for[!not_accounted_for %in% measurement_gwas]
print("Percentage of background GWAS traits not accounted for by disease, disorder or measurement terms")
[1] "Percentage of background GWAS traits not accounted for by disease, disorder or measurement terms"
round(100 * length(not_accounted_for) / length(gwas_background),
digits = 1)
[1] 7.6
print("Number of background GWAS traits not accounted for by disease, disorder or measurement terms")
[1] "Number of background GWAS traits not accounted for by disease, disorder or measurement terms"
length(not_accounted_for)
[1] 24
# Find terms where all comma-split pieces are in measurement
response_gwas <- not_accounted_for[
sapply(strsplit(not_accounted_for, ", "), function(parts) {
parts <- trimws(parts)
all(parts %in% response)
})
]
additional_response <- not_accounted_for[not_accounted_for %in% response]
response_gwas = c(response_gwas, additional_response) |> unique()
not_accounted_for = not_accounted_for[!not_accounted_for %in% response_gwas]
print("Percentage of background GWAS traits under response terms")
[1] "Percentage of background GWAS traits under response terms"
round(100 * length(response_gwas) / length(gwas_background),
digits = 1)
[1] 1.6
print("Number of background GWAS traits under response terms")
[1] "Number of background GWAS traits under response terms"
length(response_gwas)
[1] 5
print("Number of background GWAS traits not accounted for by disease, measurement or response terms")
[1] "Number of background GWAS traits not accounted for by disease, measurement or response terms"
length(not_accounted_for)
[1] 19
print("Percentage of background GWAS traits not accounted for by disease, measurement or response terms")
[1] "Percentage of background GWAS traits not accounted for by disease, measurement or response terms"
round(100 * length(not_accounted_for) / length(gwas_background),
digits = 1)
[1] 6.1
gwas_study_info =
gwas_study_info |>
dplyr::mutate(BACKGROUND_TRAIT_CATEGORY =
dplyr::case_when(
MAPPED_BACKGROUND_TRAIT == "" ~ NA,
stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% disease_gwas ~ "Disease/Disorder",
stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% pheno_abnorm_gwas ~ "Phenotypic Abnormality",
stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% measurement_gwas ~ "Measurement",
stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% response_gwas ~ "Response",
TRUE ~ "Other")
)
gwas_study_info |>
group_by(MAPPED_TRAIT_CATEGORY, BACKGROUND_TRAIT_CATEGORY) |>
summarise(n_studies = n()) |>
arrange(desc(n_studies))
# A tibble: 33 × 3
# Groups: MAPPED_TRAIT_CATEGORY [9]
MAPPED_TRAIT_CATEGORY BACKGROUND_TRAIT_CATEGORY n_studies
<chr> <chr> <int>
1 Measurement <NA> 101836
2 Disease/Disorder <NA> 20314
3 Measurement Disease/Disorder 12223
4 Other <NA> 2725
5 Phenotypic Abnormality <NA> 2328
6 Measurement Measurement 969
7 Disease/Disorder Disease/Disorder 549
8 Injury <NA> 508
9 Phenotype <NA> 340
10 Other Disease/Disorder 335
# ℹ 23 more rows
gwas_study_info =
gwas_study_info |>
dplyr::rowwise() |>
dplyr::mutate(DISEASE_STUDY =
ifelse(MAPPED_TRAIT_CATEGORY == "Disease/Disorder" |
MAPPED_TRAIT_CATEGORY == "Phenotypic Abnormality" |
BACKGROUND_TRAIT_CATEGORY == "Disease/Disorder" |
BACKGROUND_TRAIT_CATEGORY == "Phenotypic Abnormality",
T, F )
) |>
dplyr::ungroup()
gwas_study_info |>
filter(DISEASE_STUDY == T) |>
nrow()
[1] 36022
combined_disease_terms = function(MAPPED_TRAIT_1, MAPPED_TRAIT_2){
MAPPED_TRAIT_1 = stringr::str_split(MAPPED_TRAIT_1, ", ") |> unlist()
MAPPED_TRAIT_2 = stringr::str_split(MAPPED_TRAIT_2, ", ") |> unlist()
all_mapped_disease_terms =
c(MAPPED_TRAIT_1, MAPPED_TRAIT_2) |>
unique()
combined_mapped_disease_terms = paste0(all_mapped_disease_terms,
collapse = ", ")
return(combined_mapped_disease_terms)
}
gwas_study_info <-
gwas_study_info |>
dplyr::rowwise() |>
dplyr::mutate(all_disease_terms =
case_when(is.na(background_disease_terms) & is.na(disease_terms) ~ NA,
is.na(background_disease_terms) & !is.na(disease_terms) ~ disease_terms,
!is.na(background_disease_terms) & is.na(disease_terms) ~ background_disease_terms,
!is.na(background_disease_terms) & !is.na(disease_terms) ~
combined_disease_terms(background_disease_terms,
disease_terms))
) |>
dplyr::ungroup()
# What studies are disease studies but have no collected disease terms?
gwas_study_info |>
filter(DISEASE_STUDY == T) |>
filter(all_disease_terms == "") |>
select(MAPPED_TRAIT, MAPPED_TRAIT_CATEGORY) |>
distinct() |>
nrow()
[1] 22
gwas_study_info = gwas_study_info |>
rowwise() |>
mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT == "",
"Other",
MAPPED_TRAIT_CATEGORY)) |>
mutate(BACKGROUND_TRAIT_CATEGORY = ifelse(MAPPED_BACKGROUND_TRAIT == "",
"Other",
BACKGROUND_TRAIT_CATEGORY))
# Fixing fractures, ununited as injury
gwas_study_info = gwas_study_info |>
mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT == "fractures, ununited",
"Injury",
MAPPED_TRAIT_CATEGORY)
)
# Fixing some response terms mislabelled as disease/disorder
other_response_terms = c("response to COVID-19 vaccine, localized superficial swelling, mass, or lump",
"response to COVID-19 vaccine, SARS-CoV-2 neutralizing antibody measurement",
"Anti-hepatitis B virus surface antigen IgG measurement, response to vaccine",
"anti-SARS-CoV-2 IgG measurement, response to COVID-19 vaccine",
"anti-tetanus toxoid IgG measurement, response to vaccine",
"height growth attribute, response to growth hormone",
"adrenal suppression measurement, response to corticosteroid",
"asthenia, response to COVID-19 vaccine",
"SARS-CoV-2 antibody measurement, response to COVID-19 vaccine",
"response to vaccine, anti-Haemophilus influenzae type b polyribosylribitol phosphate IgG measurement",
"height growth attribute, response to growth hormone"
)
# Fixing some measurement terms mislabelled as disease/disorder
other_measurement_terms = c("amygdala volume, pallidum volume, nucleus accumbens volume, hippocampal volume, putamen volume, intracranial volume measurement, thalamus volume, caudate nucleus volume",
"lamina-associated polypeptide 2, isoforms beta/gamma measurement",
"apoptosis-inducing factor 1, mitochondrial measurement",
"central corneal thickness, intraocular pressure measurement",
"level of apoptosis-inducing factor 1, mitochondrial in blood serum",
"level of MHC class I polypeptide-related sequence A in blood, level of MHC class I polypeptide-related sequence B in blood",
"psychosocial stress measurement, intracranial volume measurement")
# Fixing some other terms mislabelled as disease/disorder
other_terms = c("insulin metabolic clearance rate measurement, disposition index measurement, insulin sensitivity measurement, glucose homeostasis trait, glucose effectiveness measurement, acute insulin response measurement",
"disease recurrence, response to allogeneic hematopoietic stem cell transplant",
"disease recurrence, response to allogeneic hematopoietic stem cell transplant, donor genotype effect measurement"
)
gwas_study_info =
gwas_study_info |>
mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT %in% other_response_terms,
"Response",
MAPPED_TRAIT_CATEGORY)) |>
mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT %in% other_measurement_terms,
"Response",
MAPPED_TRAIT_CATEGORY)) |>
mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT %in% other_terms,
"Other",
MAPPED_TRAIT_CATEGORY))
# spine osteoarthritis was not being picked up as a disease term
gwas_study_info =
gwas_study_info |>
mutate(disease_terms = ifelse(grepl("spine osteoarthritis", MAPPED_TRAIT),
paste0("spine psteoarthritis", disease_terms, collapse = ","),
disease_terms)
) |>
mutate(all_disease_terms = ifelse(grepl("spine osteoarthritis", MAPPED_TRAIT),
paste0("spine psteoarthritis", all_disease_terms, collapse = ","),
all_disease_terms)
)
# liver disease biomarker was not being picked up as a disease term
gwas_study_info =
gwas_study_info |>
mutate(disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
paste0("liver disease", disease_terms, collapse = ","),
disease_terms)
) |>
mutate(all_disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
paste0("liver disease", all_disease_terms, collapse = ","),
all_disease_terms)
)
gwas_study_info =
gwas_study_info |>
mutate(disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
paste0("liver disease", disease_terms, collapse = ","),
disease_terms)
) |>
mutate(all_disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
paste0("liver disease", all_disease_terms, collapse = ","),
all_disease_terms)
)
# growth delay was not being picked up as a disease term
gwas_study_info =
gwas_study_info |>
mutate(background_disease_terms= ifelse(grepl("Growth delay", MAPPED_BACKGROUND_TRAIT),
paste0("growth delay", background_disease_terms, collapse = ","),
background_disease_terms)
) |>
mutate(all_disease_terms = ifelse(grepl("Growth delay", MAPPED_BACKGROUND_TRAIT),
paste0("growth delay", all_disease_terms, collapse = ","),
all_disease_terms)
)
now that I have corrected any mistakes in categorization and added some missing disease terms
gwas_study_info =
gwas_study_info |>
dplyr::rowwise() |>
dplyr::mutate(DISEASE_STUDY =
ifelse(MAPPED_TRAIT_CATEGORY == "Disease/Disorder" |
MAPPED_TRAIT_CATEGORY == "Phenotypic Abnormality" |
BACKGROUND_TRAIT_CATEGORY == "Disease/Disorder" |
BACKGROUND_TRAIT_CATEGORY == "Phenotypic Abnormality",
T, F )
) |>
dplyr::ungroup()
gwas_study_info |>
filter(DISEASE_STUDY == T) |>
filter(all_disease_terms == "") |>
select(MAPPED_TRAIT, MAPPED_TRAIT_CATEGORY) |>
distinct() |>
nrow()
[1] 0
data.table::fwrite(gwas_study_info,
here::here("output/gwas_cat/gwas_study_info_trait_cat.csv"),
sep = ",")
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] stringr_1.5.1 ggplot2_3.5.2 dplyr_1.1.4 data.table_1.17.8
[5] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.3.1 renv_1.0.3
[5] promises_1.3.3 tidyselect_1.2.1 Rcpp_1.1.0 git2r_0.36.2
[9] callr_3.7.6 later_1.4.2 jquerylib_0.1.4 scales_1.4.0
[13] yaml_2.3.10 fastmap_1.2.0 here_1.0.1 R6_2.6.1
[17] generics_0.1.4 knitr_1.50 tibble_3.3.0 rprojroot_2.1.0
[21] RColorBrewer_1.1-3 bslib_0.9.0 pillar_1.11.0 rlang_1.1.6
[25] utf8_1.2.6 cachem_1.1.0 stringi_1.8.7 httpuv_1.6.16
[29] xfun_0.52 getPass_0.2-4 fs_1.6.6 sass_0.4.10
[33] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 ps_1.9.1
[37] grid_4.3.1 digest_0.6.37 processx_3.8.6 rstudioapi_0.17.1
[41] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.4 glue_1.8.0
[45] farver_2.1.2 whisker_0.4.1 rmarkdown_2.29 httr_1.4.7
[49] tools_4.3.1 pkgconfig_2.0.3 htmltools_0.5.8.1