Last updated: 2025-09-10

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: a7e2f7c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a7e2f7c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/gwas_catalog/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_study_info_cohort_corrected.csv
    Ignored:    output/gwas_study_info_trait_corrected.csv
    Ignored:    output/gwas_study_info_trait_ontology_info.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l1.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l2.csv
    Ignored:    output/trait_ontology/
    Ignored:    renv/

Untracked files:
    Untracked:  analysis/disease_trait_terms_simplify.Rmd
    Untracked:  data/gbd/
    Untracked:  data/who/

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/level_1_disease_group.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Deleted:    analysis/non_ontology_trait_collapse.Rmd
    Deleted:    analysis/trait_ontology_collapse.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/trait_ontology_categorization.Rmd) and HTML (docs/trait_ontology_categorization.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	a7e2f7c	IJbeasley	2025-09-10	Fixing / re-formatting of initial trait categorization

1 Set up

knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE, 
                      warning = FALSE
                      )

library(data.table)
library(dplyr)
library(ggplot2)
library(stringr)

2 Overlap ontology terms and GWAS traits

gwas_study_info <- fread(here::here("output/gwas_study_info_cohort_corrected.csv"))

# fixing weird terms- where comma is in the term
# gwas_study_info |> 
#  mutate(n_commas_trait = stringr::str_count(MAPPED_TRAIT, ", "),
#         n_commas_uri = stringr::str_count(MAPPED_TRAIT_URI, ",")) |>
#   filter(n_commas_trait != n_commas_uri)


gwas_study_info =
gwas_study_info |> 
  rowwise() |>
  mutate(MAPPED_TRAIT = case_when(
                         # osteoarthritis, hip ... http://www.ebi.ac.uk/efo/EFO_1000786
                         grepl("EFO_1000786", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, hip", "hip osteoarthritis"),
                         #  osteoarthritis, hand ... http://www.ebi.ac.uk/efo/EFO_1000789
                         grepl("EFO_1000789", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, hand", "hand osteoarthritis"),
                         #  osteoarthritis, spine ... http://www.ebi.ac.uk/efo/EFO_1000787
                         grepl("EFO_1000787", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, spine", "spine osteoarthritis"),
                         #  Hepatitis, Alcoholic, http://www.ebi.ac.uk/efo/EFO_1001345
                         grepl("EFO_1001345", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Hepatitis, Alcoholic", "Alcoholic Hepatitis"),
                         # psoriasis 14, pustular http://purl.obolibrary.org/obo/MONDO_0013626
                         grepl("MONDO_0013626", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "psoriasis 14, pustula", "pustular psoriasis 14"),
                         # hypertension, pregnancy-induced http://purl.obolibrary.org/obo/MONDO_0024664
                         grepl("MONDO_0024664", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "hypertension, pregnancy-induced", "pregnancy-induced hypertension"),
                         # renal agenesis, unilateral http://purl.obolibrary.org/obo/MONDO_0019636
                         grepl("MONDO_0019636", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "renal agenesis, unilateral", "unilateral renal agenesis"),
                         #  Cholecystitis, Acute http://www.ebi.ac.uk/efo/EFO_1001289
                         grepl("EFO_1001289", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Cholecystitis, Acute", "Acute Cholecystitis"),
                         #  Genital neoplasm, female http://www.ebi.ac.uk/efo/EFO_1001331
                         grepl("EFO_1001331", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Genital neoplasm, female", "female reproductive organ cancer"),
                         #  Anemia, Hemolytic, Autoimmune http://www.ebi.ac.uk/efo/EFO_1001264
                          grepl("EFO_1001264", MAPPED_TRAIT_URI) ~ stringr::str_replace_all(MAPPED_TRAIT, pattern = "Anemia, Hemolytic, Autoimmune", "autoimmune haemolytic anemia"),
                         TRUE ~ MAPPED_TRAIT
                         )
  ) |>
  ungroup()

# osteoarthritis, knee ... http://www.ebi.ac.uk/efo/EFO_0004616
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(
                         grepl("EFO_0004616", MAPPED_TRAIT_URI),
                         stringr::str_replace_all(MAPPED_TRAIT, pattern = "osteoarthritis, knee", "knee osteoarthritis"),
                         MAPPED_TRAIT
  )
  ) 

# rheumatoid factor-negative juvenile idiopathic arthritis
# polyarticular juvenile idiopathic arthritis, rheumatoid factor negative
# http://www.ebi.ac.uk/efo/EFO_1002020
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("EFO_1002020", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, 
                                                        pattern = "polyarticular juvenile idiopathic arthritis, rheumatoid factor negative", 
                                                        "rheumatoid factor-negative juvenile idiopathic arthritis"),
                               MAPPED_TRAIT)
         )

# http://www.ebi.ac.uk/efo/EFO_0007294, hand, foot and mouth disease,
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("EFO_0007294", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, 
                                                        pattern = "hand, foot and mouth disease", 
                                                        "hand foot and mouth disease"),
                               MAPPED_TRAIT)
         )

# susceptibility to migriane without aura http://purl.obolibrary.org/obo/MONDO_0011847 

gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("MONDO_0011847", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, pattern = "migraine without aura, susceptibility to, 4", "migraine without aura"),
                               MAPPED_TRAIT)
         )

#   neural tube defects, susceptibility to, http://purl.obolibrary.org/obo/MONDO_0020705
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("MONDO_0020705", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, 
                                                        pattern = "neural tube defects, susceptibility to", 
                                                        "neural tube defects"),
                               MAPPED_TRAIT)
         )

# infantile diarrhea
# Diarrhea, Infantile http://www.ebi.ac.uk/efo/EFO_1001306
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("EFO_1001306", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, 
                                                        pattern = "Diarrhea, Infantile", 
                                                        "infantile diarrhea"),
                               MAPPED_TRAIT)
         )

# self reported traits 
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("EFO_0009803|EFO_0009822|EFO_0009803|EFO_0009817|EFO_0009822|EFO_0009819|EFO_0009823|EFO_0009824", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, 
                                                        pattern = ", self-reported$", 
                                                        " self-reported"),
                               MAPPED_TRAIT)
         )


# Hodgkins lymphoma, mixed cellularity http://www.ebi.ac.uk/efo/EFO_1002031
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("EFO_1002031", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, pattern = "Hodgkins lymphoma, mixed cellularity", "Hodgkins lymphoma mixed cellularity"),
                               MAPPED_TRAIT)
         )

# encephalopathy, acute, infection-induced, http://purl.obolibrary.org/obo/MONDO_0000166
gwas_study_info =
gwas_study_info |>
  mutate(MAPPED_TRAIT = ifelse(grepl("MONDO_0000166", MAPPED_TRAIT_URI),
                               stringr::str_replace_all(MAPPED_TRAIT, 
                                                        pattern = "encephalopathy, acute, infection-induced", 
                                                        "encephalopathy acute infection-induced"),
                               MAPPED_TRAIT)
         )


gwas_study_info = 
  gwas_study_info |>
  mutate(MAPPED_TRAIT = stringr::str_replace_all(MAPPED_TRAIT, 
                                                 "^level of .+, mitochondrial in blood$", "blood protein amount")
         )

all_gwas_terms = unique(gwas_study_info$MAPPED_TRAIT)

all_gwas_terms = stringr::str_trim(tolower(all_gwas_terms))

2.1 Disease Overlap (How many GWAS traits fall within disease or disorder terms?)

2.1.1 Combine disease terms

efo_descendants <- readLines(here::here("output/trait_ontology/efo_0000408_descendants.txt"))

mondo_descendants <- readLines(here::here("output/trait_ontology/mondo_0700096_descendants.txt"))

ncit_descendants <- readLines(here::here("output/trait_ontology/ncit_C2991_descendants.txt"))

orphanet_descendants <- readLines(here::here("output/trait_ontology/orphanet_557493_descendants.txt"))

age_of_onset_descendants <- readLines(here::here("output/trait_ontology/oba_2020000_descendants.txt"))

disease_measurement_terms <- readLines(here::here("output/trait_ontology/efo_0001444_disease_measurement_terms.txt"))

disease_typos = c("Alzheimer disease",
                  "late-onset Alzheimers disease",
                  "age of onset of Alzheimer disease",
                  "Chagas cardiomyopathy",
                  "Parkinson disease",
                  "Iron deficiency anemia",
                  "Churg-Strauss syndrome",
                  "Iridocyclitis",
                  "Phlebitis" 
                  )

other <- c("Allergic disease", 
  "Lewy body dementia",
  "Lewy body attribute",
 "non-Hodgkins lymphoma",
           "Ischemic Stroke",
           "Lung disease",
           "Respiratory System Disease",
  "Iron deficiency anemia (disorder)",
  "Alzheimer disease, APOE carrier status",
  "Alzheimer's disease biomarker measurement",
 "Genital neoplasm, female",
 "HIV-associated neurocognitive disorder",
 "encephalopathy acute infection-induced"
           )

disease_terms = c(mondo_descendants,
                  efo_descendants,
                  ncit_descendants,
                  orphanet_descendants,
                  age_of_onset_descendants,
                  disease_measurement_terms,
                  disease_typos,
                  other) |>
                 unique()


disease_terms = stringr::str_trim(tolower(disease_terms))

print("Number of terms related to disease or disorder")

[1] "Number of terms related to disease or disorder"

length(disease_terms)

[1] 53397

# Find GWAS traits that fall within disease or disorder terms
simple_disease_terms = all_gwas_terms[all_gwas_terms %in% disease_terms]

# Also search for cases where there are multiple terms separated by commas
# and one of them is a disease term
not_simple_disease_terms = all_gwas_terms[!all_gwas_terms %in% disease_terms]

# sometimes there's multiple terms - check if any disease term is in these gwas terms
multiple_terms = grep(",", not_simple_disease_terms, value = T)

disease_chunks <- split(disease_terms, ceiling(seq_along(disease_terms) / 100))
disease_chunks  <- lapply(disease_chunks, function(x) paste0(x, collapse = "|"))
mask <- Reduce(`|`, lapply(disease_chunks, function(x) grepl(x, multiple_terms, ignore.case = T)))
additional_disease_gwas <- multiple_terms[mask]

disease_gwas = c(all_gwas_terms[all_gwas_terms %in% disease_terms],
                 additional_disease_gwas)

not_disease_terms = not_simple_disease_terms[!not_simple_disease_terms %in% additional_disease_gwas]

print("Number of GWAS traits under disease or disorder terms")

[1] "Number of GWAS traits under disease or disorder terms"

length(all_gwas_terms) - length(not_disease_terms)

[1] 3501

print("Percentage of GWAS traits under disease or disorder terms")

[1] "Percentage of GWAS traits under disease or disorder terms"

round(100 * (length(all_gwas_terms) - length(not_disease_terms)) / length(all_gwas_terms),
      digits = 1)

[1] 15.3

print("Percentage of GWAS traits not under disease or disorder terms")

[1] "Percentage of GWAS traits not under disease or disorder terms"

round(100 * length(not_disease_terms) / length(all_gwas_terms),
      digits = 1)

[1] 84.7

not_accounted_for = not_disease_terms

2.2 Phenotype abnormality overlap

pheno_abnorm <- readLines(here::here("output/trait_ontology/hp_0000118_descendants.txt"))
pheno_abnorm = stringr::str_trim(tolower(pheno_abnorm))

# Find terms where all comma-split pieces are in measurement
pheno_abnorm_gwas <- not_accounted_for[
  sapply(strsplit(not_accounted_for, ", "), function(parts) {
    parts <- trimws(parts) # remove extra spaces
    all(parts %in% pheno_abnorm)
  })
]

additional_pheno_abnorm <- not_accounted_for[not_accounted_for %in% pheno_abnorm]

pheno_abnorm_gwas  = c(pheno_abnorm_gwas, additional_pheno_abnorm) |> unique()

print("Percentage of GWAS traits under phenotype abnormality terms")

[1] "Percentage of GWAS traits under phenotype abnormality terms"

round(100 * length(pheno_abnorm_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 1.5

not_accounted_for = not_accounted_for[!not_accounted_for %in% pheno_abnorm_gwas]

print("Percentage of GWAS traits not accounted for so far")

[1] "Percentage of GWAS traits not accounted for so far"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 83.2

print("Number of GWAS traits not accounted for by so far")

[1] "Number of GWAS traits not accounted for by so far"

length(not_accounted_for)

[1] 18990

2.3 Add disease & phenotypic abnormality terms to GWAS study info dataset

find_disease_terms  <-   function(MAPPED_TRAIT) {
        # find all disease terms that appear in the trait
        split_mapped_traits <- stringr::str_split(MAPPED_TRAIT, ", ") |> 
                               unlist()
        
        mapped_disease_terms <- split_mapped_traits[split_mapped_traits %in% disease_terms]
        mapped_pheno_abnorm_terms <- split_mapped_traits[split_mapped_traits %in% pheno_abnorm]
        
        mapped_disease_terms = unique(c(mapped_disease_terms, 
                                        mapped_pheno_abnorm_terms
                                        )
        )
                                      
        return(paste0(mapped_disease_terms, collapse = ", "))  # combine multiple matches
        
    }

gwas_study_info <- 
  gwas_study_info |> 
  dplyr::rowwise() |>
  dplyr::mutate(
    disease_terms = 
      ifelse(stringr::str_trim(tolower(MAPPED_TRAIT)) %in% c(disease_gwas,pheno_abnorm_gwas),
             find_disease_terms(stringr::str_trim(tolower(MAPPED_TRAIT))),
             NA)
  )

gwas_study_info <- 
  gwas_study_info |>
  rowwise() |>
  dplyr::mutate(
   disease_terms = 
      ifelse(MAPPED_TRAIT == "",
             NA,
            disease_terms)
  ) |>
  ungroup()

2.4 Measurement Overlap (how many GWAS traits fall within measurement terms?)

2.4.1 Combine measurement

measurement <- readLines(here::here("output/trait_ontology/efo_0001444_descendants.txt"))
total_choles <- readLines(here::here("output/trait_ontology/efo_0004574_descendants.txt"))

measurement <- c(total_choles,
                 measurement) 

measurement <- unique(measurement)

measurement <- c("cerebrospinal fluid composition attribute",
                 "blood protein amount",
                 measurement)

measurement = stringr::str_trim(tolower(measurement))

# Find terms where all comma-split pieces are in measurement
measurement_gwas <- not_accounted_for[
  sapply(strsplit(not_accounted_for, ", "), function(parts) {
    parts <- trimws(parts)
    all(parts %in% measurement)
  })
]
additional_measurement <- not_accounted_for[not_accounted_for %in% measurement]

measurement_gwas  = c(measurement_gwas, additional_measurement) |> unique()

print("Number of GWAS traits under measurement terms")

[1] "Number of GWAS traits under measurement terms"

length(measurement_gwas)

[1] 18068

print("Percentage of GWAS traits under measurement terms")

[1] "Percentage of GWAS traits under measurement terms"

round(100 * length(measurement_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 79.1

not_accounted_for = not_accounted_for[!not_accounted_for %in% measurement_gwas]

print("Percentage of GWAS traits not accounted for by disease, disorder or measurement terms")

[1] "Percentage of GWAS traits not accounted for by disease, disorder or measurement terms"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 4

print("Number of GWAS traits not accounted for by disease, disorder or measurement terms")

[1] "Number of GWAS traits not accounted for by disease, disorder or measurement terms"

length(not_accounted_for)

[1] 921

2.5 Response to stimulus

2.5.1 Combine response terms

go_response = readLines(here::here("output/trait_ontology/go_0050896_descendants.txt"))

efo_response <- readLines(here::here("output/trait_ontology/efo_go_0050896_descendants.txt"))

response <- c(go_response,
              efo_response,
              "response to stimulus")

response <- unique(response)

response = stringr::str_trim(tolower(response))

# Find terms where all comma-split pieces are in measurement
response_gwas <- not_accounted_for[
  sapply(strsplit(not_accounted_for, ", "), function(parts) {
    parts <- trimws(parts)
    all(parts %in% response)
  })
]
additional_response <- not_accounted_for[not_accounted_for %in% response]

measurement_gwas  = c(measurement_gwas, additional_response) |> unique()

print("Percentage of GWAS traits under response terms")

[1] "Percentage of GWAS traits under response terms"

round(100 * length(response_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 0.7

not_accounted_for = not_accounted_for[!not_accounted_for %in% response_gwas]

print("Percentage of GWAS traits not accounted for by disease, measurement or response terms")

[1] "Percentage of GWAS traits not accounted for by disease, measurement or response terms"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 3.4

print("Number of GWAS traits not accounted for by disease, measurement or response terms")

[1] "Number of GWAS traits not accounted for by disease, measurement or response terms"

length(not_accounted_for)

[1] 768

2.6 Mental process

mental <- readLines(here::here("output/trait_ontology/efo_0004323_descendants.txt"))
mental = stringr::str_trim(tolower(mental))

mental_gwas = not_accounted_for[not_accounted_for %in% mental]

print("Percentage of GWAS traits under mental process terms")

[1] "Percentage of GWAS traits under mental process terms"

round(100 * length(mental_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 0.1

not_accounted_for = not_accounted_for[!not_accounted_for %in% mental_gwas]

print("Percentage of GWAS traits not accounted for thus far")

[1] "Percentage of GWAS traits not accounted for thus far"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 3.3

print("Number of GWAS traits not accounted for thus far")

[1] "Number of GWAS traits not accounted for thus far"

length(not_accounted_for)

[1] 750

2.7 Behavior

behavior <- readLines(here::here("output/trait_ontology/go_0007610_descendants.txt"))
behavior = stringr::str_trim(tolower(behavior))

behavior_gwas = not_accounted_for[not_accounted_for %in% behavior]

print("Percentage of GWAS traits under behavouir terms")

[1] "Percentage of GWAS traits under behavouir terms"

round(100 * length(behavior_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 0.1

not_accounted_for = not_accounted_for[!not_accounted_for %in% behavior_gwas]

print("Percentage of GWAS traits not accounted for so far")

[1] "Percentage of GWAS traits not accounted for so far"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 3.2

print("Number of GWAS traits not accounted for so far")

[1] "Number of GWAS traits not accounted for so far"

length(not_accounted_for)

[1] 727

2.8 Injury

injury <- readLines(here::here("output/trait_ontology/efo_0000546_descendants.txt"))

injury = stringr::str_trim(tolower(injury))

injury_gwas = not_accounted_for[not_accounted_for %in% injury]

print("Percentage of GWAS traits under injury terms")

[1] "Percentage of GWAS traits under injury terms"

round(100 * length(injury_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 0.1

not_accounted_for = not_accounted_for[!not_accounted_for %in% injury_gwas]

print("Percentage of GWAS traits not accounted for so far")

[1] "Percentage of GWAS traits not accounted for so far"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 3.1

print("Number of GWAS traits not accounted for so far")

[1] "Number of GWAS traits not accounted for so far"

length(not_accounted_for)

[1] 708

2.9 Phenotype

phenotype <- readLines(here::here("output/trait_ontology/efo_0000651_descendants.txt"))

phenotype = stringr::str_trim(tolower(phenotype))

phenotype_gwas = not_accounted_for[not_accounted_for %in% phenotype]

print("Percentage of GWAS traits under phenotype terms")

[1] "Percentage of GWAS traits under phenotype terms"

round(100 * length(phenotype_gwas) / length(all_gwas_terms),
      digits = 1)

[1] 0.2

not_accounted_for = not_accounted_for[!not_accounted_for %in% phenotype_gwas]

print("Percentage of GWAS traits not accounted for so far")

[1] "Percentage of GWAS traits not accounted for so far"

round(100 * length(not_accounted_for) / length(all_gwas_terms),
      digits = 1)

[1] 2.9

print("Number of GWAS traits not accounted for so far")

[1] "Number of GWAS traits not accounted for so far"

length(not_accounted_for)

[1] 661

3 Add Categories to GWAS Info

gwas_study_info = 
gwas_study_info |>
  dplyr::mutate(MAPPED_TRAIT_CATEGORY = dplyr::case_when(is.na(MAPPED_TRAIT) ~ NA,
                                                         tolower(MAPPED_TRAIT) %in% disease_gwas ~ "Disease/Disorder",
                                                         tolower(MAPPED_TRAIT) %in% pheno_abnorm_gwas ~ "Phenotypic Abnormality",
                                                         tolower(MAPPED_TRAIT) %in% measurement_gwas ~ "Measurement",
                                                         tolower(MAPPED_TRAIT) %in% response_gwas ~ "Response",
                                                         tolower(MAPPED_TRAIT) %in% mental_gwas ~ "Mental Process",
                                                         tolower(MAPPED_TRAIT) %in% behavior_gwas ~ "Behavior",
                                                         tolower(MAPPED_TRAIT) %in% injury_gwas ~ "Injury",
                                                         tolower(MAPPED_TRAIT) %in% phenotype_gwas ~ "Phenotype",
                                                          TRUE ~ "Other"
                                                          )
                )

4 Background traits

gwas_study_info$MAPPED_BACKGROUND_TRAIT |> unique() -> gwas_background

gwas_background = stringr::str_trim(tolower(gwas_background))

length(gwas_background)

[1] 314

4.1 Overlap with disease/disorder traits

multiple_terms = grep(",", gwas_background, value = T)
mask <- Reduce(`|`, lapply(disease_terms, function(x) grepl(x, multiple_terms)))
additional_disease_gwas <- multiple_terms[mask]

disease_gwas = c(gwas_background[gwas_background %in% disease_terms],
                 additional_disease_gwas)

print("Number of background GWAS traits under disease or disorder terms")

[1] "Number of background GWAS traits under disease or disorder terms"

length(disease_gwas)

[1] 229

print("Percentage of background GWAS traits under disease or disorder terms")

[1] "Percentage of background GWAS traits under disease or disorder terms"

round(100 * length(disease_gwas) / length(gwas_background),
      digits = 1)

[1] 72.9

not_accounted_for = gwas_background[!gwas_background %in% disease_gwas]

gwas_study_info <- 
  gwas_study_info |>
  rowwise() |>
  dplyr::mutate(
    background_disease_terms = 
      ifelse(stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% c(disease_gwas, pheno_abnorm_gwas),
             find_disease_terms(stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT))),
             NA)
  ) |>
  ungroup()


gwas_study_info <- 
  gwas_study_info |>
  rowwise() |>
  dplyr::mutate(
    background_disease_terms = 
      ifelse(MAPPED_BACKGROUND_TRAIT == "",
             NA,
             background_disease_terms)
  ) |>
  ungroup()

4.2 Phenotype abnormality overlap

# Find terms where all comma-split pieces are in measurement
pheno_abnorm_gwas <- not_accounted_for[
  sapply(strsplit(not_accounted_for, ", "), function(parts) {
    parts <- trimws(parts) # remove extra spaces
    all(parts %in% pheno_abnorm)
  })
]

additional_pheno_abnorm <- not_accounted_for[not_accounted_for %in% pheno_abnorm]

pheno_abnorm_gwas  = c(pheno_abnorm_gwas, additional_pheno_abnorm) |> unique()

print("Percentage of background GWAS traits under phenotype abnormality terms")

[1] "Percentage of background GWAS traits under phenotype abnormality terms"

round(100 * length(pheno_abnorm_gwas) / length(gwas_background),
      digits = 1)

[1] 2.2

not_accounted_for = not_accounted_for[!not_accounted_for %in% pheno_abnorm_gwas]

print("Percentage of background GWAS traits not accounted for so far")

[1] "Percentage of background GWAS traits not accounted for so far"

round(100 * length(not_accounted_for) / length(gwas_background),
      digits = 1)

[1] 25.2

print("Number of background GWAS traits not accounted for by so far")

[1] "Number of background GWAS traits not accounted for by so far"

length(not_accounted_for)

[1] 79

4.3 Measurement traits

measurement_gwas <- not_accounted_for[
  sapply(strsplit(not_accounted_for, ", "), function(parts) {
    parts <- trimws(parts) # remove extra spaces
    all(parts %in% measurement)
  })
]

measurement_gwas  = c(measurement_gwas, additional_measurement) |> unique()
additional_measurement <- not_accounted_for[not_accounted_for %in% measurement]

measurement_gwas  = c(measurement_gwas, additional_measurement) |> unique()

not_accounted_for = not_accounted_for[!not_accounted_for %in% measurement_gwas]

print("Percentage of background GWAS traits not accounted for by disease, disorder or measurement terms")

[1] "Percentage of background GWAS traits not accounted for by disease, disorder or measurement terms"

round(100 * length(not_accounted_for) / length(gwas_background),
      digits = 1)

[1] 7.6

print("Number of background GWAS traits not accounted for by disease, disorder or measurement terms")

[1] "Number of background GWAS traits not accounted for by disease, disorder or measurement terms"

length(not_accounted_for)

[1] 24

4.4 Response traits

# Find terms where all comma-split pieces are in measurement
response_gwas <- not_accounted_for[
  sapply(strsplit(not_accounted_for, ", "), function(parts) {
    parts <- trimws(parts)
    all(parts %in% response)
  })
]
additional_response <- not_accounted_for[not_accounted_for %in% response]

response_gwas  = c(response_gwas, additional_response) |> unique()

not_accounted_for = not_accounted_for[!not_accounted_for %in% response_gwas]

print("Percentage of background GWAS traits under response terms")

[1] "Percentage of background GWAS traits under response terms"

round(100 * length(response_gwas) / length(gwas_background),
      digits = 1)

[1] 1.6

print("Number of background GWAS traits under response terms")

[1] "Number of background GWAS traits under response terms"

length(response_gwas)

[1] 5

print("Number of background GWAS traits not accounted for by disease, measurement or response terms")

[1] "Number of background GWAS traits not accounted for by disease, measurement or response terms"

length(not_accounted_for)

[1] 19

print("Percentage of background GWAS traits not accounted for by disease, measurement or response terms")

[1] "Percentage of background GWAS traits not accounted for by disease, measurement or response terms"

round(100 * length(not_accounted_for) / length(gwas_background),
      digits = 1)

[1] 6.1

4.5 Background trait categories

gwas_study_info = 
gwas_study_info |>
  dplyr::mutate(BACKGROUND_TRAIT_CATEGORY = 
                   dplyr::case_when(
                                      MAPPED_BACKGROUND_TRAIT == "" ~ NA,
                                      stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% disease_gwas ~ "Disease/Disorder",
                                      stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% pheno_abnorm_gwas ~ "Phenotypic Abnormality",
                                      stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% measurement_gwas ~ "Measurement",
                                      stringr::str_trim(tolower(MAPPED_BACKGROUND_TRAIT)) %in% response_gwas ~ "Response",
                                      TRUE ~ "Other")
                )

5 Summary of number of disease studies (and studies of each kind of trait)

gwas_study_info |>
  group_by(MAPPED_TRAIT_CATEGORY, BACKGROUND_TRAIT_CATEGORY) |>
  summarise(n_studies = n()) |> 
  arrange(desc(n_studies))

# A tibble: 33 × 3
# Groups:   MAPPED_TRAIT_CATEGORY [9]
   MAPPED_TRAIT_CATEGORY  BACKGROUND_TRAIT_CATEGORY n_studies
   <chr>                  <chr>                         <int>
 1 Measurement            <NA>                         101836
 2 Disease/Disorder       <NA>                          20314
 3 Measurement            Disease/Disorder              12223
 4 Other                  <NA>                           2725
 5 Phenotypic Abnormality <NA>                           2328
 6 Measurement            Measurement                     969
 7 Disease/Disorder       Disease/Disorder                549
 8 Injury                 <NA>                            508
 9 Phenotype              <NA>                            340
10 Other                  Disease/Disorder                335
# ℹ 23 more rows

gwas_study_info = 
gwas_study_info |>
  dplyr::rowwise() |>
  dplyr::mutate(DISEASE_STUDY = 
                   ifelse(MAPPED_TRAIT_CATEGORY == "Disease/Disorder" | 
                          MAPPED_TRAIT_CATEGORY == "Phenotypic Abnormality" |  
                          BACKGROUND_TRAIT_CATEGORY == "Disease/Disorder" | 
                          BACKGROUND_TRAIT_CATEGORY == "Phenotypic Abnormality",
                          T, F )
                ) |>
  dplyr::ungroup()

5.1 Number of disease studies

gwas_study_info |>
  filter(DISEASE_STUDY == T) |>
  nrow()

[1] 36022

6 Creating disease labels column of just disease or phenotype abnormality terms for each study - so that we can see what diseases are being studied

6.1 Make disease label column - combining disease terms from both mapped trait and background trait

combined_disease_terms = function(MAPPED_TRAIT_1, MAPPED_TRAIT_2){
  
  
  MAPPED_TRAIT_1 = stringr::str_split(MAPPED_TRAIT_1, ", ") |> unlist()
  MAPPED_TRAIT_2  = stringr::str_split(MAPPED_TRAIT_2, ", ") |> unlist()
  
  all_mapped_disease_terms = 
    c(MAPPED_TRAIT_1, MAPPED_TRAIT_2) |>
    unique()
  
  combined_mapped_disease_terms = paste0(all_mapped_disease_terms, 
                                         collapse = ", ")
  
  return(combined_mapped_disease_terms)
  
}


gwas_study_info <- 
  gwas_study_info |>
  dplyr::rowwise() |>
  dplyr::mutate(all_disease_terms = 
                case_when(is.na(background_disease_terms) & is.na(disease_terms) ~ NA,
                          is.na(background_disease_terms) & !is.na(disease_terms) ~ disease_terms,
                          !is.na(background_disease_terms) & is.na(disease_terms) ~ background_disease_terms,
                          !is.na(background_disease_terms) & !is.na(disease_terms) ~
                            combined_disease_terms(background_disease_terms,
                                                   disease_terms)) 

  ) |>
  dplyr::ungroup()

6.2 Minor fixes of trait categorisation and returning traits

# What studies are disease studies but have no collected disease terms?
gwas_study_info |> 
  filter(DISEASE_STUDY == T) |> 
  filter(all_disease_terms == "")  |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_CATEGORY) |> 
  distinct() |>
  nrow()

[1] 22

6.2.1 Fix bug where MAPPED_TRAIT/BACKGROUND_MAPPED_TRAIT is empty string but TRAIT_CATEGORY is listed as disease/phenotypic abnormality

gwas_study_info = gwas_study_info |>
  rowwise() |>
  mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT == "",
                                        "Other",
                                        MAPPED_TRAIT_CATEGORY)) |>
  mutate(BACKGROUND_TRAIT_CATEGORY = ifelse(MAPPED_BACKGROUND_TRAIT == "",
                                        "Other",
                                        BACKGROUND_TRAIT_CATEGORY))

6.2.2 Fix some specific mis-categorizations

# Fixing fractures, ununited as injury 
gwas_study_info = gwas_study_info |>
  mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT == "fractures, ununited",
                                        "Injury",
                                        MAPPED_TRAIT_CATEGORY)
         )   

# Fixing some response terms mislabelled as disease/disorder
other_response_terms = c("response to COVID-19 vaccine, localized superficial swelling, mass, or lump",
                         "response to COVID-19 vaccine, SARS-CoV-2 neutralizing antibody measurement",
                         "Anti-hepatitis B virus surface antigen IgG measurement, response to vaccine",
                         "anti-SARS-CoV-2 IgG measurement, response to COVID-19 vaccine",
                         "anti-tetanus toxoid IgG measurement, response to vaccine",
                         "height growth attribute, response to growth hormone",
                         "adrenal suppression measurement, response to corticosteroid",
                         "asthenia, response to COVID-19 vaccine",
                         "SARS-CoV-2 antibody measurement, response to COVID-19 vaccine",
                         "response to vaccine, anti-Haemophilus influenzae type b polyribosylribitol phosphate IgG measurement",
                         "height growth attribute, response to growth hormone"
                )
# Fixing some measurement terms mislabelled as disease/disorder
other_measurement_terms = c("amygdala volume, pallidum volume, nucleus accumbens volume, hippocampal volume, putamen volume, intracranial volume measurement, thalamus volume, caudate nucleus volume",
                            "lamina-associated polypeptide 2, isoforms beta/gamma measurement",
                            "apoptosis-inducing factor 1, mitochondrial measurement",
                            "central corneal thickness, intraocular pressure measurement",
                            "level of apoptosis-inducing factor 1, mitochondrial in blood serum",
                            "level of MHC class I polypeptide-related sequence A in blood, level of MHC class I polypeptide-related sequence B in blood",
                            "psychosocial stress measurement, intracranial volume measurement")

# Fixing some other terms mislabelled as disease/disorder
other_terms = c("insulin metabolic clearance rate measurement, disposition index measurement, insulin sensitivity measurement, glucose homeostasis trait, glucose effectiveness measurement, acute insulin response measurement",
                "disease recurrence, response to allogeneic hematopoietic stem cell transplant",
                "disease recurrence, response to allogeneic hematopoietic stem cell transplant, donor genotype effect measurement"
)

gwas_study_info =
  gwas_study_info |>
    mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT %in% other_response_terms,
                                        "Response",
                                        MAPPED_TRAIT_CATEGORY)) |>
    mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT %in% other_measurement_terms,
                                        "Response",
                                        MAPPED_TRAIT_CATEGORY)) |>
    mutate(MAPPED_TRAIT_CATEGORY = ifelse(MAPPED_TRAIT %in% other_terms,
                                        "Other",
                                        MAPPED_TRAIT_CATEGORY))

6.2.3 Fixing some specific missing disease terms

# spine osteoarthritis was not being picked up as a disease term
gwas_study_info = 
  gwas_study_info |>
    mutate(disease_terms = ifelse(grepl("spine osteoarthritis", MAPPED_TRAIT),
                                        paste0("spine psteoarthritis", disease_terms, collapse = ","),
                                        disease_terms)
    ) |>
      mutate(all_disease_terms = ifelse(grepl("spine osteoarthritis", MAPPED_TRAIT),
                                        paste0("spine psteoarthritis", all_disease_terms, collapse = ","),
                                        all_disease_terms)
    )

# liver disease biomarker was not being picked up as a disease term 
gwas_study_info = 
  gwas_study_info |>
    mutate(disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
                                        paste0("liver disease", disease_terms, collapse = ","),
                                        disease_terms)
    ) |>
      mutate(all_disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
                                        paste0("liver disease", all_disease_terms, collapse = ","),
                                        all_disease_terms)
    )


gwas_study_info = 
  gwas_study_info |>
    mutate(disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
                                        paste0("liver disease", disease_terms, collapse = ","),
                                        disease_terms)
    ) |>
      mutate(all_disease_terms = ifelse(grepl("liver disease biomarker", MAPPED_TRAIT),
                                        paste0("liver disease", all_disease_terms, collapse = ","),
                                        all_disease_terms)
    )

# growth delay was not being picked up as a disease term
gwas_study_info = 
  gwas_study_info |>
    mutate(background_disease_terms= ifelse(grepl("Growth delay", MAPPED_BACKGROUND_TRAIT),
                                        paste0("growth delay", background_disease_terms, collapse = ","),
                                        background_disease_terms)
    ) |> 
      mutate(all_disease_terms = ifelse(grepl("Growth delay", MAPPED_BACKGROUND_TRAIT),
                                        paste0("growth delay", all_disease_terms, collapse = ","),
                                        all_disease_terms)
    )

6.2.4 Recalculate disease study flag (is disease study or not?)

now that I have corrected any mistakes in categorization and added some missing disease terms

gwas_study_info = 
gwas_study_info |>
  dplyr::rowwise() |>
  dplyr::mutate(DISEASE_STUDY = 
                   ifelse(MAPPED_TRAIT_CATEGORY == "Disease/Disorder" | 
                          MAPPED_TRAIT_CATEGORY == "Phenotypic Abnormality" |  
                          BACKGROUND_TRAIT_CATEGORY == "Disease/Disorder" | 
                          BACKGROUND_TRAIT_CATEGORY == "Phenotypic Abnormality",
                          T, F )
                ) |>
  dplyr::ungroup() 


gwas_study_info |> 
  filter(DISEASE_STUDY == T) |> 
  filter(all_disease_terms == "")  |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_CATEGORY) |> 
  distinct() |>
  nrow()

[1] 0

7 Saving:

data.table::fwrite(gwas_study_info,
                  here::here("output/gwas_cat/gwas_study_info_trait_cat.csv"), 
                  sep = ",")

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringr_1.5.1     ggplot2_3.5.2     dplyr_1.1.4       data.table_1.17.8
[5] workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.3.1     renv_1.0.3        
 [5] promises_1.3.3     tidyselect_1.2.1   Rcpp_1.1.0         git2r_0.36.2      
 [9] callr_3.7.6        later_1.4.2        jquerylib_0.1.4    scales_1.4.0      
[13] yaml_2.3.10        fastmap_1.2.0      here_1.0.1         R6_2.6.1          
[17] generics_0.1.4     knitr_1.50         tibble_3.3.0       rprojroot_2.1.0   
[21] RColorBrewer_1.1-3 bslib_0.9.0        pillar_1.11.0      rlang_1.1.6       
[25] utf8_1.2.6         cachem_1.1.0       stringi_1.8.7      httpuv_1.6.16     
[29] xfun_0.52          getPass_0.2-4      fs_1.6.6           sass_0.4.10       
[33] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     ps_1.9.1          
[37] grid_4.3.1         digest_0.6.37      processx_3.8.6     rstudioapi_0.17.1 
[41] lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.4     glue_1.8.0        
[45] farver_2.1.2       whisker_0.4.1      rmarkdown_2.29     httr_1.4.7        
[49] tools_4.3.1        pkgconfig_2.0.3    htmltools_0.5.8.1

GWAS Trait Categorisation

Isobel Beasley

2025-08-24