Last updated: 2025-09-22

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version fd6c194. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/who/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_study_info_cohort_corrected.csv
    Ignored:    output/gwas_study_info_trait_corrected.csv
    Ignored:    output/gwas_study_info_trait_ontology_info.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l1.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l2.csv
    Ignored:    output/trait_ontology/
    Ignored:    renv/

Unstaged changes:
    Modified:   analysis/exclude_infectious_diseases.Rmd
    Modified:   analysis/level_1_disease_group_cancer.Rmd
    Modified:   analysis/level_1_disease_group_non_cancer.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   code/get_term_descendants.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/disease_trait_terms_simplify.Rmd) and HTML (docs/disease_trait_terms_simplify.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd fd6c194 IJbeasley 2025-09-22 More typo + structure fixing …
html 87f9ba3 IJbeasley 2025-09-22 Build site.
Rmd 9de036a IJbeasley 2025-09-22 More typo + structure fixing …
html ffeabc1 IJbeasley 2025-09-22 Build site.
Rmd cd768cb IJbeasley 2025-09-22 …maybe fixing typos
html 224e1c5 IJbeasley 2025-09-22 Build site.
Rmd 85452c2 IJbeasley 2025-09-22 Updating regex for disease trait capturing
html a2fae4e IJbeasley 2025-09-16 Build site.
Rmd 6da2f7d IJbeasley 2025-09-16 Improving cancer grouping
html 7412343 IJbeasley 2025-09-16 Build site.
html fe7efac IJbeasley 2025-09-15 Build site.
html cb08805 IJbeasley 2025-09-15 Build site.
Rmd 096f434 IJbeasley 2025-09-15 More disease term grouping
html 6a82fc8 IJbeasley 2025-09-15 Build site.
Rmd 069a54b IJbeasley 2025-09-15 Dealing with disease progression
html 51c6c29 IJbeasley 2025-09-15 Build site.
Rmd 0597bb5 IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html a27c404 IJbeasley 2025-09-10 Build site.
Rmd 49e052c IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 527497c IJbeasley 2025-09-10 Build site.
Rmd ed265ce IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 27ea97b IJbeasley 2025-09-10 Build site.
Rmd 8b77275 IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 932ca3b IJbeasley 2025-09-10 Build site.
Rmd 142ada8 IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 813a3ad IJbeasley 2025-09-10 Build site.
Rmd ac15860 IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 0cd6e22 IJbeasley 2025-09-10 Build site.
Rmd 174ec54 IJbeasley 2025-09-10 Fixing + improving format of initial disease trait name harmonization

1 Set up

library(dplyr)
library(data.table)
library(stringr)
gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_trait_cat.csv"))

2 Initial summary

2.1 Objectives of this analysis:

  • Collapsing and standardizing disease trait terms in GWAS study metadata without relying on ontology mappings.
  • Initial harmonization of disease trait terms (fixing spelling, collapsing studies that study different aspects of the same disease)

3 Number of studies per trait - before any grouping

Let’s look at the number of studies associated with each unique disease/trait term before any grouping or collapsing.

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))

head(n_studies_trait)
# A tibble: 6 × 2
  all_disease_terms         n_studies
  <chr>                         <int>
1 type 2 diabetes mellitus        144
2 major depressive disorder       108
3 alzheimer disease               104
4 asthma                           99
5 schizophrenia                    99
6 breast carcinoma                 98
dim(n_studies_trait)
[1] 3202    2

4 Disease trait collapsing and cleaning

# Basic cleaning of disease terms
# Removes trailing commas, 
gwas_study_info$all_disease_terms = sub(",$", "", gwas_study_info$all_disease_terms)

# trims whitespace
gwas_study_info$all_disease_terms = stringr::str_trim(gwas_study_info$all_disease_terms)

# remove apostrophes
gwas_study_info$all_disease_terms = stringr::str_remove_all(gwas_study_info$all_disease_terms, "'|’")

# initializes a new column (collected_all_disease_terms) to standardized traits
# collapse more traits 
gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = all_disease_terms) 

4.1 Where a study investigates a given aspect of of disease, reduce this to a study of this disease

4.1.1 Disease susceptibility

4.1.1.1 Susceptibility to X measurement, decreased susceptibility to)

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to (.*?) measurement(?=,|$)", "\\1")

         ) |>
    mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to (.*?) (.*?) measurement(?=,|$)", "\\1 \\2")

         )


gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_remove(collected_all_disease_terms, 
                   pattern = "(?<=^|, )susceptibility to "
         )
  )

4.1.1.2 Decreased susceptibility to X

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_remove_all(collected_all_disease_terms,
                       pattern = "(?<=^|, )decreased susceptibility to")
  )

4.1.1.3 Predisposition

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_remove_all(collected_all_disease_terms,
                       pattern = " predisposition measurement(?=,|$)")
  )

4.1.1.4 Specific examples / changes to make - bruisability

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "(?<=^|, )bruising susceptibility(?=,|$)",
                    "bruisability")
    )

4.1.2 Disease severity

4.1.2.1 Severity / symptom severity / exacerbation

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
          stringr::str_remove(collected_all_disease_terms, " symptom severity measurement(?=,|$)")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
          stringr::str_remove(collected_all_disease_terms, " severity measurement(?=,|$)")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
          stringr::str_remove(collected_all_disease_terms, " exacerbation measurement(?=,|$)")
         ) 

4.1.2.2 Specific example: psoriasis area and severity index

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
          stringr::str_remove(collected_all_disease_terms, ", psoriasis area and severity index(?=,|$)")
         )

4.1.2.3 Specific example: uveal melanoma disease severity

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
          stringr::str_replace_all(collected_all_disease_terms,
                           "(?<=^|, )uveal melanoma disease severity(?=,|$)",
                           "uveal melanoma"
          )
         )

4.1.3 Disease symptom measurement

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_remove_all(collected_all_disease_terms,
                         " symptom measurement(?=,|$)")
         )

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_remove_all(collected_all_disease_terms,
                         " symptoms measurement(?=,|$)")
         )


gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_remove_all(collected_all_disease_terms,
                         " symptom count(?=,|$)")
         )

4.1.4 General measurement

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "(.*?) (.*?) measurement(?=,|$) ", "\\1 \\2")
  )

4.1.4.1 Dependence / addiction / withdrawal measurement

# dependence measurement -> dependence
gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bdependence\\s+measurement(?=,|$)",
                           "dependence")) 

# addiction measurement -> addiction
gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\baddiction\\s+measurement(?=,|$)",
                           "addiction"))

# withdrawal measurement -> withdrawal
gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bwithdrawal\\s+measurement(?=,|$)",
                           "withdrawal"))

4.1.4.2 Syndrome / disorder measurement

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bsyndrome\\s+measurement(?=,|$)",
                           "syndrome"))

gwas_study_info = gwas_study_info |> 
      mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bdisorder\\s+measurement(?=,|$)",
                           "disorder")) 

4.1.4.3 Allergy measurement

# allergy measurement -> allergy
gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\ballergy\\s+measurement(?=,|$)",
                           "allergy")) 

4.1.5 Symptom measurement

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_remove_all(collected_all_disease_terms,
                         " symptom measurement(?=,|$)")
         )

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_remove_all(collected_all_disease_terms,
                         " symptoms measurement(?=,|$)")
         )


gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_remove_all(collected_all_disease_terms,
                         " symptom count(?=,|$)")
         )

4.1.5.1 Specific example: autism

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "(?<=^|, )autism spectrum disorder symptom(?=,|$)",
                            "autism")
         ) 

4.1.5.2 Specific example: respiratory symptom change

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                          ", respiratory symptom change measurement(?=,|$)"
         ))

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "respiratory symptom change(?=,|$)",
                          "respiratory symptom" 
         ))

4.2 Age of onset of

4.2.0.1 Non-specific correction

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                              pattern = "(?<=^|, )age of onset of ")
         )

4.2.0.2 Specific correction: febrile seizure

# febrile seizure (within the age range of 3 months to 6 years)
gwas_study_info = gwas_study_info |> 
 mutate(collected_all_disease_terms  = 
          stringr::str_replace_all(collected_all_disease_terms ,
                                  pattern = "febrile seizure \\(within the age range of 3 months to 6 years\\)",
                                   "febrile seizure"
                          )  
        )

4.3 Family history of

4.3.0.1 Non-specific example

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                         pattern = "(?<=^|, )family history of "))

4.4 Disease progression

4.4.1 Disease progression measurement

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                          "(?<=^|, )disease progression measurement(?=,|$)"
         ))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                          "disease progression measurement"
         ))

4.5 Time to remission

4.5.1 COVID

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )time to remission of covid-19 symptoms(?=,|$)",
                          "covid-19"
         ))

4.6 Alzheimer’s disease - additional examples

4.6.1 Neuropathological change

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )alzheimers disease neuropathologic change(?=,|$)",
                          "alzheimers disease"
         ))

4.6.2 Biomarker measurement

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )alzheimers disease biomarker measurement(?=,|$)",
                          "alzheimers disease"
         ))

4.7 Cancer aggressiveness measurement

gwas_study_info = 
gwas_study_info |> 
 mutate(collected_all_disease_terms  = 
          stringr::str_remove_all(collected_all_disease_terms,
                                  pattern = "(?<=^|, )cancer aggressiveness measurement(?=,|$)"
                          )
 ) 

4.8 Other specific / non-defined measurement

4.8.1 Anxiety

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )anxiety measurement(?=,|$)",
                          "anxiety"
         ))

4.8.2 Coronary atherosclerosis

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )coronary atherosclerosis measurement(?=,|$)",
                          "coronary atherosclerosis"
         ))

4.9 Cutaneous psoriasis

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )cutaneous psoriasis measurement(?=,|$)",
                          "cutaneous psoriasis"
         ))

4.9.1 Depressive episode

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )depressive episode measurement(?=,|$)",
                          "depressive episode"
         ))

4.9.2 Insomnia

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )insomnia measurement(?=,|$)",
                          "insomnia"
         ))

4.9.3 Lewy body dementia

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "lewy body dementia measurement(?=,|$)",
                          "lewy body dementia"
         ))

4.9.4 Sleep apnea

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )sleep apnea measurement(?=,|$)",
                          "sleep apnea"
         ))

# during rem sleep
gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )sleep apnea measurement during rem sleep(?=,|$)",
                          "sleep apnea"
         ))

# during non-rem sleep
gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )sleep apnea measurement during non-rem sleep(?=,|$)",
                          "sleep apnea"
         ))

4.9.5 Substance abuse measurement

4.9.6 Alcohol

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )alcohol withdrawal delirium(?=,|$)",
                          "alcohol withdrawal"
         ))


alcohol_use_disorder_terms <- c("(?<=^|, )alcohol dependence(?=,|$)",
                             "(?<=^|, )alcohol withdrawal(?=,|$)",
                             "(?<=^|, )alcohol abuse(?=,|$)",
                             "(?<=^|, )alcohol use disorder(?=,|$)",
                             "(?<=^|, )addictive alcohol use(?=,|$)"
                             )

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            paste0(alcohol_use_disorder_terms, collapse = "|"),
                            "alcohol-related disorders"
           ))

4.9.7 Cocaine

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "cocaine use disorder(?=,|$)|cocaine dependence(?=,|$)",
                          "cocaine-related disorders"
         ))

4.9.8 Nicotine

nicotine_use_disorder_terms <- c("(?<=^|, )nicotine dependence(?=,|$)",
                             "(?<=^|, )nicotine withdrawal(?=,|$)",
                             "(?<=^|, )nicotine addiction(?=,|$)"
                             )

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          paste0(nicotine_use_disorder_terms, collapse = "|"),
                          "nicotine-related disorders"
         ))

4.10 Methamphetamine use disorders

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )methamphetamine dependence(?=,|$)|(?<=^|, )methamphetamine-induced psychosis(?=,|$)",
                          "methamphetamine use disorders"
         ))

4.11 Deal with basic synonyms

4.11.1 ADHD = attention deficit hyperactivity disorder

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "(?<=^|, )attention deficit hyperactivity disorder(?=,|$)",
                            "adhd"
         ))

4.11.2 Anemia (phenotype) -> anemia

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )anemia \\(phenotype\\)(?=,|$)",
                          "anemia"
         ))

4.11.3 Autism = Autism Spectrum Disorder = Asperger Syndrome

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "(?<=^|, )autism spectrum disorder(?=,|$)",
                         "autism")
        ) |>
    mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "(?<=^|, )asperger syndrome(?=,|$)",
                         "autism")
        )

4.11.4 Carcinoma = Cancer

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "\\bcarcinoma",
                            "cancer")
         )

4.11.5 Coronary artery disease = coronary atherosclerosis

https://www.ebi.ac.uk/ols4/ontologies/efo/classes/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0001645

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )coronary atherosclerosis(?=,|$)",
                          "coronary artery disease"
         ))

4.11.6 Head and neck cancer = head and neck malignant neoplasia

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )head and neck malignant neoplasia(?=,|$)",
                          "head and neck cancer"
         ))

4.11.7 Hepatitis virus infection = hepatitis infection

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )hepatitis a virus infection(?=,|$)",
                          "hepatitis a infection"
         )) |>
      mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )hepatitis b virus infection(?=,|$)",
                          "hepatitis b infection"
         )) |>
        mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )hepatitis c virus infection(?=,|$)",
                          "hepatitis c infection"
         ))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )chronic hepatitis c virus infection(?=,|$)",
                          "chronic hepatitis, hepatitis c infection"
         )) |>
      mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )chronic hepatitis b virus infection(?=,|$)",
                          "chronic hepatitis, hepatitis b infection"
         ))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )hepatitis virus-related liver cancer(?=,|$)",
                          "hepatitis, liver cancer"
         ))

4.11.8 Narcolepsy = Narcolepsy without cataplexy

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )narcolepsy without cataplexy(?=,|$)",
                          "narcolepsy"
         ))

4.11.9 Mumps virus infection = mumps

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )mumps virus infectious disease(?=,|$)",
                          "mumps"
         ))

4.11.10 Renal cancer = kidney cancer

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )renal cancer(?=,|$)",
                          "kidney cancer"
         ))

4.11.11 Rubella infect = rubella

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )rubella infection(?=,|$)",
                          "rubella"
         ))

4.11.12 Type 1 diabetes = type 1 diabetes mellitus

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "(?<=^|, )type 1 diabetes(?=,|$)",
                         "type 1 diabetes mellitus")
        )

4.11.13 Type 1 diabetes mellitus nephropathy = type 1 diabetes nephropathy

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "(?<=^|, )type 1 diabetes mellitus nephropathy(?=,|$)",
                         "type 1 diabetes nephropathy")
        )

4.11.14 Von willebrand disease (hereditary or acquired) = von willebrand disease

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )von willebrand disease \\(hereditary or acquired\\)(?=,|$)",
                          "von willebrand disease"
         ))

4.12 Typos

4.12.1 Alzheimer -> alzheimers

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )alzheimer disease(?=,|$)",
                          "alzheimers disease"
         ))

4.12.2 Parkinson -> parkinsons

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "(?<=^|, )parkinson disease(?=,|$)",
                          "parkinsons disease"
         ))

5 Putting it all together

5.1 Final summary - number of unique diseases studied

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$collected_all_disease_terms[gwas_study_info$collected_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()

length(unique(diseases))
[1] 2187

5.2 Frequency of unique disease terms after harmonization

# make frequency table
freq <- table(as.factor(diseases))

# sort in decreasing order
freq_sorted <- sort(freq, decreasing = TRUE)

# show top N, e.g. top 10
print("Top 5 most frequent disease terms after harmonization:")
[1] "Top 5 most frequent disease terms after harmonization:"
head(freq_sorted, 5)

   chronic kidney disease              hypertension  type 2 diabetes mellitus 
                    10828                      6988                       922 
major depressive disorder   coronary artery disease 
                      471                       456 
print("Number of unique disease terms after harmonization:")
[1] "Number of unique disease terms after harmonization:"
sum(freq[freq == 1])
[1] 409
print("Average number of studies per disease term after harmonization:")
[1] "Average number of studies per disease term after harmonization:"
mean(freq)
[1] 20.7316

5.3 Save the updated gwas_study_info with repaired / harmonized disease terms

data.table::fwrite(
  gwas_study_info,
   here::here("output/gwas_cat/gwas_study_info_disease_trait_simplified.csv")
  )

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringr_1.5.1     data.table_1.17.8 dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0    compiler_4.3.1    renv_1.0.3        promises_1.3.3   
 [5] tidyselect_1.2.1  Rcpp_1.1.0        git2r_0.36.2      callr_3.7.6      
 [9] later_1.4.2       jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0    
[13] here_1.0.1        R6_2.6.1          generics_0.1.4    knitr_1.50       
[17] tibble_3.3.0      rprojroot_2.1.0   bslib_0.9.0       pillar_1.11.0    
[21] rlang_1.1.6       utf8_1.2.6        cachem_1.1.0      stringi_1.8.7    
[25] httpuv_1.6.16     xfun_0.52         getPass_0.2-4     fs_1.6.6         
[29] sass_0.4.10       cli_3.6.5         withr_3.0.2       magrittr_2.0.3   
[33] ps_1.9.1          digest_0.6.37     processx_3.8.6    rstudioapi_0.17.1
[37] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.4    glue_1.8.0       
[41] whisker_0.4.1     rmarkdown_2.29    httr_1.4.7        tools_4.3.1      
[45] pkgconfig_2.0.3   htmltools_0.5.8.1