Last updated: 2025-09-15

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 096f434. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/who/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_study_info_cohort_corrected.csv
    Ignored:    output/gwas_study_info_trait_corrected.csv
    Ignored:    output/gwas_study_info_trait_ontology_info.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l1.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l2.csv
    Ignored:    output/trait_ontology/
    Ignored:    renv/

Unstaged changes:
    Modified:   analysis/level_1_disease_group_cancer.Rmd
    Modified:   analysis/level_1_disease_group_non_cancer.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   code/get_term_descendants.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/disease_trait_terms_simplify.Rmd) and HTML (docs/disease_trait_terms_simplify.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 096f434 IJbeasley 2025-09-15 More disease term grouping
html 6a82fc8 IJbeasley 2025-09-15 Build site.
Rmd 069a54b IJbeasley 2025-09-15 Dealing with disease progression
html 51c6c29 IJbeasley 2025-09-15 Build site.
Rmd 0597bb5 IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html a27c404 IJbeasley 2025-09-10 Build site.
Rmd 49e052c IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 527497c IJbeasley 2025-09-10 Build site.
Rmd ed265ce IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 27ea97b IJbeasley 2025-09-10 Build site.
Rmd 8b77275 IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 932ca3b IJbeasley 2025-09-10 Build site.
Rmd 142ada8 IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 813a3ad IJbeasley 2025-09-10 Build site.
Rmd ac15860 IJbeasley 2025-09-10 workflowr::wflow_publish("analysis/disease_trait_terms_simplify.Rmd")
html 0cd6e22 IJbeasley 2025-09-10 Build site.
Rmd 174ec54 IJbeasley 2025-09-10 Fixing + improving format of initial disease trait name harmonization

1 Set up

library(dplyr)
library(data.table)
library(stringr)
gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_trait_cat.csv"))

2 Initial summary

2.1 Objectives of this analysis:

  • Collapsing and standardizing disease trait terms in GWAS study metadata without relying on ontology mappings.
  • Initial harmonization of disease trait terms (fixing spelling, collapsing studies that study different aspects of the same disease)

3 Number of studies per trait - before any grouping

Let’s look at the number of studies associated with each unique disease/trait term before any grouping or collapsing.

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))

head(n_studies_trait)
# A tibble: 6 × 2
  all_disease_terms         n_studies
  <chr>                         <int>
1 type 2 diabetes mellitus        144
2 major depressive disorder       108
3 alzheimer disease               104
4 asthma                           99
5 schizophrenia                    99
6 breast carcinoma                 98
dim(n_studies_trait)
[1] 3202    2

4 Disease trait collapsing and cleaning

# Basic cleaning of disease terms
# Removes trailing commas, 
gwas_study_info$all_disease_terms = sub(",$", "", gwas_study_info$all_disease_terms)

# trims whitespace
gwas_study_info$all_disease_terms = stringr::str_trim(gwas_study_info$all_disease_terms)

# remove apostrophes
gwas_study_info$all_disease_terms = stringr::str_remove_all(gwas_study_info$all_disease_terms, "'|’")

# initializes a new column (collected_all_disease_terms) to standardized traits
# collapse more traits 
gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = all_disease_terms) 

4.1 Where a study investigates a given aspect of of disease, reduce this to a study of this disease

4.1.1 Susceptibility to X measurement

4.1.1.1 Non-specific examples

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to (.*?) measurement(?=,|$)", "\\1")

         ) |>
    mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to (.*?) (.*?) measurement(?=,|$)", "\\1 \\2")

         ) |>
      mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to (.*?) (.*?) infection(?=,|$)", "\\1 \\2")

         ) 

4.1.2 Specific examples / changes to make

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to viral and mycobacterial infections",
                    "viral and mycobacterial infections")

         ) |>
    mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "susceptibility to strep throat",
                    "strep throat")
         ) |>    
  mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "decreased susceptibility to bacterial infection",
                    "bacterial infection")
  ) |> 
    mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "bruising susceptibility",
                    "bruisability")
    )

4.1.3 Severity measurement / symptom severity measurement / exacerbation measurement

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
          stringr::str_remove(collected_all_disease_terms, " symptom severity measurement")
         ) |>
  mutate(collected_all_disease_terms = 
          stringr::str_remove(collected_all_disease_terms, " severity measurement| exacerbation measurement")
         ) 

4.1.4 Other kinds of measurement

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        str_replace_all(collected_all_disease_terms, 
                    "(.*?) (.*?) measurement(?=,|$)", "\\1 \\2")
  )

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bdependence\\s+measurement(?=,|$)",
                           "dependence")) |> 
  mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\ballergy\\s+measurement(?=,|$)",
                           "allergy")) |> 
  mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\baddiction\\s+measurement(?=,|$)",
                           "addiction")) |> 
      mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bsyndrome\\s+measurement(?=,|$)",
                           "syndrome")) |> 
      mutate(collected_all_disease_terms =
           str_replace_all(collected_all_disease_terms,
                           "\\bdisorder\\s+measurement(?=,|$)",
                           "disorder")) 

4.1.5 Age of onset of

4.1.5.1 Non-specific correction

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
         stringr::str_remove(collected_all_disease_terms,
                              pattern = "age of onset of ")
         )

4.1.5.2 Specific corrections

gwas_study_info = 
gwas_study_info |> 
 mutate(collected_all_disease_terms  = 
          stringr::str_replace_all(collected_all_disease_terms ,
                                  pattern = "age of onset of type 2 diabetes mellitus",
                                   "type 2 diabetes mellitus"
                          )  
        )


gwas_study_info = 
gwas_study_info |> 
 mutate(collected_all_disease_terms  = 
          stringr::str_replace_all(collected_all_disease_terms ,
                                  pattern = "age of onset of alzheimer disease",
                                   "alzheimers disease"
                          )  
        )


# febrile seizure (within the age range of 3 months to 6 years)
gwas_study_info = gwas_study_info |> 
 mutate(collected_all_disease_terms  = 
          stringr::str_replace_all(collected_all_disease_terms ,
                                  pattern = "febrile seizure \\(within the age range of 3 months to 6 years\\)",
                                   "febrile seizure"
                          )  
        )

4.1.6 Family history of

4.1.6.1 Non-specific example

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
         stringr::str_remove(collected_all_disease_terms,
                         pattern = "family history of "))

4.1.6.2 Speciic example: Alzheimer’s disease

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "family history of alzheimer’s disease",
                          "alzheimers disease"
         ))

4.1.7 Disease progression

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                          "^disease progression, |"
         )) |> 
    mutate(collected_all_disease_terms = 
         stringr::str_remove_all(collected_all_disease_terms,
                          ", disease progression$"
         ))   


gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          ", disease progression, ",
                          ", "
         ))

4.1.8 Time to remission

4.1.8.1 COVID

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "time to remission of covid-19 symptoms",
                          "covid-19"
         ))

4.1.9 Symptom measurement

4.1.9.1 ADHD

gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "adhd symptom",
                            "adhd")
         )

4.1.9.2 Agoraphobia

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "agoraphobia symptom",
                          "agoraphobia"
         ))

4.1.9.3 Autism

gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "autism spectrum disorders symptom",
                            "autism")
         ) |>
    mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "autism spectrum disorder symptom",
                            "autism")
         ) 


gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "autism symptom",
                            "autism")
         ) 


gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "autism symptom",
                            "autism")
         ) 

4.1.9.4 Agoraphobia

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "agoraphobia symptom measurement",
                          "agoraphobia"
         ))

4.1.9.5 Alzheimer’s disease - neuropathological change

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "alzheimer disease neuropathological change",
                          "alzheimers disease"
         ))


gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "alzheimers disease neuropathologic change",
                          "alzheimers disease"
         ))

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "alzheimers disease biomarker",
                          "alzheimers disease"
         ))

4.1.9.6 Asthma

gwas_study_info = gwas_study_info |> 
  mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "asthma symptoms",
                            "asthma"
           )) 

4.1.9.7 Covid-19

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "covid-19 symptoms",
                          "covid-19"
         ))

4.1.9.8 Irritable bowel syndrome

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "irritable bowel syndrome symptom",
                          "irritable bowel syndrome"
         ))

4.1.9.9 Multiple sclerosis

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "multiple sclerosis symptom",
                          "multiple sclerosis"
         ))

4.1.9.10 Obsessive-compulsive disorder

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "obsessive-compulsive symptom",
                          "obsessive-compulsive disorder"
         ))

4.1.9.11 Parkinson’s disease

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "parkinsons disease symptom",
                          "parkinsons disease"
         ))

4.1.9.12 PTSD

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "post-traumatic stress disorder symptom",
                          "post-traumatic stress disorder"
         ))

4.1.9.13 Respiratory symptom change

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "respiratory symptom change",
                          "respiratory symptom" 
         ))

4.1.10 Measurement

4.1.10.1 Anxiety

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "anxiety measurement",
                          "anxiety"
         ))

4.1.10.2 Insomnia

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "insomnia measurement",
                          "insomnia"
         ))

4.1.10.3 Lewy body dementia

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "lewy body dementia measurement",
                          "lewy body dementia"
         ))

4.1.11 Substance abuse measurement

4.1.11.1 Cocaine

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "cocaine use disorder|cocaine dependence",
                          "cocaine-related disorders"
         ))

4.1.11.2 Nicotine

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "nicotine dependence symptom count",
                          "nicotine dependence"
         ))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "nicotine withdrawal symptom count|nicotine withdrawal measurement",
                          "nicotine withdrawal"
         ))

4.2 Deal with basic synonyms

4.2.1 ADHD = attention deficit hyperactivity disorder

gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "attention deficit hyperactivity disorder",
                            "adhd"
         ))

4.2.2 Anemia (phenotype) -> anemia

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "anemia \\(phenotype\\)",
                          "anemia"
         ))

4.2.3 Autism = Autism Spectrum Disorder = Asperger Syndrome

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "autism spectrum disorder",
                         "autism")
        ) |>
    mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "asperger syndrome",
                         "autism")
        )

4.2.4 Carcinoma = Cancer

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
           stringr::str_replace_all(collected_all_disease_terms,
                            "\\bcarcinoma",
                            "cancer")
         )

4.2.5 Coronary artery disease = coronary atherosclerosis

https://www.ebi.ac.uk/ols4/ontologies/efo/classes/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0001645

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "coronary atherosclerosis",
                          "coronary artery disease"
         ))

4.2.6 Head and neck cancer = head and neck malignant neoplasia

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "head and neck malignant neoplasia",
                          "head and neck cancer"
         ))

4.2.7 Hepatitis virus infection = hepatitis infection

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "hepatitis a virus infection",
                          "hepatitis a infection"
         )) |>
      mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "hepatitis b virus infection",
                          "hepatitis b infection"
         )) |>
        mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "hepatitis c virus infection",
                          "hepatitis c infection"
         ))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "chronic hepatitis c virus infection",
                          "chronic hepatitis, hepatitis c infection"
         )) |>
      mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "chronic hepatitis b virus infection",
                          "chronic hepatitis, hepatitis b infection"
         ))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "hepatitis virus-related liver cancer",
                          "hepatitis, liver cancer"
         ))

4.2.8 Narcolepsy = Narcolepsy without cataplexy

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "narcolepsy without cataplexy",
                          "narcolepsy"
         ))

4.2.9 Mumps virus infection = mumps

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "mumps virus infectious disease",
                          "mumps"
         ))

4.2.10 Renal cancer = kidney cancer

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "renal cancer",
                          "kidney cancer"
         ))

4.2.11 Rubella infect = rubella

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "rubella infection",
                          "rubella"
         ))
gwas_study_info = gwas_study_info |> 
    mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "multiple sclerosis symptom measurement",
                          "multiple sclerosis"
         ))  

4.2.12 Type 1 diabetes = type 1 diabetes mellitus

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "type 1 diabetes(?=,|$)",
                         "type 1 diabetes mellitus")
        )

gwas_study_info = gwas_study_info |>
      mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "type 1 diabetes(?=,|$)",
                         "type 1 diabetes mellitus")
        )

4.2.13 Type 1 diabetes mellitus nephropathy = type 1 diabetes nephropathy

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
        stringr::str_replace_all(collected_all_disease_terms,
                         "type 1 diabetes mellitus nephropathy",
                         "type 1 diabetes nephropathy")
        )

4.2.14 Von willebrand disease (hereditary or acquired) = von willebrand disease

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "von willebrand disease \\(hereditary or acquired\\)",
                          "von willebrand disease"
         ))

4.3 Typos

4.3.1 Alzheimer -> alzheimers

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "alzheimer disease",
                          "alzheimers disease"
         ))

4.3.2 Parkinson -> parkinsons

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms =
         stringr::str_replace_all(collected_all_disease_terms,
                          "parkinson disease",
                          "parkinsons disease"
         ))

5 Other

5.1 Decreased susceptibility to hepatitis c infection -> decreased hepatitis c -> hepatitis c

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms = 
         stringr::str_replace_all(collected_all_disease_terms,
                          "decreased hepatitis c",
                          "hepatitis c"
         ))

6 Putting it all together

gwas_study_info = gwas_study_info |>
  rowwise() |>
  mutate(
    collected_all_disease_terms = paste0(
      unique(unlist(stringr::str_split(collected_all_disease_terms, ", "))),
      collapse = ", "
    )
  ) |>
  ungroup()

           
gwas_study_info$collected_all_disease_terms = stringr::str_trim(gwas_study_info$collected_all_disease_terms)
gwas_study_info$collected_all_disease_terms = sub(",$", "", gwas_study_info$collected_all_disease_terms)



gwas_study_info$collected_all_disease_terms = stringr::str_replace_all(gwas_study_info$collected_all_disease_terms, "^, ", "")

6.1 Final summary - number of studies per trait - after harmonization

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(collected_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(collected_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))

head(n_studies_trait)
# A tibble: 6 × 2
  collected_all_disease_terms n_studies
  <chr>                           <int>
1 type 2 diabetes mellitus          145
2 alzheimers disease                116
3 breast cancer                     112
4 asthma                            110
5 major depressive disorder         108
6 schizophrenia                     103
dim(n_studies_trait)
[1] 3065    2

6.2 Number of unique disease terms after harmonization

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$collected_all_disease_terms[gwas_study_info$collected_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()

length(unique(diseases))
[1] 2194

6.3 Frequency of unique disease terms after harmonization

# make frequency table
freq <- table(as.factor(diseases))

# sort in decreasing order
freq_sorted <- sort(freq, decreasing = TRUE)

# show top N, e.g. top 10
head(freq_sorted, 10)

   chronic kidney disease              hypertension  type 2 diabetes mellitus 
                    10828                      6991                       922 
major depressive disorder   coronary artery disease        alzheimers disease 
                      471                       456                       406 
            schizophrenia                  covid-19                    asthma 
                      356                       305                       283 
            breast cancer 
                      270 

6.4 Save the updated gwas_study_info with repaired / harmonized disease terms

fwrite(gwas_study_info,
here::here("output/gwas_cat/gwas_study_info_disease_trait_simplified.csv"))

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringr_1.5.1     data.table_1.17.8 dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0    compiler_4.3.1    renv_1.0.3        promises_1.3.3   
 [5] tidyselect_1.2.1  Rcpp_1.1.0        git2r_0.36.2      callr_3.7.6      
 [9] later_1.4.2       jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0    
[13] here_1.0.1        R6_2.6.1          generics_0.1.4    knitr_1.50       
[17] tibble_3.3.0      rprojroot_2.1.0   bslib_0.9.0       pillar_1.11.0    
[21] rlang_1.1.6       utf8_1.2.6        cachem_1.1.0      stringi_1.8.7    
[25] httpuv_1.6.16     xfun_0.52         getPass_0.2-4     fs_1.6.6         
[29] sass_0.4.10       cli_3.6.5         withr_3.0.2       magrittr_2.0.3   
[33] ps_1.9.1          digest_0.6.37     processx_3.8.6    rstudioapi_0.17.1
[37] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.4    glue_1.8.0       
[41] whisker_0.4.1     rmarkdown_2.29    httr_1.4.7        tools_4.3.1      
[45] pkgconfig_2.0.3   htmltools_0.5.8.1