Last updated: 2025-09-10

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 50ef69d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/gwas_catalog/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_study_info_cohort_corrected.csv
    Ignored:    output/gwas_study_info_trait_corrected.csv
    Ignored:    output/gwas_study_info_trait_ontology_info.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l1.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l2.csv
    Ignored:    output/trait_ontology/
    Ignored:    renv/

Untracked files:
    Untracked:  code/get_term_descendants.R
    Untracked:  data/gbd/
    Untracked:  data/who/

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/index.Rmd
    Deleted:    analysis/level_1_disease_group.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Deleted:    analysis/non_ontology_trait_collapse.Rmd
    Deleted:    analysis/trait_ontology_collapse.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/level_1_disease_group_cancer.Rmd) and HTML (docs/level_1_disease_group_cancer.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 50ef69d IJbeasley 2025-09-10 Update cancer grouping

library(dplyr)
library(data.table)
library(stringr)

0.1 Ontology help - for getting disease subtypes

source(here::here("code/get_term_descendants.R"))
gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_group_l1.csv"))

1 Initial summary - number of unique study terns

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l1_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(l1_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))


head(n_studies_trait)
# A tibble: 6 × 2
  l1_all_disease_terms      n_studies
  <chr>                         <int>
1 type 2 diabetes mellitus        145
2 asthma                          131
3 alzheimers disease              124
4 breast cancer                   112
5 major depressive disorder       108
6 schizophrenia                   103
dim(n_studies_trait)
[1] 2901    2

1.1 When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$l1_all_disease_terms[gwas_study_info$l1_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()


length(unique(diseases))
[1] 2029
test <- data.frame(trait = unique(diseases))

2 Disease subtype grouping (cancer)

2.0.1 Bone cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002129/descendants"

bone_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 232
[1] "\n Some example terms"
[1] "cancer affecting bone of limb skeleton"
[2] "bone marrow cancer"                    
[3] "bone sarcoma"                          
[4] "primary bone lymphoma"                 
[5] "adult extraskeletal osteosarcoma"      
bone_cancer_terms = stringr::str_replace_all(bone_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

bone_cancer_terms = c("malignant bone neoplasm",
                      bone_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(bone_cancer_terms, collapse = "|"),
                                   "bone cancer"
                          )  
        )

2.0.2 Bladder cancer

url <-  "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_4007/descendants"

# maybe do uninary bladder cancer instead

bladder_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 33
[1] "\n Some example terms"
[1] "superficial urinary bladder cancer"                 
[2] "jewett-marshall bladder cancer"                     
[3] "urinary bladder small cell neuroendocrine carcinoma"
[4] "bladder urothelial carcinoma"                       
[5] "bladder squamous cell carcinoma"                    
bladder_cancer_terms = stringr::str_replace_all(bladder_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(bladder_cancer_terms, collapse = "|"),
                                   "bladder cancer"
                          )  
        )

2.0.3 Breast cancer

breast_cancer_terms <- grep("breast cancer", unique(diseases), value = T)

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(breast_cancer_terms, collapse = "|"),
                                   "breast cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "breast cancer in situ",
                                   "breast cancer"
                          )  
        ) |>
     mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "invasive lobular cancer",
                                   "breast cancer"
                          )  
        )

2.0.4 Benign neoplasms

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "benign neoplasm of (.*?)(?=,|$)|benign neoplasm of (.*?) (.*?)(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "benign (.*?) neoplasm(?=,|$)|benign (.*?) (.*?) neoplasm(?=,|$)", 
                    "benign neoplasm")
         ) 


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(.*?) benign neoplasm(?=,|$)|(.*?) (.*?) neoplasm(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "polyp of (.*?)(?=,|$)|polyp of (.*?) (.*?)(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?=,|^)(.*?) polyp(?=,|$)|(?=,|^)(.*?) (.*?) polyp (?=,|$)", 
                    "benign neoplasm")
         ) 

# https://my.clevelandclinic.org/health/diseases/21477-adenomas - benign
other_benign_neoplasms = c("adenomatous colon polyp",
                           "colorectal adenoma",
                           "pituitary gland adenoma",
                           "aldosterone-producing adenoma",
                           "female genital tract polyp",
                           "polyp"
                           )

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    paste0(other_benign_neoplasms, collapse = "|"), 
                    "benign neoplasm")
         )

2.0.5 Central nervous system cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0000326/descendants"

cns_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 173
[1] "\n Some example terms"
[1] "malignant jugulotympanic paraganglioma"   
[2] "malignant adrenal gland pheochromocytoma" 
[3] "central nervous system lymphoma"          
[4] "central nervous system embryonal neoplasm"
[5] "oligoastrocytoma"                         
cns_cancer_terms = stringr::str_replace_all(cns_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
pattern = paste0(cns_cancer_terms, collapse = "|")

gwas_study_info = gwas_study_info |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = pattern,
                                   "central nervous system cancer"
                          )  
        )

2.0.6 Cervical cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_4362/descendants"

cervical_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 37
[1] "\n Some example terms"
[1] "cervix endometrial stromal tumor"    "cervix melanoma"                    
[3] "cervical alveolar soft part sarcoma" "epithelioid trophoblastic tumor"    
[5] "cervical adenosquamous carcinoma"   
cervical_cancer_terms = stringr::str_replace_all(cervical_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

cervical_cancer_terms = c("cervical intraepithelial neoplasia grade 2/3",
                          "uterine cervical cancer in situ",
                          cervical_cancer_terms)

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "uterine cervical cancer in situ",
                                   "cervical cancer"
                          )  
        )
gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(cervical_cancer_terms, collapse = "|"),
                                   "cervical cancer"
                          )  
        )

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "uterine cervical cancer in situ",
                                   "cervical cancer"
                          )  
        )

2.0.7 Colorectal cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0005575/descendants"

colorectal_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 106
[1] "\n Some example terms"
[1] "colorectal lymphoma"        "colorectal carcinoma"      
[3] "familial colorectal cancer" "malignant colon neoplasm"  
[5] "rectal cancer"             
colorectal_cancer_terms = stringr::str_replace_all(colorectal_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

colorectal_cancer_terms= c("metastatic colorectal cancer",
                             "rectum cancer",
                          colorectal_cancer_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(colorectal_cancer_terms, collapse = "|"),
                                   "colorectal cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "colorectal mucinous adenocarcinoma",
                                   "colorectal cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "metachronous colorectal adenoma",
                                   "colorectal cancer"
                          )  
        )

2.0.8 Endometrial cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0011962/descendants"

endometrial_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 23
[1] "\n Some example terms"
[1] "endometrioid stromal sarcoma"              
[2] "endometrial carcinoma"                     
[3] "endometrioid stromal sarcoma of the cervix"
[4] "uterine corpus endometrial stromal sarcoma"
[5] "endometrioid stromal sarcoma of the vagina"
# also: http://www.ebi.ac.uk/efo/EFO_1001514: endometrial endometrioid carcinoma
endometrial_cancer_terms = c("endometrial endometrioid carcinoma",
                             endometrial_cancer_terms)

endometrial_cancer_terms = stringr::str_replace_all(endometrial_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(endometrial_cancer_terms, collapse = "|"),
                                   "endometrial cancer"
                          )  
        )

2.0.9 Esophageal cancer

esophageal_cancer_terms <- c("esophageal adenocarcinoma",
                             "esophageal squamous cell cancer")


gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(esophageal_cancer_terms, collapse = "|"),
                                   "esophageal cancer"
                          )  
        )

2.0.10 Eye cancer (to add)

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002236/descendants"

ocular_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 75
[1] "\n Some example terms"
[1] "metastatic malignant neoplasm in the eye"
[2] "eyelid cancer"                           
[3] "ocular melanoma"                         
[4] "eye lymphoma"                            
[5] "cornea cancer"                           
ocular_cancer_terms = stringr::str_replace_all(ocular_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(ocular_cancer_terms, collapse = "|"),
                                   "ocular cancer"
                          )  
        )

2.0.11 Gallbladder and bilary tract cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                         "cancer of gallbladder and extrahepatic biliary tract",
                         "gallbladder and bilary tract cancer"
         )
         )

2.0.12 Hodgkin lymphoma

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms ,
                          "nodular sclerosis hodgkin lymphoma",
                          "hodgkins lymphoma"
         ))

2.0.13 Head and neck cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms ,
                          "head and neck squamous cell cancer",
                          "head and neck cancer, squamous cell cancer"
         ))

2.0.14 Intestinal cancer (non- colorectal)

intestinal_cancer_terms <- c("small intestine cancer",
                           "small bowel cancer",
                           "small intestine cancer")

2.0.15 Kidney cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_263/descendants"


kidney_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 51
[1] "\n Some example terms"
[1] "malignant cystic nephroma"       "kidney liposarcoma"             
[3] "renal pelvis carcinoma"          "congenital mesoblastic nephroma"
[5] "renal carcinoma"                
kidney_cancer_terms = c("renal cell carcinoma",
                       "clear cell renal carcinoma",
                       "clear cell renal cell carcinoma",
                       kidney_cancer_terms)

kidney_cancer_terms = stringr::str_replace_all(kidney_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

kidney_cancer_terms =  stringr::str_replace_all(kidney_cancer_terms,
                          "renal cancer",
                          "kidney cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(kidney_cancer_terms, collapse = "|"),
                                   "kidney cancer"
                          )  
        )

2.0.16 Laryngeal cancer

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "laryngeal squamous cell cancer|laryngeal cancer",
                                   "larynx cancer"
                          )  
        )

2.0.17 Leukemia

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0000565/descendants"

leukemia_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 133
[1] "\n Some example terms"
[1] "chronic eosinophilic leukemia, not otherwise specified"
[2] "acute leukemia"                                        
[3] "mast-cell leukemia"                                    
[4] "lymphoid leukemia"                                     
[5] "myeloid leukemia"                                      
leukemia_terms = stringr::str_replace_all(leukemia_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(leukemia_terms, collapse = "|"),
                                   "leukemia"
                          )  
        )

2.0.18 Lip and oral cavity cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0005570/descendants"

lip_oral_cavity_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 51
[1] "\n Some example terms"
[1] "squamous odontogenic tumor" "lip cancer"                
[3] "gum cancer"                 "oral cavity carcinoma"     
[5] "vestibule of mouth cancer" 
lip_oral_cavity_cancer_terms = stringr::str_replace_all(lip_oral_cavity_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(lip_oral_cavity_cancer_terms, collapse = "|"),
                                   "lip and oral cavity cancer"
                          )  
        )

2.0.19 Liver cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002691/descendants"

liver_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 76
[1] "\n Some example terms"
[1] "carcinoma of liver and intrahepatic biliary tract"      
[2] "calcifying nested epithelial stromal tumor of the liver"
[3] "liver lymphoma"                                         
[4] "biliary tract cancer"                                   
[5] "liver sarcoma"                                          
liver_cancer_terms = stringr::str_replace_all(liver_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")


liver_cancer_terms = c("hepatitis virus-related liver cancer",
                       liver_cancer_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(liver_cancer_terms, collapse = "|"),
                                   "liver cancer"
                          )  
        )

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "hepatitis virus-related liver cancer",
                                   "liver cancer"
                          )  
        )

2.0.20 Lung cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0008903/descendants"

lung_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 63
[1] "\n Some example terms"
[1] "graham-boyle-troxell syndrome"      "malignant superior sulcus neoplasm"
[3] "lung carcinoma"                     "lung hilum cancer"                 
[5] "lung lymphoma"                     
lung_cancer_terms = stringr::str_replace_all(lung_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(lung_cancer_terms, collapse = "|"),
                                   "lung cancer"
                          )  
        )

2.0.21 Ovarian cancer

url <-  "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0008170/descendants"

ovarian_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 77
[1] "\n Some example terms"
[1] "malignant non-epithelial tumor of ovary" 
[2] "malignant epithelial tumor of ovary"     
[3] "familial ovarian cancer"                 
[4] "ovarian endometrioid adenocarcinofibroma"
[5] "ovarian neuroendocrine neoplasm"         
ovarian_cancer_terms = stringr::str_replace_all(ovarian_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

ovarian_cancer_terms = c("high grade serous ovarian cancer",
                         "high grade ovarian cancer",
                         "high grade ovarian cancers",
                         "ovarian endometrioid cancer", # http://www.ebi.ac.uk/efo/EFO_1001515 - ovarian edometrioid carcinoma
                         
                         "ovarian serous cancer", # http://www.ebi.ac.uk/efo/EFO_1001516 - ovarian serous carcinoma
                       ovarian_cancer_terms
                       )
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(ovarian_cancer_terms, collapse = "|"),
                                   "ovarian cancer"
                  
        )
 )
        
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "high grade ovarian cancer",
                                   "ovarian cancer"
                          )  
        )

2.0.22 Pancreatic cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0009831/descendants"

pancreatic_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 37
[1] "\n Some example terms"
[1] "pancreatic endocrine carcinoma"                
[2] "pancreas sarcoma"                              
[3] "malignant exocrine pancreas neoplasm"          
[4] "pancreas lymphoma"                             
[5] "pancreatic small cell neuroendocrine carcinoma"
pancreatic_cancer_terms = stringr::str_replace_all(pancreatic_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(pancreatic_cancer_terms, collapse = "|"),
                                   "pancreatic cancer"
                          )  
        )

2.0.23 Prostate cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_10283/descendants"

prostate_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 21
[1] "\n Some example terms"
[1] "prostate small cell carcinoma"    "adenosquamous prostate carcinoma"
[3] "prostate sarcoma"                 "prostate neuroendocrine neoplasm"
[5] "prostate lymphoma"               
prostate_cancer_terms = stringr::str_replace_all(prostate_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

prostate_cancer_terms = c("grade iii prostatic intraepithelial neoplasia",
                          "metastatic prostate cancer",
                          prostate_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(prostate_cancer_terms, collapse = "|"),
                                   "prostate cancer"
                          )  
        )

2.0.24 Non-Hodgkins lymphoma

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0005952/descendants"

nhl_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 93
[1] "\n Some example terms"
[1] "b-cell non-hodgkins lymphoma" "lymphoma, aids-related"      
[3] "sezary's disease"             "acute lymphoblastic leukemia"
[5] "gastric non-hodgkin lymphoma"
nhl_terms = stringr::str_replace_all(leukemia_terms,
                            "\\bcarcinoma",
                            "cancer")

nhl_terms = c("central nervous system non-hodgkin lymphoma",
              "lymphoblastic lymphoma",
              "extranodal nasal nk/t cell lymphoma", # https://www.ebi.ac.uk/ols4/ontologies/ordo/classes/http%253A%252F%252Fwww.orpha.net%252FORDO%252FOrphanet_86879
              "follicular lymphoma", # http://purl.obolibrary.org/obo/DOID_0050873
              "marginal zone b-cell lymphoma",
              "diffuse large b-cell lymphoma",
              nhl_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(nhl_terms, collapse = "|"),
                                   "non-hodgkins lymphoma"
                          )  
        )

2.0.25 Non-melanoma skin cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0009260/descendants"

non_melanoma_skin_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 11
[1] "\n Some example terms"
[1] "keratinocyte carcinoma"                  
[2] "basal cell carcinoma"                    
[3] "skin basal cell carcinoma"               
[4] "skin basosquamous cell carcinoma"        
[5] "salivary gland basal cell adenocarcinoma"
non_melanoma_skin_cancer_terms = stringr::str_replace_all(non_melanoma_skin_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(non_melanoma_skin_cancer_terms, collapse = "|"),
                                   "non-melanoma skin cancer"
                          )  
        )

2.0.26 Other pharygnx cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_0060119/descendants"

other_pharynx_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 16
[1] "\n Some example terms"
[1] "nasopharynx carcinoma"           "oropharynx cancer"              
[3] "hypopharynx cancer"              "pharynx squamous cell carcinoma"
[5] "tonsillar fossa cancer"         
other_pharynx_cancer_terms = stringr::str_replace_all(other_pharynx_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

other_pharynx_cancer_terms = c("hypopharyngeal cancer",
                               other_pharynx_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(other_pharynx_cancer_terms, collapse = "|"),
                                   "other pharynx cancer"
                          )  
        )

2.0.27 Stomach cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_10534/descendants"

stomach_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 32
[1] "\n Some example terms"
[1] "gastric liposarcoma"               "gastric gastrinoma"               
[3] "gastric teratoma"                  "stomach carcinoma"                
[5] "malignant gastric germ cell tumor"
stomach_cancer_terms = stringr::str_replace_all(stomach_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

stomach_cancer_terms = c(
                          "diffuse stomach cancer",
                          "gastric cancer",
                          stomach_cancer_terms
                          )
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(stomach_cancer_terms, collapse = "|"),
                                   "stomach cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "diffuse stomach cancer",
                                   "stomach cancer"
                          )  
        )

2.0.28 Squamous cell carcinoma

gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "cutaneous squamous cell cancer",
                                   "squamous cell cancer"
                          )  
        )

2.0.29 Testicular cancer

2.0.30 Thyroid cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_1781/descendants"

thyroid_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 24
[1] "\n Some example terms"
[1] "thyroid sarcoma"                       
[2] "thyroid gland carcinoma"               
[3] "thyroid lymphoma"                      
[4] "thyroid angiosarcoma"                  
[5] "thyroid gland mucoepidermoid carcinoma"
thyroid_cancer_terms = stringr::str_replace_all(thyroid_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

thyroid_cancer_terms = c("differentiated thyroid cancer",
                         thyroid_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(thyroid_cancer_terms, collapse = "|"),
                                   "thyroid cancer"
                          )  
        )

2.0.31 Uterine cancer

uterine_cancer_terms <- c("uterine corpus cancer",
                          "uterine adnexa cancer")

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(uterine_cancer_terms, collapse = "|"),
                                   "uterine cancer"
                          )  
        )

2.1 Ocular Melanoma

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "uveal melanoma|uveal melanoma disease severity|epithelioid cell uveal melanoma|choroidal melanoma",
                          "ocular melanoma"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "ocular melanoma disease severity",
                          "ocular melanoma"
         ))

3 Final summary - number of unique study terns

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l1_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(l1_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))


head(n_studies_trait)
# A tibble: 6 × 2
  l1_all_disease_terms      n_studies
  <chr>                         <int>
1 type 2 diabetes mellitus        145
2 asthma                          131
3 alzheimers disease              124
4 breast cancer                   122
5 major depressive disorder       108
6 colorectal cancer               104
dim(n_studies_trait)
[1] 2722    2

3.0.1 When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$l1_all_disease_terms[gwas_study_info$l1_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()


test <- data.frame(trait = unique(diseases))

length(unique(diseases))
[1] 1875
# make frequency table
freq <- table(as.factor(diseases))

# sort in decreasing order
freq_sorted <- sort(freq, decreasing = TRUE)

# show top N, e.g. top 10
head(freq_sorted, 10)

   chronic kidney disease              hypertension  type 2 diabetes mellitus 
                    10835                      7093                       922 
  coronary artery disease major depressive disorder           benign neoplasm 
                      514                       471                       430 
       alzheimers disease             breast cancer                    asthma 
                      422                       404                       357 
            schizophrenia 
                      356 

3.0.2 Save the updated gwas_study_info with harmonized disease terms

fwrite(gwas_study_info,
        here::here("output/gwas_cat/gwas_study_info_group_l1_v2.csv")
         )

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] jsonlite_2.0.0    httr_1.4.7        stringr_1.5.1     data.table_1.17.8
[5] dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] compiler_4.3.1    renv_1.0.3        promises_1.3.3    tidyselect_1.2.1 
 [5] Rcpp_1.1.0        git2r_0.36.2      callr_3.7.6       later_1.4.2      
 [9] jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0     here_1.0.1       
[13] R6_2.6.1          generics_0.1.4    curl_6.4.0        knitr_1.50       
[17] tibble_3.3.0      rprojroot_2.1.0   bslib_0.9.0       pillar_1.11.0    
[21] rlang_1.1.6       utf8_1.2.6        cachem_1.1.0      stringi_1.8.7    
[25] httpuv_1.6.16     xfun_0.52         getPass_0.2-4     fs_1.6.6         
[29] sass_0.4.10       cli_3.6.5         withr_3.0.2       magrittr_2.0.3   
[33] ps_1.9.1          digest_0.6.37     processx_3.8.6    rstudioapi_0.17.1
[37] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.4    glue_1.8.0       
[41] whisker_0.4.1     rmarkdown_2.29    tools_4.3.1       pkgconfig_2.0.3  
[45] htmltools_0.5.8.1