Last updated: 2025-09-17

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 41b1b7c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/who/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_study_info_cohort_corrected.csv
    Ignored:    output/gwas_study_info_trait_corrected.csv
    Ignored:    output/gwas_study_info_trait_ontology_info.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l1.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l2.csv
    Ignored:    output/trait_ontology/
    Ignored:    renv/

Unstaged changes:
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   code/get_term_descendants.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/level_1_disease_group_cancer.Rmd) and HTML (docs/level_1_disease_group_cancer.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 41b1b7c IJbeasley 2025-09-17 Better grouping of cardiovascular disease
html fa95c62 IJbeasley 2025-09-17 Build site.
Rmd 57e46da IJbeasley 2025-09-17 More typo fixing
html cfd2ef8 IJbeasley 2025-09-17 Build site.
Rmd 7df4726 IJbeasley 2025-09-17 Dealing with non-specific cancer labels
html 2aa6027 IJbeasley 2025-09-17 Build site.
Rmd b6f20c4 IJbeasley 2025-09-17 Adding more benign neoplasm
html 83152bd IJbeasley 2025-09-16 Build site.
html d7db734 IJbeasley 2025-09-16 Build site.
Rmd 53bf24e IJbeasley 2025-09-16 More cancer typos
html b0f0ff5 IJbeasley 2025-09-16 Build site.
Rmd e8fb82c IJbeasley 2025-09-16 Correcting some cancer grouping
html de1a740 IJbeasley 2025-09-16 Build site.
Rmd 69d6255 IJbeasley 2025-09-16 Improving cancer grouping
html da4e2cc IJbeasley 2025-09-16 Build site.
Rmd 0196914 IJbeasley 2025-09-16 More disease grouping
html 937b460 IJbeasley 2025-09-16 Build site.
Rmd 3ac50bd IJbeasley 2025-09-16 Even more disease term grouping
html 7c6dee8 IJbeasley 2025-09-15 Build site.
Rmd 4451421 IJbeasley 2025-09-15 Grouping more neoplasms
html 2e145f8 IJbeasley 2025-09-15 Build site.
Rmd 2702dc1 IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/level_1_disease_group_cancer.Rmd")
html 7fe9a06 IJbeasley 2025-09-15 Build site.
Rmd 81a1d22 IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/level_1_disease_group_cancer.Rmd")
html 1f89b20 IJbeasley 2025-09-15 Build site.
Rmd fdd60ed IJbeasley 2025-09-15 More disease term grouping
html bf45a69 IJbeasley 2025-09-15 Build site.
Rmd 1414cad IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/level_1_disease_group_cancer.Rmd")
html 3c8309c IJbeasley 2025-09-15 Build site.
Rmd 17a16b0 IJbeasley 2025-09-15 Further grouping of disease terms
html 778ac1e IJbeasley 2025-09-15 Build site.
Rmd bb5431c IJbeasley 2025-09-15 Dealing with duplicate disease terms
html 9f69979 IJbeasley 2025-09-10 Build site.
html 9ca183a IJbeasley 2025-09-10 Build site.
Rmd 50ef69d IJbeasley 2025-09-10 Update cancer grouping

library(dplyr)
library(data.table)
library(stringr)

0.1 Ontology help - for getting disease subtypes

source(here::here("code/get_term_descendants.R"))
gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_group_l1.csv"))

1 Initial summary - number of unique study terns

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l1_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(l1_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))


head(n_studies_trait)
# A tibble: 6 × 2
  l1_all_disease_terms      n_studies
  <chr>                         <int>
1 type 2 diabetes mellitus        145
2 asthma                          134
3 alzheimers disease              124
4 breast cancer                   112
5 ischemic heart disease          109
6 major depressive disorder       108
dim(n_studies_trait)
[1] 2690    2

1.1 When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$l1_all_disease_terms[gwas_study_info$l1_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()


length(unique(diseases))
[1] 1878
test <- data.frame(trait = unique(diseases))

2 Cancer aggressiveness measurement

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_remove_all(l1_all_disease_terms,
                                  pattern = "^cancer aggressiveness"
                          )
 ) 

3 Disease subtype grouping (cancer)

3.0.1 Astrocytoma

gwas_study_info |>
    filter(grepl("astrocytoma", l1_all_disease_terms)) |> 
    pull(STUDY) |> 
    unique()
[1] "Multi-ancestry genome-wide association study of 4,069 children with glioma identifies 9p21.3 risk locus."
# all comes from one cancer study - so a central nervous system cancer

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  =
          ifelse(PUBMED_ID == "36810956",
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "astrocytoma",
                                   "central nervous system cancer"
                          ),
          l1_all_disease_terms
        )
 )

3.0.2 Bone cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002129/descendants"

bone_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 232
[1] "\n Some example terms"
[1] "acute myeloid leukemia with abnormal bone marrow eosinophils inv(16)(p13q22) or t(16;16)(p13;q22)"
[2] "acute myeloid leukemia and myelodysplastic syndromes related to topoisomerase type 2 inhibitor"   
[3] "acute myeloid leukemia and myelodysplastic syndromes related to alkylating agent"                 
[4] "acute myeloid leukemia and myelodysplastic syndromes related to radiation"                        
[5] "therapy related acute myeloid leukemia and myelodysplastic syndrome"                              
bone_cancer_terms = stringr::str_replace_all(bone_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

bone_cancer_terms = c("malignant bone neoplasm",
                      bone_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(bone_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "bone cancer"
                          )  
        )

3.0.3 Bladder cancer

url <-  "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_4007/descendants"

# maybe do uninary bladder cancer instead

bladder_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 33
[1] "\n Some example terms"
[1] "micropapillary variant infiltrating bladder urothelial carcinoma"
[2] "lymphoma-like variant infiltrating bladder urothelial carcinoma" 
[3] "plasmacytoid variant infiltrating bladder urothelial carcinoma"  
[4] "microcystic variant infiltrating bladder urothelial carcinoma"   
[5] "infiltrating bladder urothelial carcinoma sarcomatoid variant"   
bladder_cancer_terms = stringr::str_replace_all(bladder_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(bladder_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "bladder cancer"
                          )  
        )

3.0.4 Breast cancer

breast_cancer_terms <- grep("breast cancer", unique(diseases), value = T)

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(breast_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "breast cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "breast cancer in situ",
                                   "breast cancer"
                          )  
        ) |>
     mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "invasive lobular cancer",
                                   "breast cancer"
                          )  
        )

3.0.5 Benign neoplasms

3.0.5.1 Benign neoplasm of blood vessel

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/snomed/terms/http%253A%252F%252Fsnomed.info%252Fid%252F92017000/descendants"

benign_blood_vessel_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 87
[1] "\n Some example terms"
[1] "abnormal vision due to superficial infantile hemangioma of periorbital region (disorder)"
[2] "cavernous hemangiomas of face and supraumbilical midline raphe (disorder)"               
[3] "superficial infantile hemangioma of periorbital region (disorder)"                       
[4] "capillary hemangioma of bilateral orbital regions (disorder)"                            
[5] "capillary hemangioma of right orbit region (disorder)"                                   
benign_blood_vessel_terms <- stringr::str_replace_all(benign_blood_vessel_terms,
                            "\\bcarcinoma",
                            "cancer")


gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms = 
        stringr::str_replace_all(l1_all_disease_terms, 
                    paste0(benign_blood_vessel_terms, collapse = "(?=,|$)|\\b"), 
                    "benign neoplasm")
         )

3.0.6 Labelled benign neoplasms

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "benign neoplasm of (.*?)(?=,|$)|benign neoplasm of (.*?) (.*?)(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "benign (.*?) neoplasm(?=,|$)|benign (.*?) (.*?) neoplasm(?=,|$)", 
                    "benign neoplasm")
         ) 


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(.*?) benign neoplasm(?=,|$)|(.*?) (.*?) neoplasm(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "polyp of (.*?)(?=,|$)|polyp of (.*?) (.*?)(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?=,|^)(.*?) polyp(?=,|$)|(?=,|^)(.*?) (.*?) polyp (?=,|$)", 
                    "benign neoplasm")
         ) 

4 Other benign neoplasms

# https://my.clevelandclinic.org/health/diseases/21477-adenomas - benign
other_benign_neoplasms = c("adenomatous colon polyp",
                           "colorectal adenoma",
                           "pituitary gland adenoma",
                           "aldosterone-producing adenoma",
                           "metachronous colorectal adenoma",
                           "adenomatous colon polyp",
                           "female genital tract polyp",
                           "\\bpolyp\\b",
                           "uterine leiomyoma",
                           "hepatic hemangioma",
                           "lobular capilliary hemangioma", 
                           "hemangioma of subcutaneous tissue",
                           "benign prostatic hyperplasia",
                           "melanocytic nevus",
                           "hemangioma",
                           "lymphangioma",
                           "vestibular schwannoma",
                           "schwannoma",
                           "skin lipoma", # likely benign ... 
                           "lipoma",
                           "hamartoma",
                           "meningioma" # most are benign (80%)
                           )

other_benign_neoplasms = str_length_sort(other_benign_neoplasms)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    paste0(other_benign_neoplasms, collapse = "(?=,|$)|\\b"), 
                    "benign neoplasm")
         )

4.0.1 Central nervous system cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0000326/descendants"

cns_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 173
[1] "\n Some example terms"
[1] "central nervous system ewing sarcoma/peripheral primitive neuroectodermal tumor"
[2] "malignant central nervous system mesenchymal, non-meningothelial neoplasm"      
[3] "malignant peripheral nerve sheath tumor with mesenchymal differentiation"       
[4] "diffuse pediatric-type high-grade glioma, h3-wildtype and idh-wildtype"         
[5] "childhood central nervous system primitive neuroectodermal neoplasm"            
cns_cancer_terms = stringr::str_replace_all(cns_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
pattern = paste0(cns_cancer_terms, collapse = "(?=,|$)|\\b")

gwas_study_info = gwas_study_info |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = pattern,
                                   "central nervous system cancer"
                          )  
        )

4.0.2 Cervical cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_4362/descendants"

cervical_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 37
[1] "\n Some example terms"
[1] "signet ring cell variant cervical mucinous adenocarcinoma"
[2] "villoglandular variant cervical mucinous adenocarcinoma"  
[3] "glassy cell variant cervical adenosquamous carcinoma"     
[4] "intestinal variant cervical mucinous adenocarcinoma"      
[5] "endocervical type cervical mucinous adenocarcinoma"       
cervical_cancer_terms = stringr::str_replace_all(cervical_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

cervical_cancer_terms = c("cervical intraepithelial neoplasia grade 2/3",
                          "uterine cervical cancer in situ",
                          cervical_cancer_terms)

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "uterine cervical cancer in situ",
                                   "cervical cancer"
                          )  
        )
gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(cervical_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "cervical cancer"
                          )  
        )

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "uterine cervical cancer in situ",
                                   "cervical cancer"
                          )  
        )

4.0.3 Colorectal cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0005575/descendants"

colorectal_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 106
[1] "\n Some example terms"
[1] "pold1-related polyposis and colorectal cancer syndrome"
[2] "pole-related polyposis and colorectal cancer syndrome" 
[3] "colorectal cancer, hereditary nonpolyposis, type 7"    
[4] "colorectal cancer, hereditary nonpolyposis, type 6"    
[5] "colon mucosa-associated lymphoid tissue lymphoma"      
colorectal_cancer_terms = stringr::str_replace_all(colorectal_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

colorectal_cancer_terms= c("metastatic colorectal cancer",
                             "rectum cancer",
                          colorectal_cancer_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(colorectal_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "colorectal cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "colorectal mucinous adenocarcinoma",
                                   "colorectal cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "metachronous colorectal adenoma",
                                   "colorectal cancer"
                          )  
        )

4.0.4 Endometrial cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0011962/descendants"

endometrial_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 23
[1] "\n Some example terms"
[1] "endometrial endometrioid adenocarcinoma with spindled epithelial cells"
[2] "ovarian endometrioid adenocarcinoma with squamous differentiation"     
[3] "villoglandular endometrial endometrioid adenocarcinoma"                
[4] "secretory uterine corpus endometrioid adenocarcinoma"                  
[5] "mucin-rich endometrial endometrioid adenocarcinoma"                    
# also: http://www.ebi.ac.uk/efo/EFO_1001514: endometrial endometrioid carcinoma
endometrial_cancer_terms = c("endometrial endometrioid carcinoma",
                             endometrial_cancer_terms)

endometrial_cancer_terms = stringr::str_replace_all(endometrial_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(endometrial_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "endometrial cancer"
                          )  
        )

4.0.5 Esophageal cancer

esophageal_cancer_terms <- c("esophageal adenocarcinoma",
                             "esophageal squamous cell cancer")


gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(esophageal_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "esophageal cancer"
                          )  
        )

4.0.6 Eye cancer (to add)

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002236/descendants"

ocular_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 75
[1] "\n Some example terms"
[1] "lacrimal gland carcinoma ex pleomorphic adenoma"
[2] "intermediate cell type ciliary body melanoma"   
[3] "ocular melanoma with extraocular extension"     
[4] "medium/large size posterior uveal melanoma"     
[5] "metastatic malignant neoplasm in the eye"       
ocular_cancer_terms = stringr::str_replace_all(ocular_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(ocular_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "ocular cancer"
                          )  
        )

4.0.7 Gallbladder and biliary tract cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                         "cancer of gallbladder and extrahepatic biliary tract",
                         "gallbladder and biliary tract cancer"
         )
         )

4.0.8 Hodgkin lymphoma

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms ,
                          "nodular sclerosis hodgkin lymphoma",
                          "hodgkins lymphoma"
         ))

4.0.9 Head and neck cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms ,
                          "head and neck squamous cell cancer",
                          "head and neck cancer, squamous cell cancer"
         ))

4.0.10 Intestinal cancer (non- colorectal)

intestinal_cancer_terms <- c("small intestine cancer",
                           "small bowel cancer",
                           "small intestine cancer")

4.0.11 Kidney cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_263/descendants"


kidney_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 51
[1] "\n Some example terms"
[1] "eosinophilic variant of chromophobe renal cell carcinoma"
[2] "childhood renal cell carcinoma with mit translocations"  
[3] "kidney pelvis sarcomatoid transitional cell carcinoma"   
[4] "infiltrating renal pelvis transitional cell carcinoma"   
[5] "classic variant of chromophobe renal cell carcinoma"     
kidney_cancer_terms = c("renal cell carcinoma",
                       "clear cell renal carcinoma",
                       "clear cell renal cell carcinoma",
                       kidney_cancer_terms)

kidney_cancer_terms = stringr::str_replace_all(kidney_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

kidney_cancer_terms =  stringr::str_replace_all(kidney_cancer_terms,
                          "renal cancer",
                          "kidney cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(kidney_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "kidney cancer"
                          )  
        )

4.0.12 Laryngeal cancer

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "laryngeal squamous cell cancer|laryngeal cancer",
                                   "larynx cancer"
                          )  
        )

4.0.13 Leukemia

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0000565/descendants"

leukemia_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 133
[1] "\n Some example terms"
[1] "b-cell acute lymphoblastic leukemia with t(1;19)(q23;p13.3); e2a-pbx1 (tcf3-pbx1)"
[2] "therapy related acute myeloid leukemia and myelodysplastic syndrome"              
[3] "acute myeloid leukemia, flt3 tyrosine kinase domain point mutation"               
[4] "acute myeloid leukemia, non-kmt2a mllt10 rearrangement positive"                  
[5] "blast phase chronic myelogenous leukemia, bcr-abl1 positive"                      
leukemia_terms <- c("b-cell acute lymphoblastic leukemia with t\\(1;19\\)\\(q23;p13.3\\); e2a-pbx1 \\(tcf3-pbx1\\)",
                    leukemia_terms)

leukemia_terms = stringr::str_replace_all(leukemia_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(leukemia_terms, collapse = "(?=,|$)|\\b"),
                                   "leukemia"
                          )  
        )

4.0.14 Lip and oral cavity cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0005570/descendants"

lip_oral_cavity_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 51
[1] "\n Some example terms"
[1] "major salivary gland carcinoma ex pleomorphic adenoma"
[2] "salivary gland carcinoma ex pleomorphic adenoma"      
[3] "mucoepidermoid carcinoma of submandibular gland"      
[4] "parotid gland carcinoma ex pleomorphic adenoma"       
[5] "major salivary gland mucoepidermoid carcinoma"        
lip_oral_cavity_cancer_terms = stringr::str_replace_all(lip_oral_cavity_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(lip_oral_cavity_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "lip and oral cavity cancer"
                          )  
        )

4.0.15 Liver cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002691/descendants"

liver_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 76
[1] "\n Some example terms"
[1] "squamous cell carcinoma of gallbladder and extrahepatic biliary tract"
[2] "undifferentiated carcinoma of liver and intrahepatic biliary tract"   
[3] "squamous cell carcinoma of liver and intrahepatic biliary tract"      
[4] "adenocarcinoma of gallbladder and extrahepatic biliary tract"         
[5] "combined hepatocellular carcinoma and cholangiocarcinoma"             
liver_cancer_terms = stringr::str_replace_all(liver_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")


liver_cancer_terms = c("hepatitis virus-related liver cancer",
                       liver_cancer_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(liver_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "liver cancer"
                          )  
        )

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "hepatitis virus-related liver cancer",
                                   "liver cancer"
                          )  
        )

4.0.16 Lung cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0008903/descendants"

lung_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 63
[1] "\n Some example terms"
[1] "mixed mucinous and nonmucinous bronchioloalveolar adenocarcinoma"
[2] "well-differentiated fetal adenocarcinoma of the lung"            
[3] "lung combined large cell neuroendocrine carcinoma"               
[4] "primary pulmonary diffuse large b-cell lymphoma"                 
[5] "large cell lung carcinoma, clear cell variant"                   
lung_cancer_terms = stringr::str_replace_all(lung_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(lung_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "lung cancer"
                          )  
        )

4.0.17 Ovarian cancer

url <-  "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0008170/descendants"

ovarian_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 77
[1] "\n Some example terms"
[1] "theca steroid-producing cell malignant tumor of ovary, not further specified"
[2] "ovarian endometrioid adenocarcinoma with squamous differentiation"           
[3] "malignant non-dysgerminomatous germ cell tumor of ovary"                     
[4] "ovarian yolk sac tumor, polyvesicular vitelline pattern"                     
[5] "malignant dysgerminomatous germ cell tumor of ovary"                         
ovarian_cancer_terms = stringr::str_replace_all(ovarian_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

ovarian_cancer_terms = c("high grade serous ovarian cancer",
                         "high grade ovarian cancer",
                         "high grade ovarian cancers",
                         "ovarian endometrioid cancer", # http://www.ebi.ac.uk/efo/EFO_1001515 - ovarian edometrioid carcinoma
                         
                         "ovarian serous cancer", # http://www.ebi.ac.uk/efo/EFO_1001516 - ovarian serous carcinoma
                       ovarian_cancer_terms
                       )
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(ovarian_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "ovarian cancer"
                  
        )
 )
        
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = "high grade ovarian cancer",
                                   "ovarian cancer"
                          )  
        )

4.0.18 Pancreatic cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0009831/descendants"

pancreatic_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 37
[1] "\n Some example terms"
[1] "pancreatic intraductal papillary-mucinous neoplasm with an associated invasive carcinoma"
[2] "pancreatic intraductal papillary-mucinous neoplasm with high grade dysplasia"            
[3] "pancreatic intraductal papillary-mucinous neoplasm with low grade dysplasia"             
[4] "pancreatic mucinous-cystic neoplasm with an associated invasive carcinoma"               
[5] "undifferentiated pancreatic carcinoma with osteoclast-like giant cells"                  
pancreatic_cancer_terms = stringr::str_replace_all(pancreatic_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(pancreatic_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "pancreatic cancer"
                          )  
        )

4.1 Peripheral nervous system cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0021089/descendants"

peripheral_nervous_system_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 42
[1] "\n Some example terms"
[1] "malignant melanocytic peripheral nerve sheath tumor of mediastinum"
[2] "melanotic psammomatous malignant peripheral nerve sheath tumor"    
[3] "malignant melanocytic neoplasm of the peripheral nerve sheath"     
[4] "peripheral primitive neuroectodermal tumor of soft tissues"        
[5] "malignant glandular tumor of peripheral nerve sheath"              
peripheral_nervous_system_cancer_terms = stringr::str_replace_all(peripheral_nervous_system_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(peripheral_nervous_system_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "peripheral nervous system cancer"
                          )  
        )

4.1.1 Prostate cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_10283/descendants"

prostate_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 21
[1] "\n Some example terms"
[1] "lymphoepithelioma-like acinar prostate adenocarcinoma"
[2] "prostate signet ring cell adenocarcinoma"             
[3] "castration-resistant prostate carcinoma"              
[4] "prostate transitional cell carcinoma"                 
[5] "prostate embryonal rhabdomyosarcoma"                  
prostate_cancer_terms = stringr::str_replace_all(prostate_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

prostate_cancer_terms = c("grade iii prostatic intraepithelial neoplasia",
                          "metastatic prostate cancer",
                          prostate_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = paste0(prostate_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "prostate cancer"
                          )  
        )

4.1.2 Mesothelioma

mesothelioma_terms = c("pleural mesothelioma",
                       "malignant pleural mesothelioma")

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(mesothelioma_terms, collapse = "(?=,|$)|\\b"),
                                   "mesothelioma"
                          )  
        )

4.1.3 Neuroendocrine tumor

neuroendo_terms <- c("pulmonary neuroendocrine tumor",
                     "small intestine neuroendocrine tumor",
                     "pancreatic neuroendocrine tumor",
                     "carcinoid tumor" #http://www.ebi.ac.uk/efo/EFO_0004243
)

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(neuroendo_terms, collapse = "(?=,|$)|\\b"),
                                   "neuroendocrine tumor"
                          )  
        )

4.1.4 Non-Hodgkins lymphoma

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0005952/descendants"

nhl_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 93
[1] "\n Some example terms"
[1] "b-cell acute lymphoblastic leukemia with t(1;19)(q23;p13.3); e2a-pbx1 (tcf3-pbx1)"
[2] "epstein-barr virus-positive diffuse large b-cell lymphoma of the elderly"         
[3] "diffuse large b-cell lymphoma of the central nervous system"                      
[4] "small intestinal mucosa-associated lymphoid tissue lymphoma"                      
[5] "primary cutaneous diffuse large b-cell lymphoma, leg type"                        
nhl_terms = stringr::str_replace_all(nhl_terms,
                            "\\bcarcinoma",
                            "cancer")



nhl_terms = c("central nervous system non-hodgkin lymphoma",
              "lymphoblastic lymphoma",
              "extranodal nasal nk/t cell lymphoma", # https://www.ebi.ac.uk/ols4/ontologies/ordo/classes/http%253A%252F%252Fwww.orpha.net%252FORDO%252FOrphanet_86879
              "follicular lymphoma", # http://purl.obolibrary.org/obo/DOID_0050873
              "marginal zone b-cell lymphoma",
              "diffuse large b-cell lymphoma",
              nhl_terms)

# also likely that reticulum cell sarcoma is NHL
# see; http://www.ebi.ac.uk/efo/EFO_0005287
# https://pubmed.ncbi.nlm.nih.gov/6328875/
nhl_terms = c("reticulum cell sarcoma",
              nhl_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(nhl_terms, collapse = "(?=,|$)|\\b"),
                                   "non-hodgkins lymphoma"
                          )  
        )

4.1.5 Non-melanoma skin cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0009260/descendants"

non_melanoma_skin_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 11
[1] "\n Some example terms"
[1] "salivary gland basal cell adenocarcinoma"
[2] "skin adenoid basal cell carcinoma"       
[3] "external ear basal cell carcinoma"       
[4] "skin basosquamous cell carcinoma"        
[5] "cervical adenoid basal carcinoma"        
non_melanoma_skin_cancer_terms = stringr::str_replace_all(non_melanoma_skin_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(non_melanoma_skin_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "non-melanoma skin cancer"
                          )  
        )

4.1.6 Other pharygnx cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_0060119/descendants"

other_pharynx_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 16
[1] "\n Some example terms"
[1] "pharynx squamous cell carcinoma" "adenoid squamous cell carcinoma"
[3] "tonsil squamous cell carcinoma"  "adenoid basal cell carcinoma"   
[5] "aryepiglottic fold cancer"      
other_pharynx_cancer_terms = stringr::str_replace_all(other_pharynx_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

other_pharynx_cancer_terms = c("hypopharyngeal cancer",
                               other_pharynx_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(other_pharynx_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "other pharynx cancer"
                          )  
        )

4.1.7 Soft tissue sarcoma

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_1001968/descendants"

soft_tissue_sarcoma_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 191
[1] "\n Some example terms"
[1] "malignant peripheral nerve sheath tumor with mesenchymal differentiation"
[2] "rhabdomyosarcoma with mixed embryonal and alveolar features"             
[3] "undifferentiated pleomorphic sarcoma, inflammatory variant"              
[4] "low grade fibromyxoid sarcoma with giant collagen rosettes"              
[5] "epithelioid malignant peripheral nerve sheath tumor"                     
soft_tissue_sarcoma_terms = stringr::str_replace_all(soft_tissue_sarcoma_terms,
                            "\\bcarcinoma",
                            "cancer")

soft_tissue_sarcoma_terms = c("kaposis sarcoma",
                              "iatrogenic kaposis sarcoma",
                              soft_tissue_sarcoma_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(soft_tissue_sarcoma_terms, collapse = "(?=,|$)|\\b"),
                                   "soft tissue sarcoma"
                          )  
        )

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "sarcoma, soft tissue sarcoma",
                                   "soft tissue sarcoma"
                          )
 )

4.1.7.1 Ewing sarcoma

# can be either bone or soft tissue sarcoma
# hard to tell from these studies: 
gwas_study_info |> 
  filter(grepl("ewing", l1_all_disease_terms))  |> 
  select(PUBMED_ID, `DISEASE/TRAIT`, COHORT, STUDY)
   PUBMED_ID DISEASE/TRAIT COHORT
       <int>        <char> <char>
1:  22327514 Ewing sarcoma       
2:  32881892 Ewing sarcoma       
3:  30093639 Ewing sarcoma       
                                                                                                         STUDY
                                                                                                        <char>
1:                   Common variants near TARDBP and EGR2 are associated with susceptibility to Ewing sarcoma.
2: Low-frequency variation near common germline susceptibility loci are associated with risk of Ewing sarcoma.
3:    Genome-wide association study identifies multiple new loci associated with Ewing sarcoma susceptibility.

4.1.8 Stomach cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_10534/descendants"

stomach_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 32
[1] "\n Some example terms"
[1] "gastric signet ring cell adenocarcinoma"
[2] "malignant gastric granular cell tumor"  
[3] "malignant gastric germ cell tumor"      
[4] "hereditary diffuse gastric cancer"      
[5] "gastric papillary adenocarcinoma"       
stomach_cancer_terms = stringr::str_replace_all(stomach_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

stomach_cancer_terms = c(
                          "diffuse stomach cancer",
                          "gastric cancer",
                          "gastric intestinal type adenocarcinoma",
                          "gastric cardia cancer",
                          "cardia cancer",
                          stomach_cancer_terms
                          )
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(stomach_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "stomach cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "diffuse stomach cancer",
                                   "stomach cancer"
                          )  
        )

4.1.9 Squamous cell carcinoma

gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = "cutaneous squamous cell cancer",
                                   "squamous cell cancer"
                          )  
        )

4.1.10 Testicular cancer

4.1.11 Thyroid cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_1781/descendants"

thyroid_cancer_terms <- get_descendants(url)
[1] "Number of terms collected:"
[1] 24
[1] "\n Some example terms"
[1] "thyroid gland mixed medullary and follicular cell-derived carcinoma"
[2] "thyroid gland spindle epithelial tumor with thymus-like elements"   
[3] "spindle epithelial tumor with thymus-like differentiation tumor"    
[4] "diffuse sclerosing papillary thyroid carcinoma"                     
[5] "differentiated high-grade thyroid carcinoma"                        
thyroid_cancer_terms = stringr::str_replace_all(thyroid_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

thyroid_cancer_terms = c("differentiated thyroid cancer",
                         thyroid_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(thyroid_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "thyroid cancer"
                          )  
        )

4.1.12 Uterine cancer

uterine_cancer_terms <- c("uterine corpus cancer",
                          "uterine adnexa cancer")

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = paste0(uterine_cancer_terms, collapse = "(?=,|$)|\\b"),
                                   "uterine cancer"
                          )  
        )

5 Reducing non-specific cancer terms

5.0.1 Brain neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "brain neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique() 
 [1] "ICD10 C71: Malignant neoplasm of brain"                                                               
 [2] "Cancer code, self-reported: brain cancer / primary malignant brain tumour (UKB data field 20001_1032)"
 [3] "ICD10 C71.1: Malignant neoplasm of brain, frontal lobe"                                               
 [4] "ICD10 C71.2: Malignant neoplasm of brain, temporal lobe"                                              
 [5] "ICD10 C71.9: Malignant neoplasm of brain, unspecified"                                                
 [6] "Cancer of brain (PheCode 191.11)"                                                                     
 [7] "Overall survival in brain cancer"                                                                     
 [8] "Brain cancer specific survival"                                                                       
 [9] "Malignant neoplasm of brain (UKB data field 40006) (Gene-based burden)"                               
[10] "Malignant neoplasm of brain (UKB data field 40006)"                                                   
[11] "ICD10 C71: Malignant neoplasm of brain (Gene-based burden)"                                           
[12] "ICD10 C71.9: Malignant neoplasm of brain, unspecified (Gene-based burden)"                            
[13] "Brain tumor"                                                                                          
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "brain neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "central nervous system cancer",
                 l1_all_disease_terms
         )
         )

# still leaves one study (with Brain Tumor)

gwas_study_info |> 
  filter(l1_all_disease_terms == "brain neoplasm") |> 
  pull(PUBMED_ID) |>
  unique() 
[1] 34594039
# from paper sup tables, ICD-10 code of brain tumor term is C71 - malignant neoplasm of brain
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "brain neoplasm" & 
                 PUBMED_ID == 34594039,
                 "central nervous system cancer",
                 l1_all_disease_terms
         )
         )

5.0.2 Breast neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "breast neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
 [1] "Neoplasm of uncertain behavior of breast (PheCode 174.3)"                                       
 [2] "Lump or mass in breast (PheCode 611.3)"                                                         
 [3] "ICD10 C50.1: Malignant neoplasm of central portion of breast"                                   
 [4] "ICD10 C50.2: Malignant neoplasm of upper-inner quadrant of breast"                              
 [5] "ICD10 C50.3: Malignant neoplasm of lower-inner quadrant of breast"                              
 [6] "ICD10 C50.4: Malignant neoplasm of upper-outer quadrant of breast"                              
 [7] "ICD10 C50.5: Malignant neoplasm of lower-outer quadrant of breast"                              
 [8] "ICD10 C50.8: Malignant neoplasm of overlapping sites of breast"                                 
 [9] "ICD10 C50.9: Malignant neoplasm of breast of unspecified site"                                  
[10] "Malignant neoplasm of female breast (PheCode 174.11)"                                           
[11] "Malignant neoplasm of overlapping sites of breast (UKB data field 40006)"                       
[12] "Malignant neoplasm of upper-inner quadrant of breast (UKB data field 40006)"                    
[13] "Malignant neoplasm of upper-outer quadrant of breast (UKB data field 40006)"                    
[14] "Malignant neoplasm of upper-inner quadrant of breast (UKB data field 40006) (Gene-based burden)"
[15] "Malignant neoplasm of upper-outer quadrant of breast (UKB data field 40006) (Gene-based burden)"
[16] "Malignant neoplasm of lower-inner quadrant of breast (UKB data field 40006) (Gene-based burden)"
[17] "Malignant neoplasm of lower-outer quadrant of breast (UKB data field 40006) (Gene-based burden)"
[18] "Malignant neoplasm of overlapping sites of breast (UKB data field 40006) (Gene-based burden)"   
[19] "Malignant neoplasm of lower-inner quadrant of breast (UKB data field 40006)"                    
[20] "Malignant neoplasm of lower-outer quadrant of breast (UKB data field 40006)"                    
[21] "ICD10 C50.2: Malignant neoplasm of upper-inner quadrant of breast (Gene-based burden)"          
[22] "ICD10 C50.4: Malignant neoplasm of upper-outer quadrant of breast (Gene-based burden)"          
[23] "ICD10 C50.8: Malignant neoplasm of overlapping sites of breast (Gene-based burden)"             
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "breast neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "breast cancer",
                 l1_all_disease_terms
         )
         )

5.0.3 Bone neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "bone neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Benign neoplasm of bone and articular cartilage (PheCode 213)"              
[2] "Cancer code, self-reported: primary bone cancer (UKB data field 20001_1063)"
[3] "Bone cancer (PheCode 170.1)"                                                
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "bone neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "bone cancer",
                 l1_all_disease_terms
         )
         )


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "bone neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

5.0.4 Cecal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "cecal neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "Malignant neoplasm of cecum (UKB data field 40006) (Gene-based burden)"
[2] "Malignant neoplasm of cecum (UKB data field 40006)"                    
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "cecal neoplasm" & 
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "colorectal cancer",
                 l1_all_disease_terms
         )
         )

5.0.5 Colonic neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "colonic neoplasm") |> 
  select(MAPPED_TRAIT, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct() |>
  unique()
                                                 MAPPED_TRAIT
                                                       <char>
 1:                                          colonic neoplasm
 2: disease free survival, colonic neoplasm, overall survival
 3:                                          colonic neoplasm
 4:                                          colonic neoplasm
 5:                                          colonic neoplasm
 6:                                          colonic neoplasm
 7:                                          colonic neoplasm
 8:                                          colonic neoplasm
 9:                                          colonic neoplasm
10:                                          colonic neoplasm
11:                                          colonic neoplasm
                                                                DISEASE/TRAIT
                                                                       <char>
 1:                                                              Colon cancer
 2:                                                  Survival in colon cancer
 3:                        ICD10 C18.2: Malignant neoplasm of ascending colon
 4:                     ICD10 C18.9: Malignant neoplasm of colon, unspecified
 5:                                                        Right colon cancer
 6:                                                         Left colon cancer
 7:                                   Left colon cancer vs right colon cancer
 8:                     ICD10 C18.9: Malignant neoplasm of colon, unspecified
 9:                        ICD10 C18.2: Malignant neoplasm of ascending colon
10: ICD10 C18.9: Malignant neoplasm of colon, unspecified (Gene-based burden)
11:    ICD10 C18.2: Malignant neoplasm of ascending colon (Gene-based burden)
    STUDY_ACCESSION
             <char>
 1:      GCST004167
 2:      GCST002822
 3:    GCST90043849
 4:    GCST90043855
 5:    GCST90162553
 6:    GCST90162554
 7:    GCST90179121
 8:    GCST90079580
 9:    GCST90079578
10:    GCST90083566
11:    GCST90083564
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("malignant|cancer", `DISEASE/TRAIT`, ignore.case = T) &
                l1_all_disease_terms == "colonic neoplasm",
                stringr::str_replace_all(l1_all_disease_terms,
                          "colonic neoplasm",
                          "colorectal cancer"),
                l1_all_disease_terms
         )
  )

# also specific example where measuring rectal cancer vs colon cancer

gwas_study_info |> 
  filter(grepl("colonic neoplasm", l1_all_disease_terms)) |> 
  select(MAPPED_TRAIT, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct() |>
  unique()
                      MAPPED_TRAIT                 DISEASE/TRAIT
                            <char>                        <char>
1: rectum cancer, colonic neoplasm Rectal cancer vs colon cancer
   STUDY_ACCESSION
            <char>
1:    GCST90179122
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST90179122",
                stringr::str_replace_all(l1_all_disease_terms,
                          "colonic neoplasm",
                          "colorectal cancer"),
                l1_all_disease_terms
         )
  )

5.0.6 Endometrial neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "endometrial neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Endometrial cancer"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "endometrial neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "endometrial cancer",
                 l1_all_disease_terms
         )
         )

5.0.7 Esophageal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "neoplasm of esophagus") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C15: Malignant neoplasm of esophagus"                                     
[2] "Cancer of esophagus (PheCode 150)"                                              
[3] "ICD10 C15.5: Malignant neoplasm of lower third of esophagus"                    
[4] "ICD10 C15.9: Malignant neoplasm of esophagus, unspecified"                      
[5] "Malignant neoplasm of esophagus (UKB data field 40006) (Gene-based burden)"     
[6] "Malignant neoplasm of esophagus (UKB data field 40006)"                         
[7] "ICD10 C15.5: Malignant neoplasm of lower third of esophagus (Gene-based burden)"
[8] "ICD10 C15.9: Malignant neoplasm of esophagus, unspecified (Gene-based burden)"  
[9] "ICD10 C15: Malignant neoplasm of esophagus (Gene-based burden)"                 
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neoplasm of esophagus" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "esophageal cancer",
                 l1_all_disease_terms
         )
         )

5.0.8 Eye neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "eye neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "Cancer code, self-reported: eye and/or adnexal cancer (UKB data field 20001_1030)"
[2] "Cancer of eye (PheCode 190)"                                                      
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "eye neoplasm" & 
                 grepl("cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "ocular cancer",
                 l1_all_disease_terms
         )
         )

5.0.9 Gallbladder neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "gallbladder neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "Gallbladder cancer"            "Gallbladder cancer"           
[3] "Gallbladder cancer"            "ICD10 C23: Gallbladder cancer"
[5] "Gallbladder adenomyomatosis"  
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "gallbladder neoplasm" & 
                 grepl("cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "gallbladder and biliary tract cancer",
                 l1_all_disease_terms
         )
         )

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "gallbladder neoplasm" & 
                `DISEASE/TRAIT` == "Gallbladder adenomyomatosis",
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

# also specific example where measuring sclerosing cholangitis & gallbladder cancer
gwas_study_info |>
  filter(grepl("gallbladder neoplasm", l1_all_disease_terms)) |>
  select(STUDY_ACCESSION, `DISEASE/TRAIT`)
   STUDY_ACCESSION                                           DISEASE/TRAIT
            <char>                                                  <char>
1:      GCST005857 Gallbladder carcinoma in primary sclerosing cholangitis
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST005857",
                stringr::str_replace_all(l1_all_disease_terms,
                          "gallbladder neoplasm",
                          "gallbladder and biliary tract cancer"),
                l1_all_disease_terms
         ))

5.0.10 Glioma (can be benign or malignant)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "central nervous system cancer, glioma",
                          "central nervous system cancer"
         ))


gwas_study_info |> 
  filter(l1_all_disease_terms == "glioma") |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_URI, `DISEASE/TRAIT`) |> 
  distinct()
                         MAPPED_TRAIT
                               <char>
 1:                            glioma
 2:                            glioma
 3:                            glioma
 4:                            glioma
 5:                            glioma
 6:          overall survival, glioma
 7: progression free survival, glioma
 8:                            glioma
 9:                            glioma
10:                            glioma
11:                            glioma
12:                            glioma
13:                            glioma
14:                            glioma
15:                            glioma
16:                            glioma
17:                            glioma
                                                              MAPPED_TRAIT_URI
                                                                        <char>
 1:                                       http://www.ebi.ac.uk/efo/EFO_0005543
 2:                                       http://www.ebi.ac.uk/efo/EFO_0005543
 3:                                       http://www.ebi.ac.uk/efo/EFO_0005543
 4:                                       http://www.ebi.ac.uk/efo/EFO_0005543
 5:                                       http://www.ebi.ac.uk/efo/EFO_0005543
 6: http://www.ebi.ac.uk/efo/EFO_0000638, http://www.ebi.ac.uk/efo/EFO_0005543
 7: http://www.ebi.ac.uk/efo/EFO_0004920, http://www.ebi.ac.uk/efo/EFO_0005543
 8:                                       http://www.ebi.ac.uk/efo/EFO_0005543
 9:                                       http://www.ebi.ac.uk/efo/EFO_0005543
10:                                       http://www.ebi.ac.uk/efo/EFO_0005543
11:                                       http://www.ebi.ac.uk/efo/EFO_0005543
12:                                       http://www.ebi.ac.uk/efo/EFO_0005543
13:                                       http://www.ebi.ac.uk/efo/EFO_0005543
14:                                       http://www.ebi.ac.uk/efo/EFO_0005543
15:                                       http://www.ebi.ac.uk/efo/EFO_0005543
16:                                       http://www.ebi.ac.uk/efo/EFO_0005543
17:                                       http://www.ebi.ac.uk/efo/EFO_0005543
                                                                         DISEASE/TRAIT
                                                                                <char>
 1:                                                                             Glioma
 2:                                                                       Glioblastoma
 3:                                                            Non-glioblastoma glioma
 4:                                                                Glioma (high-grade)
 5:                                                                 Glioma (low-grade)
 6:                                                         Overall survival in glioma
 7:                                                Progression-free survival in glioma
 8:                                                Adult diffuse glioma (IDH mutation)
 9:                                                Adult diffuse glioma (IDH wildtype)
10:                             Adult diffuse glioma (IDH mutation, 1p/19q codeletion)
11:                          Adult diffuse glioma (IDH mutation, 1p/19q non-codeleted)
12:    Adult diffuse triple positive glioma (IDH and TERT mutations, 1p19q codeletion)
13:                                      Adult diffuse glioma (IDH and TERT mutations)
14:                                           Adult diffuse glioma (IDH mutation only)
15:                                          Adult diffuse glioma (TERT mutation only)
16: Adult diffuse triple negative glioma (IDH and TERT wildtype, 1p/19q non-codeleted)
17:                                                     Glioma (pediatric/youth onset)
# assume where measure survival and glioma, it is cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("survival", MAPPED_TRAIT, ignore.case = T),
         stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
         l1_all_disease_terms
         )
  )

# assme where measure is grade, it is cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("grade", `DISEASE/TRAIT`, ignore.case = T),
         stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
         l1_all_disease_terms
         )
  )

# Adult diffuse glioma - assume maglignant
# https://pmc.ncbi.nlm.nih.gov/articles/PMC9245936/
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("adult diffuse", `DISEASE/TRAIT`, ignore.case = T),
         stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
         l1_all_disease_terms
         )
  )

gwas_study_info |>
  filter(`DISEASE/TRAIT` == "Glioma (pediatric/youth onset)") |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_URI, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct()
   MAPPED_TRAIT                     MAPPED_TRAIT_URI
         <char>                               <char>
1:       glioma http://www.ebi.ac.uk/efo/EFO_0005543
                    DISEASE/TRAIT STUDY_ACCESSION
                           <char>          <char>
1: Glioma (pediatric/youth onset)      GCST008912
# from paper; seems malignant
# https://pubmed.ncbi.nlm.nih.gov/31040135/

gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(STUDY_ACCESSION == "GCST008912",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms
         ))


gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(`DISEASE/TRAIT` == "Glioblastoma" & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms
         ))


# for pubmed id: 22886559 
# majority (~90%) graded glioma, gliobastoma and Oligodendroglioma
# so assume malignant
gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 22886559 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )


# for pubmed id: 29743610
# majority (~60%) are glioblastoma
# so assume malignant

gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 29743610 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

# for pubmed id: 36810956
# seems to primarily include high grade glioma
# so assume malignant
gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 36810956 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

# pubmed id: 30714141 
# considers glioma cancer
gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 30714141 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

# pubmed id: 34319593
# considers glioma a maglignant tumor
gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 34319593 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

5.0.11 Glottis neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "glottis neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C32.0: Malignant neoplasm of glottis"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "glottis neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "larynx cancer",
                 l1_all_disease_terms
         )
         )

5.0.12 Kidney neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "kidney neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "Cancer code, self-reported: kidney/renal cell cancer (UKB data field 20001_1034)"            
[2] "ICD10 C64: Malignant neoplasm of kidney, except renal pelvis"                                
[3] "Malignant neoplasm of kidney, except pelvis (PheCode 189.11)"                                
[4] "Kidney / renal cell cancer (UKB data field 20001) (Gene-based burden)"                       
[5] "Kidney / renal cell cancer (UKB data field 20001)"                                           
[6] "Malignant neoplasm of kidney, except renal pelvis (UKB data field 40006) (Gene-based burden)"
[7] "Malignant neoplasm of kidney, except renal pelvis (UKB data field 40006)"                    
[8] "ICD10 C64: Malignant neoplasm of kidney, except renal pelvis"                                
[9] "ICD10 C64: Malignant neoplasm of kidney, except renal pelvis (Gene-based burden)"            
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "kidney neoplasm" & 
                 grepl("malignant|cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "kidney cancer",
                 l1_all_disease_terms
         )
         )

5.0.13 Laryngeal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "laryngeal neoplasm") |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_URI, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct()
         MAPPED_TRAIT                     MAPPED_TRAIT_URI
               <char>                               <char>
1: laryngeal neoplasm http://www.ebi.ac.uk/efo/EFO_0003817
                                                                  DISEASE/TRAIT
                                                                         <char>
1: Cancer code, self-reported: larynx/throat cancer (UKB data field 20001_1006)
   STUDY_ACCESSION
            <char>
1:    GCST90041889
# one study - 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST90041889",
                stringr::str_replace_all(l1_all_disease_terms,
                          "laryngeal neoplasm",
                          "larynx cancer"),
                l1_all_disease_terms
         ))

5.0.14 Liver neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "liver neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C22: Malignant neoplasm of liver and intrahepatic bile ducts"                    
[2] "Cancer of liver and intrahepatic bile duct (PheCode 155)"                              
[3] "ICD10 C22: Malignant neoplasm of liver and intrahepatic bile ducts (Gene-based burden)"
[4] "Hepatic cancer"                                                                        
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "liver neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "liver cancer",
                 l1_all_disease_terms
         )
         )

5.0.15 Lung neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "lung neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Malignant neoplasm of unspecified part of bronchus or lung (UKB data field 40006) (Gene-based burden)"
[2] "Malignant neoplasm of upper lobe, bronchus or lung (UKB data field 40006) (Gene-based burden)"        
[3] "Malignant neoplasm of lower lobe, bronchus or lung (UKB data field 40006) (Gene-based burden)"        
[4] "Malignant neoplasm of lower lobe, bronchus or lung (UKB data field 40006)"                            
[5] "Malignant neoplasm of unspecified part of bronchus or lung (UKB data field 40006)"                    
[6] "Malignant neoplasm of upper lobe, bronchus or lung (UKB data field 40006)"                            
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "lung neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "lung cancer",
                 l1_all_disease_terms
         )
         )

5.0.16 Lymphoid neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "lymphoid neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Cancer code, self-reported: malignant lymph node, unspecified (UKB data field 20001_1070)"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "lymphoid neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "malignant lymphoid tumor",
                 l1_all_disease_terms
         )
         )

5.0.17 Meningeal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "meningeal neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 D32.0: Benign neoplasm of cerebral meninges"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "meningeal neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

5.0.18 Mature b-cell neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "neoplasm of mature b-cells") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Follicular lymphoma"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neoplasm of mature b-cells" & 
                 `DISEASE/TRAIT` == "Follicular lymphoma",
                 "non-hodgkins lymphoma",
                 l1_all_disease_terms
         )
         )

5.0.19 Mouth neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "mouth neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Oral cancers (chewing tobacco related)"
[2] "Cancer of mouth (PheCode 145)"         
[3] "Oral cancer"                           
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "mouth neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "lip and oral cavity cancer",
                 l1_all_disease_terms
         )
         )

5.0.20 Myeloid neoplasm

gwas_study_info |> 
  filter(grepl("myeloid neoplasm", l1_all_disease_terms)) |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C92: Myeloid leukemia (Gene-based burden)"
[2] "ICD10 C92: Myeloid leukemia"                    
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "myeloid neoplasm" & 
                 grepl("Myeloid leukemia", `DISEASE/TRAIT`, ignore.case = T),
                 "leukemia",
                 l1_all_disease_terms
         )
         )

5.0.21 Neuroendocrine neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "neuroendocrine neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Neuroendocrine tumor"                "Neuroendocrine tumors (PheCode 209)"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neuroendocrine neoplasm" & 
                 grepl("PheCode 209", `DISEASE/TRAIT`, ignore.case = T),
                 "neuroendocrine tumor",
                 l1_all_disease_terms
         )
         )

# ? to double check: neuroendocrine tumor is malignant
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neuroendocrine neoplasm" & 
                 grepl("neuroendocrine tumor", `DISEASE/TRAIT`, ignore.case = T),
                 "neuroendocrine tumor",
                 l1_all_disease_terms
         )
         )

5.0.22 Ovarian neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "ovarian neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "Malignant neoplasm of ovary (PheCode 184.11)"                          
[2] "ICD10 C56: Malignant neoplasm of ovary"                                
[3] "ICD10 C56: Malignant neoplasm of ovary"                                
[4] "Malignant neoplasm of ovary (UKB data field 40006) (Gene-based burden)"
[5] "ICD10 C56: Malignant neoplasm of ovary"                                
[6] "Malignant neoplasm of ovary (UKB data field 40006)"                    
[7] "ICD10 C56: Malignant neoplasm of ovary (Gene-based burden)"            
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "ovarian neoplasm" &
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "ovarian cancer",
                 l1_all_disease_terms
         )
         )

5.0.23 Nasopharyngeal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "nasopharyngeal neoplasm") |> 
  pull(`DISEASE/TRAIT`)
 [1] "Nasopharyngeal carcinoma"                                                            
 [2] "Nasopharyngeal carcinoma"                                                            
 [3] "Nasopharyngeal carcinoma"                                                            
 [4] "Nasopharyngeal carcinoma"                                                            
 [5] "Nasopharyngeal carcinoma"                                                            
 [6] "Nasopharyngeal carcinoma"                                                            
 [7] "Nasopharyngeal carcinoma (SNP x SNP interaction)"                                    
 [8] "Nasopharyngeal carcinoma"                                                            
 [9] "Malignant neoplasm of nasopharynx (Union C11)"                                       
[10] "Nasopharyngeal carcinoma"                                                            
[11] "Response to radiotherapy in nasopharyngeal carcinoma (primary lesion efficacy)"      
[12] "Response to radiotherapy in nasopharyngeal carcinoma (positive lymph nodes efficacy)"
[13] "Cancer of nasopharynx (PheCode 149.2)"                                               
[14] "Dysphagia in nasopharyngeal carcinoma treated with radiotherapy"                     
[15] "Myelosuppression in nasopharyngeal carcinoma treated with radiotherapy"              
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "nasopharyngeal neoplasm" & 
                 grepl("carcinoma|cancer|malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "nasopharyngeal cancer",
                 l1_all_disease_terms
         )
         )

gwas_study_info |> 
  filter(grepl("nasopharyngeal neoplasm", l1_all_disease_terms)) |> 
  pull(`DISEASE/TRAIT`)
[1] "Response to radiotherapy in nasopharyngeal carcinoma (acute oral mucositis)"  
[2] "Radiation-induced brain injury in nasopharyngeal carcinoma"                   
[3] "Skin reaction in nasopharyngeal carcinoma treated with radiotherapy"          
[4] "Oral mucositis in nasopharyngeal carcinoma treated with radiotherapy"         
[5] "Salivary gland toxicity in nasopharyngeal carcinoma treated with radiotherapy"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "nasopharyngeal neoplasm" & 
                 grepl("nasopharyngeal carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "nasopharyngeal cancer",
                 l1_all_disease_terms
         )
         )

5.0.24 Pancreatic neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "pancreatic neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "Intraductal papillary mucinous neoplasm of the pancreas"
# Intraductal papillary mucinous neoplasm of the pancreas is a benign precursor lesion

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
         ifelse(l1_all_disease_terms == "pancreatic neoplasm" & 
                STUDY_ACCESSION == "GCST90104145",
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

5.0.25 Sigmoid neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "sigmoid neoplasm") |> 
  pull(`DISEASE/TRAIT`)
[1] "ICD10 C18.7: Malignant neoplasm of sigmoid colon"                              
[2] "ICD10 C18.7: Malignant neoplasm of sigmoid colon"                              
[3] "Malignant neoplasm of sigmoid colon (UKB data field 40006) (Gene-based burden)"
[4] "Malignant neoplasm of sigmoid colon (UKB data field 40006)"                    
[5] "ICD10 C18.7: Malignant neoplasm of sigmoid colon (Gene-based burden)"          
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "sigmoid neoplasm" & 
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "colorectal cancer",
                 l1_all_disease_terms
         )
         )

5.0.26 Skin neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "skin neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
 [1] "Other non-epithelial cancer of skin (PheCode 172.2)"                                                                     
 [2] "Neoplasm of uncertain behavior of skin (PheCode 173)"                                                                    
 [3] "ICD10 C44.8: Overlapping lesion of skin"                                                                                 
 [4] "ICD10 C44.9:  Malignant neoplasm of skin, unspecified"                                                                   
 [5] "ICD10 C44.0:  Other and unspecified malignant neoplasm of skin of lip"                                                   
 [6] "ICD10 C44.1: Other and unspecified malignant neoplasm of skin of eyelid, including canthus"                              
 [7] "ICD10 C44.2: Other and unspecified malignant neoplasm of skin of ear and external auricular canal"                       
 [8] "ICD10 C44.3: Other and unspecified malignant neoplasm of skin of other and unspecified parts of face"                    
 [9] "ICD10 C44.4: Other and unspecified malignant neoplasm of skin of scalp and neck"                                         
[10] "ICD10 C44.5: Other and unspecified malignant neoplasm of skin of trunk"                                                  
[11] "ICD10 C44.6: Other and unspecified malignant neoplasm of skin of upper limb, including shoulder"                         
[12] "ICD10 C44.7: Other and unspecified malignant neoplasm of skin of lower limb, including hip"                              
[13] "Carcinoma in situ of skin (PheCode 172.3)"                                                                               
[14] "ICD10 D04: Carcinoma in situ of skin"                                                                                    
[15] "Skin cancer (UKB data field 20001)"                                                                                      
[16] "Carcinoma in situ of skin (UKB data field 40006) (Gene-based burden)"                                                    
[17] "Malignant neoplasm of skin of eyelid, including canthus (UKB data field 40006) (Gene-based burden)"                      
[18] "Malignant neoplasm of skin of lower limb, including hip (UKB data field 40006) (Gene-based burden)"                      
[19] "Malignant neoplasm of skin of parts of face (UKB data field 40006) (Gene-based burden)"                                  
[20] "Malignant neoplasm of skin of scalp and neck (UKB data field 40006) (Gene-based burden)"                                 
[21] "Malignant neoplasm of skin of trunk (UKB data field 40006) (Gene-based burden)"                                          
[22] "Malignant neoplasm of skin of upper limb, including shoulder (UKB data field 40006) (Gene-based burden)"                 
[23] "Malignant neoplasm of skin, unspecified (UKB data field 40006) (Gene-based burden)"                                      
[24] "Malignant neoplasm of skin (UKB data field 40006) (Gene-based burden)"                                                   
[25] "Malignant neoplasm of skin of ear and external auricular canal (UKB data field 40006) (Gene-based burden)"               
[26] "Skin cancer (UKB data field 20001) (Gene-based burden)"                                                                  
[27] "Malignant neoplasm of skin of eyelid, including canthus (UKB data field 40006)"                                          
[28] "Malignant neoplasm of skin of lower limb, including hip (UKB data field 40006)"                                          
[29] "Malignant neoplasm of skin of parts of face (UKB data field 40006)"                                                      
[30] "Malignant neoplasm of skin of scalp and neck (UKB data field 40006)"                                                     
[31] "Malignant neoplasm of skin of trunk (UKB data field 40006)"                                                              
[32] "Malignant neoplasm of skin of upper limb, including shoulder (UKB data field 40006)"                                     
[33] "Malignant neoplasm of skin, unspecified (UKB data field 40006)"                                                          
[34] "Malignant neoplasm of skin (UKB data field 40006)"                                                                       
[35] "Malignant neoplasm of skin of ear and external auricular canal (UKB data field 40006)"                                   
[36] "Carcinoma in situ of skin (UKB data field 40006)"                                                                        
[37] "Skin cancer"                                                                                                             
[38] "ICD10 C44: Other and unspecified malignant neoplasm of skin"                                                             
[39] "ICD10 D04: Carcinoma in situ of skin (Gene-based burden)"                                                                
[40] "ICD10 C44: Other and unspecified malignant neoplasm of skin (Gene-based burden)"                                         
[41] "ICD10 C44.1: Other and unspecified malignant neoplasm of skin of eyelid, including canthus (Gene-based burden)"          
[42] "ICD10 C44.2: Other and unspecified malignant neoplasm of skin of ear and external auricular canal (Gene-based burden)"   
[43] "ICD10 C44.3: Other and unspecified malignant neoplasm of skin of other and unspecified parts of face (Gene-based burden)"
[44] "ICD10 C44.4: Other and unspecified malignant neoplasm of skin of scalp and neck (Gene-based burden)"                     
[45] "ICD10 C44.5: Other and unspecified malignant neoplasm of skin of trunk (Gene-based burden)"                              
[46] "ICD10 C44.6: Other and unspecified malignant neoplasm of skin of upper limb, including shoulder (Gene-based burden)"     
[47] "ICD10 C44.7: Other and unspecified malignant neoplasm of skin of lower limb, including hip (Gene-based burden)"          
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "skin neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "non-melanoma skin cancer",
                 l1_all_disease_terms
         )
         )

5.0.27 Stomach neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "stomach neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C16: Malignant neoplasm of stomach"                              
[2] "Cancer code, self-reported: stomach cancer (UKB data field 20001_1018)"
[3] "Cancer of stomach (PheCode 151)"                                       
[4] "ICD10 C16: Malignant neoplasm of stomach (Gene-based burden)"          
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "stomach neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "stomach cancer",
                 l1_all_disease_terms
         )
         )

5.0.28 Testicular neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "testicular neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C62.1: Malignant neoplasm of descended testis"   
[2] "ICD10 C62.9: Malignant neoplasm of testis, unspecified"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
        ifelse(l1_all_disease_terms == "testicular neoplasm" &
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "testicular cancer",
                 l1_all_disease_terms
         )
         )

5.0.29 Tongue neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "tongue neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Cancer of tongue (PheCode 145.2)"                                     
[2] "Cancer code, self-reported: tongue cancer (UKB data field 20001_1011)"
[3] "ICD10 C01: Malignant neoplasm of base of tongue"                      
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "tongue neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "lip and oral cavity cancer",
                 l1_all_disease_terms
         )
         )

5.0.30 Uterine neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "uterine neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Other benign neoplasms of uterus (Union D26)"   
[2] "Other benign neoplasm of uterus (PheCode 218.2)"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "uterine neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

5.0.31 Urogenital neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "urogenital neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Benign neoplasm of kidney and other urinary organs (PheCode 223)"                                    
[2] "ICD10 D07: Carcinoma in situ of other and unspecified genital organs"                                
[3] "Carcinoma in situ of other and unspecified genital organs (UKB data field 40006) (Gene-based burden)"
[4] "Carcinoma in situ of other and unspecified genital organs (UKB data field 40006)"                    
[5] "ICD10 D07: Carcinoma in situ of other and unspecified genital organs (Gene-based burden)"            
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "urogenital neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "urogenital cancer",
                 l1_all_disease_terms
         )
         )

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "urogenital neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

5.0.31.1 Vulvar neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "vulvar neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "ICD10 C51.9: Malignant neoplasm of vulva, unspecified"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "vulvar neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "vulvar cancer",
                 l1_all_disease_terms
         )
         )

5.0.32 Ocular Melanoma

ocular_melanoma_terms <- c("uveal melanoma",
                           "uveal melanoma disease severity",
                           "epithelioid cell uveal melanoma",
                           "choroidal melanoma",
                           "ocular melanoma disease severity"
                           )

ocular_melanoma_terms = str_length_sort(ocular_melanoma_terms)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          paste0(ocular_melanoma_terms, collapse = "(?=,|$)|\\b"),
                          "ocular melanoma"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "ocular melanoma disease severity",
                          "ocular melanoma"
         ))

6 Other …

6.0.1 Benign neoplasm, colorectal cancer

gwas_study_info |>
  filter(l1_all_disease_terms == "benign neoplasm, colorectal cancer") |> 
  select(MAPPED_TRAIT, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct()
                            MAPPED_TRAIT                         DISEASE/TRAIT
                                  <char>                                <char>
1: colorectal cancer, colorectal adenoma Colorectal cancer or advanced adenoma
   STUDY_ACCESSION
            <char>
1:      GCST007856
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST90093303",
                stringr::str_replace_all(l1_all_disease_terms,
                          "benign neoplasm, colorectal cancer",
                          "colorectal cancer"),
                l1_all_disease_terms
         ))

6.0.2 More other pharynx cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "larynx cancer, pharynx cancer",
                          "larynx cancer, other pharynx cancer"
         ))

6.0.3 Malignant melanoma of skin

6.0.3.1 Cutaneous melanoma to malignant melanoma of skin

gwas_study_info =
  gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "cutaneous melanoma",
                          "malignant melanoma of skin"
         ))

6.0.3.2 Dealing with studies just labelled as “melanoma”

gwas_study_info |>
 filter(l1_all_disease_terms == "melanoma") |>
  pull(`DISEASE/TRAIT`) |>
  unique()
 [1] "Adverse response to dabrafenib or dabrafenib-trametinib treatment in melanoma (pyrexia)"        
 [2] "Melanoma"                                                                                       
 [3] "Survival in melanoma"                                                                           
 [4] "ICD10 D03.3: Melanoma in situ of other and unspecified parts of face"                           
 [5] "ICD10 D03.5: Melanoma in situ of trunk"                                                         
 [6] "ICD10 D03.6: Melanoma in situ of upper limb, including shoulder"                                
 [7] "ICD10 D03.7: Melanoma in situ of lower limb, including hip"                                     
 [8] "ICD10 C43.3: Malignant melanoma of other and unspecified parts of face"                         
 [9] "ICD10 C43.5: Malignant melanoma of trunk"                                                       
[10] "ICD10 C43.6: Malignant melanoma of upper limb, including shoulder"                              
[11] "ICD10 C43.7: Malignant melanoma of lower limb, including hip"                                   
[12] "ICD10 C43.9: Malignant melanoma of skin, unspecified"                                           
[13] "Cancer code, self-reported: malignant melanoma (UKB data field 20001_1059)"                     
[14] "Melanoma in situ"                                                                               
[15] "Invasive melanoma"                                                                              
[16] "Melanoma (in situ vs invasive)"                                                                 
[17] "Malignant melanoma"                                                                             
[18] "Overall survival in skin melanoma"                                                              
[19] "Skin melanoma specific survival"                                                                
[20] "Malignant melanoma (Gene-based burden)"                                                         
[21] "ICD10 D03: Melanoma in situ"                                                                    
[22] "Malignant melanoma (UKB data field 20001) (Gene-based burden)"                                  
[23] "Malignant melanoma (UKB data field 20001)"                                                      
[24] "Melanoma in situ (UKB data field 40006) (Gene-based burden)"                                    
[25] "Malignant melanoma of skin (UKB data field 40006) (Gene-based burden)"                          
[26] "Malignant melanoma of trunk (UKB data field 40006) (Gene-based burden)"                         
[27] "Malignant melanoma of upper limb, including shoulder (UKB data field 40006) (Gene-based burden)"
[28] "Malignant melanoma of lower limb, including hip (UKB data field 40006) (Gene-based burden)"     
[29] "Melanoma in situ (UKB data field 40006)"                                                        
[30] "Malignant melanoma of lower limb, including hip (UKB data field 40006)"                         
[31] "Malignant melanoma of skin (UKB data field 40006)"                                              
[32] "Malignant melanoma of trunk (UKB data field 40006)"                                             
[33] "Malignant melanoma of upper limb, including shoulder (UKB data field 40006)"                    
[34] "Melanoma x citrus consumption interaction (2df)"                                                
[35] "Melanoma specific survival"                                                                     
[36] "ICD10 D03: Melanoma in situ (Gene-based burden)"                                                
[37] "ICD10 C43.6: Malignant melanoma of upper limb, including shoulder (Gene-based burden)"          
[38] "ICD10 C43.7: Malignant melanoma of lower limb, including hip (Gene-based burden)"               
# checked UKB data field 40006 (ICD10 codes) 
# https://biobank.ctsu.ox.ac.uk/ukb/field.cgi?id=40006
# malignant melanoma of skin includes: 
# Malignant melanoma of trunk 
# Malignant melanoma of upper limb, including shoulder
# Malignant melanoma of lower limb, including hip

# checked UKB data field 20001 
# https://biobank.ctsu.ox.ac.uk/ukb/field.cgi?id=20001
# malignant melanoma is a subcategory of skin cancer


malignant_skin_melanoma <- c("ICD10 C43",
                             "survival in skin melanoma",
                             "Skin melanoma specific survival",
                             "malignant melanoma of skin",
                             "malignant melanoma of trunk",
                             "Malignant melanoma \\(UKB data field 20001\\)",
                             "malignant melanoma \\(UKB data field 20001_1059\\)",
                             "malignant melanoma of upper limb, including shoulder",
                             "malignant melanoma of lower limb, including hip",
                             "ICD10 D03", # skin melanoma in situ 
                             "Melanoma in situ \\(UKB data field 40006\\)"
)


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "melanoma" &
                grepl(paste0(malignant_skin_melanoma, collapse = "|\\b"),
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "\\bmelanoma",
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))

# UKBB malignant melanoma of skin
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "melanoma" &
                grepl("UKBB", COHORT, ignore.case = T) &
                grepl("malignant melanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "\\bmelanoma",
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))

# cutaneous melanoma in title
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "melanoma" & 
                grepl("\\bcutaneous melanoma", STUDY, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "\\bmelanoma", 
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))
                  

gwas_study_info |>
 filter(l1_all_disease_terms == "\\bmelanoma") |>
  pull(`DISEASE/TRAIT`) |>
  unique()
character(0)
gwas_study_info |>
 filter(l1_all_disease_terms == "\\bmelanoma") |>
  pull(PUBMED_ID) |>
  unique()
integer(0)
# Checking the clinical trials that make up pubmed id 27023328
# https://clinicaltrials.gov/study/NCT01153763 - cutaneous melanoma
# https://clinicaltrials.gov/study/NCT01266967 - not specified, likely skin melanoma by MeSH terms
# https://clinicaltrials.gov/study/NCT01227889 - not specified, likely skin melanoma by MeSH terms
# https://clinicaltrials.gov/study/NCT01584648 - cutaneous melanoma
# https://clinicaltrials.gov/study/NCT01597908 - cutaneous melanoma
# thus, likely malignant melanoma of skin
                             
# for pubmed ID 21983785, 
# seems likely malignant melanoma of skin
# as test in situ vs invasive, and use non-skin cancer controls
# https://pmc.ncbi.nlm.nih.gov/articles/PMC3227560/#SM

# for pubmed id: 23455637 - uses one of the same cohorts as 21983785 (genoMEL)
# thus likely malignant melanoma of skin
# https://pubmed.ncbi.nlm.nih.gov/23455637/

# for pubmed id: 21706340
# Cutaneous malignant melanoma
# therefore, malignant melanoma of skin

# pubmed id: 19578364 also uses GenoMEL consortium
# therefore, malignant melanoma of skin

# pubmed id: 18488026
# cutaneous malignant melanoma
# therefore, malignant melanoma of skin

# pubmed id: 21983787
# also uses GenoMEL consortium
# therefore, malignant melanoma of skin

# pubmed id: 28212542
# cutaneous melanoma
# therefore, malignant melanoma of skin

# pubmed id: 24980573
# skin cancer melanoma discussion 
# therefore, malignant melanoma of skin

# pubmed id: 35626014
# not entirely clear, but likely malignant melanoma of skin

# pubmed id: 34724200
# "current study focused on melanomas of the skin"
# hence, malignant melanoma of skin

# pubmed id: 34290314
# lists ICD10 codes as C43 (malignant melanoma of skin) 
# for melanoma - thus malignant melanoma of skin

# pubmed id: 36064556
# lists cutaneous melanoma

malignant_skin_melanoma_studies <- c(27023328,
                                     21983785,
                                     23455637,
                                     21706340,
                                     19578364,
                                     18488026,
                                     21983787,
                                     28212542,
                                     24980573,
                                     35626014,
                                     34724200,
                                     34290314,
                                     36064556)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(PUBMED_ID %in% malignant_skin_melanoma_studies &
                grepl("\\bmelanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "\\bmelanoma",
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))


# honestly not sure of pubmed id: 32887889
# ? guess but likely malignant melanoma of skin
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(PUBMED_ID == 32887889 &
                grepl("\\bmelanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "\\bmelanoma",
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))


# also not of pubmed id: 33409738
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(PUBMED_ID == 33409738 &
                grepl("\\bmelanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "\\bmelanoma",
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))

6.0.4 In situ cancer

gwas_study_info |> 
  filter(grepl("in situ", l1_all_disease_terms)) |> 
  pull(l1_all_disease_terms) |>
  unique()
[1] "skin cancer in situ"           "uterine cervix cancer in situ"
[3] "in situ cancer"               
# cervical cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "uterine cervix cancer in situ",
                          "cervical cancer"
         ))


gwas_study_info |> 
  filter(grepl("in situ", l1_all_disease_terms))  |>
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Carcinoma in situ of skin (PheCode 172.3)"                                                
[2] "ICD10 D04.3: Carcinoma in situ of skin of other and unspecified parts of face"            
[3] "ICD10 D04.5: Carcinoma in situ of skin of trunk"                                          
[4] "ICD10 D04.6: Carcinoma in situ of skin of upper limb, including shoulder"                 
[5] "ICD10 D04.7: Carcinoma in situ of skin of lower limb, including hip"                      
[6] "ICD10 D04.9: Carcinoma in situ of skin, unspecified"                                      
[7] "Behaviour of cancer tumour: carcinoma in situ, PHESANT recoding (UKB data field 40012_2)" 
[8] "Behaviour of cancer tumour - Carcinoma in situ (UKB data field 40012) (Gene-based burden)"
[9] "Behaviour of cancer tumour - Carcinoma in situ (UKB data field 40012)"                    
# strictly speaking is unspecified skin cancer
# likely ICD10 DO4 is non-melanoma skin cancer
# PheCode 172.3 - maps to D04.9
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("in situ", l1_all_disease_terms) & 
                grepl("ICD10 D04|PheCode 172.3", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "skin cancer in situ",
                          "non-melanoma skin cancer"),
                l1_all_disease_terms
         )
  )

# in situ cancer -> to cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "in situ cancer",
                stringr::str_replace_all(l1_all_disease_terms,
                          "in situ cancer",
                          "cancer"),
                l1_all_disease_terms
         )
  )

6.0.5 Non-specific cancer terms

gwas_study_info |> 
  filter(l1_all_disease_terms=="cancer")  |> 
  select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
  head()
                                           DISEASE/TRAIT l1_all_disease_terms
                                                  <char>               <char>
1:                                                Cancer               cancer
2:      Response to Pazopanib in cancer (hepatotoxicity)               cancer
3:          Body mass index (change over time) in cancer               cancer
4:                                                Cancer               cancer
5:                                                Cancer               cancer
6: Reported occurrences of cancer (UKB data field 40009)               cancer
# ICD10 Z85.4: Personal history of malignant neoplasm of genital organs
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("ICD10 Z85.4", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "cancer",
                          "urogenital cancer"),
                l1_all_disease_terms
         )
  )

# ICD10 Z85.1: Personal history of malignant neoplasm of trachea, bronchus and lung 
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("ICD10 Z85.1", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "cancer",
                           "tracheal bronchus and lung cancer"),
                l1_all_disease_terms
         )
  )

# ICD10 Z85.0: Personal history of malignant neoplasm of digestive organs
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("ICD10 Z85.0", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "cancer",
                          "digestive system cancer"),
                l1_all_disease_terms
         )
  )

# Cancer of intrathoracic organs (PheCode 164)

# Malignant neoplasm of retroperitoneum and peritoneum (PheCode 159.4)"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("PheCode 164|PheCode 159.4", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "cancer",
                          "peritoneum cancer, retroperitoneal cancer"),
                l1_all_disease_terms
         )
  )
# Malignant neoplasm of other and ill-defined sites within the digestive organs and peritoneum (PheCode 159)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("PheCode 159$", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "cancer",
                          "digestive system cancer, peritoneum cancer"),
                l1_all_disease_terms
         )
  )

6.0.6 peritoneum cancer -> peritoneal cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "\\bperitoneum cancer\\b",
                          "peritoneal cancer"
         ))

6.0.7 Bladder tumor

gwas_study_info |>
  filter(l1_all_disease_terms == "bladder tumor") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()
[1] "Neoplasm of uncertain behavior of bladder (UKB data field 40006) (Gene-based burden)"
[2] "Neoplasm of uncertain behavior of bladder (UKB data field 40006)"                    

6.0.8 Squamous cell cancer

gwas_study_info |>
    filter(grepl("squamous cell cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()
                                                                      DISEASE/TRAIT
                                                                             <char>
 1:                                                   Multiple keratinocyte cancers
 2:                                                         Squamous cell carcinoma
 3:                                               Cutaneous squamous cell carcinoma
 4:                                               Esophageal cancer (squamous cell)
 5:                            Esophageal squamous cell cancer (length of survival)
 6:                                                                     Lung cancer
 7:                     Pre-treatment pain in head and neck squamous cell carcinoma
 8:                                           Head and neck squamous cell carcinoma
 9:     Multiple cancers (lung cancer, gastric cancer, and squamous cell carcinoma)
10:                                                     Pan-squamous cell carcinoma
11: Cancer code, self-reported: squamous cell carcinoma (UKB data field 20001_1062)
12:                                        Squamous cell carcinoma (PheCode 172.22)
13:                                  Squamous cell carcinoma (UKB data field 20001)
14:              Squamous cell carcinoma (UKB data field 20001) (Gene-based burden)
15:    Head and neck squamous cell carcinoma (adjusted for environmental exposures)
16:                                                  Squamous cell carcinoma (MTAG)
                                 l1_all_disease_terms
                                               <char>
 1:    non-melanoma skin cancer, squamous cell cancer
 2:                              squamous cell cancer
 3:                              squamous cell cancer
 4:           esophageal cancer, squamous cell cancer
 5:           esophageal cancer, squamous cell cancer
 6:    lung cancer, lung cancer, squamous cell cancer
 7:  head and neck cancer, squamous cell cancer, pain
 8:        head and neck cancer, squamous cell cancer
 9: stomach cancer, lung cancer, squamous cell cancer
10:                              squamous cell cancer
11:                              squamous cell cancer
12:                              squamous cell cancer
13:                              squamous cell cancer
14:                              squamous cell cancer
15:        head and neck cancer, squamous cell cancer
16:                              squamous cell cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "lung cancer, squamous cell cancer",
                          "lung cancer"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "esophageal cancer, squamous cell cancer",
                          "esophageal cancer"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "head and neck cancer, squamous cell cancer",
                          "head and neck cancer"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "head and neck cancer, pain, squamous cell cancer",
                          "head and neck cancer, cancer pain"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "non-melanoma skin cancer, squamous cell cancer",
                          "non-melanoma skin cancer"
         ))

6.0.9 Female reproductive organ

gwas_study_info |>
    filter(grepl("female reproductive organ cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()
                                                                        DISEASE/TRAIT
                                                                               <char>
1:                       Benign neoplasm of other female genital organs (PheCode 221)
2:                                Cancer of other female genital organs (PheCode 184)
3: Cancer of other female genital organs (excluding uterus and ovary) (PheCode 184.2)
               l1_all_disease_terms
                             <char>
1: female reproductive organ cancer
2: female reproductive organ cancer
3: female reproductive organ cancer
gwas_study_info = gwas_study_info |>
    mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "female reproductive organ cancer" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )


gwas_study_info |>
    filter(grepl("\\breproductive system cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()
                DISEASE/TRAIT       l1_all_disease_terms
                       <char>                     <char>
1: Female reproductive cancer reproductive system cancer
gwas_study_info = gwas_study_info |>
    mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "reproductive system cancer" & 
                 grepl("female reproductive cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "female reproductive organ cancer",
                 l1_all_disease_terms
         )
         )

6.0.10 Male reproductive organ

gwas_study_info |>
    filter(grepl("\\bmale reproductive organ cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()
                                                           DISEASE/TRAIT
                                                                  <char>
1:                                              Male reproductive cancer
2: Neoplasm of uncertain behavior of male genital organs (PheCode 187.8)
3:                     Cancer of other male genital organs (PheCode 187)
4:  Malignant neoplasm of unspecified male genial organs (PheCode 187.1)
             l1_all_disease_terms
                           <char>
1: male reproductive organ cancer
2: male reproductive organ cancer
3: male reproductive organ cancer
4: male reproductive organ cancer

6.0.11 Small cell cancer

gwas_study_info |>
    filter(grepl("\\bsmall cell cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()
            DISEASE/TRAIT l1_all_disease_terms
                   <char>               <char>
1: Small-cell lung cancer    small cell cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(`DISEASE/TRAIT` == "Small-cell lung cancer",
         stringr::str_replace_all(l1_all_disease_terms,
                          "small cell cancer",
                          "lung cancer"),
         l1_all_disease_terms
         )
)

6.0.12 Central nervous system cancer, nervous system cancer

gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms =
         case_when(l1_all_disease_terms == "central nervous system cancer, nervous system cancer" ~ "central nervous system cancer",
                   TRUE ~ l1_all_disease_terms)
         ) 

6.0.13 Fix non-melanoma skin cancer, woopsy

gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms =
         case_when(l1_all_disease_terms == "non-malignant skin melanoma skin cancer" ~ "non-melanoma skin cancer",
                   l1_all_disease_terms == "non-malignant melanoma of skin skin cancer" ~ "non-melanoma skin cancer",
                   TRUE ~ l1_all_disease_terms)
         ) 

# 
# gwas_study_info =
# gwas_study_info |>
#   mutate(l1_all_disease_terms =
#          stringr::str_replace_all(l1_all_disease_terms,
#                            "non-malignant skin melanoma skin cancer",
#                           "non-melanoma skin cancer"
#          )
#          ) 

6.0.14 Unspecified skin cancer

gwas_study_info |> 
  filter(grepl("skin cancer", l1_all_disease_terms) & 
         !grepl("non-melanoma", l1_all_disease_terms)
         )  |> 
  select(STUDY, 
         `DISEASE/TRAIT`, 
         all_disease_terms, 
         l1_all_disease_terms) |> 
  distinct()
                                                                                                                   STUDY
                                                                                                                  <char>
1: Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.
2:                                             A generalized linear mixed model association tool for biobank-scale data.
3:                           Diversity and scale: Genetic architecture of 2068 traits in the VA Million Veteran Program.
4:                           Diversity and scale: Genetic architecture of 2068 traits in the VA Million Veteran Program.
                                                         DISEASE/TRAIT
                                                                <char>
1:                                           Skin cancer (PheCode 172)
2: Cancer code, self-reported: skin cancer (UKB data field 20001_1003)
3:                                                         Skin cancer
4:                                   Takes medication for skin cancer?
   all_disease_terms l1_all_disease_terms
              <char>               <char>
1:       skin cancer          skin cancer
2:    skin carcinoma          skin cancer
3:       skin cancer          skin cancer
4:       skin cancer          skin cancer
# make them listed under both malignant melanoma of skin and non-melanoma skin cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("skin cancer", l1_all_disease_terms) & 
                !grepl("melanoma", l1_all_disease_terms),
                stringr::str_replace_all(l1_all_disease_terms,
                          "skin cancer",
                          "malignant melanoma of skin, non-melanoma skin cancer"),
                l1_all_disease_terms
         )
         )

6.1 Unspecified lymphoma

gwas_study_info |> 
  filter(grepl("lymphoma", l1_all_disease_terms) & 
          !grepl("hodgkin", l1_all_disease_terms)) |> 
  pull(`DISEASE/TRAIT`) |> 
  unique()
 [1] "Lymphoma"                                                          
 [2] "B cell non-Hodgkin lymphoma"                                       
 [3] "Large cell lymphoma (PheCode 202.24)"                              
 [4] "ICD10 C83.8: Other non-follicular lymphoma"                        
 [5] "Cancer code, self-reported: lymphoma (UKB data field 20001_1047)"  
 [6] "Lymphosarcoma (PheCode 202.23)"                                    
 [7] "Non-follicular lymphoma (UKB data field 40006) (Gene-based burden)"
 [8] "ICD10 C85.1: Unspecified B-cell lymphoma"                          
 [9] "Non-follicular lymphoma (UKB data field 40006)"                    
[10] "Asthma in lymphoma"                                                
[11] "ICD10 C83: Non-follicular lymphoma"                                
[12] "ICD10 C83: Non-follicular lymphoma (Gene-based burden)"            
[13] "ICD10 C85.1: Unspecified B-cell lymphoma (Gene-based burden)"      
[14] "Malignant lymphoma"                                                
# PheCode 202.23 maps to ICD-9  200.1   Lymphosarcoma
# which as from: http://snomed.info/id/188498009, is a form of non-Hodgkin's lymphoma

# PheCode 202.24 code maps to ICD-9 200.6,  Anaplastic large cell lymphoma a form of non-Hodgkin's lymphoma

# all ICD10 C83 codes are non-Hodgkin's lymphoma

# ICD10 C85.1 maps to PheCode 202.2 Non-Hodgkins lymphoma

nhl_terms <- c("B cell non-Hodgkin lymphoma",
               "PheCode 202.23",
               "PheCode 202.24",
               "ICD10 C83",
               "ICD10 C85.1"
               )


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                grepl(paste0(nhl_terms, collapse = "|"), 
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "^lymphoma",
                          "non-hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )


# Non-follicular lymphoma (UKB data field 40006) likely non-hodgkin lymphoma
# as ICD10 C83: Non-follicular lymphoma is non-hodgkin lymphoma

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                grepl("Non-follicular lymphoma \\(UKB data field 40006\\)", 
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "^lymphoma",
                          "non-hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )


# Cancer code, self-reported: lymphoma (UKB data field 20001_1047)
# includes both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                grepl("Cancer code, self-reported: lymphoma \\(UKB data field 20001_1047\\)", 
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          "^lymphoma",
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

# for pubmed id: 34594039
# from sup table 1; 
# Malignant lymphoma    Malignant_Lymphoma  is defined PheCodes 201/202 CD2_NONFOLLICULAR_LYMPHOMA
# thus includes both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                PUBMED_ID == 34594039,
                stringr::str_replace_all(l1_all_disease_terms,
                          "^lymphoma",
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

# pubmed id: 23349640
# includes both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                PUBMED_ID == 23349640,
                stringr::str_replace_all(l1_all_disease_terms,
                          "^lymphoma",
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

# not entirely sure for pubmed id: 36344522
# perhaps need to read further in, but seems like it is both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                PUBMED_ID == 36344522,
                stringr::str_replace_all(l1_all_disease_terms,
                          "^lymphoma",
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

6.1.1 breast cancer, cancer, colon and rectum cancer, tracheal bronchus and lung cancer, ovarian cancer, prostate cancer

gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms =
         case_when(l1_all_disease_terms == "breast cancer, cancer, colorectal cancer, lung cancer, ovarian cancer, prostate cancer" ~ 
                     "breast cancer, colorectal cancer, lung cancer, ovarian cancer, prostate cancer",
                   TRUE ~ l1_all_disease_terms)
         )

7 Final summary - number of unique study terns

7.1 Deal with duplicate terms created during grouping

gwas_study_info = 
  gwas_study_info |>
  rowwise() |>
  mutate(l1_all_disease_terms = paste0(sort(unique(unlist(strsplit(l1_all_disease_terms, ", ")))),
                                      collapse = ", ")
         ) |>
  ungroup()

7.2 Deal with hanging commas and spaces

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = stringr::str_remove_all(l1_all_disease_terms, "^,|,$")
         ) |>
  mutate(l1_all_disease_terms = stringr::str_trim(l1_all_disease_terms)
         ) 

7.3 Final summary - number of unique study terms pairs

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l1_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(l1_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))


head(n_studies_trait)
# A tibble: 6 × 2
  l1_all_disease_terms      n_studies
  <chr>                         <int>
1 type 2 diabetes mellitus        145
2 asthma                          134
3 breast cancer                   125
4 alzheimers disease              124
5 ischemic heart disease          109
6 major depressive disorder       108
dim(n_studies_trait)
[1] 2428    2

7.3.1 When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$l1_all_disease_terms[gwas_study_info$l1_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()


test <- data.frame(trait = unique(diseases))

length(unique(diseases))
[1] 1665
# make frequency table
freq <- table(as.factor(diseases))

# sort in decreasing order
freq_sorted <- sort(freq, decreasing = TRUE)

# show top N, e.g. top 10
head(freq_sorted, 10)

           kidney disease              hypertension  type 2 diabetes mellitus 
                    10915                      7091                       922 
   ischemic heart disease           benign neoplasm major depressive disorder 
                      570                       559                       471 
       alzheimers disease             breast cancer             schizophrenia 
                      422                       379                       368 
                   asthma 
                      348 

7.3.2 Save the updated gwas_study_info with harmonized disease terms

fwrite(gwas_study_info,
        here::here("output/gwas_cat/gwas_study_info_group_l1_v2.csv")
         )

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] jsonlite_2.0.0    httr_1.4.7        stringr_1.5.1     data.table_1.17.8
[5] dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] compiler_4.3.1    renv_1.0.3        promises_1.3.3    tidyselect_1.2.1 
 [5] Rcpp_1.1.0        git2r_0.36.2      callr_3.7.6       later_1.4.2      
 [9] jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0     here_1.0.1       
[13] R6_2.6.1          generics_0.1.4    curl_6.4.0        knitr_1.50       
[17] tibble_3.3.0      rprojroot_2.1.0   bslib_0.9.0       pillar_1.11.0    
[21] rlang_1.1.6       utf8_1.2.6        cachem_1.1.0      stringi_1.8.7    
[25] httpuv_1.6.16     xfun_0.52         getPass_0.2-4     fs_1.6.6         
[29] sass_0.4.10       cli_3.6.5         withr_3.0.2       magrittr_2.0.3   
[33] ps_1.9.1          digest_0.6.37     processx_3.8.6    rstudioapi_0.17.1
[37] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.4    glue_1.8.0       
[41] whisker_0.4.1     rmarkdown_2.29    tools_4.3.1       pkgconfig_2.0.3  
[45] htmltools_0.5.8.1