Last updated: 2025-12-29

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 66f7b4e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    human_dictionary/
    Ignored:    igsr_populations.tsv
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/
    Ignored:    visualization.Rdata

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/level_1_disease_group_non_cancer.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/level_1_disease_group_cancer.Rmd) and HTML (docs/level_1_disease_group_cancer.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 66f7b4e IJbeasley 2025-12-29 Archiving old GWAS trait conversion
html 8072a8e IJbeasley 2025-09-24 Build site.
Rmd bcd91ff IJbeasley 2025-09-24 More fixing diseases
html 137d326 IJbeasley 2025-09-23 Build site.
Rmd 03f1751 IJbeasley 2025-09-23 More icd codes
html 0fd4287 IJbeasley 2025-09-23 Build site.
html 3b88f25 IJbeasley 2025-09-22 Build site.
Rmd df0281b IJbeasley 2025-09-22 More typo + structure fixing …
html 02263dd IJbeasley 2025-09-22 Build site.
Rmd 72709e9 IJbeasley 2025-09-22 …maybe fixing typos
html f7ea257 IJbeasley 2025-09-22 Build site.
html a13c272 IJbeasley 2025-09-17 Build site.
Rmd 003c226 IJbeasley 2025-09-17 More fixing up of disease grouping
html 3d701f2 IJbeasley 2025-09-17 Build site.
html 4f70a33 IJbeasley 2025-09-17 Build site.
Rmd 41b1b7c IJbeasley 2025-09-17 Better grouping of cardiovascular disease
html fa95c62 IJbeasley 2025-09-17 Build site.
Rmd 57e46da IJbeasley 2025-09-17 More typo fixing
html cfd2ef8 IJbeasley 2025-09-17 Build site.
Rmd 7df4726 IJbeasley 2025-09-17 Dealing with non-specific cancer labels
html 2aa6027 IJbeasley 2025-09-17 Build site.
Rmd b6f20c4 IJbeasley 2025-09-17 Adding more benign neoplasm
html 83152bd IJbeasley 2025-09-16 Build site.
html d7db734 IJbeasley 2025-09-16 Build site.
Rmd 53bf24e IJbeasley 2025-09-16 More cancer typos
html b0f0ff5 IJbeasley 2025-09-16 Build site.
Rmd e8fb82c IJbeasley 2025-09-16 Correcting some cancer grouping
html de1a740 IJbeasley 2025-09-16 Build site.
Rmd 69d6255 IJbeasley 2025-09-16 Improving cancer grouping
html da4e2cc IJbeasley 2025-09-16 Build site.
Rmd 0196914 IJbeasley 2025-09-16 More disease grouping
html 937b460 IJbeasley 2025-09-16 Build site.
Rmd 3ac50bd IJbeasley 2025-09-16 Even more disease term grouping
html 7c6dee8 IJbeasley 2025-09-15 Build site.
Rmd 4451421 IJbeasley 2025-09-15 Grouping more neoplasms
html 2e145f8 IJbeasley 2025-09-15 Build site.
Rmd 2702dc1 IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/level_1_disease_group_cancer.Rmd")
html 7fe9a06 IJbeasley 2025-09-15 Build site.
Rmd 81a1d22 IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/level_1_disease_group_cancer.Rmd")
html 1f89b20 IJbeasley 2025-09-15 Build site.
Rmd fdd60ed IJbeasley 2025-09-15 More disease term grouping
html bf45a69 IJbeasley 2025-09-15 Build site.
Rmd 1414cad IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/level_1_disease_group_cancer.Rmd")
html 3c8309c IJbeasley 2025-09-15 Build site.
Rmd 17a16b0 IJbeasley 2025-09-15 Further grouping of disease terms
html 778ac1e IJbeasley 2025-09-15 Build site.
Rmd bb5431c IJbeasley 2025-09-15 Dealing with duplicate disease terms
html 9f69979 IJbeasley 2025-09-10 Build site.
html 9ca183a IJbeasley 2025-09-10 Build site.
Rmd 50ef69d IJbeasley 2025-09-10 Update cancer grouping

library(dplyr)
library(data.table)
library(stringr)

0.1 Ontology help - for getting disease subtypes

source(here::here("code/get_term_descendants.R"))
gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_group_l1.csv"))

1 Initial summary - number of unique study terns

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l1_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(l1_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))


head(n_studies_trait)

dim(n_studies_trait)

1.1 When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$l1_all_disease_terms[gwas_study_info$l1_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()


length(unique(diseases))


test <- data.frame(trait = unique(diseases))

2 Disease subtype grouping (cancer)

2.0.1 Astrocytoma

gwas_study_info |>
    filter(grepl("astrocytoma", l1_all_disease_terms)) |> 
    pull(STUDY) |> 
    unique()

# all comes from one cancer study - so a central nervous system cancer

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  =
          ifelse(PUBMED_ID == "36810956",
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern("astrocytoma"),
                                   "central nervous system cancer"
                          ),
          l1_all_disease_terms
        )
 )

2.0.2 Bone cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002129/descendants"

bone_cancer_terms <- get_descendants(url)

bone_cancer_terms = stringr::str_replace_all(bone_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

bone_cancer_terms = c("malignant bone neoplasm",
                      bone_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern(bone_cancer_terms),
                                   "bone cancer"
                          )  
        )

2.0.3 Bladder cancer

url <-  "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_4007/descendants"

# maybe do uninary bladder cancer instead

bladder_cancer_terms <- get_descendants(url)

bladder_cancer_terms = stringr::str_replace_all(bladder_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(bladder_cancer_terms),
                                   "bladder cancer"
                          )  
        )

2.0.4 Breast cancer

breast_cancer_terms <- grep("breast cancer", unique(diseases), value = T)

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(breast_cancer_terms),
                                   "breast cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern("breast cancer in situ"),
                                   "breast cancer"
                          )  
        ) |>
     mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern("invasive lobular cancer"),
                                   "breast cancer"
                          )  
        )

2.0.5 Benign neoplasms

2.0.5.1 Benign neoplasm of blood vessel

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/snomed/terms/http%253A%252F%252Fsnomed.info%252Fid%252F92017000/descendants"

benign_blood_vessel_terms <- get_descendants(url)

benign_blood_vessel_terms <- stringr::str_replace_all(benign_blood_vessel_terms,
                            "\\bcarcinoma",
                            "cancer")


gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms = 
        stringr::str_replace_all(l1_all_disease_terms, 
                    vec_to_grep_pattern(benign_blood_vessel_terms),
                    "benign neoplasm")
         )

2.0.6 Labelled benign neoplasms

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?<=^|, )benign neoplasm of (.*?)(?=,|$)|(?<=^|, )benign neoplasm of (.*?) (.*?)(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?<=^|, )benign (.*?) neoplasm(?=,|$)|(?<=^|, )benign (.*?) (.*?) neoplasm(?=,|$)", 
                    "benign neoplasm")
         ) 


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?<=^|, )(.*?) benign neoplasm(?=,|$)|(.*?) (.*?) neoplasm(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?<=^|, )polyp of (.*?)(?=,|$)|(?<=^|, )polyp of (.*?) (.*?)(?=,|$)", 
                    "benign neoplasm")
         ) 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    "(?<=^|, ) (.*?) polyp(?=,|$)|(?<=^|, )(.*?) (.*?) polyp (?=,|$)", 
                    "benign neoplasm")
         ) 

3 Other benign neoplasms

# https://my.clevelandclinic.org/health/diseases/21477-adenomas - benign
other_benign_neoplasms = c("adenomatous colon polyp",
                           "colorectal adenoma",
                           "pituitary gland adenoma",
                           "aldosterone-producing adenoma",
                           "metachronous colorectal adenoma",
                           "adenomatous colon polyp",
                           "female genital tract polyp",
                           "\\bpolyp\\b",
                           "uterine leiomyoma",
                           "hepatic hemangioma",
                           "lobular capilliary hemangioma", 
                           "hemangioma of subcutaneous tissue",
                           "benign prostatic hyperplasia",
                           "melanocytic nevus",
                           "hemangioma",
                           "lymphangioma",
                           "vestibular schwannoma",
                           "schwannoma",
                           "skin lipoma", # likely benign ... 
                           "lipoma",
                           "hamartoma",
                           "seborrheic keratosis",
                           "actinic keratosis",
                           "keratosis",
                           "meningioma", # most are benign (80%)
                           "common wart",
                           "plantar wart",
                           "penile Fibromatosis"
                           )

other_benign_neoplasms = str_length_sort(other_benign_neoplasms)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
        str_replace_all(l1_all_disease_terms, 
                    vec_to_grep_pattern(other_benign_neoplasms),
                    "benign neoplasm")
         )

3.0.1 Central nervous system cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0000326/descendants"

cns_cancer_terms <- get_descendants(url)

cns_cancer_terms = stringr::str_replace_all(cns_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
pattern = vec_to_grep_pattern(cns_cancer_terms)

gwas_study_info = gwas_study_info |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = pattern,
                                   "central nervous system cancer"
                          )  
        )

3.0.2 Cervical cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_4362/descendants"

cervical_cancer_terms <- get_descendants(url)

cervical_cancer_terms = stringr::str_replace_all(cervical_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

cervical_cancer_terms = c("cervical intraepithelial neoplasia grade 2/3",
                          "uterine cervical cancer in situ",
                          cervical_cancer_terms)

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern("uterine cervical cancer in situ"),
                                   "cervical cancer"
                          )  
        )


gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern(cervical_cancer_terms),
                                   "cervical cancer"
                          )  
        )

3.0.3 Colorectal cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0005575/descendants"

colorectal_cancer_terms <- get_descendants(url)

colorectal_cancer_terms = stringr::str_replace_all(colorectal_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

colorectal_cancer_terms= c("metastatic colorectal cancer",
                             "rectum cancer",
                          colorectal_cancer_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern(colorectal_cancer_terms),
                                   "colorectal cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern("colorectal mucinous adenocarcinoma"),
                                   "colorectal cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern("metachronous colorectal adenoma"),
                                   "colorectal cancer"
                          )  
        )

3.0.4 Endometrial cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0011962/descendants"

endometrial_cancer_terms <- get_descendants(url)

# also: http://www.ebi.ac.uk/efo/EFO_1001514: endometrial endometrioid carcinoma
endometrial_cancer_terms = c("endometrial endometrioid carcinoma",
                             endometrial_cancer_terms)

endometrial_cancer_terms = stringr::str_replace_all(endometrial_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(endometrial_cancer_terms),
                                   "endometrial cancer"
                          )  
        )

3.0.5 Esophageal cancer

esophageal_cancer_terms <- c("esophageal adenocarcinoma",
                             "esophageal squamous cell cancer")


gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(esophageal_cancer_terms),
                                   "esophageal cancer"
                          )  
        )

3.0.6 Eye cancer (to add)

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002236/descendants"

ocular_cancer_terms <- get_descendants(url)

ocular_cancer_terms = stringr::str_replace_all(ocular_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(ocular_cancer_terms),
                                   "ocular cancer"
                          )  
        )

3.0.7 Gallbladder and biliary tract cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                         vec_to_grep_pattern("cancer of gallbladder and extrahepatic biliary tract"),
                         "gallbladder and biliary tract cancer"
         )
         )

3.0.8 Hodgkin lymphoma

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms ,
                          vec_to_grep_pattern("nodular sclerosis hodgkin lymphoma"),
                          "hodgkins lymphoma"
         ))

3.0.9 Head and neck cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms ,
                          vec_to_grep_pattern("head and neck squamous cell cancer"),
                          "head and neck cancer, squamous cell cancer"
         ))

3.0.10 Intestinal cancer (non- colorectal)

intestinal_cancer_terms <- c("small intestine cancer",
                           "small bowel cancer",
                           "small intestine cancer")

3.0.11 Kidney cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_263/descendants"


kidney_cancer_terms <- get_descendants(url)

kidney_cancer_terms = c("renal cell carcinoma",
                       "clear cell renal carcinoma",
                       "clear cell renal cell carcinoma",
                       kidney_cancer_terms)

kidney_cancer_terms = stringr::str_replace_all(kidney_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

kidney_cancer_terms =  stringr::str_replace_all(kidney_cancer_terms,
                          vec_to_grep_pattern("renal cancer"),
                          "kidney cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(kidney_cancer_terms),
                                   "kidney cancer"
                          )  
        )

3.0.12 Laryngeal cancer

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = 
                                    vec_to_grep_pattern(
                                      c("laryngeal squamous cell cancer",
                                      "laryngeal cancer")
                                      ),
                                   "larynx cancer"
                          )  
        )

3.0.13 Leukemia

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0000565/descendants"

leukemia_terms <- get_descendants(url)

leukemia_terms <- c("b-cell acute lymphoblastic leukemia with t\\(1;19\\)\\(q23;p13.3\\); e2a-pbx1 \\(tcf3-pbx1\\)",
                    leukemia_terms)

leukemia_terms = stringr::str_replace_all(leukemia_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(leukemia_terms),
                                   "leukemia"
                          )  
        )

3.0.14 Lip and oral cavity cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0005570/descendants"

lip_oral_cavity_cancer_terms <- get_descendants(url)

lip_oral_cavity_cancer_terms = stringr::str_replace_all(lip_oral_cavity_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(lip_oral_cavity_cancer_terms),
                                   "lip and oral cavity cancer"
                          )  
        )

3.0.15 Liver cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0002691/descendants"

liver_cancer_terms <- get_descendants(url)

liver_cancer_terms = stringr::str_replace_all(liver_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")


liver_cancer_terms = c("hepatitis virus-related liver cancer",
                       liver_cancer_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(liver_cancer_terms),
                                   "liver cancer"
                          )  
        )

gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern("hepatitis virus-related liver cancer"),
                                   "liver cancer"
                          )  
        )

3.0.16 Lung cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0008903/descendants"

lung_cancer_terms <- get_descendants(url)

lung_cancer_terms = stringr::str_replace_all(lung_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(lung_cancer_terms),
                                   "lung cancer"
                          )  
        )

3.0.17 Ovarian cancer

url <-  "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0008170/descendants"

ovarian_cancer_terms <- get_descendants(url)

ovarian_cancer_terms = stringr::str_replace_all(ovarian_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

ovarian_cancer_terms = c("high grade serous ovarian cancer",
                         "high grade ovarian serous adenocarcinoma",
                         "high grade ovarian cancer",
                         "high grade ovarian cancers",
                         "ovarian endometrioid cancer", # http://www.ebi.ac.uk/efo/EFO_1001515 - ovarian edometrioid carcinoma
                         
                         "ovarian serous cancer", # http://www.ebi.ac.uk/efo/EFO_1001516 - ovarian serous carcinoma
                       ovarian_cancer_terms
                       )
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern(ovarian_cancer_terms),
                                   "ovarian cancer"
                  
        )
 )
        
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern("high grade ovarian cancer"),
                                   "ovarian cancer"
                          )  
        )

3.0.18 Pancreatic cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0009831/descendants"

pancreatic_cancer_terms <- get_descendants(url)

pancreatic_cancer_terms = stringr::str_replace_all(pancreatic_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(pancreatic_cancer_terms),
                                   "pancreatic cancer"
                          )  
        )

3.1 Peripheral nervous system cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mondo/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMONDO_0021089/descendants"

peripheral_nervous_system_cancer_terms <- get_descendants(url)

peripheral_nervous_system_cancer_terms = stringr::str_replace_all(peripheral_nervous_system_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(peripheral_nervous_system_cancer_terms),
                                   "peripheral nervous system cancer"
                          )  
        )

3.1.1 Prostate cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_10283/descendants"

prostate_cancer_terms <- get_descendants(url)

prostate_cancer_terms = stringr::str_replace_all(prostate_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

prostate_cancer_terms = c("grade iii prostatic intraepithelial neoplasia",
                          "metastatic prostate cancer",
                          prostate_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms,
                                  pattern = vec_to_grep_pattern(prostate_cancer_terms),
                                   "prostate cancer"
                          )  
        )

3.1.2 Mesothelioma

mesothelioma_terms = c("pleural mesothelioma",
                       "malignant pleural mesothelioma")

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(mesothelioma_terms),
                                   "mesothelioma"
                          )  
        )

3.1.3 Neuroendocrine tumor

neuroendo_terms <- c("pulmonary neuroendocrine tumor",
                     "small intestine neuroendocrine tumor",
                     "pancreatic neuroendocrine tumor",
                     "carcinoid tumor" #http://www.ebi.ac.uk/efo/EFO_0004243
)

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                    pattern = vec_to_grep_pattern(neuroendo_terms),
                                   "neuroendocrine tumor"
                          )  
        )

3.1.4 Non-Hodgkins lymphoma

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0005952/descendants"

nhl_terms <- get_descendants(url)

nhl_terms = stringr::str_replace_all(nhl_terms,
                            "\\bcarcinoma",
                            "cancer")



nhl_terms = c("central nervous system non-hodgkin lymphoma",
              "lymphoblastic lymphoma",
              "extranodal nasal nk/t cell lymphoma", # https://www.ebi.ac.uk/ols4/ontologies/ordo/classes/http%253A%252F%252Fwww.orpha.net%252FORDO%252FOrphanet_86879
              "follicular lymphoma", # http://purl.obolibrary.org/obo/DOID_0050873
              "marginal zone b-cell lymphoma",
              "diffuse large b-cell lymphoma",
              nhl_terms)

# also likely that reticulum cell sarcoma is NHL
# see; http://www.ebi.ac.uk/efo/EFO_0005287
# https://pubmed.ncbi.nlm.nih.gov/6328875/
nhl_terms = c("reticulum cell sarcoma",
              nhl_terms)
gwas_study_info = 
gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(nhl_terms),
                                   "non-hodgkins lymphoma"
                          )  
        )

3.1.5 Non-melanoma skin cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0009260/descendants"

non_melanoma_skin_cancer_terms <- get_descendants(url)

non_melanoma_skin_cancer_terms = stringr::str_replace_all(non_melanoma_skin_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(non_melanoma_skin_cancer_terms),
                                   "non-melanoma skin cancer"
                          )  
        )

3.1.6 Other pharygnx cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_0060119/descendants"

other_pharynx_cancer_terms <- get_descendants(url)

other_pharynx_cancer_terms = stringr::str_replace_all(other_pharynx_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

other_pharynx_cancer_terms = c("hypopharyngeal cancer",
                               other_pharynx_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(other_pharynx_cancer_terms),
                                   "other pharynx cancer"
                          )  
        )

3.1.7 Soft tissue sarcoma

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_1001968/descendants"

soft_tissue_sarcoma_terms <- get_descendants(url)

soft_tissue_sarcoma_terms = stringr::str_replace_all(soft_tissue_sarcoma_terms,
                            "\\bcarcinoma",
                            "cancer")

soft_tissue_sarcoma_terms = c("kaposis sarcoma",
                              "iatrogenic kaposis sarcoma",
                              soft_tissue_sarcoma_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(soft_tissue_sarcoma_terms),
                                   "soft tissue sarcoma"
                          )  
        )

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern("sarcoma, soft tissue sarcoma"),
                                   "soft tissue sarcoma"
                          )
 )

3.1.7.1 Ewing sarcoma

# can be either bone or soft tissue sarcoma
# hard to tell from these studies: 
gwas_study_info |> 
  filter(grepl("ewing", l1_all_disease_terms))  |> 
  select(PUBMED_ID, `DISEASE/TRAIT`, COHORT, STUDY)

3.1.8 Stomach cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_10534/descendants"

stomach_cancer_terms <- get_descendants(url)

stomach_cancer_terms = stringr::str_replace_all(stomach_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

stomach_cancer_terms = c(
                          "diffuse stomach cancer",
                          "gastric cancer",
                          "gastric intestinal type adenocarcinoma",
                          "gastric cardia cancer",
                          "cardia cancer",
                          stomach_cancer_terms
                          )
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(stomach_cancer_terms),
                                   "stomach cancer"
                          )  
        ) |>
   mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern("diffuse stomach cancer"),
                                   "stomach cancer"
                          )  
        )

3.1.9 Squamous cell carcinoma

gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern("cutaneous squamous cell cancer"),
                                   "non-melanoma skin cancer"
                          )  
        )

3.1.10 Testicular cancer

3.1.11 Thyroid cancer

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/doid/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FDOID_1781/descendants"

thyroid_cancer_terms <- get_descendants(url)

thyroid_cancer_terms = stringr::str_replace_all(thyroid_cancer_terms,
                            "\\bcarcinoma",
                            "cancer")

thyroid_cancer_terms = c("differentiated thyroid cancer",
                         thyroid_cancer_terms)
gwas_study_info = gwas_study_info |> 
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(thyroid_cancer_terms),
                                   "thyroid cancer"
                          )  
        )

3.1.12 Uterine cancer

uterine_cancer_terms <- c("uterine corpus cancer",
                          "uterine adnexa cancer")

gwas_study_info = gwas_study_info |>
 mutate(l1_all_disease_terms  = 
          stringr::str_replace_all(l1_all_disease_terms ,
                                  pattern = vec_to_grep_pattern(uterine_cancer_terms),
                                   "uterine cancer"
                          )  
        )

4 Reducing non-specific cancer terms

4.0.1 Brain neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "brain neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique() 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "brain neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "central nervous system cancer",
                 l1_all_disease_terms
         )
         )

# still leaves one study (with Brain Tumor)

gwas_study_info |> 
  filter(l1_all_disease_terms == "brain neoplasm") |> 
  pull(PUBMED_ID) |>
  unique() 

# from paper sup tables, ICD-10 code of brain tumor term is C71 - malignant neoplasm of brain
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "brain neoplasm" & 
                 PUBMED_ID == 34594039,
                 "central nervous system cancer",
                 l1_all_disease_terms
         )
         )

4.0.2 Breast neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "breast neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "breast neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "breast cancer",
                 l1_all_disease_terms
         )
         )

4.0.3 Bone neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "bone neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "bone neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "bone cancer",
                 l1_all_disease_terms
         )
         )


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "bone neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

4.0.4 Cecal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "cecal neoplasm") |> 
  pull(`DISEASE/TRAIT`)


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "cecal neoplasm" & 
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "colorectal cancer",
                 l1_all_disease_terms
         )
         )

4.0.5 Colonic neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "colonic neoplasm") |> 
  select(MAPPED_TRAIT, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct() |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("malignant|cancer", `DISEASE/TRAIT`, ignore.case = T) &
                l1_all_disease_terms == "colonic neoplasm",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("colonic neoplasm"),
                          "colorectal cancer"),
                l1_all_disease_terms
         )
  )

# also specific example where measuring rectal cancer vs colon cancer

gwas_study_info |> 
  filter(grepl("colonic neoplasm", l1_all_disease_terms)) |> 
  select(MAPPED_TRAIT, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct() |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST90179122",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("colonic neoplasm"),
                          "colorectal cancer"),
                l1_all_disease_terms
         )
  )

4.0.6 Endometrial neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "endometrial neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "endometrial neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "endometrial cancer",
                 l1_all_disease_terms
         )
         )

4.0.7 Esophageal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "neoplasm of esophagus") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neoplasm of esophagus" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "esophageal cancer",
                 l1_all_disease_terms
         )
         )

4.0.8 Eye neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "eye neoplasm") |> 
  pull(`DISEASE/TRAIT`)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "eye neoplasm" & 
                 grepl("cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "ocular cancer",
                 l1_all_disease_terms
         )
         )

4.0.9 Gallbladder neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "gallbladder neoplasm") |> 
  pull(`DISEASE/TRAIT`)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "gallbladder neoplasm" & 
                 grepl("cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "gallbladder and biliary tract cancer",
                 l1_all_disease_terms
         )
         )

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "gallbladder neoplasm" & 
                `DISEASE/TRAIT` == "Gallbladder adenomyomatosis",
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

# also specific example where measuring sclerosing cholangitis & gallbladder cancer
gwas_study_info |>
  filter(grepl("gallbladder neoplasm", l1_all_disease_terms)) |>
  select(STUDY_ACCESSION, `DISEASE/TRAIT`)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST005857",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("gallbladder neoplasm"),
                          "gallbladder and biliary tract cancer"),
                l1_all_disease_terms
         ))

4.0.10 Glioma (can be benign or malignant)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          "central nervous system cancer, glioma",
                          "central nervous system cancer"
         ))


gwas_study_info |> 
  filter(l1_all_disease_terms == "glioma") |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_URI, `DISEASE/TRAIT`) |> 
  distinct()

# assume where measure survival and glioma, it is cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("survival", MAPPED_TRAIT, ignore.case = T),
         stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
         l1_all_disease_terms
         )
  )

# assme where measure is grade, it is cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("grade", `DISEASE/TRAIT`, ignore.case = T),
         stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
         l1_all_disease_terms
         )
  )

# Adult diffuse glioma - assume maglignant
# https://pmc.ncbi.nlm.nih.gov/articles/PMC9245936/
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("adult diffuse", `DISEASE/TRAIT`, ignore.case = T),
         stringr::str_replace_all(l1_all_disease_terms,
                          "glioma",
                          "central nervous system cancer"),
         l1_all_disease_terms
         )
  )

gwas_study_info |>
  filter(`DISEASE/TRAIT` == "Glioma (pediatric/youth onset)") |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_URI, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct()

# from paper; seems malignant
# https://pubmed.ncbi.nlm.nih.gov/31040135/

gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(STUDY_ACCESSION == "GCST008912",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms
         ))


gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(`DISEASE/TRAIT` == "Glioblastoma" & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms
         ))


# for pubmed id: 22886559 
# majority (~90%) graded glioma, gliobastoma and Oligodendroglioma
# so assume malignant
gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 22886559 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )


# for pubmed id: 29743610
# majority (~60%) are glioblastoma
# so assume malignant

gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 29743610 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

# for pubmed id: 36810956
# seems to primarily include high grade glioma
# so assume malignant
gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 36810956 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

# pubmed id: 30714141 
# considers glioma cancer
gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 30714141 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

# pubmed id: 34319593
# considers glioma a maglignant tumor
gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms  =
         ifelse(PUBMED_ID == 34319593 & l1_all_disease_terms == "glioma",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("glioma"),
                          "central nervous system cancer"),
                l1_all_disease_terms)
  )

4.0.11 Glottis neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "glottis neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "glottis neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "larynx cancer",
                 l1_all_disease_terms
         )
         )

4.0.12 Kidney neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "kidney neoplasm") |> 
  pull(`DISEASE/TRAIT`)


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "kidney neoplasm" & 
                 grepl("malignant|cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "kidney cancer",
                 l1_all_disease_terms
         )
         )

4.0.13 Laryngeal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "laryngeal neoplasm") |> 
  select(MAPPED_TRAIT, MAPPED_TRAIT_URI, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct()

# one study - 

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST90041889",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("laryngeal neoplasm"),
                          "larynx cancer"),
                l1_all_disease_terms
         ))

4.0.14 Liver neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "liver neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "liver neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "liver cancer",
                 l1_all_disease_terms
         )
         )

4.0.15 Lung neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "lung neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "lung neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "lung cancer",
                 l1_all_disease_terms
         )
         )

4.0.16 Lymphoid neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "lymphoid neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "lymphoid neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "malignant lymphoid tumor",
                 l1_all_disease_terms
         )
         )

4.0.17 Meningeal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "meningeal neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "meningeal neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

4.0.18 Mature b-cell neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "neoplasm of mature b-cells") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neoplasm of mature b-cells" & 
                 `DISEASE/TRAIT` == "Follicular lymphoma",
                 "non-hodgkins lymphoma",
                 l1_all_disease_terms
         )
         )

4.0.19 Mouth neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "mouth neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "mouth neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "lip and oral cavity cancer",
                 l1_all_disease_terms
         )
         )

4.0.20 Myeloid neoplasm

gwas_study_info |> 
  filter(grepl("myeloid neoplasm", l1_all_disease_terms)) |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "myeloid neoplasm" & 
                 grepl("Myeloid leukemia", `DISEASE/TRAIT`, ignore.case = T),
                 "leukemia",
                 l1_all_disease_terms
         )
         )

4.0.21 Neuroendocrine neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "neuroendocrine neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neuroendocrine neoplasm" & 
                 grepl("PheCode 209", `DISEASE/TRAIT`, ignore.case = T),
                 "neuroendocrine tumor",
                 l1_all_disease_terms
         )
         )

# ? to double check: neuroendocrine tumor is malignant
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "neuroendocrine neoplasm" & 
                 grepl("neuroendocrine tumor", `DISEASE/TRAIT`, ignore.case = T),
                 "neuroendocrine tumor",
                 l1_all_disease_terms
         )
         )

4.0.22 Ovarian neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "ovarian neoplasm") |> 
  pull(`DISEASE/TRAIT`)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "ovarian neoplasm" &
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "ovarian cancer",
                 l1_all_disease_terms
         )
         )

4.0.23 Nasopharyngeal neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "nasopharyngeal neoplasm") |> 
  pull(`DISEASE/TRAIT`)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "nasopharyngeal neoplasm" & 
                 grepl("carcinoma|cancer|malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "nasopharyngeal cancer",
                 l1_all_disease_terms
         )
         )

gwas_study_info |> 
  filter(grepl("nasopharyngeal neoplasm", l1_all_disease_terms)) |> 
  pull(`DISEASE/TRAIT`)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "nasopharyngeal neoplasm" & 
                 grepl("nasopharyngeal carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "nasopharyngeal cancer",
                 l1_all_disease_terms
         )
         )

4.0.24 Pancreatic neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "pancreatic neoplasm") |> 
  pull(`DISEASE/TRAIT`)

# Intraductal papillary mucinous neoplasm of the pancreas is a benign precursor lesion

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = 
         ifelse(l1_all_disease_terms == "pancreatic neoplasm" & 
                STUDY_ACCESSION == "GCST90104145",
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

4.0.25 Sigmoid neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "sigmoid neoplasm") |> 
  pull(`DISEASE/TRAIT`)


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "sigmoid neoplasm" & 
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "colorectal cancer",
                 l1_all_disease_terms
         )
         )

4.0.26 Skin neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "skin neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "skin neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "non-melanoma skin cancer",
                 l1_all_disease_terms
         )
         )

4.0.27 Stomach neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "stomach neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "stomach neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "stomach cancer",
                 l1_all_disease_terms
         )
         )

4.0.28 Testicular neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "testicular neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
        ifelse(l1_all_disease_terms == "testicular neoplasm" &
                 grepl("malignant", `DISEASE/TRAIT`, ignore.case = T),
                 "testicular cancer",
                 l1_all_disease_terms
         )
         )

4.0.29 Tongue neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "tongue neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "tongue neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "lip and oral cavity cancer",
                 l1_all_disease_terms
         )
         )

4.0.30 Uterine neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "uterine neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "uterine neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

4.0.31 Urogenital neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "urogenital neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "urogenital neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "urogenital cancer",
                 l1_all_disease_terms
         )
         )

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "urogenital neoplasm" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )

4.0.31.1 Vulvar neoplasm

gwas_study_info |> 
  filter(l1_all_disease_terms == "vulvar neoplasm") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "vulvar neoplasm" & 
                 grepl("malignant|cancer|carcinoma", `DISEASE/TRAIT`, ignore.case = T),
                 "vulvar cancer",
                 l1_all_disease_terms
         )
         )

4.0.32 Ocular Melanoma

ocular_melanoma_terms <- c("uveal melanoma",
                           "uveal melanoma disease severity",
                           "epithelioid cell uveal melanoma",
                           "choroidal melanoma",
                           "ocular melanoma disease severity"
                           )

ocular_melanoma_terms = str_length_sort(ocular_melanoma_terms)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern(ocular_melanoma_terms),
                          "ocular melanoma"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("ocular melanoma disease severity"),
                          "ocular melanoma"
         ))

5 Other …

5.0.1 Benign neoplasm, colorectal cancer

gwas_study_info |>
  filter(l1_all_disease_terms == "benign neoplasm, colorectal cancer") |> 
  select(MAPPED_TRAIT, `DISEASE/TRAIT`, STUDY_ACCESSION) |> 
  distinct()

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(STUDY_ACCESSION == "GCST90093303",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("benign neoplasm, colorectal cancer"),
                          "colorectal cancer"),
                l1_all_disease_terms
         ))

5.0.2 More other pharynx cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("larynx cancer, pharynx cancer"),
                          "larynx cancer, other pharynx cancer"
         ))

5.0.3 Malignant melanoma of skin

5.0.3.1 Cutaneous melanoma to malignant melanoma of skin

gwas_study_info =
  gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("cutaneous melanoma"),
                          "malignant melanoma of skin"
         ))

5.0.3.2 Dealing with studies just labelled as “melanoma”

gwas_study_info |>
 filter(l1_all_disease_terms == "melanoma") |>
  pull(`DISEASE/TRAIT`) |>
  unique()

# checked UKB data field 40006 (ICD10 codes) 
# https://biobank.ctsu.ox.ac.uk/ukb/field.cgi?id=40006
# malignant melanoma of skin includes: 
# Malignant melanoma of trunk 
# Malignant melanoma of upper limb, including shoulder
# Malignant melanoma of lower limb, including hip

# checked UKB data field 20001 
# https://biobank.ctsu.ox.ac.uk/ukb/field.cgi?id=20001
# malignant melanoma is a subcategory of skin cancer


malignant_skin_melanoma <- c("ICD10 C43",
                             "survival in skin melanoma",
                             "Skin melanoma specific survival",
                             "malignant melanoma of skin",
                             "malignant melanoma of trunk",
                             "Malignant melanoma \\(UKB data field 20001\\)",
                             "malignant melanoma \\(UKB data field 20001_1059\\)",
                             "malignant melanoma of upper limb, including shoulder",
                             "malignant melanoma of lower limb, including hip",
                             "ICD10 D03", # skin melanoma in situ 
                             "Melanoma in situ \\(UKB data field 40006\\)"
)


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "melanoma" &
                grepl(paste0(malignant_skin_melanoma, collapse = "|\\b"),
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("melanoma"),
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))

# UKBB malignant melanoma of skin
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "melanoma" &
                grepl("UKBB", COHORT, ignore.case = T) &
                grepl("malignant melanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("melanoma"),
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))

# cutaneous melanoma in title
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "melanoma" & 
                grepl("\\bcutaneous melanoma", STUDY, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("melanoma"), 
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))
                  

gwas_study_info |>
 filter(l1_all_disease_terms == "\\bmelanoma") |>
  pull(`DISEASE/TRAIT`) |>
  unique()

gwas_study_info |>
 filter(l1_all_disease_terms == "\\bmelanoma") |>
  pull(PUBMED_ID) |>
  unique()

# Checking the clinical trials that make up pubmed id 27023328
# https://clinicaltrials.gov/study/NCT01153763 - cutaneous melanoma
# https://clinicaltrials.gov/study/NCT01266967 - not specified, likely skin melanoma by MeSH terms
# https://clinicaltrials.gov/study/NCT01227889 - not specified, likely skin melanoma by MeSH terms
# https://clinicaltrials.gov/study/NCT01584648 - cutaneous melanoma
# https://clinicaltrials.gov/study/NCT01597908 - cutaneous melanoma
# thus, likely malignant melanoma of skin
                             
# for pubmed ID 21983785, 
# seems likely malignant melanoma of skin
# as test in situ vs invasive, and use non-skin cancer controls
# https://pmc.ncbi.nlm.nih.gov/articles/PMC3227560/#SM

# for pubmed id: 23455637 - uses one of the same cohorts as 21983785 (genoMEL)
# thus likely malignant melanoma of skin
# https://pubmed.ncbi.nlm.nih.gov/23455637/

# for pubmed id: 21706340
# Cutaneous malignant melanoma
# therefore, malignant melanoma of skin

# pubmed id: 19578364 also uses GenoMEL consortium
# therefore, malignant melanoma of skin

# pubmed id: 18488026
# cutaneous malignant melanoma
# therefore, malignant melanoma of skin

# pubmed id: 21983787
# also uses GenoMEL consortium
# therefore, malignant melanoma of skin

# pubmed id: 28212542
# cutaneous melanoma
# therefore, malignant melanoma of skin

# pubmed id: 24980573
# skin cancer melanoma discussion 
# therefore, malignant melanoma of skin

# pubmed id: 35626014
# not entirely clear, but likely malignant melanoma of skin

# pubmed id: 34724200
# "current study focused on melanomas of the skin"
# hence, malignant melanoma of skin

# pubmed id: 34290314
# lists ICD10 codes as C43 (malignant melanoma of skin) 
# for melanoma - thus malignant melanoma of skin

# pubmed id: 36064556
# lists cutaneous melanoma

malignant_skin_melanoma_studies <- c(27023328,
                                     21983785,
                                     23455637,
                                     21706340,
                                     19578364,
                                     18488026,
                                     21983787,
                                     28212542,
                                     24980573,
                                     35626014,
                                     34724200,
                                     34290314,
                                     36064556)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(PUBMED_ID %in% malignant_skin_melanoma_studies &
                grepl("\\bmelanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("melanoma"),
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))


# honestly not sure of pubmed id: 32887889
# ? guess but likely malignant melanoma of skin
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(PUBMED_ID == 32887889 &
                grepl("\\bmelanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("melanoma"),
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))


# also not of pubmed id: 33409738
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(PUBMED_ID == 33409738 &
                grepl("\\bmelanoma", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("melanoma"),
                          "malignant melanoma of skin"),
                l1_all_disease_terms
         ))

5.0.4 In situ cancer

gwas_study_info |> 
  filter(grepl("in situ", l1_all_disease_terms)) |> 
  pull(l1_all_disease_terms) |>
  unique()

# cervical cancer
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("uterine cervix cancer in situ"),
                          "cervical cancer"
         ))


gwas_study_info |> 
  filter(grepl("in situ", l1_all_disease_terms))  |>
  pull(`DISEASE/TRAIT`) |>
  unique()

# strictly speaking is unspecified skin cancer
# likely ICD10 DO4 is non-melanoma skin cancer
# PheCode 172.3 - maps to D04.9
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(grepl("in situ", l1_all_disease_terms) & 
                grepl("ICD10 D04|PheCode 172.3", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("skin cancer in situ"),
                          "non-melanoma skin cancer"),
                l1_all_disease_terms
         )
  )

# in situ cancer -> to cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "in situ cancer",
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("in situ cancer"),
                          "cancer"),
                l1_all_disease_terms
         )
  )

5.0.5 Non-specific cancer terms

gwas_study_info |> 
  filter(l1_all_disease_terms=="cancer")  |> 
  select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
  head()

# ICD10 Z85.4: Personal history of malignant neoplasm of genital organs
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("ICD10 Z85.4", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("cancer"),
                          "urogenital cancer"),
                l1_all_disease_terms
         )
  )

# ICD10 Z85.1: Personal history of malignant neoplasm of trachea, bronchus and lung 
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("ICD10 Z85.1", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("cancer"),
                           "tracheal bronchus and lung cancer"),
                l1_all_disease_terms
         )
  )

# ICD10 Z85.0: Personal history of malignant neoplasm of digestive organs
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("ICD10 Z85.0", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("cancer"),
                          "digestive system cancer"),
                l1_all_disease_terms
         )
  )

# Cancer of intrathoracic organs (PheCode 164)

# Malignant neoplasm of retroperitoneum and peritoneum (PheCode 159.4)"
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("PheCode 164|PheCode 159.4", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("cancer"),
                          "peritoneum cancer, retroperitoneal cancer"),
                l1_all_disease_terms
         )
  )
# Malignant neoplasm of other and ill-defined sites within the digestive organs and peritoneum (PheCode 159)

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(l1_all_disease_terms == "cancer" & 
                grepl("PheCode 159$", `DISEASE/TRAIT`, ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("cancer"),
                          "digestive system cancer, peritoneum cancer"),
                l1_all_disease_terms
         )
  )

5.0.6 peritoneum cancer -> peritoneal cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("peritoneum cancer"),
                          "peritoneal cancer"
         ))

5.0.7 Bladder tumor

gwas_study_info |>
  filter(l1_all_disease_terms == "bladder tumor") |> 
  pull(`DISEASE/TRAIT`) |>
  unique()

5.0.8 Squamous cell cancer

gwas_study_info |>
    filter(grepl("squamous cell cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lung cancer, squamous cell cancer"),
                          "lung cancer"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("esophageal cancer, squamous cell cancer"),
                          "esophageal cancer"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("head and neck cancer, squamous cell cancer"),
                          "head and neck cancer"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("head and neck cancer, pain, squamous cell cancer"),
                          "head and neck cancer, cancer pain"
         ))

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("non-melanoma skin cancer, squamous cell cancer"),
                          "non-melanoma skin cancer"
         ))

5.0.9 Female reproductive organ

gwas_study_info |>
    filter(grepl("female reproductive organ cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()


gwas_study_info = gwas_study_info |>
    mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "female reproductive organ cancer" & 
                 grepl("benign", `DISEASE/TRAIT`, ignore.case = T),
                 "benign neoplasm",
                 l1_all_disease_terms
         )
         )


gwas_study_info |>
    filter(grepl("\\breproductive system cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()

gwas_study_info = gwas_study_info |>
    mutate(l1_all_disease_terms =
         ifelse(l1_all_disease_terms == "reproductive system cancer" & 
                 grepl("female reproductive cancer", `DISEASE/TRAIT`, ignore.case = T),
                 "female reproductive organ cancer",
                 l1_all_disease_terms
         )
         )

5.0.10 Male reproductive organ

gwas_study_info |>
    filter(grepl("\\bmale reproductive organ cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()

5.0.11 Small cell cancer

gwas_study_info |>
    filter(grepl("\\bsmall cell cancer", l1_all_disease_terms)) |>
    select(`DISEASE/TRAIT`, l1_all_disease_terms) |>
    distinct()


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms  = 
         ifelse(`DISEASE/TRAIT` == "Small-cell lung cancer",
         stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("small cell cancer"),
                          "lung cancer"),
         l1_all_disease_terms
         )
)

5.0.12 Central nervous system cancer, nervous system cancer

gwas_study_info = 
gwas_study_info |>
  mutate(l1_all_disease_terms =
         case_when(l1_all_disease_terms == "central nervous system cancer, nervous system cancer" ~ "central nervous system cancer",
                   TRUE ~ l1_all_disease_terms)
         ) 

5.0.13 Fix non-melanoma skin cancer, woopsy

gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms =
         case_when(l1_all_disease_terms == "non-malignant skin melanoma skin cancer" ~ "non-melanoma skin cancer",
                   l1_all_disease_terms == "non-malignant melanoma of skin skin cancer" ~ "non-melanoma skin cancer",
                   TRUE ~ l1_all_disease_terms)
         ) 

# 
# gwas_study_info =
# gwas_study_info |>
#   mutate(l1_all_disease_terms =
#          stringr::str_replace_all(l1_all_disease_terms,
#                            "non-malignant skin melanoma skin cancer",
#                           "non-melanoma skin cancer"
#          )
#          ) 

5.0.14 Unspecified skin cancer

gwas_study_info |> 
  filter(grepl("skin cancer", l1_all_disease_terms) & 
         !grepl("non-melanoma", l1_all_disease_terms)
         )  |> 
  select(STUDY, 
         `DISEASE/TRAIT`, 
         all_disease_terms, 
         l1_all_disease_terms) |> 
  distinct()

# make them listed under both malignant melanoma of skin and non-melanoma skin cancer

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("skin cancer", l1_all_disease_terms) & 
                !grepl("melanoma", l1_all_disease_terms),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("skin cancer"),
                          "malignant melanoma of skin, non-melanoma skin cancer"),
                l1_all_disease_terms
         )
         )

5.1 Unspecified lymphoma

gwas_study_info |> 
  filter(grepl("lymphoma", l1_all_disease_terms) & 
          !grepl("hodgkin", l1_all_disease_terms)) |> 
  pull(`DISEASE/TRAIT`) |> 
  unique()

# PheCode 202.23 maps to ICD-9  200.1   Lymphosarcoma
# which as from: http://snomed.info/id/188498009, is a form of non-Hodgkin's lymphoma

# PheCode 202.24 code maps to ICD-9 200.6,  Anaplastic large cell lymphoma a form of non-Hodgkin's lymphoma

# all ICD10 C83 codes are non-Hodgkin's lymphoma

# ICD10 C85.1 maps to PheCode 202.2 Non-Hodgkins lymphoma

nhl_terms <- c("B cell non-Hodgkin lymphoma",
               "PheCode 202.23",
               "PheCode 202.24",
               "ICD10 C83",
               "ICD10 C85.1"
               )


gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                grepl(paste0(nhl_terms, collapse = "|"), 
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lymphoma"),
                          "non-hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )


# Non-follicular lymphoma (UKB data field 40006) likely non-hodgkin lymphoma
# as ICD10 C83: Non-follicular lymphoma is non-hodgkin lymphoma

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                grepl("Non-follicular lymphoma \\(UKB data field 40006\\)", 
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lymphoma"),
                          "non-hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )


# Cancer code, self-reported: lymphoma (UKB data field 20001_1047)
# includes both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                grepl("Cancer code, self-reported: lymphoma \\(UKB data field 20001_1047\\)", 
                      `DISEASE/TRAIT`, 
                      ignore.case = T),
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lymphoma"),
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

# for pubmed id: 34594039
# from sup table 1; 
# Malignant lymphoma    Malignant_Lymphoma  is defined PheCodes 201/202 CD2_NONFOLLICULAR_LYMPHOMA
# thus includes both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                PUBMED_ID == 34594039,
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lymphoma"),
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

# pubmed id: 23349640
# includes both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                PUBMED_ID == 23349640,
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lymphoma"),
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

# not entirely sure for pubmed id: 36344522
# perhaps need to read further in, but seems like it is both hodgkin and non-hodgkin lymphoma
gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms =
         ifelse(grepl("lymphoma", l1_all_disease_terms, ignore.case = T) & 
                !grepl("hodgkin", l1_all_disease_terms, ignore.case = T) & 
                PUBMED_ID == 36344522,
                stringr::str_replace_all(l1_all_disease_terms,
                          vec_to_grep_pattern("lymphoma"),
                          "non-hodgkin lymphoma, hodgkin lymphoma"),
                l1_all_disease_terms
         )
         )

5.1.1 breast cancer, cancer, colon and rectum cancer, tracheal bronchus and lung cancer, ovarian cancer, prostate cancer

gwas_study_info =
gwas_study_info |>
  mutate(l1_all_disease_terms =
         case_when(l1_all_disease_terms == "breast cancer, cancer, colorectal cancer, lung cancer, ovarian cancer, prostate cancer" ~ 
                     "breast cancer, colorectal cancer, lung cancer, ovarian cancer, prostate cancer",
                   TRUE ~ l1_all_disease_terms)
         )

6 Final summary - number of unique study terns

6.1 Deal with duplicate terms created during grouping

gwas_study_info = 
  gwas_study_info |>
  rowwise() |>
  mutate(l1_all_disease_terms = paste0(sort(unique(unlist(strsplit(l1_all_disease_terms, ", ")))),
                                      collapse = ", ")
         ) |>
  ungroup()

6.2 Deal with hanging commas and spaces

gwas_study_info = gwas_study_info |>
  mutate(l1_all_disease_terms = stringr::str_remove_all(l1_all_disease_terms, "^,|,$")
         ) |>
  mutate(l1_all_disease_terms = stringr::str_trim(l1_all_disease_terms)
         ) 

6.3 Final summary - number of unique study terms pairs

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l1_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(l1_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))


head(n_studies_trait)

dim(n_studies_trait)

6.3.1 When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
                               gwas_study_info$l1_all_disease_terms[gwas_study_info$l1_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()


test <- data.frame(trait = unique(diseases))

length(unique(diseases))

# make frequency table
freq <- table(as.factor(diseases))

# sort in decreasing order
freq_sorted <- sort(freq, decreasing = TRUE)

# show top N, e.g. top 10
head(freq_sorted, 10)

6.3.2 Save the updated gwas_study_info with harmonized disease terms

fwrite(gwas_study_info,
        here::here("output/gwas_cat/gwas_study_info_group_l1_v2.csv")
         )

sessionInfo()