Last updated: 2025-12-29

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ba69411. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    figures/
    Ignored:    human_dictionary/
    Ignored:    igsr_populations.tsv
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/
    Ignored:    visualization.Rdata

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/map_trait_to_icd10.Rmd) and HTML (docs/map_trait_to_icd10.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd ba69411 IJbeasley 2025-12-29 Fix some UMLS ICD10 Codes
html 6bf9d47 IJbeasley 2025-12-29 Build site.
Rmd f597620 IJbeasley 2025-12-29 Update mapping to ICD-10 codes (to keep year)
html 1f555b6 IJbeasley 2025-12-29 Build site.
Rmd b4527b8 IJbeasley 2025-12-29 Update mapping to ICD-10 codes
html 757b4b4 IJbeasley 2025-10-09 Build site.
Rmd 6019c96 IJbeasley 2025-10-09 Even more correcting of icd 10 codes
html 0feea16 IJbeasley 2025-10-08 Build site.
Rmd a8f1628 IJbeasley 2025-10-08 Include study accession in icd 10 map
html 50ebebc IJbeasley 2025-10-08 Build site.
Rmd 9bbe0dd IJbeasley 2025-10-08 Updating icd 10 mapping
html ec027a3 IJbeasley 2025-10-08 Build site.
Rmd cb8a570 IJbeasley 2025-10-08 Updating disease icd code mapping
html 41d6fe5 IJbeasley 2025-09-28 Build site.
Rmd 97d340d IJbeasley 2025-09-28 workflowr::wflow_publish("analysis/map_trait_to_icd10.Rmd")

title: “Mapping GWAS traits to ICD 10” author: “Isobel Beasley” date: “2025-09-26” output: html_document —

Set up

library(dplyr)
library(stringr)
library(data.table)

Ontology help - for getting disease subtypes

source(here::here("code/get_term_descendants.R"))

Load Data

# gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_group_v2.csv"))

gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_all_group.csv"))

gwas_study_info = gwas_study_info |>
    mutate(collected_all_disease_terms = stringi::stri_trans_general(collected_all_disease_terms, "Latin-ASCII")
           )

# study accession = GCST003209
# replace benign neoplasm with colorectal cancer, endometrial cancer

gwas_study_info = 
  gwas_study_info |>
  mutate(collected_all_disease_terms = ifelse(STUDY_ACCESSION == "GCST003209" |
                                              STUDY_ACCESSION == "GCST003208" ,
                                              "colorectal cancer, endometrial cancer",
                                              collected_all_disease_terms))

# for study accession = GCST90133383
# replace benign neoplasm with testicular cancer, hearing loss

Automatic mapping of GWAS traits to ICD 10

Create a mapping table

disease_mapping <- gwas_study_info |>
  filter(DISEASE_STUDY == T) |>
  tidyr::separate_longer_delim(cols = collected_all_disease_terms, 
                               delim = ", ") |>
  select(`DISEASE/TRAIT`, 
         collected_all_disease_terms, 
         PUBMED_ID,
         YEAR,
         STUDY_ACCESSION) |>
  distinct()

disease_mapping =
  disease_mapping |>
  filter(collected_all_disease_terms != "")

print("Number of unique disease trait & study pairs")
[1] "Number of unique disease trait & study pairs"
nrow(disease_mapping)
[1] 46316
diseases <- stringr::str_split(pattern = ", ",
                               gwas_study_info$collected_all_disease_terms[gwas_study_info$collected_all_disease_terms != ""])  |>
  unlist() |>
  stringr::str_trim()

diseases <- unique(diseases)

print("Number of unique disease terms")
[1] "Number of unique disease terms"
print(length(diseases))
[1] 1993

Get ICD10 codes from author provided DISEASE/TRAIT column

disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(grepl("ICD10", `DISEASE/TRAIT`),
                             str_extract(`DISEASE/TRAIT`, 
                                         "(?<=ICD10 )[^:]+(?=:)|(?<=ICD10 )[^ ]+"),
                             NA),
         icd10_code_origin = ifelse(grepl("ICD10", `DISEASE/TRAIT`),
                                    "Study Provided",
                                    NA)
)

# fix weird ICD10 codes: R11.103.11 -> R11
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = stringr::str_replace(icd10_code, 
                                         pattern = "R11.103.11",
                                         replacement = "R11")
         )

# fix weird ICD10 codes: D63.165.8 -> D63, N18.3-N18.9
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = stringr::str_replace(icd10_code, 
                                         pattern = "D63.165.8",
                                         replacement = "D63, N18.3-N18.9")
         ) 

# Additionally, some studies provide ICD10 codes in the DISEASE/TRAIT column without the "ICD10" prefix
# e.g. Source of report of K76 (other diseases of liver) (UKB data field 131671)
# let's try to capture these too
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(is.na(icd10_code) &
                             grepl("Source of report of", `DISEASE/TRAIT`),
                             str_extract(`DISEASE/TRAIT`,
                                         "(?<=Source of report of\\s)[A-Z][0-9]{1,2}(?:\\.[0-9A-Z]+)?\\b"),
                             icd10_code),
         icd10_code_origin = ifelse(grepl("Source of report of", 
                                          `DISEASE/TRAIT`),
                                    "Study Provided",
                                    icd10_code_origin)
) |>
  mutate(icd10_code = str_remove_all(icd10_code, 
                                     pattern = "Source of report of ")
         )

# Another example where  some studies provide ICD10 codes in the DISEASE/TRAIT column without the "ICD10" prefix
# e.g. Insulin-dependent diabetes mellitus (Union E10)
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(is.na(icd10_code) &
                             grepl("Union", `DISEASE/TRAIT`),
                             str_extract(`DISEASE/TRAIT`, 
                                         "Union [A-Z][0-9]{1,2}(?:\\.[0-9A-Z]+)?\\b"),
                             icd10_code),
         icd10_code_origin = ifelse(grepl("Union", 
                                          `DISEASE/TRAIT`),
                                    "Study Provided",
                                    icd10_code_origin)
) |>
  mutate(icd10_code = str_remove_all(icd10_code, 
                                     pattern = "Union ")
  )

disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(is.na(icd10_code) &
                             grepl("ICD [A-Z][0-9]|ICD-10 [A-Z][0-9]", 
                                   `DISEASE/TRAIT`),
                        str_match(`DISEASE/TRAIT`,
                                  "ICD(?:-10|10)?\\s*([A-Z][0-9]{1,2}(?:\\.[0-9A-Z]+)?)")[,2],
                             icd10_code),
         icd10_code_origin = ifelse(grepl("ICD [A-Z][0-9]|ICD-10 [A-Z][0-9]",
                                          `DISEASE/TRAIT`),
                                    "Study Provided",
                                    icd10_code_origin)
)

n_study_provided = 
disease_mapping |>
  filter(icd10_code_origin == "Study Provided") |>
  nrow()

n_studies =
  disease_mapping |>
  nrow()

print("Percentage of studies that provide ICD10 codes:")
[1] "Percentage of studies that provide ICD10 codes:"
print(round(n_study_provided / n_studies * 100, 2))
[1] 8.55

Get ICD10 codes from author provided PheCodes

Get Phecodes for diseases

disease_mapping <- disease_mapping |>
  mutate(
    phecode = str_extract(`DISEASE/TRAIT`, "(?<=PheCode )[^)]+")
  ) |>
  mutate(phecode = as.numeric(phecode))

Convert Phecodes to ICD10

# phecode to ICD10 mapping from https://wei-lab.app.vumc.org/phecode-data/phecode_international_version

phecodes <- fread(here::here("data/icd/phecode_international_version_unrolled.csv"))

phecode_icd_map =
  phecodes |>
  select(icd10_code = ICD10, 
         phecode = PheCode
         ) |>
  filter(!is.na(phecode))

phecode_icd_map =
  phecode_icd_map |>
  filter(phecode %in% unique(disease_mapping$phecode))

# if more than one ICD10 code per phecode, collapse into a single row
phecode_icd_map =
  phecode_icd_map |>
group_by(phecode) |>
  summarise(icd10_code = 
            str_flatten(unique(icd10_code), 
                        collapse = ", ", 
                        na.rm = T), 
            .groups = "drop")

# label the origin of these ICD10 codes as "Study PheCode Mapping"
phecode_icd_map =
  phecode_icd_map |>
  mutate(icd10_code_origin = "Study PheCode Mapping")

disease_mapping =
  rows_patch(disease_mapping,
              phecode_icd_map,
              by = c("phecode"),
              unmatched = "ignore",
              )
# left_join(disease_mapping,
#           phecode_icd_map,
#           by = c("phecode","icd10_code_origin", "icd10_code"),
#           relationship = "many-to-one",
#           na_matches = "never")

# disease_mapping =
#   disease_mapping |>
#   mutate(icd10_code_origin = "Study PheCode")

print("Number of ICD10 code obtained")
[1] "Number of ICD10 code obtained"
disease_mapping |>
    filter(!is.na(icd10_code)) |> 
  group_by(icd10_code_origin) |> 
  summarise(n = n(), 
            percent = n()/n_studies *100)
# A tibble: 2 × 3
  icd10_code_origin         n percent
  <chr>                 <int>   <dbl>
1 Study PheCode Mapping  6780   14.6 
2 Study Provided         3961    8.55
disease_mapping |>
  filter(!is.na(icd10_code_origin)) |>
  nrow()
[1] 10741
disease_mapping |>
  filter(!is.na(icd10_code_origin)) |>
  head()
                                                                          DISEASE/TRAIT
1              Source of report of M45 (ankylosing spondylitis) (UKB data field 131913)
2         Source of report of I21 (acute myocardial infarction) (UKB data field 131299)
3                         Source of report of B07 (viral warts) (UKB data field 130189)
4                         Source of report of D86 (sarcoidosis) (UKB data field 130687)
5                Source of report of E03 (other hypothyroidism) (UKB data field 130697)
6 Source of report of E10 (insulin-dependent diabetes mellitus) (UKB data field 130707)
  collected_all_disease_terms PUBMED_ID YEAR STUDY_ACCESSION icd10_code
1      ankylosing spondylitis  36779085 2022    GCST90103423        M45
2 acute myocardial infarction  36779085 2022    GCST90103349        I21
3             benign neoplasm  36779085 2022    GCST90103384        B07
4                 sarcoidosis  36779085 2022    GCST90103387        D86
5              hypothyroidism  36779085 2022    GCST90103388        E03
6    type 1 diabetes mellitus  36779085 2022    GCST90103390        E10
  icd10_code_origin phecode
1    Study Provided      NA
2    Study Provided      NA
3    Study Provided      NA
4    Study Provided      NA
5    Study Provided      NA
6    Study Provided      NA
# per pubmed id, are there multiple icd10 code origins?
disease_mapping |>
  filter(!is.na(icd10_code_origin)) |>
  group_by(PUBMED_ID) |>
  summarise(n_icd10_code_origins = n_distinct(icd10_code_origin)) |>
  pull(n_icd10_code_origins) |>
  summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.059   1.000   2.000 
# which pubmed ids have multiple icd10 code origins?
disease_mapping |>
  filter(!is.na(icd10_code_origin)) |>
  group_by(PUBMED_ID) |>
  summarise(n_icd10_code_origins = n_distinct(icd10_code_origin)) |>
  filter(n_icd10_code_origins > 1)
# A tibble: 1 × 2
  PUBMED_ID n_icd10_code_origins
      <int>                <int>
1  34737426                    2
# per study accession, are there multiple icd10 code origins?
disease_mapping |>
  filter(!is.na(icd10_code_origin)) |>
  group_by(STUDY_ACCESSION) |>
  summarise(n_icd10_code_origins = n_distinct(icd10_code_origin)) |>
  pull(n_icd10_code_origins) |>
  summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

Complete PheCode to ICD10 mappings

Some Phecodes do not have ICD10 mappings in the above Phecode to ICD10 mapping file. Here we’ve created a file with manual mappings for these missing Phecodes.

# how many missing ICD10 codes for PheCodes?
disease_mapping |>
    filter(is.na(icd10_code) & !is.na(phecode)) |> 
    nrow()
[1] 913
# load manual mapping file
phecode_icd_manual = 
  readxl::read_xlsx(here::here("data/icd/phecode_to_icd10_manual_mapping.xlsx"))

to_add = phecode_icd_manual |>
  mutate(phecode = as.numeric(phecode),
         icd10_code_origin = "Study PheCode (Manual Mapping)") |>
  select(-phenotype) |>
  distinct()

disease_mapping =
  rows_patch(disease_mapping,
             to_add,
             unmatched = "ignore")
Matching, by = "phecode"
# 
# nrow(phecode_icd_manual)
# 
# phecode = c("38.30",
#             "79.90",
#             "110.00",
#             "530.13",
#             "562.20",
#             "580.10",
#             "174.00",
#             "174.20",
#             "189.10",
#             "218.00",
#             "228.10",
#             "224.00",
#             "250.15",
#             "250.25",
#             "250.40",
#             "285.21",
#             "292.11",
#             "362.26",
#             "362.23",
#             "362.27",
#             "362.50",
#             "452.20",
#             "528.10",
#             "535.90",
#             "724.10",
#             "724.22",
#             "740.00",
#             "743.10",
#             "743.12",
#             "743.13",
#             "172.21",
#             "274.00",
#             "286.00",
#             "282.00",
#             "280.00",
#             "272.00",
#             "276.00",
#             "338.00",
#             "350.00",
#             "401.00",
#             "411.00",
#             "414.20",
#             "415.1",
#             "427.21",
#             "429.00",
#             "585.00",
#             "735.22",
#             "724.00",
#             "722.00",
#             "716.10",
#             "709.00",
#             "706.00",
#             "592.00",
#             "571.00",
#             "580.00",
#             "170.00",
#             "172.00",
#             "264.00",
#             "291.00",
#             "555.00",
#             "562.00",
#             "578.00",
#             "783.1",
#             "536.80",
#             "427.40",
#             "443.00",
#             "41.80",
#             "41.90",
#             "244.00",
#             "250.00",
#             "427.40",
#             "526.40",
#             "977.00",
#             "840.20",
#             "823.00",
#             "751.00",
#             "743.00",
#             "270.38",
#             "504.10",
#             "253.40",
#             "279.20",
#             "426.22",
#             "537.1",
#             "707.20",
#             "736.10",
#             "789.10",
#             "290.13",
#             "327.70",
#             "433.60",
#             "695.00",
#             "602.30",
#             "375.10",
#             "560.00",
#             "586.10",
#             "593.20",
#             "620.10",
#             "475.90",
#             "799.00",
#             "662.10",
#             "855.00",
#             "536.70",
#             "856.00",
#             "519.10",
#             "771.20",
#             "792.10",
#             "525.10",
#             "724.20",
#             "611.11",
#             "556.11",
#             "767.00",
#             "736.50",
#             "526.42")
# 
# icd10_code = c("A49.9",
#                "B34.9",
#                "B35, B36",
#                "K22.7",
#                "K57",
#                "N00, N01, N02, N03, N04, N05, N06, N07",
#                "Z85.3",
#                "C50",
#                "C64, C65",
#                "D25, D26",
#                "D18.0",
#                "E00, E00.0, E00.1, E00.2, E00.9, E01.8, E02, E03.0, E03.1, E03.2, E03.3, E03.8, E03.9, E89.0",
#                "E10.5",
#                "E11.5",
#                "R73.0",
#                "D63",
#                "R47.0",
#                "H35.3",
#                "H35.3",
#                "H35.3",
#                "H35.5",
#                "I80",
#                "K12.30",
#                "K29.7, K29.8, K29.9",
#                "M43.2",
#                "M21.5",
#                "M13.9, M15.0, M15.1, M15.2, M15.3, M15.4, M16, M16.0, M16.1, M16.3, M16.6, M16.7, M16.9, M17.1, M17.4, M17.5, M18.0, M18.1, M18.5, M18.9, M19.0, M19.2",
#                "M81",
#                "M81.8",
#                "M81.8",
#                "C44",
#                "M10, M10.0, M10.1, M10.2, M10.4, M10.9, M11.0, M11.1, M11.2, M11.8, M11.9, M67.9",
#                "D65, D66, D67, D68, D68, D68.0, D68.1, D68.2, D68.3, D68.4, D68.8, D68.9, O72.3, O99.1",
#                "D55, D55.0, D55.1, D55.2, D55.3, D55.8, D55.9, D56, D56.0, D56.1, D56.2, D56.3, D56.4, D56.8, D56.9, D57, D57.0, D57.1, D57.2, D57.3, D57.8, D58, D58.0, D58.1, D58.2, D58.8, D58.9, M90.4",
#                "D50, D50.0, D50.1, D50.8, D50.9",
#                "E78.0, E78.1, E78.2, E78.3, E78.4, E78.5, E78.9",
#                "E86, E87.0, E87.1, E87.2, E87.3, E87.4, E87.5, E87.6, E87.7, E87.8, R63.1",
#                "R52.0, R52.2, R52.9",
#                "R25, R25.0, R25.1, R25.2, R25.3, R25.8, R26, R26.0, R26.1, R26.8, R27, R27.0, R27.8, R29.0, R29.2, R43, R43.0, R43.1, R43.2",
#                "I10, I11, I11.0, I11.9, I12, I12.0, I12.9, I13, I13.0, I13.1, I13.2, I13.9, I15, I15.0, I15.1, I15.2, I15.8, I15.9, I67.4",
#                "I20, I20.0, I20.1, I20.8, I20.9, I21, I21.0, I21.1, I21.2, I21.3, I21.4, I21.9, I22, I22.0, I22.1, I22.8, I22.9, I23, I23.0, I23.1, I23.2, I23.3, I23.6, I23.8, I24, I24.0, I24.1, I24.8, I24.9, I25, I25.1, I25.2, I25.3, I25.4, I25.5, I25.6, I25.8, I25.9, I34.1, I51.0, I51.3, Z95.1, Z95.5",
#                "I25.10",
#                "I26, I26.0",
#                "I48",
#                "I51.8",
#                "N17, N17.0, N17.1, N17.2, N17.8, N17.9, N18, N18.0, N18.9, N19, Y60.2, Y61.2, Y62.0, Y84.1, Z49.1, Z49.2, Z99.2",
#                "M21.5",
#                "M40.2, M43.2, M43.8, M48.8, M49.8, M50.0, M99.6",
#                "G55.1, M46.4, M50, M50.0, M50.0, M50.1, M50.2, M50.3, M50.8, M50.9, M51.3, M51.4, M96.1",
#                "M13.0",
#                "L94.3, M33, M33.0, M33.1, M33.2, M33.9, M34, M34.0, M34.1, M34.2, M34.8, M34.9, M35.0, M35.1, M35.5, M35.8, M35.9, M36.0, M36.8, M65.3, N16.4",
#                "K09.8, L70, L70.0, L70.1, L70.2, L70.3, L70.4, L70.5, L70.8, L70.9, L72, L72.0, L72.1, L72.2, L72.8, L72.9, L73.0, L85.3",
#                "N30, N30.0, N30.1, N30.2, N30.3, N30.8, N30.9, N34, N34.0, N34.2, N34.3, N35.1, N37",
#                 "K70.4, K72, K72.1, K72.9, K74.0, K74.1, K74.2, K74.3, K74.4, K74.5, K74.6, K75.0, K75.1, K76.0, K76.6, K76.7",
#                "B52.0, N00.0, N00.1, N00.2, N00.3, N00.4, N00.5, N00.6, N00.7, N01, N01.0, N01.1, N01.2, N01.3, N01.4, N01.5, N01.6, N01.7, N01.9, N02.0, N02.1, N02.2, N02.3, N02.4, N02.5, N02.6, N02.7, N03, N03.0, N03.1, N03.2, N03.3, N03.4, N03.5, N03.6, N03.7, N03.9, N04, N04.0, N04.1, N04.2, N04.3, N04.4, N04.5, N04.6, N04.7, N04.8, N04.9, N05, N05.0, N05.1, N05.2, N05.3, N05.4, N05.5, N05.6, N05.7, N05.9, N06.0, N06.1, N06.2, N06.3, N06.4, N06.5, N06.6, N06.7, N07.0, N07.1, N07.2, N07.3, N07.4, N07.5, N07.6, N07.7, N08, N08.1, N08.2, N08.3, N08.4, N08.5, N08.8, N14, N14.0, N14.1, N14.2, N14.3, N14.4, N15.0, N15.8, N16.1, N16.2, N16.3, N16.4, N16.5",
#                "C40, C40.0, C40.1, C40.2, C40.3, C40.8, C40.9, C41, C41.0, C41.1, C41.2, C41.3, C41.4, C41.9, C47, C47.0, C47.1, C47.2, C47.3, C47.4, C47.5, C47.6, C47.8, C47.9, C49, C49.0, C49.1, C49.2, C49.3, C49.4, C49.5, C49.6, C49.8, C49.9",
# "C43, C43.0, C43.1, C43.2, C43.3, C43.4, C43.5, C43.6, C43.7, C43.8, C43.9, C44.0, C44.1, C44.2, C44.3, C44.4, C44.5, C44.6, C44.7, C44.8, C44.9, D03, D03.0, D03.1, D03.2, D03.3, D03.4, D03.5, D03.6, D03.7, D03.8, D03.9, D04, D04.0, D04.1, D04.2, D04.3, D04.4, D04.5, D04.6, D04.7, D04.8, D04.9",
# "R62.0, R62.8, R62.9",
# "F06, F06.1, F07.0, F07.1, F07.2, F07.8, F07.9, F23, F23.0, F23.1, F23.8, F23.9, G47.1, R40.0, R40.1",
# "K50, K50.0, K50.1, K50.8, K50.9, K51, K51.0, K51.1, K51.2, K51.3, K51.4, K51.5, K51.8, K51.9",
# "K57, K57.0, K57.1, K57.2, K57.3, K57.4, K57.5, K57.8, K57.9",
# "K62.5, K92.0, K92.1, K92.2",
# "R50.8",
# "K30",
# "I46, I46.0, I46.9, I49.0",
# "E10.5, E11.5, E14.5, I73, I73.0, I73.8, I73.9, I79.1, I79.2, I79.8",
# "B96.8",
# "U82, U83, U84",
# "E00, E00.0, E00.1, E00.2, E00.9, E01.8, E02, E03.0, E03.1, E03.2, E03.3, E03.8, E03.9, E89.0",
# "E10, E10.0, E10.1, E10.2, E10.3, E10.3, E10.3, E10.4, E10.4, E10.6, E10.7, E10.8, E10.9, E11, E11.0, E11.1, E11.2, E11.3, E11.4, E11.6, E11.7, E11.8, E11.9, E12.3, E13, E13.1, E13.3, E13.4, E13.5, E13.6, E13.7, E13.8, E13.9, E14.9, G59.0, G63.2, H36.0, R73.0, R73.9, R81, R82.4, Z96.4",
# "I46, I46.0, I46.9, I49.0",
# "K07.6",
# "Z88.9",
# "S43.4",
# "S82.1, S82.3, S82.8",
#  "Q50.0, Q50.1, Q50.2, Q50.3, Q50.4, Q50.5, Q50.6, Q51, Q51.0, Q51.1, Q51.2, Q51.3, Q51.4, Q51.5, Q51.6, Q51.7, Q51.8, Q51.9, Q52.0, Q52.1, Q52.2, Q52.3, Q52.4, Q52.5, Q52.6, Q52.7, Q52.8, Q52.9, Q53, Q53.0, Q53.1, Q53.2, Q53.9, Q54, Q54.0, Q54.1, Q54.2, Q54.3, Q54.4, Q54.8, Q54.9, Q55, Q55.0, Q55.1, Q55.2, Q55.3, Q55.4, Q55.5, Q55.6, Q55.8, Q55.9, Q56, Q56.0, Q56.1, Q56.2, Q56.3, Q56.4, Q60, Q60.0, Q60.1, Q60.2, Q60.3, Q60.4, Q60.5, Q60.6, Q61, Q61.0, Q61.1, Q61.2, Q61.3, Q61.4, Q61.5, Q61.8, Q61.9, Q62, Q62.0, Q62.1, Q62.3, Q62.4, Q62.5, Q62.6, Q62.7, Q62.8, Q63, Q63.0, Q63.1, Q63.2, Q63.3, Q63.8, Q63.9, Q64.0, Q64.1, Q64.2, Q64.3, Q64.4, Q64.5, Q64.6, Q64.7, Q64.8, Q64.9",
# "M48.4, M48.5, M80.5, M80.8, M81.6, M81.9, M84.4, M85.9, M89.9",
# "E88.0",
# "J84.1, J84.2",
# "E23.6",
# "D89.8",
# "I44.1",
# "K31.8, K31.9",
# "L97",
# "M21.0, M21.1, M21.9",
# "R11.10",
# "F03",
# "G47.6, G25.8",
# "G43.6",
# "L49",
# "N42.3",
# "H04.1",
# "K56.6",
# "N28.8",
# "R31.2",
# "N87",
# "R09.8",
# "R53",
# "T38.0, T50.0",
# "T85.1, T85.8",
# "K91.4, K91.8, Y83.3",
# "I97.8",
# "J95.0",
# "R25.2",
# "R87.6",
# "K08.1",
# "M53.2, M53.3",
# "R92",
# "K55.2",
# "M53.0, M53.1",
# "M21.1, M21.8",
# "K07.6")
# 
# writexl::write_xlsx(data.frame(phecode, icd10_code) |> arrange(phecode),
#                     here::here("data/icd/phecode_to_icd10_manual_mapping.xlsx")
#                     )
# 
# icd10_code_origin = rep("Study PheCode (Manual Mapping)", 
#                         length(phecode))
# 
# to_add = data.frame(phecode, 
#                     icd10_code, 
#                     icd10_code_origin)
# 
# to_add = to_add |> distinct()
# 
# to_add = 
#   to_add |>
#   mutate(phecode = as.numeric(phecode))
# 
# disease_mapping =
#   rows_patch(disease_mapping,
#              to_add,
#              unmatched = "ignore")

How many diseases are not mapped yet?

print("Number of ICD10 codes obtained")
[1] "Number of ICD10 codes obtained"
disease_mapping |>
    filter(!is.na(icd10_code)) |> 
  group_by(icd10_code_origin) |> 
  summarise(n = n(), 
            percent = n()/n_studies *100)
# A tibble: 3 × 3
  icd10_code_origin                  n percent
  <chr>                          <int>   <dbl>
1 Study PheCode (Manual Mapping)   875    1.89
2 Study PheCode Mapping           6780   14.6 
3 Study Provided                  3961    8.55
# we were able to get PheCodes or ICD10 codes directly for roughly 25% of studies 

disease_mapping_matched =
  disease_mapping |>
  filter(icd10_code != "" | !is.na(icd10_code))

not_found_diseases <- diseases[!diseases %in% disease_mapping_matched$collected_all_disease_terms] 
not_found_diseases <- not_found_diseases[not_found_diseases != ""]

print("Number of diseases not mapped to a single ICD10 code yet:")
[1] "Number of diseases not mapped to a single ICD10 code yet:"
print(length(not_found_diseases))
[1] 653

Get ICD10 codes by matching DISEASE/TRAIT terms

Tidy DISEASE/TRAIT Column to better match terms

Remove patient genotype effects

# removing genotype effect (e.g. (fetal genotype effect))
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = 
           str_remove_all(`DISEASE/TRAIT`, 
                          "\\s*\\([^)]*genotype effect\\)")
         ) |>
    mutate(`DISEASE/TRAIT` = 
           str_remove_all(`DISEASE/TRAIT`, 
                          "\\s*\\([^)]* effect\\)")
         )

# remove '(adjusted for APOE e4 dosage)'
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          "\\(adjusted for APOE e4 dosage\\)"))

# remove '(maternal):' & '(paternal):
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, "\\s*\\(maternal\\):")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, "\\s*\\(paternal\\):"))


# remove 'Biological Grandparent '
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, "Biological Grandparent "))

# remove 'Biological Father: ', 'Biological Sibling: ', 'Biological Mother: '
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, "Biological Father: ")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, "Biological Sibling: ")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, "Biological Mother: "))

Removing trait specific terms

# remove  (apnea hypopnea index), 
# (average respiratory event duration), 
#  (micro-arousal index)
#  (percentage of N3 sleep time during total sleep time)
# (percentage of N3 sleep time during sleep period time)
# (average oxyhemoglobin desaturation per event)
#  (average oxyhemoglobin saturation across sleep episode)
# (percentage sleep with oxyhemoglobin saturation less than 90%)
#  (wake time during sleep period time)
# (minimum oxyhemoglobin saturation across sleep episode)
#  (oxygen desaturation index)
# (average oxygen saturation during sleep)
# (apnea hypopnea index, change over time)

disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          "\\s*\\(apnea hypopnea index\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          "\\s*\\(average respiratory event duration\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(micro-arousal index\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(percentage of N3 sleep time during total sleep time\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(percentage of N3 sleep time during sleep period time\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(average oxyhemoglobin desaturation per event\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(average oxyhemoglobin saturation across sleep episode\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(percentage sleep with oxyhemoglobin saturation less than 90%\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(wake time during sleep period time\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(minimum oxyhemoglobin saturation across sleep episode\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(oxygen desaturation index\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(average oxygen saturation during sleep\\)")) |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`,
                                          "\\s*\\(apnea hypopnea index, change over time\\)"))

disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          " during REM sleep$| during non-REM sleep$"))


# ends in  levels in coronary artery disease, make it just coronary artery disease
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = 
           ifelse(str_detect(`DISEASE/TRAIT`,
                             "(?i)\\s*levels in coronary artery disease$"),
                  "coronary artery disease",
                  `DISEASE/TRAIT`))

# ends in levels in chronic kidney disease, make it just chronic kidney disease
# ends in  levels in coronary artery disease, make it just coronary artery disease
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = 
           ifelse(str_detect(`DISEASE/TRAIT`,
                             "(?i)\\s*levels in chronic kidney disease$"),
                  "chronic kidney disease",
                  `DISEASE/TRAIT`))

# ends in levels in type 2 diabetes, make it just type 2 diabetes
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = 
           ifelse(str_detect(`DISEASE/TRAIT`,
                             "(?i)\\s*levels in type 2 diabetes$"),
                  "type 2 diabetes",
                  `DISEASE/TRAIT`))

# ends in levels in prediabetes, make it just prediabetes
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = 
           ifelse(str_detect(`DISEASE/TRAIT`,
                             "(?i)\\s*levels in prediabetes$"),
                  "prediabetes",
                  `DISEASE/TRAIT`))


# remove str_extract(x, "(?<=Takes medication for )\\w+")
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` =
           ifelse(str_detect(`DISEASE/TRAIT`,
                             "(?i)Takes medication for \\w+"),
                  str_extract(`DISEASE/TRAIT`,
                              "(?<=Takes medication for )\\w+"),
                  `DISEASE/TRAIT`))
                                          

# remove BMI adjustments 
# '(BMI adjusted)', or '(adjusted for BMI)'
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          "\\s*\\(BMI adjusted\\)|\\s*\\(adjusted for BMI\\)"))

# remove 'trait'
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          "\\s*trait$"))


# remove  (slight), (severe), (generalised), (localised)
disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_remove_all(`DISEASE/TRAIT`, 
                                          "\\s*\\(slight\\)|\\s*\\(severe\\)|\\s*\\(generalised\\)|\\s*\\(localised\\)"))

Remove white space

disease_mapping =
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = str_squish(`DISEASE/TRAIT`))

Match disease/trait to ICD 10 descriptions

matched_icd10_desc  = 
phecodes |>
    mutate(collected_all_disease_terms = tolower(iconv(ICD_DESCRIPTION, 
                                                       to = "UTF-8"))
           ) |>
    select(collected_all_disease_terms,    
           icd10_code = ICD10) 

# filter missing strings
matched_icd10_desc = 
  matched_icd10_desc |>
  filter(!is.na(collected_all_disease_terms) & 
         collected_all_disease_terms != "") |>
  filter(!is.na(icd10_code) & 
         icd10_code != "")


matched_icd10_desc =
  matched_icd10_desc |>
  group_by(collected_all_disease_terms) |>
  summarise(icd10_code = str_flatten(unique(icd10_code), 
                                     collapse = ", ", 
                                     na.rm = T), 
            .groups = "drop") 
  

matched_icd10_desc = 
  matched_icd10_desc |>
    mutate(icd10_code_origin = "ICD Description Match (DISEASE/TRAIT)")

# match by DISEASE/TRAIT
matched_icd10_desc =
  matched_icd10_desc |>
  rename(`DISEASE/TRAIT` = collected_all_disease_terms)

disease_mapping = 
disease_mapping |>
rows_patch(matched_icd10_desc,
           unmatched = "ignore")

print("Number of ICD10 codes obtained")
[1] "Number of ICD10 codes obtained"
disease_mapping |>
    filter(!is.na(icd10_code)) |> 
  group_by(icd10_code_origin) |> 
  summarise(n = n(), 
            percent = n()/n_studies *100)
# A tibble: 4 × 3
  icd10_code_origin                         n percent
  <chr>                                 <int>   <dbl>
1 ICD Description Match (DISEASE/TRAIT)  3945    8.52
2 Study PheCode (Manual Mapping)          875    1.89
3 Study PheCode Mapping                  6780   14.6 
4 Study Provided                         3961    8.55

Match disease/trait to phenotypes (corresponding to PheCodes)

phenotype_icd_map =
  phecodes |>
group_by(Phenotype) |>
  summarise(icd10_code = 
            str_flatten(ICD10, 
                        collapse = ", ", 
                        na.rm = T), 
            .groups = "drop")

matched_phenotypes =
phenotype_icd_map #|>
#filter(tolower(Phenotype) %in% not_found_diseases)

matched_phenotypes =
matched_phenotypes |>
  mutate(collected_all_disease_terms = tolower(Phenotype)) |>
  select(collected_all_disease_terms, icd10_code) |>
  mutate(icd10_code_origin = "Phecode Phenotype Match (DISEASE/TRAIT)")

# match by DISEASE/TRAIT
matched_phenotypes =
  matched_phenotypes |>
  rename(`DISEASE/TRAIT` = collected_all_disease_terms)

disease_mapping =
disease_mapping |>
rows_patch(matched_phenotypes,
           unmatched = "ignore")

disease_mapping |>
  filter(icd10_code_origin == "Phecode Phenotype Match (DISEASE/TRAIT)") |>
  nrow()
[1] 580
disease_mapping |>
    filter(!is.na(icd10_code)) |> 
  group_by(icd10_code_origin) |> 
  summarise(n = n(), 
            percent = n()/n_studies *100)
# A tibble: 5 × 3
  icd10_code_origin                           n percent
  <chr>                                   <int>   <dbl>
1 ICD Description Match (DISEASE/TRAIT)    3945    8.52
2 Phecode Phenotype Match (DISEASE/TRAIT)   580    1.25
3 Study PheCode (Manual Mapping)            875    1.89
4 Study PheCode Mapping                    6780   14.6 
5 Study Provided                           3961    8.55

How many diseases are not mapped yet?

matched =
  disease_mapping |>
  filter(icd10_code != "") |>
  pull(collected_all_disease_terms)

not_found_diseases <- diseases[!diseases %in% 
                               matched
                               ] 

not_found_diseases <- not_found_diseases[not_found_diseases != ""]

print(length(not_found_diseases))
[1] 652

Match disease/trait to UMLS

unmapped_disease_terms <-
  disease_mapping |>
  filter(is.na(icd10_code)) |>
  pull(`DISEASE/TRAIT`) |>
  unique() |>
  tolower()

# get UMLS CUIs with ICD10 mappings
umls_data <-
  data.table::fread(here::here("data/icd/2025AA/META/MRCONSO.RRF"),
                    sep = "|",
                    header = FALSE,
                    quote = "",
                    fill = TRUE,
                    na.strings = c("", "NA")
  )

colnames(umls_data)[1:18] <- c(
  "CUI","LAT","TS","LUI","STT","SUI","ISPREF",
  "AUI","SAUI","SCUI","SDUI","SAB","TTY","CODE",
  "STR","SRL","SUPPRESS","CVF"
)

umls_cuis_icd10 <-
  umls_data |>
  filter(SAB %in% c("ICD10", "ICD10CM")) |>
  select(CODE, CUI) |>
  group_by(CUI) |>
  summarise(CODE = str_flatten(unique(CODE),
                               collapse = ", ",
                               na.rm = T),
            .groups = "drop"
            ) |>
  rename(icd10_code = CODE) 



umls_icd10 <-
  umls_data |>
  filter(CUI %in% umls_cuis_icd10$CUI)

umls_icd10 =
  umls_icd10 |>
  left_join(umls_cuis_icd10,
            by = "CUI")

# overlap with umls terms
disease_trait_umls  <-
  umls_icd10 |>
  filter(tolower(STR) %in% unmapped_disease_terms) |>
  mutate(`DISEASE/TRAIT` = tolower(STR)) |>
  # mutate(CODE = ifelse(SAB %in% c("ICD10", "ICD10CM"),
  #                      CODE,
  #                      NA)) |>
  select(`DISEASE/TRAIT`,
         icd10_code) |>
  group_by(`DISEASE/TRAIT`) |>
  summarise(icd10_code = str_flatten(unique(icd10_code),
                               collapse = ", ",
                               na.rm = T),
            .groups = "drop"
            ) |>
  distinct() |>
  mutate(icd10_code_origin = "UMLS term match")

 disease_mapping <-
  disease_mapping |>
  mutate(`DISEASE/TRAIT` = tolower(`DISEASE/TRAIT`)) |>
  rows_patch(disease_trait_umls,
             by = "DISEASE/TRAIT")

Get ICD10 codes by matching collected_all_disease_terms terms

Match collected_all_disease_terms to ICD10 descriptions

matched_icd10_desc = 
  matched_icd10_desc |>
  rename(collected_all_disease_terms = `DISEASE/TRAIT`)

matched_icd10_desc = 
  matched_icd10_desc |>
  mutate(icd10_code_origin = "ICD Description Match (collected_all_disease_terms)")

disease_mapping =
  rows_patch(disease_mapping,
             matched_icd10_desc,
             unmatched = "ignore")
Matching, by = "collected_all_disease_terms"

How many diseases are not mapped yet?

print("Number of ICD10 codes obtained")
[1] "Number of ICD10 codes obtained"
disease_mapping |>
    filter(!is.na(icd10_code)) |> 
    group_by(icd10_code_origin) |> 
    summarise(n = n(), 
              percent = n()/n_studies *100)
# A tibble: 7 × 3
  icd10_code_origin                                       n percent
  <chr>                                               <int>   <dbl>
1 ICD Description Match (DISEASE/TRAIT)                3945    8.52
2 ICD Description Match (collected_all_disease_terms)  9575   20.7 
3 Phecode Phenotype Match (DISEASE/TRAIT)               580    1.25
4 Study PheCode (Manual Mapping)                        875    1.89
5 Study PheCode Mapping                                6780   14.6 
6 Study Provided                                       3961    8.55
7 UMLS term match                                      4820   10.4 
matched =
  disease_mapping |>
  filter(icd10_code != "") |>
  pull(collected_all_disease_terms)

not_found_diseases <- diseases[!diseases %in% 
                               matched
                               ] 

not_found_diseases <- not_found_diseases[not_found_diseases != ""]

print(length(not_found_diseases))
[1] 455

Match collected_all_disease_terms to ICD10 codes (by phenotypes)

matched_phenotypes = 
  matched_phenotypes |>
  rename(collected_all_disease_terms = `DISEASE/TRAIT`)

matched_phenotypes = 
  matched_phenotypes |>
  mutate(icd10_code_origin = "Phecode Phenotype Match (collected_all_disease_terms)")

# match by collected_all_disease_terms
disease_mapping = 
disease_mapping |>
rows_patch(matched_phenotypes,
           unmatched = "ignore")
Matching, by = "collected_all_disease_terms"

How many diseases are not mapped yet?

print("Number of ICD10 codes obtained")
[1] "Number of ICD10 codes obtained"
disease_mapping |>
    filter(!is.na(icd10_code)) |> 
    group_by(icd10_code_origin) |> 
    summarise(n = n(), 
              percent = n()/n_studies *100)
# A tibble: 8 × 3
  icd10_code_origin                                         n percent
  <chr>                                                 <int>   <dbl>
1 ICD Description Match (DISEASE/TRAIT)                  3945    8.52
2 ICD Description Match (collected_all_disease_terms)    9575   20.7 
3 Phecode Phenotype Match (DISEASE/TRAIT)                 580    1.25
4 Phecode Phenotype Match (collected_all_disease_terms)  2194    4.74
5 Study PheCode (Manual Mapping)                          875    1.89
6 Study PheCode Mapping                                  6780   14.6 
7 Study Provided                                         3961    8.55
8 UMLS term match                                        4820   10.4 
matched =
  disease_mapping |>
  filter(icd10_code != "") |>
  pull(collected_all_disease_terms)

not_found_diseases <- diseases[!diseases %in% 
                               matched
                               ] 

not_found_diseases <- not_found_diseases[not_found_diseases != ""]

print(length(not_found_diseases))
[1] 442

UMLS match collected_all_disease_terms

unmapped_terms <-
  disease_mapping |>
  filter(is.na(icd10_code)) |>
  pull(collected_all_disease_terms) |>
  unique()

# overlap with umls terms
collected_trait_umls  <-
  umls_icd10 |>
  filter(tolower(STR) %in% unmapped_terms) |>
  mutate(collected_all_disease_terms = tolower(STR)) |>
  # mutate(CODE = ifelse(SAB %in% c("ICD10", "ICD10CM"),
  #                      CODE,
  #                      NA)) |>
  select(collected_all_disease_terms,
         icd10_code) |>
  group_by(collected_all_disease_terms) |>
  summarise(icd10_code = str_flatten(unique(icd10_code),
                               collapse = ", ",
                               na.rm = T),
            .groups = "drop"
            ) |>
  distinct() |>
  mutate(icd10_code_origin = "UMLS term match (collected_all_disease_terms)")

disease_mapping <-
  disease_mapping |>
  rows_patch(collected_trait_umls,
             by = "collected_all_disease_terms")

Manual mapping of GWAS traits to ICD 10

manual_icd10_map <-
  readxl::read_xlsx(here::here("data/icd/manual_disease_icd10_mappings.xlsx"))

manual_icd10_map =
  manual_icd10_map |>
  select(collected_all_disease_terms = mapped_trait, 
         icd10_code) |>
  mutate(collected_all_disease_terms = stringr::str_squish(tolower(collected_all_disease_terms))) |>
  mutate(icd10_code_origin = "Manual Mapping (collected_all_disease_terms)")

# disease_mapping =
#   bind_rows(disease_mapping, to_add) |>
#   distinct()
disease_mapping =
  rows_patch(disease_mapping,
             manual_icd10_map,
             unmatched = "ignore")
Matching, by = "collected_all_disease_terms"
disease_mapping |>
  filter(icd10_code_origin == "Manual Mapping (collected_all_disease_terms)") |>
  nrow()
[1] 1172
# repeat for `DISEASE/TRAIT`
manual_icd10_map =
  manual_icd10_map |>
  select(`DISEASE/TRAIT` = collected_all_disease_terms,
         icd10_code) |>
  mutate(icd10_code_origin = "Manual Mapping (DISEASE/TRAIT)")

disease_mapping =
  rows_patch(disease_mapping |> mutate(`DISEASE/TRAIT` = tolower(`DISEASE/TRAIT`)),
             manual_icd10_map,
             unmatched = "ignore")
Matching, by = "DISEASE/TRAIT"

Additional manual mapping

disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "toxicity" & 
                            grepl("Anthracycline-induced cardiotoxicity|Trastuzumab-induced cardiotoxicity", `DISEASE/TRAIT`, ignore.case = T),
                            "I42.7, T45.1",
                             icd10_code)) |>
mutate(icd10_code_origin = 
         ifelse(collected_all_disease_terms == "toxicity" & 
                  grepl("Anthracycline-induced cardiotoxicity|Trastuzumab-induced cardiotoxicity", `DISEASE/TRAIT`, ignore.case = T),
                "Manual Mapping (from DISEASE/TRAIT)",
                icd10_code_origin)
         )

disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "toxicity" & 
                            grepl("Induration at the injection site after COVID-19", `DISEASE/TRAIT`, ignore.case = T),
                            "R23.4",
                             icd10_code)) |>
  mutate(collected_all_disease_terms = 
           ifelse(collected_all_disease_terms == "toxicity" & 
                    grepl("Induration at the injection site after COVID-19", `DISEASE/TRAIT`, ignore.case = T),
                  "induration of skin",
                  collected_all_disease_terms)) |>
  mutate(icd10_code_origin =
           ifelse(collected_all_disease_terms == "induration of skin",
                  "Manual Mapping (from DISEASE/TRAIT)",
                  icd10_code_origin)
         )


# N64.5
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "toxicity" & 
                            grepl("Induration \\(>= grade 2\\) in breast cancer treated with radiotherapy", 
                                  `DISEASE/TRAIT`, ignore.case = T),
                            "N64.5",
                             icd10_code)) |>
  mutate(collected_all_disease_terms = 
           ifelse(collected_all_disease_terms == "toxicity" & 
                    grepl("Induration \\(>= grade 2\\) in breast cancer treated with radiotherapy", 
                          `DISEASE/TRAIT`, ignore.case = T),
                  "induration of breast",
                  collected_all_disease_terms)) |>
  mutate(icd10_code_origin =
           ifelse(collected_all_disease_terms == "induration of breast",
                  "Manual Mapping (from DISEASE/TRAIT)",
                  icd10_code_origin)
         )


# T45.1
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "toxicity" & 
                            grepl("Nivolumab-induced immune-related adverse events in cancer|Response to immune checkpoint inhibitors in melanoma", 
                                  `DISEASE/TRAIT`, ignore.case = T),
                            "T45.1",
                             icd10_code)) |>
  mutate(icd10_code_origin =
           ifelse(collected_all_disease_terms == "toxicity" & 
                    grepl("Nivolumab-induced immune-related adverse events in cancer|Response to immune checkpoint inhibitors in melanoma", 
                          `DISEASE/TRAIT`, ignore.case = T),
                  "Manual Mapping (from DISEASE/TRAIT)",
                  icd10_code_origin)
         )


disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "toxicity" & 
                            grepl("Methotrexate-related central neurotoxicity in children treated for acute lymphoblastic leukemia", 
                                  `DISEASE/TRAIT`, ignore.case = T),
                            "G92, T45.1",
                             icd10_code)) |>
  mutate(icd10_code_origin =
           ifelse(collected_all_disease_terms == "toxicity" & 
                    grepl("Methotrexate-related central neurotoxicity in children treated for acute lymphoblastic leukemia", 
                          `DISEASE/TRAIT`, ignore.case = T),
                  "Manual Mapping (from DISEASE/TRAIT)",
                  icd10_code_origin)
         )

## if DISEASE/TRAIT ==  Abnormalities of forces of labour
## and collected_all_disease_terms == abnormal delivery
## if icd10_code is missing, map to
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "abnormal delivery" & 
                            grepl("^Abnormalities of forces of labour$", 
                                  `DISEASE/TRAIT`, 
                                  ignore.case = T) &
                            icd10_code == "",
                            "O62",
                             icd10_code)) |>
  mutate(icd10_code_origin = 
         ifelse(collected_all_disease_terms == "abnormal delivery" & 
                  grepl("^Abnormalities of forces of labour$", 
                        `DISEASE/TRAIT`, 
                        ignore.case = T),
                "Manual Mapping (from DISEASE/TRAIT)",
                icd10_code_origin)
         )

## if DISEASE/TRAIT == Ischemic heart disease
## and collected_all_disease_terms == heart disease
## if icd10_code is missing, map to "I20-I21.6, I21.9-I25.9, Z82.4-Z82.49"
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "heart disease" & 
                            grepl("^Ischemic heart disease$", 
                                  `DISEASE/TRAIT`, 
                                  ignore.case = T) &
                            icd10_code == "",
                            "I20-I21.6, I21.9-I25.9, Z82.4-Z82.49",
                             icd10_code)) |>
  mutate(icd10_code_origin = 
         ifelse(collected_all_disease_terms == "heart disease" & 
                  grepl("^Ischemic heart disease$", 
                        `DISEASE/TRAIT`, 
                        ignore.case = T),
                "Manual Mapping (from DISEASE/TRAIT)",
                icd10_code_origin)
         )


# if collected_all_disease_terms == "heart disease"
# and icd10_code is missing
disease_mapping =
  disease_mapping |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "heart disease" &
                            icd10_code == "",
                            "C34",
                             icd10_code)) |>
  mutate(icd10_code_origin = 
         ifelse(collected_all_disease_terms == "heart disease" &
                  icd10_code == "C34",
                "Manual Mapping (collected_all_disease_terms)",
                icd10_code_origin)
         )

How many diseases are not mapped yet?

# disease_mapping =
#   disease_mapping |>
#   filter(icd10_code != "")

matched <- c(disease_mapping_matched$collected_all_disease_terms,
             matched_phenotypes$collected_all_disease_terms,
             to_add$collected_all_disease_terms,
             manual_icd10_map$collected_all_disease_terms)
Warning: Unknown or uninitialised column: `collected_all_disease_terms`.
Unknown or uninitialised column: `collected_all_disease_terms`.
not_found_diseases <- diseases[!diseases %in% matched] 
not_found_diseases <- not_found_diseases[not_found_diseases != ""]

print(length(not_found_diseases))
[1] 617

Inferring ICD10 codes from similar study ICD10 codes

# similar studies:
study_icd_map =
disease_mapping |>
filter(!is.na(icd10_code)) |>
filter(icd10_code_origin == "Study Provided" | icd10_code_origin == "Study PheCode Mapping") 

study_icd_map =
  study_icd_map |>
  select(collected_all_disease_terms, icd10_code) |>
  distinct()

# remove Z95.1 and Z95.5 from coronary artery disease
study_icd_map =
  study_icd_map |>
  mutate(icd10_code = ifelse(collected_all_disease_terms == "coronary artery disease",
                             str_replace_all(icd10_code, "Z95.1|Z95.5", ""),
                             icd10_code))

# assume ovarian cancer terms more specific than 
# Malignant neoplasm of ovary and other uterine adnexa (PheCode 184.1)
# study_icd_map =
#   study_icd_map |>
#   filter(!( collected_all_disease_terms == "ovarian cancer" &&phecode == "184.10")
#   )

study_icd_map = 
  study_icd_map |>
  mutate(collected_all_disease_terms = str_trim(collected_all_disease_terms)) |>
  mutate(collected_all_disease_terms = str_remove_all(collected_all_disease_terms, "^, ")) 
  

study_icd_map =
  study_icd_map |>
  filter(icd10_code != "" & !is.na(icd10_code))


study_icd_map =
  study_icd_map |>
  tidyr::separate_longer_delim(icd10_code, delim = ", ") |>
  tidyr::separate_longer_delim(icd10_code, delim = ",")

study_icd_map =
  study_icd_map |>
  group_by(collected_all_disease_terms) |>
  summarise(icd10_code = str_flatten(unique(sort(icd10_code)),
                                     collapse = ", ", 
                                     na.rm = T), 
            .groups = "drop")

study_icd_map = 
  study_icd_map |>
  mutate(icd10_code_origin = "Inferred from similar studies")

disease_mapping = 
  rows_patch(disease_mapping,
             study_icd_map,
             unmatched = "ignore")
Matching, by = "collected_all_disease_terms"
disease_mapping |>
  filter(is.na(icd10_code)) |>
  nrow()
[1] 495

How was ICD-10 code inferred?

disease_mapping  |> 
group_by(icd10_code_origin) |> 
summarise(n = n()) |> 
arrange(desc(n))
# A tibble: 13 × 2
   icd10_code_origin                                         n
   <chr>                                                 <int>
 1 UMLS term match (collected_all_disease_terms)         10995
 2 ICD Description Match (collected_all_disease_terms)    9575
 3 Study PheCode Mapping                                  6780
 4 UMLS term match                                        4820
 5 Study Provided                                         3961
 6 ICD Description Match (DISEASE/TRAIT)                  3945
 7 Phecode Phenotype Match (collected_all_disease_terms)  2194
 8 Manual Mapping (collected_all_disease_terms)           1172
 9 Inferred from similar studies                           899
10 Study PheCode (Manual Mapping)                          875
11 Phecode Phenotype Match (DISEASE/TRAIT)                 580
12 <NA>                                                    495
13 Manual Mapping (DISEASE/TRAIT)                           25

Saving disease mapping

Add description of ICD10 codes to mapping table

Prepare / fix disease mappings

Fixing where single number is mapped as a range:

disease_mapping =
  disease_mapping |>
  mutate(icd10_code = sub("^([^-]+)-\\1$", "\\1", icd10_code))

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "C50-C50.9",
    replacement = "C50"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F30-F39.9",
    replacement = "F30"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F99-F99.9",
    replacement = "F99"
    )
  )

Fixing multiple ICD10 codes missing commas

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F40 F41 F42",
    replacement = "F40, F41, F42"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "G47.8. G47.9",
    replacement = "G47.8, G47.9"
    )
  )

Correcting for letters after numbers in ICD-10 codes

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "J09.X",
    replacement = "J09"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "G40.A",
    replacement = "G40"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "G20.A1",
    replacement = "G20"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "G20.C",
    replacement = "G20"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F32.A",
    replacement = "F32"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "S06.0X",
    replacement = "S06.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "S06.0XA",
    replacement = "S06.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "S06.0A",
    replacement = "S06.0"
    )
    ) 

Correcting ranges created by UMLS codes:

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "A15-A19.9|A15-A19",
    replacement = paste0("A", 15:19, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "A50-A64.9|A50-A64",
    replacement = paste0("A", 50:64, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "B20-B24.9",
    replacement = paste0("B", 20:24, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "C60-C63.9|C60-C63",
    replacement = paste0("C", 60:63, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "D80-D89",
    replacement = paste0("D", 80:89, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E00-E07.9|E00-E07",
    replacement = paste0("E0", 0:7, collapse = ", ")
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E08-E13",
    replacement = "E08, E09, E10, E11, E12, E13"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F30-F39",
    replacement = paste0("F", 30:39, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "G00-G99.9|G00-G99",
    replacement = paste0(
                  paste0("G0", 0:9, collapse = ", "),
                  paste0("G", 10:99, collapse = ", "),
                  collapse = ","
    )
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "I20-I25.9|I20-I25",
    replacement = paste0("I", 20:25, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "I60-I69.9|I60-I69",
    replacement = paste0("I", 60:69, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "K40-K46.9|K40-K46",
    replacement = paste0("K", 40:46, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "L00-L08.9|L00-L08",
    replacement = paste0("L0", 0:8, collapse = ", ")
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "L00-L99.9|L00-L99",
    replacement = paste0(paste0("L0", 0:9, collapse = ", "),
                         paste0("L", 10:99, collapse = ", "),
                         collapse = ","
    )
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "N20-N23.9|N20-N23",
    replacement = paste0("N", 20:23, collapse = ", ")
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "X71-X83",
    replacement = paste0("X", 71:83, collapse = ", ")
    )
    ) 

Converting ICD-10 cm codes to WHO ICD-10

disease_mapping =
  disease_mapping |>
  mutate(icd10_code = str_split(icd10_code, ",\\s*")) |>
  tidyr::unnest(icd10_code) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "B37.49",
    replacement = "B37.4"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E05.90",
    replacement = "E05.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E11.319|E11.31|E11.329.9|E11.329.|E11.32|E11.3.",
    replacement = "E11.3"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E11.3.",
    replacement = "E11.3"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E13.621.|E13.62",
    replacement = "E11.3"
    )
  )

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "E78.00",
    replacement = "E78.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F10.10",
    replacement = "F10.1"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F17.201.|F17.20",
    replacement = "F17.2"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "F430",
    replacement = "F43.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "G47.00",
    replacement = "G47.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H01.00|H01.09.",
    replacement = "H01.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H01.09.",
    replacement = "H01.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H029",
    replacement = "H02.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H60.90",
    replacement = "H60.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H60.90",
    replacement = "H60.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H91.8X9.",
    replacement = "H91.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H91.90",
    replacement = "H91.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "H93.299.|H93.29",
    replacement = "H01"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "I70.20",
    replacement = "I70.2"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "I82.409.|I82.40|I82.4",
    replacement = "I82"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "J30.9",
    replacement = "J30.2"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "J45.998.|J45.909.|J45.901.|J45.99|J45.90",
    replacement = "J45.9"
    )
  )

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "J98.457.6",
    replacement = "J98.4"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "K05.30-31",
    replacement = "K05.3"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "K29.70",
    replacement = "K29.7"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "K59.00",
    replacement = "K59.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "L08.89",
    replacement = "L08.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M06.99|M06.90",
    replacement = "M06.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M10.99",
    replacement = "M10.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M10.99",
    replacement = "M10.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M13.90|M13.94|M13.96|M13.97|M13.99",
    replacement = "M13.9"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M19.07",
    replacement = "M19.0"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M19.90|M19.91|M19.94|M19.97|M19.99",
    replacement = "M19.9"
    )
  )



disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M25.50|M25.51|M25.55|M25.569.|M25.56|M25.571.|M25.57",
    replacement = "M25.5"
    )
  )

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M25.76|M25.77",
    replacement = "M25.7"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M43.16",
    replacement = "M43.1"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M47.80|M47.82|M47.86",
    replacement = "M47.8"
    )
  )

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M47.92|M47.96",
    replacement = "M47.9"
    )
  )

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M48.02|M48.06",
    replacement = "M48.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M54.22",
    replacement = "M54.2"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M54.30|M54.39",
    replacement = "M54.3"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M54.56|M54.57|M54.59",
    replacement = "M54.5"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M54.99",
    replacement = "M54.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M65.34",
    replacement = "M65.3"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M65.96",
    replacement = "M65.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M72.04",
    replacement = "M72.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M79.09",
    replacement = "M79.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M79.65|M79.66|M79.67",
    replacement = "M79.6"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M79.79",
    replacement = "M79.7"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M79.86",
    replacement = "M79.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "M81.99",
    replacement = "M81.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "N390",
    replacement = "N39.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "N50.89",
    replacement = "N50.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "N52",
    replacement = "F52"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "N52.9",
    replacement = "F52.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "N814",
    replacement = "N81.4"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "P29.12",
    replacement = "P29.1"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R06.00|R06.09",
    replacement = "R06.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R06.83",
    replacement = "R06.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R07.89",
    replacement = "R07.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R07.8|R07.9",
    replacement = "R07"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R09.89",
    replacement = "R09.8"
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R10.30",
    replacement = "R10.3"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R10.8|R10.9",
    replacement = "R10"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R11.0",
    replacement = "R11"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R13.10",
    replacement = "R13"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R13.1",
    replacement = "R13"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R14.0",
    replacement = "R14"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R19.7",
    replacement = "R19"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R29.898.|R29.89",
    replacement = "R29.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R31.29",
    replacement = "R31.2"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R31.2|R31.9",
    replacement = "R31"
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R53.82|R53.83|R5382",
    replacement = "R53.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R53.8",
    replacement = "R53"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R56.9",
    replacement = "R56"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
  mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R73.09|R73.02",
    replacement = "R73.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R80.9",
    replacement = "R80"
    )
    ) 
 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R82.998.|R82.99",
    replacement = "R82.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R871",
    replacement = "R87.1"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R87.6",
    replacement = "R87"
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R87.61",
    replacement = "R87.6"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T50.905.",
    replacement = "T50.9"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T78.40",
    replacement = "T78.4"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T780",
    replacement = "T78.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T81.149.88",
    replacement = "T81.1"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T81.815.013.",
    replacement = "T81.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T84.84",
    replacement = "T84.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "T887",
    replacement = "T88.7"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "U80",
    replacement = "U82"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "W44.9",
    replacement = "W44"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "Z86.79",
    replacement = "Z86.7"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "Z87.09",
    replacement = "Z87.0"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "Z87.39",
    replacement = "Z87.3"
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "Z87.42",
    replacement = "Z87.4"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "Z87.828.",
    replacement = "Z87.8"
    )
    ) 

disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "Z87.891.",
    replacement = "Z87.8"
    )
    ) 




disease_mapping =
  disease_mapping |>
  distinct()

Add ICD10 descriptions

Other ICD10 descriptions to add

other_icd10_desc = 
  data.frame(
    icd10_code = c("A09.9", 
                  "A41",
                  "B95",
                  "B96",
                  "B97",
                  "B98",
                  "C44",
                  "C79",
                  "C80.0",
                  "C80.9",
                  "C90",
                  "D07",
                  "D35",
                  "D37",
                  "D47",
                  "D48",
                  "E03",
                  "E04",
                  "E21",
                  "E23",
                  "E27",
                  "E61",
                  "E80",
                  "E87",
                  "E89",
                  "F05",
                  "F17",
                  "F41",
                  "F43.0",
                  "F53",
                  "G31",
                  "G45",
                  "G46",
                  "G62",
                  "G93",
                  "G99",
                  "H47",
                  "H50",
                  "H57",
                  "H83",
                  "H91",
                  "H92",
                  "H93",
                  "I08",
                  "I27",
                  "I27.2",
                  "I44",
                  "I45",
                  "I48.0",
                  "I49",
                  "I51",
                  "I62",
                  "I71",
                  "J15",
                  "J16",
                  "J34",
                  "J38",
                  "J69",
                  "J84",
                  "J95",
                  "J96",
                  "K05",
                  "K07",
                  "K08",
                  "K09",
                  "K10",
                  "K12",
                  "K13",
                  "K22",
                  "K22.7",
                  "K52",
                  "K56",
                  "K64",
                  "K64.0",
                  "K64.9",
                  "K74",
                  "K75",
                  "K85.9",
                  "K86",
                  "K91",
                  "K91.8",
                  "K92",
                  "L02",
                  "L13",
                  "L30",
                  "L65",
                  "L73",
                  "L85",
                  "L89.1",
                  "L98",
                  "M05",
                  "M06",
                  "M07",
                  "M11",
                  "M13",
                  "M18",
                  "M19",
                  "M24",
                  "M25",
                  "M31",
                  "M35",
                  "M43",
                  "M48",
                  "M53",
                  "M62",
                  "M66",
                  "M67",
                  "M71",
                  "M77",
                  "M79",
                  "M79.7",
                  "M80",
                  "M81",
                  "M85",
                  "M89",
                  "M96",
                  "N02",
                  "N18.3",
                  "N18.4",
                  "N28",
                  "N39",
                  "N48",
                  "N73",
                  "N76",
                  "N88",
                  "N89",
                  "N91",
                  "N92",
                  "N93",
                  "N94",
                  "N99",
                  "O14",
                  "O04",
                  "O26",
                  "O32",
                  "O34",
                  "O36",
                  "O68",
                  "O75",
                  "O99",
                  "R03",
                  "R07",
                  "R09",
                  "R10",
                  "R19",
                  "R29",
                  "R29.6",
                  "R39",
                  "R40",
                  "R47",
                  "R57",
                  "T95.8",
                  "W44",
                  "Y95",
                  "Z86.3",
                  "Z87.3",
                  "Z87.4",
                  "Z87.7",
                  "Z87.8",
                  "T85.8",
                  "Z91.0",
                  "Z88.9",
                  "Z88.8",
                  "Z88",
                  "Z92.6",
                  "N90",
                  "U82"
                  ),
    icd10_description = c("Gastroenteritis and colitis of unspecified origin",
                         "Other sepsis",
                         "Streptococcus and staphylococcus as the cause of diseases classified to other chapters",
                         "Other specified bacterial agents as the cause of diseases classified to other chapters",
                         "Viral agents as the cause of diseases classified to other chapters",
                         "Other specified infectious agents as the cause of diseases classified to other chapters",
                         "Other malignant neoplasms of skin",
                         "Secondary malignant neoplasm of other and unspecified sites",
                         "Malignant neoplasm, primary site unknown, so stated",
                         "Malignant neoplasm, primary site unspecified",
                         "Multiple myeloma and malignant plasma cell neoplasms",
                         "Carcinoma in situ of other and unspecified genital organs",
                         "Benign neoplasm of other and unspecified endocrine glands",
                         "Neoplasm of uncertain or unknown behaviour of oral cavity and digestive organs",
                         "Other neoplasms of uncertain or unknown behaviour of lymphoid, haematopoietic and related tissue",
                         "Neoplasm of uncertain or unknown behaviour of other and unspecified sites",
                         "Other hypothyroidism",
                         "Other nontoxic goitre",
                         "Hyperparathyroidism and other disorders of parathyroid gland",
                         "Hypofunction and other disorders of pituitary gland",
                         "Other disorders of adrenal gland",
                         "Deficiency of other nutrient elements",
                         "Disorders of porphyrin and bilirubin metabolism",
                         "Other disorders of fluid, electrolyte and acid-base balance",
                         "Postprocedural endocrine and metabolic disorders, not elsewhere classified",
                         "Delirium, not induced by alcohol and other psychoactive substances",
                         "Mental and behavioural disorders due to use of tobacco",
                         "Other anxiety disorders",
                         "Acute stress reaction",
                         "Mental and behavioural disorders associated with the puerperium, not elsewhere classified",
                         "Other degenerative diseases of nervous system, not elsewhere classified",
                         "Transient cerebral ischaemic attacks and related syndromes",
                         "Vascular syndromes of brain in cerebrovascular diseases",
                         "Other polyneuropathies",
                         "Other disorders of brain",
                         "Other disorders of nervous system in diseases classified elsewhere",
                         "Other disorders of optic [2nd] nerve and visual pathways",
                         "Other strabismus",
                         "Other disorders of eye and adnexa",
                         "Other diseases of inner ear",
                         "Other hearing loss",
                         "Otalgia and effusion of ear",
                         "Other disorders of ear, not elsewhere classified",
                         "Multiple valve diseases",
                         "Other pulmonary heart diseases",
                         "Other secondary pulmonary hypertension",
                         "Atrioventricular and left bundle-branch block",
                         "Other conduction disorders",
                         "Paroxysmal atrial fibrillation",
                         "Ventricular fibrillation and flutter",
                         "Complications and ill-defined descriptions of heart disease",
                         "Other nontraumatic intracranial haemorrhage",
                         "Aortic aneurysm and dissection",
                         "Bacterial pneumonia, not elsewhere classified",
                         "Pneumonia due to other infectious organisms, not elsewhere classified",
                         "Other disorders of nose and nasal sinuses",
                         "Diseases of vocal cords and larynx, not elsewhere classified",
                         "Pneumonitis due to solids and liquids",
                         "Other interstitial pulmonary diseases",
                         "Postprocedural respiratory disorders, not elsewhere classified",
                         "Respiratory failure, not elsewhere classified",
                         "Gingivitis and periodontal diseases",
                         "Dentofacial anomalies [including malocclusion]",
                         "Other disorders of teeth and supporting structures",
                         "Cysts of oral region, not elsewhere classified",
                         "Other diseases of jaws",
                         "Stomatitis and related lesions",
                         "Other diseases of lip and oral mucosa",
                         "Other diseases of oesophagus",
                         "Barrett oesophagus",
                         "Other noninfective gastroenteritis and colitis",
                         "Paralytic ileus and intestinal obstruction without hernia",
                         "Haemorrhoids and perianal venous thrombosis",
                         "First degree haemorrhoids",
                         "Haemorrhoids, unspecified",
                         "Fibrosis and cirrhosis of liver",
                         "Other inflammatory liver diseases",
                         "Acute pancreatitis, unspecified",
                         "Other diseases of pancreas",
                         "Postprocedural disorders of digestive system, not elsewhere classified",
                         "Other postprocedural disorders of digestive system, not elsewhere classified",
                         "Other diseases of digestive system",
                         "Cutaneous abscess, furuncle and carbuncle",
                         "Other bullous disorders",
                         "Other dermatitis",
                         "Other nonscarring hair loss",
                         "Other follicular disorders",
                         "Other epidermal thickening",
                         "Stage II decubitus ulcer",
                         "Other disorders of skin and subcutaneous tissue, not elsewhere classified",
                         "Seropositive rheumatoid arthritis",
                         "Other rheumatoid arthritis",
                         "Psoriatic and enteropathic arthropathies",
                         "Other crystal arthropathies",
                         "Other arthritis",
                         "Arthrosis of first carpometacarpal joint",
                         "Other arthrosis",
                         "Other specific joint derangements",
                         "Other joint disorders, not elsewhere classified",
                         "Other necrotizing vasculopathies",
                         "Other systemic involvement of connective tissue",
                         "Other deforming dorsopathies",
                         "Other spondylopathies",
                         "Other dorsopathies, not elsewhere classified",
                         "Other disorders of muscle",
                         "Spontaneous rupture of synovium and tendon",
                         "Other disorders of synovium and tendon",
                         "Other bursopathies",
                         "Other enthesopathies",
                         "Other soft tissue disorders, not elsewhere classified",
                         "Fibromyalgia",
                         "Osteoporosis with pathological fracture",
                         "Osteoporosis without pathological fracture",
                         "Other disorders of bone density and structure",
                         "Other disorders of bone",
                         "Postprocedural musculoskeletal disorders, not elsewhere classified",
                         "Recurrent and persistent haematuria",
                         "Chronic kidney disease, stage 3",
                         "Chronic kidney disease, stage 4",
                         "Other disorders of kidney and ureter, not elsewhere classified",
                         "Other disorders of urinary system",
                         "Other disorders of penis",
                         "Other female pelvic inflammatory diseases",
                         "Other inflammation of vagina and vulva",
                         "Other noninflammatory disorders of cervix uteri",
                         "Other noninflammatory disorders of vagina",
                         "Other noninflammatory disorders of vulva and perineum",
                         "Excessive, frequent and irregular menstruation",
                         "Other abnormal uterine and vaginal bleeding",
                         "Pain and other conditions associated with female genital organs and menstrual cycle",
                         "Postprocedural disorders of genitourinary system, not elsewhere classified",
                         "Pre-eclampsia",
                         "Medical abortion",
                         "Maternal care for other conditions predominantly related to pregnancy",
                         "Maternal care for known or suspected malpresentation of fetus",
                         "Maternal care for known or suspected abnormality of pelvic organs",
                         "Maternal care for other known or suspected fetal problems",
                         "Labour and delivery complicated by fetal stress [distress]",
                         "Other complications of labour and delivery, not elsewhere classified",
                         "Other maternal diseases classifiable elsewhere but complicating pregnancy, childbirth and the puerperium",
                         "Abnormal blood-pressure reading, without diagnosis",
                         "Pain in throat and chest",
                         "Other symptoms and signs involving the circulatory and respiratory systems",
                         "Abdominal and pelvic pain",
                         "Other symptoms and signs involving the digestive system and abdomen",
                         "Other symptoms and signs involving the nervous and musculoskeletal systems",
                         "Tendency to fall, not elsewhere classified",
                         "Other symptoms and signs involving the urinary system",
                         "Somnolence, stupor and coma",
                         "Speech disturbances, not elsewhere classified",
                         "Shock, not elsewhere classified",
                         "Other complications of internal prosthetic devices, implants and grafts, not elsewhere classified",
                         "Foreign body entering into or through eye or natural orifice",
                         "Nosocomial condition",
                         "Personal history of endocrine, nutritional and metabolic diseases",
                         "Personal history of diseases of the musculoskeletal system and connective tissue",
                         "Personal history of diseases of the genitourinary system",
                         "Personal history of congenital malformations, deformations and chromosomal abnormalities",
                         "Personal history of other specified conditions",
                         "Other complications of internal prosthetic devices, implants and grafts, not elsewhere classified",
                         "Personal history of allergy, other than to drugs and biological substances",
                         "Personal history of allergy to unspecified drugs, medicaments and biological substances",
                         "Personal history of allergy to other drugs, medicaments and biological substances
",
"Personal history of allergy to drugs, medicaments and biological substances",
"Personal history of chemotherapy for neoplastic disease",
"Other noninflammatory disorders of vulva and perineum",
"Resistance to betalactam antibiotics")
  )
manual_icd10_map <-
  readxl::read_xlsx(here::here("data/icd/manual_disease_icd10_mappings.xlsx"))

icd10_descriptions =
  phecodes |>
  select(icd10_code = ICD10, 
         icd10_description = ICD_DESCRIPTION
         ) |>
  distinct()

# Expand multiple ICD codes into rows
to_add_expanded <- manual_icd10_map |>
  mutate(icd10_code = str_split(icd10_code, ",\\s*")) |>
  tidyr::unnest(icd10_code)

icd10_descriptions = 
  bind_rows(
    icd10_descriptions,
    to_add_expanded |>
      select(icd10_code, 
             icd10_description = icd10_desc),
    other_icd10_desc
  ) 

icd10_descriptions = icd10_descriptions |> distinct()

icd10_descriptions = 
  icd10_descriptions |>
  group_by(icd10_code) |>
  summarise(icd10_description = 
            str_flatten(unique(icd10_description), collapse = "; ", na.rm = T), 
            .groups = "drop"
            )

disease_mapping =
  left_join(disease_mapping,
            icd10_descriptions,
            by = "icd10_code",
            relationship = "many-to-one",
            na_matches = "never"
            ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_code = str_replace_all(
    icd10_code,
    pattern = "R871",
    replacement = "R87"
    )
    ) 


disease_mapping = 
  disease_mapping |> 
    mutate(icd10_description = 
    ifelse(icd10_code == "R87",
           "Abnormal findings in specimens from female genital organs",
           icd10_description)
    ) 

Missing descriptions for ICD10 codes

# check if any ICD10 codes are missing descriptions
disease_mapping |> 
  filter(is.na(icd10_description)) |>
  head()
# A tibble: 6 × 9
  `DISEASE/TRAIT`         collected_all_diseas…¹ PUBMED_ID  YEAR STUDY_ACCESSION
  <chr>                   <chr>                      <int> <int> <chr>          
1 smoking interaction in… lung cancer             29059373  2017 GCST005910     
2 smoking interaction in… lung cancer             29059373  2017 GCST005909     
3 smoking interaction in… lung cancer             29059373  2017 GCST005911     
4 breast cancer (estroge… breast cancer           29058716  2017 GCST005076     
5 breast cancer           breast cancer           29058716  2017 GCST005077     
6 breast cancer in brca1… breast cancer           29058716  2017 GCST005075     
# ℹ abbreviated name: ¹​collected_all_disease_terms
# ℹ 4 more variables: icd10_code <chr>, icd10_code_origin <chr>, phecode <dbl>,
#   icd10_description <chr>

Noticing some mistakes in MAPPED trait by icd 10 codes

gwas_study_info |> 
  filter(grepl("ICD10 F05", 
               `DISEASE/TRAIT`)) |> 
  select(`DISEASE/TRAIT`, MAPPED_TRAIT, collected_all_disease_terms, PUBMED_ID)
                                                                  DISEASE/TRAIT
                                                                         <char>
1: ICD10 F05: Delirium due to known physiological condition (Gene-based burden)
2:                       ICD10 F05.9: Delirium, unspecified (Gene-based burden)
3:                                           ICD10 F05.9: Delirium, unspecified
4:                     ICD10 F05: Delirium due to known physiological condition
                  MAPPED_TRAIT collected_all_disease_terms PUBMED_ID
                        <char>                      <char>     <int>
1: alcohol withdrawal delirium   alcohol-related disorders  34662886
2:                    delirium                    delirium  34662886
3:                    delirium                    delirium  34662886
4: alcohol withdrawal delirium   alcohol-related disorders  34662886
disease_mapping |> 
  filter(icd10_code == "F05")
# A tibble: 3 × 9
  `DISEASE/TRAIT`         collected_all_diseas…¹ PUBMED_ID  YEAR STUDY_ACCESSION
  <chr>                   <chr>                      <int> <int> <chr>          
1 behavioral disturbance… atypical behavior       25897833  2015 GCST002863     
2 icd10 f05: delirium du… alcohol-related disor…  34662886  2021 GCST90083772   
3 icd10 f05: delirium du… alcohol-related disor…  34662886  2021 GCST90079786   
# ℹ abbreviated name: ¹​collected_all_disease_terms
# ℹ 4 more variables: icd10_code <chr>, icd10_code_origin <chr>, phecode <dbl>,
#   icd10_description <chr>
# F05 refers to delirium not induced by alcohol and other psychoactive substances
# yet, the mapped trait is alcohol withdrawal delirium
# replace with delirium
disease_mapping = 
disease_mapping |>
  mutate(collected_all_disease_terms = 
         ifelse(icd10_code == "F05",
                "delirium",
                collected_all_disease_terms)
         )

gwas_study_info =
  gwas_study_info |>
  mutate(MAPPED_TRAIT = 
         ifelse(grepl("ICD10 F05", `DISEASE/TRAIT`),
                str_replace_all(collected_all_disease_terms,
                                pattern = "alcohol-related disorders",
                                replacement = "delirium"),
                collected_all_disease_terms)
         )

Save disease map

disease_mapping = 
  disease_mapping |> 
  arrange(YEAR,
          PUBMED_ID,
          STUDY_ACCESSION,
          collected_all_disease_terms, 
          icd10_code)

fwrite(disease_mapping,
       here::here("output/icd_map/gwas_disease_to_icd10_mapping.csv")
       )

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] jsonlite_2.0.0    httr_1.4.7        data.table_1.17.8 stringr_1.5.2    
[5] dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] bit_4.6.0         compiler_4.3.1    renv_1.0.3        promises_1.3.3   
 [5] tidyselect_1.2.1  Rcpp_1.1.0        git2r_0.36.2      tidyr_1.3.1      
 [9] callr_3.7.6       later_1.4.4       jquerylib_0.1.4   readxl_1.4.5     
[13] yaml_2.3.10       fastmap_1.2.0     here_1.0.1        R6_2.6.1         
[17] generics_0.1.4    knitr_1.50        tibble_3.3.0      rprojroot_2.1.0  
[21] bslib_0.9.0       pillar_1.11.1     rlang_1.1.6       utf8_1.2.6       
[25] cachem_1.1.0      stringi_1.8.7     httpuv_1.6.16     xfun_0.53        
[29] getPass_0.2-4     fs_1.6.6          sass_0.4.10       bit64_4.6.0-1    
[33] cli_3.6.5         withr_3.0.2       magrittr_2.0.4    ps_1.9.1         
[37] digest_0.6.37     processx_3.8.6    rstudioapi_0.17.1 lifecycle_1.0.4  
[41] vctrs_0.6.5       evaluate_1.0.5    glue_1.8.0        cellranger_1.1.0 
[45] whisker_0.4.1     purrr_1.1.0       rmarkdown_2.30    tools_4.3.1      
[49] pkgconfig_2.0.3   htmltools_0.5.8.1