Last updated: 2025-12-29

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f5eae61

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f5eae61. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/cohort/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    human_dictionary/
    Ignored:    igsr_populations.tsv
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/

Untracked files:
    Untracked:  code/extract_cdc_meta.R
    Untracked:  code/figure_4a.R
    Untracked:  code/poster_figures.R
    Untracked:  code/umls_ontology.R
    Untracked:  data/cdc/
    Untracked:  data/icd/2025AA/
    Untracked:  data/icd/umls-2025AA-mrconso.zip
    Untracked:  figures/
    Untracked:  visualization.Rdata

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/group_cancer_diseases.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/level_1_disease_group_non_cancer.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   analysis/manual_trait_map_icd10.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd
    Modified:   analysis/trait_ontology_categorization.Rmd
    Modified:   code/custom_plotting.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/correcting_cohort_names.Rmd) and HTML (docs/correcting_cohort_names.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	f5eae61	IJbeasley	2025-12-29	Saving a per paper cohort meta-data
html	c7307bc	IJbeasley	2025-12-28	Build site.
Rmd	4edd22a	IJbeasley	2025-12-28	Adding more cohorts to cohort dictionary
html	bae17f0	IJbeasley	2025-12-28	Build site.
Rmd	9a3bb9b	IJbeasley	2025-12-28	Update fixing missing cohort information
html	aab4928	IJbeasley	2025-10-28	Build site.
Rmd	d088104	IJbeasley	2025-10-28	More cohort name correcting
html	1b07ce8	IJbeasley	2025-10-28	Build site.
Rmd	cd3b8d8	IJbeasley	2025-10-28	More cohort name correcting
html	0b43415	IJbeasley	2025-10-15	Build site.
Rmd	fcd0501	IJbeasley	2025-10-15	workflowr::wflow_publish("analysis/correcting_cohort_names.Rmd")
html	b33ca74	IJbeasley	2025-08-21	Build site.
Rmd	ac13d70	IJbeasley	2025-08-21	Updating correcting cohort labels
html	6c592b7	IJbeasley	2025-08-20	Build site.
Rmd	1969e6b	IJbeasley	2025-08-20	More corrections / harmonisation of cohort names in gwas catalog

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(data.table)
library(dplyr)
library(ggplot2)
library(stringr)

1 Load / pre-process GWAS Catalog data

# Load GWAS Catalog studies
gwas_study_info <- fread(here::here("data/gwas_catalog/gwas-catalog-v1.0.3.1-studies-r2025-07-21.tsv"),
                         sep = "\t", quote = "")

# Standardize column names (remove spaces)
gwas_study_info <- gwas_study_info |>
  rename_all(~gsub(" ", "_", .x))

gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, " \\| ", "|")) |>
mutate(COHORT = str_replace_all(COHORT, "\\| ", "|")) |>
  mutate(COHORT = str_replace_all(COHORT, " \\|", "|"))

# some use commas instead of | to designate multiple cohorts
gwas_study_info <- gwas_study_info |>
    mutate(COHORT = str_replace_all(COHORT, ", ", "|"))

2 Remove uninformative labels

# making "multiple" designation to be the same
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, 
                                   "(\\(Multiple cohorts\\))|(\\(multiple\\))|Multiple",
                                  "multiple")) 


# number of cohorts listed as 'multiple'
gwas_study_info |> 
  filter(grepl("multiple", COHORT)) |> 
  group_by(COHORT) |> 
  summarise(n = n())

# A tibble: 1 × 2
  COHORT       n
  <chr>    <int>
1 multiple   320

# remove cohorts listed as multiple
gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = stringr::str_remove_all(pattern = "multiple",
                                          string = COHORT)
         )

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = stringr::str_remove_all(pattern = "multiple",
                                          string = COHORT)
         )


# making others be the same  
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Other", "other")) |> 
  mutate(COHORT = str_replace_all(COHORT, "OTHER", "other")) |>
  mutate(COHORT = str_replace_all(COHORT, "others", "other")) 

# number of cohorts listed as other
gwas_study_info |> 
  filter(grepl("other", COHORT)) |> 
  group_by(COHORT) |> 
  summarise(n = n()) |>
  arrange(desc(n)) |>
  head()

# A tibble: 6 × 2
  COHORT                                                                       n
  <chr>                                                                    <int>
1 other                                                                     8529
2 Lifelines|other                                                           2863
3 ALSPAC|other                                                              1048
4 UPENN|other                                                                 63
5 ADNI|ALS|BDC|BIG|BrainSCALE|Generation_R|IMAGEN|NCNG|NESDA|NeuroIMAGE|N…    56
6 other|CADD|DTR|FTC|MyCode|GS:SFHS|GENOA|HUNT|MOBA|NTR|ORCADES|QIMR|STR|…    48

gwas_study_info |>
  filter(grepl("other", COHORT)) |> 
  nrow()

[1] 13209

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = stringr::str_remove_all(pattern = "(^other$)|(^other\\|)|(\\|other$)",
                                          string = COHORT)
         ) |>
  mutate(COHORT = stringr::str_replace_all(pattern = "\\|other\\|",
                                          string = COHORT,
                                          replacement = "|")
         )

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = stringr::str_remove_all(pattern = "other",
                                          string = COHORT)
         )

# not reported: 
gwas_study_info |> 
  filter(grepl("NR", COHORT)) |> 
  group_by(COHORT) |> 
  summarise(n = n()) |>
  arrange(desc(n)) |>
  head()

# A tibble: 6 × 2
  COHORT                                                             n
  <chr>                                                          <int>
1 Knight_ADRC|ADNI|Barcelona-1|GR@ACE|DIAN|NR|Stanford_ADRC|PPMI  3608
2 Knight_ADRC|ADNI|Barcelona-1|GR@ACE|DIAN|NR                     1725
3 NR                                                              1533
4 Knight_ADRC|ADNI|Barcelona-1|GR@ACE|DIAN|NR|Stanford_ADRC        963
5 Knight_ADRC|ADNI|Barcelona-1|GR@ACE|DIAN|NR|PPMI                 559
6 FINRISK                                                          476

gwas_study_info |> 
  filter(grepl("NR", COHORT)) |>
  nrow()

[1] 9261

gwas_study_info = 
gwas_study_info |>
  dplyr::mutate(COHORT = stringr::str_remove_all(pattern ="(^NR$)|(^NR\\|)|(\\|NR$)", 
                                                 string = COHORT)) |>
    mutate(COHORT = stringr::str_replace_all(pattern = "(\\|NR\\|)",
                                          string = COHORT,
                                          replacement = "|")
         )

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_remove_all(pattern = "\\|$",
                                 string = COHORT))

3 Accounting for some discrepancies in cohort names across studies

3.1 Check in: how many unique cohorts (at the start)?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]
unique(all_cohorts) |> length()

[1] 1173

3.2 Accounting for some discrepancies in cohort names across studies

# Correct for discrepancies within same paper
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "AWI-Gen", "AWI-GEN")) |> # PUBMED ID :40229280
  mutate(COHORT = str_replace_all(COHORT, "AddHealth", "Add Health")) |> # PUBMED ID: 37494057
  mutate(COHORT = str_replace_all(COHORT, fixed("EB|FinnGen|UKBB"), "EB|FinnGen|UKB")) |> # 39067062
  mutate(COHORT = str_replace_all(COHORT, "Estonian Biobank", "EB")) |> # PUBMED ID: 39500877
  mutate(COHORT = str_replace_all(COHORT, "AWIGEN", "AWI-GEN"))  # 40229280


# Makes TwinsUK consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "TWINS-UK|TWINSUK", "TwinsUK")) 

# Make epic norfolk consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "EPIC-Norfolk cohort", "EPIC-Norfolk")) 

# Make emerge consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "EMERGE", "eMERGE")) 

# Make twingene consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "TWINGENE", "TwinGene"))

# Make QSkin consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "QSkin|Qskin", "QSKIN")) 

# Make 23andme consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "23ANDME", "23andMe")) 

# Make PopGen consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "PopGen", "POPGEN")) 

# Make decode consistent
gwas_study_info <- gwas_study_info |>
 mutate(COHORT = str_replace_all(COHORT, "DECODE|deCode|DeCODE", "deCODE"))

# Make FinnGen consistent
gwas_study_info <- gwas_study_info |>
mutate(COHORT = str_replace_all(COHORT, "Finngen|FINNGEN", "FinnGen")) 

gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "genomicc", "GenOMICC")) |>
  mutate(COHORT = str_replace_all(COHORT, "IPSYCH", "iPSYCH")) |>
  mutate(COHORT = str_replace_all(COHORT, "SIMES", "SiMES")) |>
  mutate(COHORT = str_replace_all(COHORT, "HELIX", "Helix")) 

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "FINLAND", "Finland"))

3.3 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]
unique(all_cohorts) |> length()

[1] 1149

4 Correcting for cardiogram cohort meta-analyses

# CARDIoGRAMplusC4D cohort includes both CARDIoGRAM and C4D cohorts
# see: https://cardiogramplusc4d.org/data-downloads/
# for coding, therefore, we change this to CARDIoGRAM|C4D
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "CARDIoGRAMplusC4D", "CARDIoGRAM|C4D"))

5 Correcting for UK Biobank naming differences …

all_cohorts[grep("ukb", tolower(all_cohorts))] |> unique()

[1] "UKB"                "UKBB"               "UKBB White British"
[4] "UKBS"               "UKB-PPP"

  gwas_study_info |>
  filter(grepl("UKBS", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  37653029
2:  34127860
                                                                                                                                                                                        COHORT
                                                                                                                                                                                        <char>
1:                                                                                                                                      GenEPA|CHOP|EPICURE|HBCS|KORA|ILM|PoBI|POPGEN|TSS|UKBS
2: BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap

# for PUBMED_ID: 37653029
# UKBS seems to be UK Biobank Bank

# for pubmed id: 34127860
# UKBS is UK Blood Service (UKBS)

 gwas_study_info |>
   filter(grepl("UKB-PPP", COHORT)) |>
   select(PUBMED_ID, COHORT) |>
   distinct()

   PUBMED_ID  COHORT
       <int>  <char>
1:  37794183 UKB-PPP

# pubmed id 37794183 is uk biobank - protein
 
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = ifelse(COHORT == "UKB-PPP", "UKB", COHORT)) |>
  mutate(COHORT = str_replace_all(COHORT, "UKBB White British", "UKB")) |>
  mutate(COHORT = gsub("\\bUKB\\b", "UKBB", COHORT))

5.1 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]
unique(all_cohorts) |> length()

[1] 1146

6 Correcting for naming cohort differences across studies (Confirmed by checking papers)

6.1 NIHR BioResource

# seems NIHR Cambridge BioResource & NIHR BIORESOURCE are the same
# https://www.cambridgebioresource.group.cam.ac.uk/ 

gwas_study_info |>
  filter(grepl("NIHR Cambridge BioResource", COHORT)) |>
  select(PUBMED_ID, COHORT)

   PUBMED_ID
       <int>
1:  34127860
2:  34127860
                                                                                                                                                                                        COHORT
                                                                                                                                                                                        <char>
1: BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap
2: BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap

gwas_study_info |>
    filter(grepl("NIHR BIORESOURCE", COHORT)) |>
    select(PUBMED_ID, COHORT) |> 
    distinct()

   PUBMED_ID
       <int>
1:  39891803
2:  40205036
                                                                                                                                                                                                                                                     COHORT
                                                                                                                                                                                                                                                     <char>
1:                                                                                                                                                                                                                      UKBB|CHARGE|ALSPAC|NIHR BIORESOURCE
2: arcOGEN|ARGO|UKHLS|China Kadoorie Biobank|deCODE|CHB|DBDS|eMERGE|EB|FinnGen|MyCode|GS:SFHS|HRS|HKDDDPC|HUNT|Bunkyo|HerediGene|RIKEN|Shimane-CoHRE|JOCO|LifeLines|NEO|NHS|MGBB|QIMR|RS|SHIP|SIMPLER|ToMMo|TwinsUK|UKBB|BioMe|G&H|NIHR BIORESOURCE|MVP|OAI

gwas_study_info |>
    filter(grepl(tolower("BIORESOURCE"), tolower(COHORT))) |>
    select(PUBMED_ID, COHORT) |>
    distinct()

   PUBMED_ID
       <int>
1:  39891803
2:  40205036
3:  34127860
                                                                                                                                                                                                                                                     COHORT
                                                                                                                                                                                                                                                     <char>
1:                                                                                                                                                                                                                      UKBB|CHARGE|ALSPAC|NIHR BIORESOURCE
2: arcOGEN|ARGO|UKHLS|China Kadoorie Biobank|deCODE|CHB|DBDS|eMERGE|EB|FinnGen|MyCode|GS:SFHS|HRS|HKDDDPC|HUNT|Bunkyo|HerediGene|RIKEN|Shimane-CoHRE|JOCO|LifeLines|NEO|NHS|MGBB|QIMR|RS|SHIP|SIMPLER|ToMMo|TwinsUK|UKBB|BioMe|G&H|NIHR BIORESOURCE|MVP|OAI
3:                                                              BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "NIHR Cambridge BioResource|NIHR BIORESOURCE" , "NIHR BioResource"))

6.2 Living biobank typo

# Leivin biobank appears to a typo - for Living Biobank
# see PUBMED ID 34059833; https://pmc.ncbi.nlm.nih.gov/articles/PMC7610958/#SD1
gwas_study_info |> filter(grepl("Leivin Biobank", COHORT))

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE   JOURNAL
                  <IDat>     <int>       <char>     <IDat>    <char>
1:            2021-06-10  34059833       Chen J 2021-05-31 Nat Genet
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/34059833
                                                          STUDY   DISEASE/TRAIT
                                                         <char>          <char>
1: The trans-ancestral genomic architecture of glycemic traits. Fasting glucose
                      INITIAL_SAMPLE_SIZE REPLICATION_SAMPLE_SIZE
                                   <char>                  <char>
1: 35,619 East Asian ancestry individuals                    <NA>
                  PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                                      <char>             <int>
1: Affymetrix, Illumina [15438438] (imputed)                15
          MAPPED_TRAIT                     MAPPED_TRAIT_URI STUDY_ACCESSION
                <char>                               <char>          <char>
1: glucose measurement http://www.ebi.ac.uk/efo/EFO_0004468    GCST90002231
                                                                               GENOTYPING_TECHNOLOGY
                                                                                              <char>
1: Genome-wide genotyping array, Targeted genotyping array [Genome-wide genotyping array|Metabochip]
   SUBMISSION_DATE STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
            <lgcl>            <lgcl>           <lgcl>                  <char>
1:              NA                NA               NA                        
   MAPPED_BACKGROUND_TRAIT_URI
                        <char>
1:                            
                                                                                                                  COHORT
                                                                                                                  <char>
1: AASC|BES|CAGE-GWAS1|CAGE|CLHNS|CHNS|KARE|Leivin Biobank|MESA|Nagahama Study|NHAPC|SCES|SiMES|SP2|TAICHI|CRC|SBCS|SMHS
   FULL_SUMMARY_STATISTICS
                    <char>
1:                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90002001-GCST90003000/GCST90002231
      GXE
   <char>
1:     no

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Leivin Biobank", "Living Biobank"))

6.3 Ghana Prostate

gwas_study_info |>
  filter(grepl("Ghana", COHORT)) |>
  select(PUBMED_ID,COHORT) |>
  distinct()

   PUBMED_ID                                              COHORT
       <int>                                              <char>
1:  39358599                     MADCaP|Ghana_Prostate|PRACTICAL
2:  36872133 AAPC|ELLIPSE|Ghana|eMERGE|BioVU|BioMe|MVP|ProHealth

# if look at papers they are referring to the same cohorts: 
# PUBMED_ID: 36872133 https://pmc.ncbi.nlm.nih.gov/articles/PMC10424812/#S9
# PUBMED_ID: 39358599 https://www.nature.com/articles/s41588-024-01931-3#Sec12

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Ghana_Prostate", "Ghana"))

6.4 Sardinia

# from reading sup table: https://pmc.ncbi.nlm.nih.gov/articles/instance/7611832/bin/EMS136340-supplement-Supplementary_Information.pdf
# for pubmed 34349265
# seems SARDINIA should be combined into SardiNIA
gwas_study_info |>
  filter(grepl("SARDINIA", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  34349265
                                                                                                                                                                                                                             COHORT
                                                                                                                                                                                                                             <char>
1: ALSPAC|ARIC|CHS|CILENTO|COLAUS|EGCUT|EPIC-Norfolk|FHS|INGI-FVG|GS:SFHS|HealthABC|HRS|INCHIANTI|InterAct|KORA|LifeLines|NEO|NHS|NTR|ORCADES|QIMR|RS|SARDINIA|SHIP|SHIP-TREND|TwinGene|TwinsUK|INGI-Val_Borbera|WGHS|WHI|BCAC|UKBB

gwas_study_info |>
  filter(grepl("SardiNIA", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

    PUBMED_ID
        <int>
 1:  36477530
 2:  36477530
 3:  36477530
 4:  36477530
 5:  36477530
 6:  36477530
 7:  36477530
 8:  36477530
 9:  36477530
10:  36477530
11:  36376304
12:  36050321
13:  36050321
14:  34718232
15:  32929287
                                                                                                                                                                                                                                COHORT
                                                                                                                                                                                                                                <char>
 1:                                                                      23andMe|ALSPAC|ARIC|CADD|deCODE|EGCUT|eMERGE|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|SardiNIA|UKBB|NINDS|FINRISK|AMISH|GeneSTAR|GOLDN|CHS|HVH|JHS|WGHS|WHI|GFG
 2:                                                                                  23andMe|ALSPAC|ARIC|CADD|COGEND|COPDGene|deCODE|EGCUT|Harvard|HRS|HUNT|METSIM|NTR|QIMR|SardiNIA|UKBB|FINRISK|AMISH|CFS|ECLIPSE|GeneSTAR|GOLDN|WHI
 3:                                                     23andMe|ALSPAC|ARIC|CADD|COGEND|COPDGene|deCODE|EGCUT|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|PAGE|QIMR|SardiNIA|UKBB|FINRISK|AMISH|CFS|ECLIPSE|GeneSTAR|GOLDN|CHS|HCHS|SOL|WHI
 4:                                                  23andMe|ALSPAC|ARIC|CADD|COGEND|COPDGene|deCODE|EGCUT|eMERGE|Harvard|HUNT|MCTFR|METSIM|NTR|SardiNIA|UKBB|NINDS|FINRISK|AMISH|CFS|ECLIPSE|GeneSTAR|GOLDN|CHS|HCHS|SOL|HVH|WGHS|WHI
 5:                                                                                                            23andMe|ALSPAC|ARIC|CADD|COGEND|deCODE|EGCUT|GERA|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|QIMR|SardiNIA|UKBB|FINRISK|WHI
 6:                23andMe|ALSPAC|ARIC|BLTS|CADD|deCODE|EGCUT|eMERGE|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|SardiNIA|UKBB|WHI|FINRISK|NINDS|BBJ|CKB|AMISH|CFS|CHS|GENSalt|GOLDN|HCHS|SOL|HVH|HyperGEN|JHS|GeneSTAR|GENOA|SARP|WGHS
 7:                                23andMe|ALSPAC|ARIC|BLTS|CADD|COGEND|COPDGene|deCODE|EGCUT|GFG|Harvard|HRS|HUNT|MESA|METSIM|NTR|OZALC|SardiNIA|UKBB|WHI|FINRISK|BBJ|CKB|AMISH|CFS|ECLIPSE|GENSalt|GOLDN|HyperGEN|JHS|GeneSTAR|GENOA
 8:        23andMe|ALSPAC|ARIC|BLTS|CADD|COGEND|COPDGene|deCODE|EGCUT|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|OZALC|SardiNIA|UKBB|WHI|FINRISK|PAGE|BBJ|CKB|AMISH|CFS|CHS|ECLIPSE|GENSalt|GOLDN|HCHS|SOL|HyperGEN|JHS|GeneSTAR|GENOA
 9: 23andMe|ALSPAC|ARIC|BLTS|CADD|COGEND|COPDGene|deCODE|EGCUT|eMERGE|GFG|Harvard|HUNT|MCTFR|MESA|METSIM|NTR|SardiNIA|UKBB|WHI|FINRISK|NINDS|BBJ|CKB|AMISH|CFS|CHS|ECLIPSE|GENSalt|GOLDN|HCHS|SOL|HVH|HyperGEN|JHS|GeneSTAR|GENOA|WGHS
10:                                                                                               23andMe|ALSPAC|ARIC|CADD|COGEND|deCODE|EGCUT|GERA|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|OZALC|SardiNIA|UKBB|WHI|FINRISK|BBJ|CKB
11:                                                                        23andMe|ALSPAC|ARIC|BLS|CADD|COGEND|COPDGene|deCODE|EGCUT|FHS|FTC|GERA|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NESCOG|FTC|NAG-FIN|NTR|QIMR|SardiNIA|UKBB|WHI
12:                       ARIC|BioMe|BRIGHT|CHRIS|CHS|ERF|FINCAVAS|GAPP|HCHS|SOL|HealthABC|INGI-Carlantino|INGI-FVG|Inter99|JHS|KORA|LifeLines|MESA|NEO|OOA|ORCADES|PIVUS|PREVEND|PROSPER|RS|SardiNIA|SHIP|TwinsUK|UKBB|VIKING|WHI|YFS
13:                                    ARIC|BioMe|BRIGHT|CHRIS|CHS|ERF|FINCAVAS|GAPP|HealthABC|INGI-Carlantino|INGI-FVG|Inter99|KORA|LifeLines|MESA|NEO|OOA|ORCADES|PIVUS|PREVEND|PROSPER|RS|SardiNIA|SHIP|TwinsUK|UKBB|VIKING|WHI|YFS
14:                                                                                                                                                                                                                           SardiNIA
15:                                                                                                                                                                                                                           SardiNIA

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SARDINIA", "SardiNIA"))

gwas_study_info |>
  filter(grepl("Sardinia", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  33830302
                                                                                                                                                                     COHORT
                                                                                                                                                                     <char>
1: GRID|British 1958 birth cohort|National blood service|WTCCC - Bipolar disease cases|Oxford Regional Prospective Study of Childhood Diabetes (ORPS)|Sardinia case-control

# not sure about case control Sardinia ... 
# see second sup table from https://pmc.ncbi.nlm.nih.gov/articles/PMC8099827/#_ad93_
# Sardinia

6.5 Odd naming convention in PUBMED_ID 32949544

Seems like mentioned ancestry groups, rather than cohorts (e.g. UKBB is used in this study)

see cohort information here: https://pmc.ncbi.nlm.nih.gov/articles/instance/8220892/bin/NIHMS1709432-supplement-Supp_Materials.pdf

gwas_study_info |>
  dplyr::filter(PUBMED_ID == 32949544)

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE       JOURNAL
                  <IDat>     <int>       <char>     <IDat>        <char>
1:            2020-10-01  32949544      Jones E 2020-09-16 Lancet Neurol
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/32949544
                                                                                                                            STUDY
                                                                                                                           <char>
1: Identification of novel risk loci and causal insights for sporadic Creutzfeldt-Jakob disease: a genome-wide association study.
                          DISEASE/TRAIT
                                 <char>
1: Creutzfeldt-Jakob disease (sporadic)
                                                INITIAL_SAMPLE_SIZE
                                                             <char>
1: 4,110 European ancestry cases, 13,569 European ancestry controls
                                              REPLICATION_SAMPLE_SIZE
                                                               <char>
1: 1,098 European ancestry cases, 498 ,016 European ancestry controls
                 PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                                     <char>             <int>
1: Affymetrix, Illumina [6314492] (imputed)                 4
                        MAPPED_TRAIT                     MAPPED_TRAIT_URI
                              <char>                               <char>
1: sporadic Creutzfeld Jacob disease http://www.ebi.ac.uk/efo/EFO_1000656
   STUDY_ACCESSION        GENOTYPING_TECHNOLOGY SUBMISSION_DATE
            <char>                       <char>          <lgcl>
1:    GCST90001389 Genome-wide genotyping array              NA
   STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
              <lgcl>           <lgcl>                  <char>
1:                NA               NA                        
   MAPPED_BACKGROUND_TRAIT_URI
                        <char>
1:                            
                                                                                                                                                               COHORT
                                                                                                                                                               <char>
1: Dutch controls|French controls|German controls|Italian controls|Spanish controls|UK controls|US controls|UK sCJD cases|US sCJD cases|German sCJD cases| sCJD cases
   FULL_SUMMARY_STATISTICS
                    <char>
1:                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90001001-GCST90002000/GCST90001389
      GXE
   <char>
1:     no

gwas_study_info = 
    rows_update(gwas_study_info ,tibble(PUBMED_ID = 32949544, COHORT = "multiple"), unmatched = "ignore")

6.6 Odd naming convention in PUBMED ID 33649486

gwas_study_info |>
  filter(grepl("Multiethnic samples from the UK", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                          COHORT
       <int>                          <char>
1:  33649486 Multiethnic samples from the UK

gwas_study_info |>
    filter(PUBMED_ID == 33649486)

   DATE_ADDED_TO_CATALOG PUBMED_ID  FIRST_AUTHOR       DATE     JOURNAL
                  <IDat>     <int>        <char>     <IDat>      <char>
1:            2021-03-22  33649486 Hardcastle AJ 2021-03-01 Commun Biol
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/33649486
                                                                                                                                 STUDY
                                                                                                                                <char>
1: A multi-ethnic genome-wide association study implicates collagen matrix integrity and cell differentiation pathways in keratoconus.
   DISEASE/TRAIT
          <char>
1:   Keratoconus
                                                INITIAL_SAMPLE_SIZE
                                                             <char>
1: 2,116 European ancestry cases, 24,626 European ancestry controls
                                                                                                                                                                               REPLICATION_SAMPLE_SIZE
                                                                                                                                                                                                <char>
1: 1, 389 European ancestry cases, 79,727 European ancestry controls, 759 South Asian ancestry cases, 8,009 South Asian ancestry controls, 405 African ancestry cases, 4,185 African ancestry controls
       PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT MAPPED_TRAIT
                           <char>             <int>       <char>
1: Affymetrix [7701190] (imputed)                36  keratoconus
                               MAPPED_TRAIT_URI STUDY_ACCESSION
                                         <char>          <char>
1: http://purl.obolibrary.org/obo/MONDO_0015486    GCST90013442
          GENOTYPING_TECHNOLOGY SUBMISSION_DATE STATISTICAL_MODEL
                         <char>          <lgcl>            <lgcl>
1: Genome-wide genotyping array              NA                NA
   BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT_URI
             <lgcl>                  <char>                      <char>
1:               NA                                                    
                            COHORT FULL_SUMMARY_STATISTICS
                            <char>                  <char>
1: Multiethnic samples from the UK                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90013001-GCST90014000/GCST90013442
      GXE
   <char>
1:     no

# looking at this study, discovery
# controls come from UKBB
# cases recruited from various places across the UK - so 

gwas_study_info = 
rows_update(gwas_study_info ,tibble(PUBMED_ID = 33649486, COHORT = "UKBB|other"), 
            unmatched = "ignore")

6.7 GAINT to GIANT

# GAINT appears to be a typo 
# see PUBMED_ID:    36376304 (https://pmc.ncbi.nlm.nih.gov/articles/PMC9663411/)

gwas_study_info |>
filter(grepl("GAINT", COHORT)) |>
select(PUBMED_ID, COHORT) |>
 distinct()

   PUBMED_ID     COHORT
       <int>     <char>
1:  36376304 UKBB|GAINT

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GAINT", "GIANT"))

6.8 1982 Pelotas (Brazil) Birth Cohort Study

unique(all_cohorts)[grepl("1982", unique(all_cohorts))]

[1] "1982 PELOTAS"                            
[2] "1982 Pelotas (Brazil) Birth Cohort Study"

gwas_study_info |>
  filter(grepl("1982", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  40537477
2:  39885687
3:  35399580
4:  35399580
                                                                                                                                                                                                                  COHORT
                                                                                                                                                                                                                  <char>
1:                     ARIC|CARDIA|CHS|GENOA|HABC|HANDLS|JHS|MESA|WHI|SP2|BAEPENDI|1982 PELOTAS|AGES|ERF|FHS|HyperGEN|NEO|RS|WHI-GARNET|GeneSTAR|HRS|SMHS|SWHS|CoLaus|KORA|LBC|Lifelines|NESDA|SHIP-Trend|TRAILS|YFS|SOL
2:                                                                                                                                                       ZOE2.0|SLS|BioVU|MyCode|VFA|SOLYouth|1982 PELOTAS|CCHC|EGG|MOBA
3:             BioMe|Baependi|CANDELA|NC-BCFR|SFBCS|FIND|HCHS|SOL|Los Angeles Latino Eye Study|MEC|MESA|Mexico City 1|Mexico City 2|MHS|1982 Pelotas (Brazil) Birth Cohort Study|SAFS|STARR COUNTY|T2D SIGMA Studies|WHI
4: BioMe|Baependi|CANDELA|NC-BCFR|SFBCS|FIND|HCHS|SOL|Los Angeles Latino Eye Study|MEC|MESA|Mexico City 1|Mexico City 2|MHS|1982 Pelotas (Brazil) Birth Cohort Study|SAFS|STARR COUNTY|T2D SIGMA Studies|WHI|AAAGC|GIANT

# can confirm, 39885687 (https://pmc.ncbi.nlm.nih.gov/articles/PMC11875162/) 
# 1982 PELOTAS refers to 1982 Pelotas (Brazil) Birth Cohort Study

# can confirm: 40537477 (https://pmc.ncbi.nlm.nih.gov/articles/PMC12179276/#MOESM2)
# 1982 PELOTAS refers to 1982 Pelotas (Brazil) Birth Cohort Study
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "1982 Pelotas (Brazil) Birth Cohort Study", "1982 PELOTAS"))

gwas_study_info |>
  filter(grepl("\\bPELOTAS\\b", COHORT)) |>
  filter(!grepl("1982", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                           COHORT
       <int>                           <char>
1:  34059833 BioMe|IRAS|MESA|PELOTAS|HCHS|SOL

# can confirm: 34059833 
# PELOTAS refers to the 1982 PELOTAS study
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "(?<!1982 )PELOTAS", "1982 PELOTAS"))

6.9 GR@ACE

# i notice there are studies with cohort listed as
# GR@CE & GR@ACE - perhaps these are the same?

# checking GR@CE - as there is fewer of these studies listed ... 
gwas_study_info |>
filter(grepl("GR@CE", COHORT)) |>
select(PUBMED_ID, COHORT) |>
distinct()

   PUBMED_ID
       <int>
1:  35379992
2:  39046104
3:  39046104
                                                                                           COHORT
                                                                                           <char>
1:                                        EADB|GR@CE|EADI|GERAD|PERADES|DemGene|Bonn|RS|CCHS|UKBB
2: 3C|AGES|ARIC|ASPREE|CHS|FVG|FHS|GR@CE|Apulia|HKOS|HUNT|MEMENTO|MYHAT|ROSMAP|RS|ADGC|UKBB|SALSA
3:       3C|AGES|ARIC|ASPREE|CHS|FVG|FHS|GR@CE|Apulia|HKOS|HUNT|MEMENTO|MYHAT|ROSMAP|RS|ADGC|UKBB

# 35379992 -GR@CE appears to be a typo, should be: GR@ACE (https://pmc.ncbi.nlm.nih.gov/articles/PMC9005347/#Sec8)

# 39046104 - GR@CE also appears to be a typo,  should be: GR@ACE
# https://pmc.ncbi.nlm.nih.gov/articles/PMC11497727/#alz14115-sec-0080

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GR@CE", "GR@ACE"))

6.10 Tohoku Medical Megabank

gwas_study_info |> 
  filter(grepl("tohoku", tolower(COHORT))) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                          COHORT
       <int>                          <char>
1:  40226751         Tohoku Medical Megabank
2:  34782693 Tohoku Medical Megabank Project

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Tohoku Medical Megabank Project", "Tohoku Medical Megabank"))

6.11 Steno Diabetes

gwas_study_info |> 
  filter(grepl("steno", tolower(COHORT))) |>
  select(PUBMED_ID, COHORT, `DISEASE/TRAIT`) |>
  distinct()

   PUBMED_ID
       <int>
1:  34127860
2:  35627254
                                                                                                                                                                              COHORT
                                                                                                                                                                              <char>
1: BC58|BDA|NIHR BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap
2:                                                                                                                                                                             Steno
                                           DISEASE/TRAIT
                                                  <char>
1:                                       Type 1 diabetes
2: Neuropeptide Y autoantibody levels in type 1 diabetes

all_cohorts[grep("steno", tolower(all_cohorts))] |> unique()

[1] "Steno Diabetes Center" "Steno"

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, 
                                  "Steno Diabetes Center", 
                                  "Steno")
         )

6.12 Nagahama

gwas_study_info |>
  filter(grepl("nagahama", tolower(COHORT))) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  35551307
2:  34059833
3:  34059833
4:  34887591
5:  40181193
6:  38277453
                                                                                                                                                                                                                             COHORT
                                                                                                                                                                                                                             <char>
1:                                                                                                                     AASC|BBJ|BES|CAGE|CHNS|CKB|CLHNS|DC|SP2|HKDR|KARE|MESA|Nagahama Study|SBCS|SWHS|SCES|SCHS|SiMES|TAICHI|TWT2D
2:                                                                                                            AASC|BES|CAGE-GWAS1|CAGE|CLHNS|CHNS|KARE|Living Biobank|MESA|Nagahama Study|NHAPC|SCES|SiMES|SP2|TAICHI|CRC|SBCS|SMHS
3:                                                                                                                                  CAGE-GWAS1|CAGE|CHNS|KARE|LivingBiobank|MESA|NagahamaStudy|NHAPC|SCES|SiMES|SP2|TAICHI|CRC|TWSC
4:                                                                                            BAS|BBJ|BES|CAGE|CAS|CHNS|CKB|SDCS|JPDSC|KARE|Living-biobank|MESA|Nagahama Study|NHAPC|SBCS|SCES|SCHS|SiMES|SINDI|SP2|SWHS|TUDR|TWT2D
5: AGES|ALSPAC|ARIC|BHS_b|CARDIA|CCHC|CFS|CHS|COLAUS|DIACORE|DRS_EXTRA|EPIC-Norfolk|EB|FHS|Fenland|GAPP|GENSALT|HANDLS|HCS|IRASFS|JHS|KOGES|LBC|LifeLines|LLFS|MESA|MVP|Nagahama_Study|NEO|NESDA|SHIP|SOL|SWAN|TwinsUK|UKBB|WHI|YFS
6:                                                                                                                                                                                           HERPACC|J-MICC|JPHC|ToMMo|Nagahama|BBJ

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Nagahama_Study|NagahamaStudy", "Nagahama Study")) |> 
  # ? maybe check Nagahama == Nagahama Study
  mutate(COHORT = str_replace_all(COHORT, "Nagahama Study", "Nagahama"))

6.13 WTCCC - Bipolar disease cases

gwas_study_info |>
  filter(grepl("WTCCC - Bipolar disease cases", COHORT)) |>
  select(1:5)

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE      JOURNAL
                  <IDat>     <int>       <char>     <IDat>       <char>
1:            2021-04-23  33830302   Inshaw JRJ 2021-04-08 Diabetologia

gwas_study_info |>
  filter(PUBMED_ID == 33830302) |>
  select(PUBMED_ID, COHORT)

   PUBMED_ID
       <int>
1:  33830302
2:  33830302
                                                                                                                                                                     COHORT
                                                                                                                                                                     <char>
1: GRID|British 1958 birth cohort|National blood service|WTCCC - Bipolar disease cases|Oxford Regional Prospective Study of Childhood Diabetes (ORPS)|Sardinia case-control
2:

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "WTCCC - Bipolar disease cases", "WTCCC"))

6.14 Qatar Genome Project

gwas_study_info |>
  filter(grepl("QGP", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                     COHORT
       <int>                     <char>
1:  33623009 Qatar Genome Program (QGP)
2:  36168886                        QGP

# Checked 36168886 - QGP is Qatar Genome Project

# so 
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Qatar Genome Program (QGP)", "QGP"))

6.15 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]
unique(all_cohorts) |> length()

[1] 1123

7 Discrepancies corrected across papers (only quick manual review - could have errors):

7.1 1982 Pelotas (Brazil) Birth Cohort Study -> 1982 PELOTAS

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  '1982 Pelotas \\(Brazil\\) Birth Cohort Study',
                                  '1982 PELOTAS'))

7.2 ABCD study

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, 
                                  "ABCD study", 
                                  "ABCD"))

7.3 canSCAD example:

# canSCAD"  "CanSCAD cases and MGI controls" 
gwas_study_info |>
  filter(grepl("CanSCAD cases and MGI controls", COHORT)) |>
select(PUBMED_ID, COHORT) |>
 distinct()

   PUBMED_ID                         COHORT
       <int>                         <char>
1:  32887874 CanSCAD cases and MGI controls

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "CanSCAD cases and MGI controls", "canSCAD|MGI"))

7.4 Potentionally simple checking similar names (just differ in capitalisation)

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]

unique_cohort_names = unique(all_cohorts) 

# Convert to lowercase and check duplicates
dup_groups <- tapply(unique_cohort_names, 
                     tolower(unique_cohort_names), 
                     I)

# Keep only groups with >1 element (i.e., capitalization differences)
dup_groups[lengths(dup_groups) > 1]

$airwave
[1] "Airwave" "AIRWAVE"

$allofus
[1] "AllofUs" "AllOfUs"

$baependi
[1] "BAEPENDI" "Baependi"

$biome
[1] "BioMe" "BioME" "BIOME"

$biovu
[1] "BioVU" "BioVu" "BIOVU"

$cilento
[1] "CILENTO" "Cilento"

$colaus
[1] "CoLaus" "COLAUS"

$`croatia-korcula`
[1] "CROATIA-KORCULA" "CROATIA-Korcula"

$famhs
[1] "FamHS" "FAMHS"

$fenland
[1] "Fenland" "FENLAND"

$gel
[1] "GEL" "GeL"

$genestar
[1] "GeneSTAR" "GENESTAR" "GeneStar"

$gensalt
[1] "GENSalt" "GENSALT" "GenSalt"

$godarts
[1] "GoDARTS" "GODARTS"

$hypergen
[1] "HyperGEN" "HyperGen" "HYPERGEN"

$inchianti
[1] "InCHIANTI" "INCHIANTI"

$inter99
[1] "Inter99" "INTER99"

$koges
[1] "KoGES" "KOGES"

$`life-heart`
[1] "LIFE-HEART" "LIFE-Heart"

$lifelines
[1] "LifeLines" "Lifelines"

$`mayo-vdb`
[1] "MAYO-VDB" "Mayo-VDB"

$moba
[1] "MOBA" "MoBa"

$nugene
[1] "Nugene" "NUGENE"

$orcades
[1] "ORCADES" "Orcades"

$panscan
[1] "PANSCAN" "PanScan"

$raine
[1] "RAINE" "Raine"

$`ship-trend`
[1] "SHIP-TREND" "SHIP-Trend"

$sign
[1] "SiGN" "SIGN"

$viva
[1] "Viva" "VIVA"

7.5 Potentionally simple checking similar names (just different in spaces and _)

# Normalize by removing spaces and underscores
normalized <- gsub("[ _]", "", sort(unique_cohort_names))

# Group by normalized value
dup_groups <- tapply(sort(unique_cohort_names), normalized, I)

# Keep only groups with >1 element (i.e. variants)
dup_groups[lengths(dup_groups) > 1]

$DRSEXTRA
[1] "DRS_EXTRA" "DRSEXTRA" 

$GALAII
[1] "GALA II" "GALA_II"

$Health2000
[1] "Health 2000" "Health2000" 

$HealthABC
[1] "Health ABC" "HealthABC" 

$`INGI-ValBorbera`
[1] "INGI-Val Borbera" "INGI-Val_Borbera"

$LivingBiobank
[1] "Living Biobank" "LivingBiobank"

7.6 Airwave

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "AIRWAVE", "Airwave"))

7.7 AllOfUs

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "AllOfUs", "AllofUs"))

7.8 Baependi

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, toupper("Baependi"), "Baependi"))

7.9 BioMe

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "BIOME|BioME", "BioMe"))

7.10 BioVU

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "BIOVU|BioVu", "BioVU"))

7.11 British 1958 birth cohort

7.11.1 British 1958 birth cohort -> B58C

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "British 1958 birth cohort", "B58C"))

7.11.2 BC58 -> B58C

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, 
                                  "BC58", 
                                  "B58C")
         )

7.12 China Kadoorie Biobank

# CKB is the acronym for the China Kadoorie Biobank (see:pubmed id 36777997) https://pmc.ncbi.nlm.nih.gov/articles/PMC9903787/#tbl1

gwas_study_info |>
  filter(grepl("\\bCKB\\b", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct() |>
  tail()

   PUBMED_ID                                                          COHORT
       <int>                                                          <char>
1:  34586374                                            CKB|23andMe|WHI|UKBB
2:  34586374                                                             CKB
3:  34586374                                                        CKB|UKBB
4:  34586374                                                         CKB|WHI
5:  33766948                                                             CKB
6:  36777997 BBJ|BioMe|BioVU|CCPM|CKB|EB|FinnGen|G&H|HUNT|MGBB|MGI|UCLA|UKBB

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, 
                                  "China Kadoorie Biobank", 
                                  "CKB")
         )

7.13 Cilento

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, toupper("Cilento"), "Cilento"))

7.14 Colaus

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "COLAUS", "CoLaus"))

7.15 CROATIA-Korcula

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "CROATIA-KORCULA", "CROATIA-Korcula"))

7.16 DRS_EXTRA

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "DRSEXTRA", "DRS_EXTRA"))

7.17 FamHS

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, toupper("FamHS"), "FamHS"))

7.18 Fenland

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "FENLAND", "Fenland"))

7.19 GALA II

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GALA_II", "GALA II"))

7.20 GEL

gwas_study_info |>
  filter(grepl("GeL", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID COHORT
       <int> <char>
1:  36124557    GeL

# Only one study uses GeL (36124557)- from 
# https://pmc.ncbi.nlm.nih.gov/articles/PMC9512401/#s4 
# Appears to be typo, for Genomics England (GEL)
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GeL", "GEL"))

7.21 GeneSTAR

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "GENESTAR|GeneStar", "GeneSTAR"))

7.22 GENSalt

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "GENSALT|GenSalt", "GENSalt"))

7.23 GoDARTS

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("godarts"), "GoDARTS"))

7.24 InCHIANTI

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("InCHIANTI"), "InCHIANTI"))

7.25 Inter99

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Inter99"), "Inter99"))

7.26 “Health ABC”

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Health ABC", "HealthABC"))

7.27 “Health 2000”

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Health 2000", "Health2000"))

7.28 HyperGen

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "HyperGEN|HYPERGEN", "HyperGen")) 

# ? LifeLines Deep

7.29 INGI-Val Borbera

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "INGI-Val_Borbera", "INGI-Val Borbera")) 

# ? LifeLines Deep

7.30 KoGES

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("KoGES"), "KoGES"))

7.31 Lifeheart

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "LIFE-HEART", "LIFE-Heart"))

7.32 LifeLines

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Lifelines", "LifeLines")) 

# ? LifeLines Deep

7.33 Living Biobank

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Living-biobank|LivingBiobank", "Living Biobank"))

7.34 Mayo-VDB

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Mayo-VDB"), "Mayo-VDB"))

7.35 A Multiethnic Genome-wide Scan of Prostate Cancer -> MEC

? I think refers to this: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000306.v4.p1

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'A Multiethnic Genome-wide Scan of Prostate Cancer',
                                  'MEC')) |>
    mutate(COHORT = str_replace_all(COHORT,
                                  'Multiethnic Genome-wide Scan of Prostate Cancer',
                                  'MEC'))

7.36 MoBa

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("MoBa"), "MoBa"))

7.37 Nugene

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Nugene", "NUGENE"))

7.38 Orcades

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Orcades"), "Orcades"))

7.39 Oxford Regional Prospective Study of Childhood Diabetes (ORPS) -> ORPS

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'Oxford Regional Prospective Study of Childhood Diabetes \\(ORPS\\)',
                                  'ORPS'))

7.40 Qatar Genome Program (QGP) -> QGP

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'Qatar Genome Program \\(QGP\\)',
                                  'QGP'))

7.41 PanScan

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("PanScan"), "PanScan"))

7.42 Raine

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Raine"), "Raine"))

7.43 ROSMAP

all_cohorts[grep("rosmap", tolower(all_cohorts))] |> unique()

[1] "ROSMAP"   "ROSMAP 1" "ROSMAP 2"

gwas_study_info |>
  filter(grepl("ROSMAP 1|ROSMAP 2", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  33510174
                                                                                                                                                                     COHORT
                                                                                                                                                                     <char>
1: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP 1|ROSMAP 2|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "ROSMAP 1|ROSMAP 2", "ROSMAP"))

7.44 SHIP trend

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SHIP-Trend", "SHIP-TREND"))

# ? "SHIPNATREND"  - comes from one study
gwas_study_info |>
 filter(grepl("SHIPNATREND", COHORT)) |>
 select(PUBMED_ID, COHORT) |>
 distinct()

    PUBMED_ID
        <int>
 1:  32888493
 2:  32888493
 3:  32888493
 4:  32888493
 5:  32888493
 6:  32888493
 7:  32888493
 8:  32888493
 9:  32888493
10:  32888493
11:  32888493
12:  32888493
13:  32888493
14:  32888493
15:  32888493
16:  32888493
17:  32888493
18:  32888493
19:  32888493
20:  32888493
21:  32888493
22:  32888493
    PUBMED_ID
                                                                                                                                                                                                                 COHORT
                                                                                                                                                                                                                 <char>
 1:                                                                                     Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIPNATREND|UKBB|WHI
 2:                                                                      Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIPNATREND|UKBB|WHI
 3:                                                                                           Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 4:                                                                            Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 5:                                                                                       Airwave|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 6:                                                                    Airwave|BBJ|BioMe|CaPS|CHS|CHS|Estonia|Estonia|FHS|FINCAVAS|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 7:                                                                            Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 8:                                                             Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 9:                                                                                                              Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|INTERVAL|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|WHI
10:                                                                                               Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|WHI
11:                                                                        Airwave|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
12:                                                         Airwave|BBJ|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
13:                                       Airwave|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|Health2006|Health2008|Health2010|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
14:                   Airwave|BBJ|BioMe|CaPS|CHNS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|Health2006|Health2008|Health2010|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
15:                                                                                                                     Airwave|BioMe|CaPS|Estonia|FHS|INTERVAL|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI
16:                                                                               Airwave|BioMe|BioMe|BioMe|CaPS|Estonia|FHS|HANDLS|INTERVAL|JHS|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI
17:                                                                                               Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|WHI
18:                                 Airwave|BBJ|BioMe|BioMe|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI
19: Airwave|BBJ|BioMe|BioMe|BioMe|CaPS|CHNS|CHS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI|YFS
20:         Airwave|BBJ|BioMe|BioMe|BioMe|CaPS|CHNS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI|YFS
21:                                                                                                              Airwave|BioMe|CaPS|FHS|GERA|GERA|GERA|INTERVAL|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI
22:                                                              Airwave|BioMe|BioMe|BioMe|CaPS|FHS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI
                                                                                                                                                                                                                 COHORT

# from sup table, seems like SHIPNATREND is SHIP-TREND - 
# https://pmc.ncbi.nlm.nih.gov/articles/PMC7480402/#SD1

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SHIPNATREND", "SHIP-TREND"))

7.45 Sign

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SIGN", "SiGN"))

7.46 Taiwan -> TWB

# for PUBMED_IDs
twb_pubmed_ids <-  c("34026292",
                     "36329257",
                     "36009466",
                     "35046404",
                     "34934334",
                     "34834521",
                     "34404248",
                     "36778051",
                     "34522458"
                    )

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = ifelse(PUBMED_ID %in% twb_pubmed_ids,
                         str_replace_all(COHORT, "Taiwan", "TWB"),
                       COHORT
  )
  )

7.47 T2D SIGMA Studies -> SIGMA T2D

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'T2D SIGMA Studies',
                                  'SIGMA T2D'))

7.48 Tohoku Medical Megabank -> TMM

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'Tohoku Medical Megabank',
                                  'TMM'))

7.49 Understanding Society -> UnderstandingSociety

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Understanding Society", "UnderstandingSociety"))

7.50 VIVA

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Viva", "VIVA"))

7.51 Rotterdam

gwas_study_info |>
  filter(grepl("Rotterdam", COHORT))

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE   JOURNAL
                  <IDat>     <int>       <char>     <IDat>    <char>
1:            2025-03-11  40050429    Roselli C 2025-03-06 Nat Genet
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/40050429
                                                                                                                         STUDY
                                                                                                                        <char>
1: Meta-analysis of genome-wide associations and polygenic risk prediction for atrial fibrillation in more than 180,000 cases.
         DISEASE/TRAIT
                <char>
1: Atrial fibrillation
                                                                                                                                                                                                                                                                                                                                                                                      INITIAL_SAMPLE_SIZE
                                                                                                                                                                                                                                                                                                                                                                                                   <char>
1: 1,782 Admix African and African American cases, 9,356 Admix African and African American controls, 11,350 East Asian ancestry cases, 137,515 East Asian ancestry controls, 166,322 European ancestry cases, 1,313,950 European ancestry controls, 1,774 Hispanic or Latin American cases, 7,665 Hispanic or Latin American controls, 218 South Asian ancestry cases, 413 South Asian ancestry controls
   REPLICATION_SAMPLE_SIZE                PLATFORM_[SNPS_PASSING_QC]
                    <char>                                    <char>
1:                    <NA> Affymetrix, Illumina [29789980] (imputed)
   ASSOCIATION_COUNT        MAPPED_TRAIT                     MAPPED_TRAIT_URI
               <int>              <char>                               <char>
1:               355 atrial fibrillation http://www.ebi.ac.uk/efo/EFO_0000275
   STUDY_ACCESSION        GENOTYPING_TECHNOLOGY SUBMISSION_DATE
            <char>                       <char>          <lgcl>
1:    GCST90559230 Genome-wide genotyping array              NA
   STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
              <lgcl>           <lgcl>                  <char>
1:                NA               NA                        
   MAPPED_BACKGROUND_TRAIT_URI
                        <char>
1:                            
                                                                                                                                                                                                                           COHORT
                                                                                                                                                                                                                           <char>
1: AGES|ARIC|BioMe|Broad CVDi|BBJ|CHS|MESA|SiGN|ENGAGE_AF-TIMI_48|SPHFC|CCAF|CHB|MyCode|EGCUT|FHS|GAPP|GS:SFHS|HRS|LURIC|HUNT|MGI|PHB|PIVUS|PREVEND|PROSPER|Rotterdam|SHIP|SiGN|TwinGene|ULSAM|Vanderbilt|WGHS|WTCCC|FinnGen|UKBB
   FULL_SUMMARY_STATISTICS SUMMARY_STATS_LOCATION    GXE
                    <char>                 <char> <char>
1:                      no                   <NA>     no

# Rotterdam study is typically listed as "RS"

# see e.g. 36568030 https://pmc.ncbi.nlm.nih.gov/articles/PMC9772568/
gwas_study_info |>
  filter(grepl("\\bRS\\b", COHORT))

     DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE
                    <IDat>     <int>       <char>     <IDat>
  1:            2023-03-21  36662418     Faber BG 2023-01-20
  2:            2023-05-12  36918541     Young WJ 2023-03-14
  3:            2023-05-12  36918541     Young WJ 2023-03-14
  4:            2023-05-12  36918541     Young WJ 2023-03-14
  5:            2023-05-12  36918541     Young WJ 2023-03-14
 ---                                                        
332:            2023-01-31  36568030     Young KL 2022-11-25
333:            2023-01-31  36568030     Young KL 2022-11-25
334:            2023-01-31  36568030     Young KL 2022-11-25
335:            2023-01-31  36568030     Young KL 2022-11-25
336:            2023-01-31  36568030     Young KL 2022-11-25
                 JOURNAL                                 LINK
                  <char>                               <char>
  1: Arthritis Rheumatol www.ncbi.nlm.nih.gov/pubmed/36662418
  2:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
  3:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
  4:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
  5:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
 ---                                                         
332:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
333:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
334:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
335:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
336:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
                                                                                                                                 STUDY
                                                                                                                                <char>
  1: A GWAS meta-analysis of alpha angle suggests cam-type morphology may be a specific feature of hip osteoarthritis in older adults.
  2:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
  3:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
  4:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
  5:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
 ---                                                                                                                                  
332:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
333:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
334:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
335:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
336:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
           DISEASE/TRAIT
                  <char>
  1:         Alpha angle
  2: Frontal QRS-T angle
  3: Spatial QRS-T angle
  4: Spatial QRS-T angle
  5: Frontal QRS-T angle
 ---                    
332:     Waist-hip ratio
333:     Waist-hip ratio
334:     Waist-hip ratio
335:     Waist-hip ratio
336:     Waist-hip ratio
                                                                     INITIAL_SAMPLE_SIZE
                                                                                  <char>
  1:                                                44,214 European ancestry individuals
  2: 159,715 European ancestry, African ancestry, Hispanic or Latin American individuals
  3:                                                96,562 European ancestry individuals
  4: 118,780 European ancestry, African ancestry, Hispanic or Latin American individuals
  5:                                               134,567 European ancestry individuals
 ---                                                                                    
332:                                                15,503 European ancestry individuals
333:                                                       8,678 European ancestry women
334:                                                         6,825 European ancestry men
335:                         2,987 African ancestry women, 8,678 European ancestry women
336:                             1,307 African ancestry men, 6,825 European ancestry men
                                       REPLICATION_SAMPLE_SIZE
                                                        <char>
  1:                                                      <NA>
  2:                                                      <NA>
  3:                                                      <NA>
  4:                                                      <NA>
  5:                                                      <NA>
 ---                                                          
332:                       1,229 European ancestry individuals
333:                               771 European ancestry women
334:                                 758 European ancestry men
335: 771 European ancestry women, 2,308 African American women
336:     758 European ancestry men, 1,239 African American men
                   PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                                       <char>             <int>
  1: Affymetrix, Illumina [9134976] (imputed)                 8
  2: Affymetrix, Illumina [8299259] (imputed)                11
  3: Affymetrix, Illumina [8603009] (imputed)                51
  4: Affymetrix, Illumina [9052360] (imputed)                61
  5: Affymetrix, Illumina [7954211] (imputed)                 9
 ---                                                           
332:                               NR [67633]                 0
333:                               NR [67633]                 0
334:                               NR [67633]                 0
335:                               NR [67633]                 0
336:                               NR [67633]                 0
                MAPPED_TRAIT                     MAPPED_TRAIT_URI
                      <char>                               <char>
  1: alpha angle measurement http://www.ebi.ac.uk/efo/EFO_0020071
  2:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
  3:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
  4:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
  5:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
 ---                                                             
332:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
333:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
334:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
335:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
336:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
     STUDY_ACCESSION        GENOTYPING_TECHNOLOGY SUBMISSION_DATE
              <char>                       <char>          <lgcl>
  1:    GCST90129635 Genome-wide genotyping array              NA
  2:    GCST90246319 Genome-wide genotyping array              NA
  3:    GCST90246320 Genome-wide genotyping array              NA
  4:    GCST90246318 Genome-wide genotyping array              NA
  5:    GCST90246321 Genome-wide genotyping array              NA
 ---                                                             
332:    GCST90245813        Exome-wide sequencing              NA
333:    GCST90245814        Exome-wide sequencing              NA
334:    GCST90245815        Exome-wide sequencing              NA
335:    GCST90245816        Exome-wide sequencing              NA
336:    GCST90245817        Exome-wide sequencing              NA
     STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
                <lgcl>           <lgcl>                  <char>
  1:                NA               NA                        
  2:                NA               NA                        
  3:                NA               NA                        
  4:                NA               NA                        
  5:                NA               NA                        
 ---                                                           
332:                NA               NA                        
333:                NA               NA                        
334:                NA               NA                        
335:                NA               NA                        
336:                NA               NA                        
     MAPPED_BACKGROUND_TRAIT_URI
                          <char>
  1:                            
  2:                            
  3:                            
  4:                            
  5:                            
 ---                            
332:                            
333:                            
334:                            
335:                            
336:                            
                                                                                                                   COHORT
                                                                                                                   <char>
  1:                                                                                                              UKBB|RS
  2: ARIC|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
  3: ARIC|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
  4: ARIC|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
  5: ARIC|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
 ---                                                                                                                     
332:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS
333:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS
334:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS
335:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS
336:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS
     FULL_SUMMARY_STATISTICS
                      <char>
  1:                     yes
  2:                     yes
  3:                     yes
  4:                     yes
  5:                     yes
 ---                        
332:                      no
333:                      no
334:                      no
335:                      no
336:                      no
                                                                                SUMMARY_STATS_LOCATION
                                                                                                <char>
  1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90129001-GCST90130000/GCST90129635
  2: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246319
  3: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246320
  4: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246318
  5: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246321
 ---                                                                                                  
332:                                                                                              <NA>
333:                                                                                              <NA>
334:                                                                                              <NA>
335:                                                                                              <NA>
336:                                                                                              <NA>
        GXE
     <char>
  1:     no
  2:     no
  3:     no
  4:     no
  5:     no
 ---       
332:     no
333:     no
334:     no
335:     no
336:     no

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Rotterdam", "RS"))

7.52 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]
unique(all_cohorts) |> length()

[1] 1069

7.53 Check in: how many cohorts are only used in one PUBMED ID (indicating possibly misnaming error?)

single_use_cohorts  =   
data.frame(cohort = all_cohorts) |>
  group_by(cohort) |>
  summarise(n_studies = n()) |>
  filter(n_studies == 1) |>
  pull(cohort)

length(single_use_cohorts)

[1] 209

7.54 Check in: have we corrected the simple changes we sought to corect?

unique_cohort_names = unique(all_cohorts) 

# Convert to lowercase and check duplicates
dup_groups <- tapply(unique_cohort_names, tolower(unique_cohort_names), I)

# Keep only groups with >1 element (i.e., capitalization differences)
dup_groups[lengths(dup_groups) > 1]

named character(0)

normalized <- gsub("[ _]", "", sort(unique_cohort_names))

# Group by normalized value
dup_groups <- tapply(sort(unique_cohort_names), normalized, I)

# Keep only groups with >1 element (i.e. variants)
dup_groups[lengths(dup_groups) > 1]

named character(0)

7.55 Aditional checks:

normalized <- gsub("[ _]", "", sort(unique_cohort_names))

# Group by normalized value
dup_groups <- tapply(sort(unique_cohort_names), tolower(normalized), I)

# Keep only groups with >1 element (i.e. variants)
dup_groups[lengths(dup_groups) > 1]

named character(0)

7.56 Checking: Any fuzzy names to check?

library(stringdist)
library(dplyr)

# Identify pairs with small distance (e.g., <=2 edits)
small_dist_pairs <- function(threshold, 
                             dist_matrix_method,
                             all_cohorts) {
  
  # Create a vector of unique cohort names
  cohorts <- unique(all_cohorts)
  
  single_use_cohorts  =   data.frame(cohort = all_cohorts) |>
                          group_by(cohort) |>
                          summarise(n_studies = n()) |>
                          filter(n_studies == 1) |>
                          pull(cohort)
  
  # Compute pairwise string distances (Levenshtein distance)
  dist_matrix <- stringdistmatrix(single_use_cohorts, 
                                  cohorts, 
                                  method = dist_matrix_method)
  
  matches <- which(dist_matrix > 0 & dist_matrix <= threshold, 
                   arr.ind = TRUE)
  
  matches <- data.frame(
    cohort1 = single_use_cohorts[matches[,1]],
    cohort2 = cohorts[matches[,2]],
    distance = dist_matrix[matches]
  )
  
  matches <- matches[matches$cohort1 != matches$cohort2, ]
  matches <- unique(matches)
  
  return(matches)
  
}

small_dist_pairs(threshold = 2, 
                 dist_matrix_method = "lv",
                 all_cohorts = all_cohorts) |>
  arrange(distance) |>
  head()

  cohort1 cohort2 distance
1    CHIP    SHIP        1
2   SpBCS   SEBCS        1
3    DCHS   DACHS        1
4     HIS     HAS        1
5    MACS    MCCS        1
6    MHCS    MCCS        1

small_dist_pairs(threshold = 2,
                 dist_matrix_method = "lcs",
                 all_cohorts = all_cohorts) |>
  arrange(distance) |>
  head()

  cohort1 cohort2 distance
1    DCHS   DACHS        1
2    DNHS     NHS        1
3    CCHS     CHS        1
4    DCHS     CHS        1
5     BLS    BLTS        1
6     EBB      EB        1

# COHRA1 vs COHRA
# COHRA2 vs COHRA

# EGLE vs EAGLE
# GHS-II GHS-I

# B-PROOF BPROOF
gwas_study_info |>
  filter(grepl("\\bB-PROOF\\b", COHORT))

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE    JOURNAL
                  <IDat>     <int>       <char>     <IDat>     <char>
1:            2024-10-07  39103364       Went M 2024-08-05 Nat Commun
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/39103364
                                                                            STUDY
                                                                           <char>
1: Deciphering the genetics and mechanisms of predisposition to multiple myeloma.
      DISEASE/TRAIT
             <char>
1: Multiple myeloma
                                                  INITIAL_SAMPLE_SIZE
                                                               <char>
1: 10,906 European ancestry cases, 366,221 European ancestry controls
   REPLICATION_SAMPLE_SIZE PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                    <char>                     <char>             <int>
1:                    <NA>     NR [8100000] (imputed)                35
       MAPPED_TRAIT                     MAPPED_TRAIT_URI STUDY_ACCESSION
             <char>                               <char>          <char>
1: multiple myeloma http://www.ebi.ac.uk/efo/EFO_0001378    GCST90451657
          GENOTYPING_TECHNOLOGY SUBMISSION_DATE STATISTICAL_MODEL
                         <char>          <lgcl>            <lgcl>
1: Genome-wide genotyping array              NA                NA
   BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT_URI
             <lgcl>                  <char>                      <char>
1:               NA                                                    
                                                                      COHORT
                                                                      <char>
1: SNMB|B58C|NBBS|GMMG|HNR|DBDS|MRC|PRACTICAL|BCAC|CGEMS|deCODE|UKBB|B-PROOF
   FULL_SUMMARY_STATISTICS SUMMARY_STATS_LOCATION    GXE
                    <char>                 <char> <char>
1:                      no                   <NA>     no

gwas_study_info |>
  filter(grepl("\\bBPROOF\\b", COHORT))

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE    JOURNAL
                  <IDat>     <int>       <char>     <IDat>     <char>
1:            2021-02-18  33510174      Jones G 2021-01-28 Nat Commun
2:            2021-02-18  33510174      Jones G 2021-01-28 Nat Commun
3:            2021-02-18  33510174      Jones G 2021-01-28 Nat Commun
4:            2021-02-18  33510174      Jones G 2021-01-28 Nat Commun
5:            2021-02-18  33510174      Jones G 2021-01-28 Nat Commun
6:            2021-02-18  33510174      Jones G 2021-01-28 Nat Commun
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/33510174
2: www.ncbi.nlm.nih.gov/pubmed/33510174
3: www.ncbi.nlm.nih.gov/pubmed/33510174
4: www.ncbi.nlm.nih.gov/pubmed/33510174
5: www.ncbi.nlm.nih.gov/pubmed/33510174
6: www.ncbi.nlm.nih.gov/pubmed/33510174
                                                                                                    STUDY
                                                                                                   <char>
1: Genome-wide meta-analysis of muscle weakness identifies 15 susceptibility loci in older men and women.
2: Genome-wide meta-analysis of muscle weakness identifies 15 susceptibility loci in older men and women.
3: Genome-wide meta-analysis of muscle weakness identifies 15 susceptibility loci in older men and women.
4: Genome-wide meta-analysis of muscle weakness identifies 15 susceptibility loci in older men and women.
5: Genome-wide meta-analysis of muscle weakness identifies 15 susceptibility loci in older men and women.
6: Genome-wide meta-analysis of muscle weakness identifies 15 susceptibility loci in older men and women.
                                          DISEASE/TRAIT
                                                 <char>
1: Low hand grip strength (60 years and older) (EWGSOP)
2: Low hand grip strength (60 years and older) (EWGSOP)
3: Low hand grip strength (60 years and older) (EWGSOP)
4:   Low hand grip strength (60 years and older) (FNIH)
5:   Low hand grip strength (60 years and older) (FNIH)
6:   Low hand grip strength (60 years and older) (FNIH)
                                                                INITIAL_SAMPLE_SIZE
                                                                             <char>
1:               48,596 European ancestry cases, 207,927 European ancestry controls
2: 34,589 European ancestry female cases, 100,879 European ancestry female controls
3:     14,007 European ancestry male cases, 107,048 European ancestry male controls
4:               20,335 European ancestry cases, 236,188 European ancestry controls
5: 13,601 European ancestry female cases, 121,867 European ancestry female controls
6:      6,734 European ancestry male cases, 114,321 European ancestry male controls
   REPLICATION_SAMPLE_SIZE               PLATFORM_[SNPS_PASSING_QC]
                    <char>                                   <char>
1:                    <NA> Affymetrix, Illumina [9457422] (imputed)
2:                    <NA> Affymetrix, Illumina [9449805] (imputed)
3:                    <NA> Affymetrix, Illumina [9464541] (imputed)
4:                    <NA> Affymetrix, Illumina [9465622] (imputed)
5:                    <NA> Affymetrix, Illumina [9431325] (imputed)
6:                    <NA> Affymetrix, Illumina [9471905] (imputed)
   ASSOCIATION_COUNT              MAPPED_TRAIT
               <int>                    <char>
1:                15 grip strength measurement
2:                 8 grip strength measurement
3:                 3 grip strength measurement
4:                 5 grip strength measurement
5:                 0 grip strength measurement
6:                 0 grip strength measurement
                       MAPPED_TRAIT_URI STUDY_ACCESSION
                                 <char>          <char>
1: http://www.ebi.ac.uk/efo/EFO_0006941    GCST90007526
2: http://www.ebi.ac.uk/efo/EFO_0006941    GCST90007527
3: http://www.ebi.ac.uk/efo/EFO_0006941    GCST90007528
4: http://www.ebi.ac.uk/efo/EFO_0006941    GCST90007529
5: http://www.ebi.ac.uk/efo/EFO_0006941    GCST90007530
6: http://www.ebi.ac.uk/efo/EFO_0006941    GCST90007531
          GENOTYPING_TECHNOLOGY SUBMISSION_DATE STATISTICAL_MODEL
                         <char>          <lgcl>            <lgcl>
1: Genome-wide genotyping array              NA                NA
2: Genome-wide genotyping array              NA                NA
3: Genome-wide genotyping array              NA                NA
4: Genome-wide genotyping array              NA                NA
5: Genome-wide genotyping array              NA                NA
6: Genome-wide genotyping array              NA                NA
   BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT_URI
             <lgcl>                  <char>                      <char>
1:               NA                                                    
2:               NA                                                    
3:               NA                                                    
4:               NA                                                    
5:               NA                                                    
6:               NA                                                    
                                                                                                                                                                 COHORT
                                                                                                                                                                 <char>
1: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP|ROSMAP|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS
2: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP|ROSMAP|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS
3: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP|ROSMAP|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS
4: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP|ROSMAP|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS
5: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP|ROSMAP|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS
6: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP|ROSMAP|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS
   FULL_SUMMARY_STATISTICS
                    <char>
1:                     yes
2:                     yes
3:                     yes
4:                     yes
5:                     yes
6:                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007526
2: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007527
3: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007528
4: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007529
5: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007530
6: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007531
      GXE
   <char>
1:     no
2:     no
3:     no
4:     no
5:     no
6:     no

# WAMHS WASHS

# CALGB 40502 CALGB 40503

# matches <- which(dist_matrix > 0 & dist_matrix <= threshold, arr.ind = TRUE)
# matches <- data.frame(
#   cohort1 = single_use_cohorts[matches[,1]],
#   cohort2 = cohorts[matches[,2]],
#   distance = dist_matrix[matches]
# )
# 
# 
# matches |>
#   arrange(distance) |>
#   head()

8 Specific studies

8.1 Mount Sinai -> BioMe (34604815)

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = ifelse(PUBMED_ID == 34604815,
                         str_replace_all(COHORT, "Mount Sinai", "BioMe"),
                         COHORT
  )
  )

8.2 BHS_b -> BHS (40181193)

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = ifelse(PUBMED_ID == 40181193,
                         str_replace_all(COHORT, "BHS_b", "BHS"),
                         COHORT
  )
  )

8.3 B-PROOF (39103364) -> BPROOF

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "B-PROOF", "BPROOF")
  )

8.4 other -> AGORA (Pubmed id: 36551779)

# replace other with: AGORA
# for PUBMED_ID 36551779

gwas_study_info = 
  gwas_study_info |>
  mutate(COHORT = ifelse(PUBMED_ID == 36551779,
                         str_replace_all(COHORT, "other", "AGORA"),
                         COHORT
  )
  )

8.5 Estonian Biobank (34791234)

# if first author, COVID-19 Host Genetics Initiative
# replace Estonia with: EB
gwas_study_info = 
  gwas_study_info |>
  mutate(COHORT = ifelse(FIRST_AUTHOR == "COVID-19 Host Genetics Initiative",
                         str_replace_all(COHORT, "Estonia", "EB"),
                         COHORT
  )
  )


# then pubmed id: 34791234
# replace Estonia with EB
gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = ifelse(PUBMED_ID == 34791234,
                         str_replace_all(COHORT, "Estonia", "EB"),
                         COHORT
  )
  )

8.6 The ““European NAFLD Registry”” Metacohort -> European NAFLD Registry (32298765)

# PUBMED_ID = 32298765
# European NAFLD Registry Metacohort
# if STAGE = "initial"
# then set
# COHORT = European NAFLD Registry Metacohort|WTCCC|HYPERGENES|KORA|Understanding Society
gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = ifelse(COHORT == "" & PUBMED_ID == "32298765",
                       "European NAFLD Registry Metacohort|WTCCC|HYPERGENES|KORA|Understanding Society",
                       COHORT
  )
  )

# PUBMED_ID = 32298765
# if STAGE = "replication"
# then set:
# COHORT = European NAFLD Registry Metacohort
# gwas_study_info =
#   gwas_study_info |>
#   mutate(COHORT = ifelse(COHORT == "" &
#                          PUBMED_ID == "32298765" &
#                          STAGE == "replication",
#                        "European NAFLD Registry Metacohort",
#                        COHORT
#   )
#   )

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'The ""European NAFLD Registry"" Metacohort',
                                  'European NAFLD Registry Metacohort')) |>
  mutate(COHORT = str_replace_all(COHORT,
                                  'The "European NAFLD Registry" Metacohort',
                                  'European NAFLD Registry Metacohort'))

8.7 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
all_cohorts = all_cohorts[all_cohorts != ""]
unique(all_cohorts) |> length()

[1] 1068

9 Any cohorts missing from data-dictionary?

cohort_names_df <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx")) |>
  mutate(across(everything(), 
                ~stringr::str_replace_all(.x,
                                          pattern = "\u00A0",
                                          replacement = " "))) 

# all cohort names:
all_cohort_names <- 
  c(cohort_names_df$cohort,
    cohort_names_df$full_name,
    cohort_names_df$synyoms)

all_cohort_names <- unique(all_cohort_names)

length(all_cohort_names)

[1] 1545

gwas_cat_names <- unique(all_cohorts)

not_found_names <-gwas_cat_names[!(gwas_cat_names %in% all_cohort_names)]

checked_all_cohort_names <- tolower(stringr::str_trim(all_cohort_names))

found_with_edits <- gwas_cat_names[(tolower(gwas_cat_names) %in% checked_all_cohort_names)]

# names that match when case is ignored: 
not_found_names[not_found_names %in% found_with_edits]

[1] "Croatia"  "Nagahama"

print("Number of cohorts not (yet) included in data-dictionary")

[1] "Number of cohorts not (yet) included in data-dictionary"

length(not_found_names)

[1] 627

print("Most used cohorts that are not included in the data-dictionary")

[1] "Most used cohorts that are not included in the data-dictionary"

data.frame(
  cohort_name = 
all_cohorts[all_cohorts %in% not_found_names]) |>
  group_by(cohort_name) |>
  summarise(n = n()) |>
  arrange(desc(n)) |>
  head()

# A tibble: 6 × 2
  cohort_name     n
  <chr>       <int>
1 WRAP          441
2 AMISH         406
3 LBC           142
4 HBCS          138
5 RBC-Omics     131
6 HELIOS        125

10 Saving:

# cohorts used per paper:


gwas_study_info =
  gwas_study_info |>
  select(PUBMED_ID,
         DATE,
         COHORT) |>
  distinct()

gwas_study_info =
  gwas_study_info |>
  mutate(COHORT = ifelse(COHORT != "",
                         unlist(str_split(string = COHORT,
                                          pattern = "\\|")
                                ),
                         "")
         ) |>
  group_by(PUBMED_ID) |>
  summarise(COHORT = str_flatten(unique(COHORT), 
                                 collapse = "|",
                                 na.rm = T),
            DATE = unique(DATE))
  
data.table::fwrite(gwas_study_info,
                  here::here("output/gwas_cohorts/gwas_cohort_name_corrected.csv"), 
                  sep = ",")

11 Others to look into:

# in below study, unlisted cohort is combination of two cohorts
gwas_study_info |>
  filter(PUBMED_ID  == 32605384) |>
  select(PUBMED_ID, 
         COHORT, 
         STUDY_ACCESSION, 
         "DISEASE/TRAIT", 
         "INITIAL_SAMPLE_SIZE", 
         "REPLICATION_SAMPLE_SIZE")


gwas_study_info |>
  filter(PUBMED_ID == 30510241) |>
    select(PUBMED_ID, 
           COHORT, 
           STUDY_ACCESSION, 
           "DISEASE/TRAIT", 
           "INITIAL_SAMPLE_SIZE", 
           "REPLICATION_SAMPLE_SIZE"
           )
# if go to supplement, can see made up of many many many studies - I believe includes other all other subsamples


gwas_study_info |>
  filter(PUBMED_ID == 33307546) |>
    select(PUBMED_ID, 
           COHORT, 
           STUDY_ACCESSION, 
           "DISEASE/TRAIT", 
           "INITIAL_SAMPLE_SIZE", 
           "REPLICATION_SAMPLE_SIZE")

# COVID-19 Host Genetics Initiative (HGI) is this hispanic individuals I believe
#  European ancestry from the ‘broad respiratory phenotype’ study of 23andMe
# See replication section of https://www.nature.com/articles/s41586-020-03065-y#Sec4
gwas_study_info |>
  filter(PUBMED_ID == 38184787) |> 
  select(PUBMED_ID, COHORT, STUDY_ACCESSION, 
         "DISEASE/TRAIT", 
         "INITIAL_SAMPLE_SIZE", 
         "REPLICATION_SAMPLE_SIZE")

# cohorts listed are for

11.1 MAYO

Mayo Clinic Bipolar Biobank (STUDY_ACCESSION: GCST90554822)

MAYO-Clinic RGC Project Generation. (PUBMED_ID: 37949852)

Mayo Clinic (PUBMED_ID: 40050615)

Mayo-VDB|

# Stanford_ADRC

# CROATIA

# Raine Study -- ? Raine

# Penn - UPenn etc.

# ?CALGB  
# "SIGNET-REGARDS"  >? "SIGNET"  



# "RISC" & "RISK" appear to be different
# Relationship Between Insulin Sensitivity and Cardiovascular Disease Risk (RISC)

# Risk Stratification and Identification of Immunogenetic and Microbial Markers of Rapid Disease Progression in Children with Crohn’s Disease (RISK) 



 "CKB" 

 [231] "COHRA"                                                                                                                                                    
 [232] "COHRA1"                                                                                                                                                   
 [233] "COHRA2"  

# UK Blood Service (UKBS)

 [294] "DiscovEHR"                                                                                                                                                
 [295] "DISCOVeRY-BMT"   

 [330] "ELSA"                                                                                                                                                     
 [331] "ELSA-Brasil"   

 [340] "EPIC"                                                                                                                                                     
 [341] "EPIC_CAD"                                                                                                                                                 
 [342] "EPIC_Obs"                                                                                                                                                 
 [343] "EPIC-Norfolk"                                                                                                                                             
 [344] "EPICURE"        

 [372] "FinnTwin"                                                                                                                                                 
 [373] "FinnTwin12"  

 [463] "GOCS"                                                                                                                                                     
 [464] "GOCS_Chilean" 

 [480] "GRAAD"                                                                                                                                                    
 [481] "GRaD" 

# Colo2&3

 [513] "HELIC"                                                                                                                                                    
 [514] "HELIC-MANOLIS"                                                                                                                                            
 [515] "HELIC-Pomak"   

# ? QTR == QTR_Qindao

# ? is "other|UKB" == "UKB|other"

# ? is UK|NR == UKB|NR

# ? CF_TSS == TSS

gwas_study_info |>
  filter(grepl("ORPS", COHORT))

gwas_study_info |>
  filter(PUBMED_ID == 39749473) |>
  select(COHORT)

# PAGE vs PAGES

# PUBMED ID: 35754128 - should be PAGES
# see sup table 1. https://pmc.ncbi.nlm.nih.gov/articles/PMC9671132/

# COGEND COGENT

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringdist_0.9.15 stringr_1.5.2     ggplot2_3.5.2     dplyr_1.1.4      
[5] data.table_1.17.8 workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] sass_0.4.10        utf8_1.2.6         generics_0.1.4     renv_1.0.3        
 [5] stringi_1.8.7      digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5    
 [9] grid_4.3.1         RColorBrewer_1.1-3 fastmap_1.2.0      cellranger_1.1.0  
[13] rprojroot_2.1.0    jsonlite_2.0.0     processx_3.8.6     whisker_0.4.1     
[17] ps_1.9.1           promises_1.3.3     httr_1.4.7         scales_1.4.0      
[21] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        withr_3.0.2       
[25] cachem_1.1.0       yaml_2.3.10        tools_4.3.1        parallel_4.3.1    
[29] httpuv_1.6.16      here_1.0.1         vctrs_0.6.5        R6_2.6.1          
[33] lifecycle_1.0.4    git2r_0.36.2       fs_1.6.6           pkgconfig_2.0.3   
[37] callr_3.7.6        pillar_1.11.1      bslib_0.9.0        later_1.4.4       
[41] gtable_0.3.6       glue_1.8.0         Rcpp_1.1.0         xfun_0.53         
[45] tibble_3.3.0       tidyselect_1.2.1   rstudioapi_0.17.1  knitr_1.50        
[49] farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.30     compiler_4.3.1    
[53] getPass_0.2-4      readxl_1.4.5

Harmonizing Cohort Labels

Isobel Beasley