Last updated: 2025-08-21

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ac13d70

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ac13d70. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/gwas_catalog/
    Ignored:    output/gwas_study_info_cohort_corrected.csv

Untracked files:
    Untracked:  analysis/cohort_dist.Rmd
    Untracked:  analysis/collapse_traits.Rmd
    Untracked:  analysis/missing_cohort_info.Rmd
    Untracked:  data/.DS_Store
    Untracked:  renv/

Unstaged changes:
    Modified:   .Rprofile
    Modified:   analysis/collapse_cohorts.Rmd
    Modified:   code/collapse_diseases.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/correcting_cohort_names.Rmd) and HTML (docs/correcting_cohort_names.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	ac13d70	IJbeasley	2025-08-21	Updating correcting cohort labels
html	6c592b7	IJbeasley	2025-08-20	Build site.
Rmd	1969e6b	IJbeasley	2025-08-20	More corrections / harmonisation of cohort names in gwas catalog

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(data.table)
library(dplyr)
library(ggplot2)
library(stringr)

1 Load / pre-process GWAS Catalog data

# Load GWAS Catalog studies
gwas_study_info <- fread(here::here("data/gwas_catalog/gwas-catalog-v1.0.3.1-studies-r2025-07-21.tsv"),
                         sep = "\t", quote = "")

# Standardize column names (remove spaces)
gwas_study_info <- gwas_study_info |>
  rename_all(~gsub(" ", "_", .x))

gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, " \\| ", "|")) |>
mutate(COHORT = str_replace_all(COHORT, "\\| ", "|")) |>
  mutate(COHORT = str_replace_all(COHORT, " \\|", "|"))

2 Accounting for some discrepancies in cohort names across studies

2.1 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
unique(all_cohorts) |> length()

[1] 1183

2.2 Accounting for some discrepancies in cohort names across studies

# Correct for discrepancies within same paper
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "AWI-Gen", "AWI-GEN")) |> # PUBMED ID :40229280
  mutate(COHORT = str_replace_all(COHORT, "AddHealth", "Add Health")) |> # PUBMED ID: 37494057
  mutate(COHORT = str_replace_all(COHORT, fixed("EB|FinnGen|UKBB"), "EB|FinnGen|UKB")) |> # 39067062
  mutate(COHORT = str_replace_all(COHORT, "Estonian Biobank", "EB")) |> # PUBMED ID: 39500877
  mutate(COHORT = str_replace_all(COHORT, "AWIGEN", "AWI-GEN"))  # 40229280

# making other be the same  
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Other", "other")) |> 
  mutate(COHORT = str_replace_all(COHORT, "OTHER", "other")) |>
  mutate(COHORT = str_replace_all(COHORT, "others", "other")) 

# making "multiple" designation to be the same
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, 
                                   "(\\(Multiple cohorts\\))|(\\(multiple\\))|Multiple",
                                  "multiple")) 


# some use commas instead of | to designate multiple cohorts
gwas_study_info <- gwas_study_info |>
    mutate(COHORT = str_replace_all(COHORT, ", ", "|")) 

# Makes TwinsUK consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "TWINS-UK|TWINSUK", "TwinsUK")) 

# Make epic norfolk consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "EPIC-Norfolk cohort", "EPIC-Norfolk")) 

# Make emerge consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "EMERGE", "eMERGE")) 

# Make twingene consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "TWINGENE", "TwinGene"))

# Make QSkin consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "QSkin|Qskin", "QSKIN")) 

# Make 23andme consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "23ANDME", "23andMe")) 

# Make PopGen consistent
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "PopGen", "POPGEN")) 

# Make decode consistent
gwas_study_info <- gwas_study_info |>
 mutate(COHORT = str_replace_all(COHORT, "DECODE|deCode|DeCODE", "deCODE"))

# Make FinnGen consistent
gwas_study_info <- gwas_study_info |>
mutate(COHORT = str_replace_all(COHORT, "Finngen|FINNGEN", "FinnGen")) 

gwas_study_info <- gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "genomicc", "GenOMICC")) |>
  mutate(COHORT = str_replace_all(COHORT, "IPSYCH", "iPSYCH")) |>
  mutate(COHORT = str_replace_all(COHORT, "SIMES", "SiMES")) |>
  mutate(COHORT = str_replace_all(COHORT, "HELIX", "Helix")) 

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "FINLAND", "Finland"))

2.3 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
unique(all_cohorts) |> length()

[1] 1153

3 Correcting for cardiogram cohort meta-analyses

# CARDIoGRAMplusC4D cohort includes both CARDIoGRAM and C4D cohorts
# see: https://cardiogramplusc4d.org/data-downloads/
# for coding, therefore, we change this to CARDIoGRAM|C4D
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "CARDIoGRAMplusC4D", "CARDIoGRAM|C4D"))

4 Correcting for UK Biobank naming differences …

all_cohorts[grep("ukb", tolower(all_cohorts))] |> unique()

[1] "UKB"                "UKBB"               "UKBB White British"
[4] "UKBS"               "UKB-PPP"

  gwas_study_info |>
  filter(grepl("UKBS", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  37653029
2:  34127860
                                                                                                                                                                                        COHORT
                                                                                                                                                                                        <char>
1:                                                                                                                                other|GenEPA|CHOP|EPICURE|HBCS|KORA|ILM|PoBI|POPGEN|TSS|UKBS
2: BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap

# for PUBMED_ID: 37653029
# UKBS seems to be UK Biobank Bank

# for pubmed id: 34127860
# UKBS is UK Blood Service (UKBS)

 gwas_study_info |>
   filter(grepl("UKB-PPP", COHORT)) |>
   select(PUBMED_ID, COHORT) |>
   distinct()

   PUBMED_ID  COHORT
       <int>  <char>
1:  37794183 UKB-PPP

# pubmed id 37794183 is uk biobank - protein
 
gwas_study_info <- gwas_study_info |>
  mutate(COHORT = ifelse(COHORT == "UKB-PPP", "UKB", COHORT)) |>
  mutate(COHORT = str_replace_all(COHORT, "UKBB White British", "UKB")) |>
  mutate(COHORT = gsub("\\bUKB\\b", "UKBB", COHORT))

4.1 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
unique(all_cohorts) |> length()

[1] 1150

5 Correcting for naming cohort differences across studies (Confirmed by checking papers)

5.1 NIHR BioResource

# seems NIHR Cambridge BioResource & NIHR BIORESOURCE are the same
# https://www.cambridgebioresource.group.cam.ac.uk/ 

gwas_study_info |>
  filter(grepl("NIHR Cambridge BioResource", COHORT)) |>
  select(PUBMED_ID, COHORT)

   PUBMED_ID
       <int>
1:  34127860
2:  34127860
                                                                                                                                                                                        COHORT
                                                                                                                                                                                        <char>
1: BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap
2: BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap

gwas_study_info |>
    filter(grepl("NIHR BIORESOURCE", COHORT)) |>
    select(PUBMED_ID, COHORT) |> 
    distinct()

   PUBMED_ID
       <int>
1:  39891803
2:  40205036
                                                                                                                                                                                                                                                     COHORT
                                                                                                                                                                                                                                                     <char>
1:                                                                                                                                                                                                                      UKBB|CHARGE|ALSPAC|NIHR BIORESOURCE
2: arcOGEN|ARGO|UKHLS|China Kadoorie Biobank|deCODE|CHB|DBDS|eMERGE|EB|FinnGen|MyCode|GS:SFHS|HRS|HKDDDPC|HUNT|Bunkyo|HerediGene|RIKEN|Shimane-CoHRE|JOCO|LifeLines|NEO|NHS|MGBB|QIMR|RS|SHIP|SIMPLER|ToMMo|TwinsUK|UKBB|BioMe|G&H|NIHR BIORESOURCE|MVP|OAI

gwas_study_info |>
    filter(grepl(tolower("BIORESOURCE"), tolower(COHORT))) |>
    select(PUBMED_ID, COHORT) |>
    distinct()

   PUBMED_ID
       <int>
1:  39891803
2:  40205036
3:  34127860
                                                                                                                                                                                                                                                     COHORT
                                                                                                                                                                                                                                                     <char>
1:                                                                                                                                                                                                                      UKBB|CHARGE|ALSPAC|NIHR BIORESOURCE
2: arcOGEN|ARGO|UKHLS|China Kadoorie Biobank|deCODE|CHB|DBDS|eMERGE|EB|FinnGen|MyCode|GS:SFHS|HRS|HKDDDPC|HUNT|Bunkyo|HerediGene|RIKEN|Shimane-CoHRE|JOCO|LifeLines|NEO|NHS|MGBB|QIMR|RS|SHIP|SIMPLER|ToMMo|TwinsUK|UKBB|BioMe|G&H|NIHR BIORESOURCE|MVP|OAI
3:                                                              BC58|BDA|NIHR Cambridge BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "NIHR Cambridge BioResource|NIHR BIORESOURCE" , "NIHR BioResource"))

5.2 Living biobank typo

# Leivin biobank appears to a typo - for Living Biobank
# see PUBMED ID 34059833; https://pmc.ncbi.nlm.nih.gov/articles/PMC7610958/#SD1
gwas_study_info |> filter(grepl("Leivin Biobank", COHORT))

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE   JOURNAL
                  <IDat>     <int>       <char>     <IDat>    <char>
1:            2021-06-10  34059833       Chen J 2021-05-31 Nat Genet
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/34059833
                                                          STUDY   DISEASE/TRAIT
                                                         <char>          <char>
1: The trans-ancestral genomic architecture of glycemic traits. Fasting glucose
                      INITIAL_SAMPLE_SIZE REPLICATION_SAMPLE_SIZE
                                   <char>                  <char>
1: 35,619 East Asian ancestry individuals                    <NA>
                  PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                                      <char>             <int>
1: Affymetrix, Illumina [15438438] (imputed)                15
          MAPPED_TRAIT                     MAPPED_TRAIT_URI STUDY_ACCESSION
                <char>                               <char>          <char>
1: glucose measurement http://www.ebi.ac.uk/efo/EFO_0004468    GCST90002231
                                                                               GENOTYPING_TECHNOLOGY
                                                                                              <char>
1: Genome-wide genotyping array, Targeted genotyping array [Genome-wide genotyping array|Metabochip]
   SUBMISSION_DATE STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
            <lgcl>            <lgcl>           <lgcl>                  <char>
1:              NA                NA               NA                        
   MAPPED_BACKGROUND_TRAIT_URI
                        <char>
1:                            
                                                                                                                  COHORT
                                                                                                                  <char>
1: AASC|BES|CAGE-GWAS1|CAGE|CLHNS|CHNS|KARE|Leivin Biobank|MESA|Nagahama Study|NHAPC|SCES|SiMES|SP2|TAICHI|CRC|SBCS|SMHS
   FULL_SUMMARY_STATISTICS
                    <char>
1:                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90002001-GCST90003000/GCST90002231
      GXE
   <char>
1:     no

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Leivin Biobank", "Living Biobank"))

5.3 Ghana Prostate

gwas_study_info |>
  filter(grepl("Ghana", COHORT)) |>
  select(PUBMED_ID,COHORT) |>
  distinct()

   PUBMED_ID                                                    COHORT
       <int>                                                    <char>
1:  39358599                           MADCaP|Ghana_Prostate|PRACTICAL
2:  36872133 AAPC|ELLIPSE|Ghana|other|eMERGE|BioVU|BioMe|MVP|ProHealth

# if look at papers they are referring to the same cohorts: 
# PUBMED_ID: 36872133 https://pmc.ncbi.nlm.nih.gov/articles/PMC10424812/#S9
# PUBMED_ID: 39358599 https://www.nature.com/articles/s41588-024-01931-3#Sec12

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Ghana_Prostate", "Ghana"))

5.4 Sardinia

# from reading sup table: https://pmc.ncbi.nlm.nih.gov/articles/instance/7611832/bin/EMS136340-supplement-Supplementary_Information.pdf
# for pubmed 34349265
# seems SARDINIA should be combined into SardiNIA
gwas_study_info |>
  filter(grepl("SARDINIA", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  34349265
                                                                                                                                                                                                                                   COHORT
                                                                                                                                                                                                                                   <char>
1: ALSPAC|ARIC|other|CHS|CILENTO|COLAUS|EGCUT|EPIC-Norfolk|FHS|INGI-FVG|GS:SFHS|HealthABC|HRS|INCHIANTI|InterAct|KORA|LifeLines|NEO|NHS|NTR|ORCADES|QIMR|RS|SARDINIA|SHIP|SHIP-TREND|TwinGene|TwinsUK|INGI-Val_Borbera|WGHS|WHI|BCAC|UKBB

gwas_study_info |>
  filter(grepl("SardiNIA", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

    PUBMED_ID
        <int>
 1:  36477530
 2:  36477530
 3:  36477530
 4:  36477530
 5:  36477530
 6:  36477530
 7:  36477530
 8:  36477530
 9:  36477530
10:  36477530
11:  36376304
12:  36050321
13:  36050321
14:  34718232
15:  32929287
                                                                                                                                                                                                                                      COHORT
                                                                                                                                                                                                                                      <char>
 1:                                                                      23andMe|ALSPAC|ARIC|CADD|deCODE|EGCUT|eMERGE|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|SardiNIA|UKBB|NINDS|FINRISK|AMISH|GeneSTAR|GOLDN|CHS|HVH|JHS|WGHS|WHI|GFG|other
 2:                                                                                  23andMe|ALSPAC|ARIC|CADD|COGEND|COPDGene|deCODE|EGCUT|Harvard|HRS|HUNT|METSIM|NTR|QIMR|SardiNIA|UKBB|FINRISK|AMISH|CFS|ECLIPSE|GeneSTAR|GOLDN|WHI|other
 3:                                                     23andMe|ALSPAC|ARIC|CADD|COGEND|COPDGene|deCODE|EGCUT|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|PAGE|QIMR|SardiNIA|UKBB|FINRISK|AMISH|CFS|ECLIPSE|GeneSTAR|GOLDN|CHS|HCHS|SOL|WHI|other
 4:                                                  23andMe|ALSPAC|ARIC|CADD|COGEND|COPDGene|deCODE|EGCUT|eMERGE|Harvard|HUNT|MCTFR|METSIM|NTR|SardiNIA|UKBB|NINDS|FINRISK|AMISH|CFS|ECLIPSE|GeneSTAR|GOLDN|CHS|HCHS|SOL|HVH|WGHS|WHI|other
 5:                                                                                                            23andMe|ALSPAC|ARIC|CADD|COGEND|deCODE|EGCUT|GERA|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|QIMR|SardiNIA|UKBB|FINRISK|WHI|other
 6:                23andMe|ALSPAC|ARIC|BLTS|CADD|deCODE|EGCUT|eMERGE|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|SardiNIA|UKBB|WHI|FINRISK|NINDS|BBJ|CKB|AMISH|CFS|CHS|GENSalt|GOLDN|HCHS|SOL|HVH|HyperGEN|JHS|GeneSTAR|GENOA|SARP|WGHS|other
 7:                                23andMe|ALSPAC|ARIC|BLTS|CADD|COGEND|COPDGene|deCODE|EGCUT|GFG|Harvard|HRS|HUNT|MESA|METSIM|NTR|OZALC|SardiNIA|UKBB|WHI|FINRISK|BBJ|CKB|AMISH|CFS|ECLIPSE|GENSalt|GOLDN|HyperGEN|JHS|GeneSTAR|GENOA|other
 8:        23andMe|ALSPAC|ARIC|BLTS|CADD|COGEND|COPDGene|deCODE|EGCUT|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|OZALC|SardiNIA|UKBB|WHI|FINRISK|PAGE|BBJ|CKB|AMISH|CFS|CHS|ECLIPSE|GENSalt|GOLDN|HCHS|SOL|HyperGEN|JHS|GeneSTAR|GENOA|other
 9: 23andMe|ALSPAC|ARIC|BLTS|CADD|COGEND|COPDGene|deCODE|EGCUT|eMERGE|GFG|Harvard|HUNT|MCTFR|MESA|METSIM|NTR|SardiNIA|UKBB|WHI|FINRISK|NINDS|BBJ|CKB|AMISH|CFS|CHS|ECLIPSE|GENSalt|GOLDN|HCHS|SOL|HVH|HyperGEN|JHS|GeneSTAR|GENOA|WGHS|other
10:                                                                                               23andMe|ALSPAC|ARIC|CADD|COGEND|deCODE|EGCUT|GERA|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NTR|OZALC|SardiNIA|UKBB|WHI|FINRISK|BBJ|CKB|other
11:                                                                              23andMe|ALSPAC|ARIC|BLS|CADD|COGEND|COPDGene|deCODE|EGCUT|FHS|FTC|GERA|GFG|Harvard|HRS|HUNT|MCTFR|MESA|METSIM|NESCOG|FTC|NAG-FIN|NTR|QIMR|SardiNIA|UKBB|WHI
12:                       ARIC|other|BioMe|BRIGHT|CHRIS|CHS|ERF|FINCAVAS|GAPP|HCHS|SOL|HealthABC|INGI-Carlantino|INGI-FVG|Inter99|JHS|KORA|LifeLines|MESA|NEO|OOA|ORCADES|PIVUS|PREVEND|PROSPER|RS|SardiNIA|SHIP|TwinsUK|UKBB|VIKING|WHI|YFS
13:                                    ARIC|BioMe|BRIGHT|other|CHRIS|CHS|ERF|FINCAVAS|GAPP|HealthABC|INGI-Carlantino|INGI-FVG|Inter99|KORA|LifeLines|MESA|NEO|OOA|ORCADES|PIVUS|PREVEND|PROSPER|RS|SardiNIA|SHIP|TwinsUK|UKBB|VIKING|WHI|YFS
14:                                                                                                                                                                                                                                 SardiNIA
15:                                                                                                                                                                                                                                 SardiNIA

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SARDINIA", "SardiNIA"))

gwas_study_info |>
  filter(grepl("Sardinia", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  33830302
                                                                                                                                                                     COHORT
                                                                                                                                                                     <char>
1: GRID|British 1958 birth cohort|National blood service|WTCCC - Bipolar disease cases|Oxford Regional Prospective Study of Childhood Diabetes (ORPS)|Sardinia case-control

# not sure about case control Sardinia ... 
# see second sup table from https://pmc.ncbi.nlm.nih.gov/articles/PMC8099827/#_ad93_
# Sardinia

5.5 Odd naming convention in PUBMED_ID 32949544

Seems like mentioned ancestry groups, rather than cohorts (e.g. UKBB is used in this study)

see cohort information here: https://pmc.ncbi.nlm.nih.gov/articles/instance/8220892/bin/NIHMS1709432-supplement-Supp_Materials.pdf

gwas_study_info |>
  dplyr::filter(PUBMED_ID == 32949544)

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE       JOURNAL
                  <IDat>     <int>       <char>     <IDat>        <char>
1:            2020-10-01  32949544      Jones E 2020-09-16 Lancet Neurol
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/32949544
                                                                                                                            STUDY
                                                                                                                           <char>
1: Identification of novel risk loci and causal insights for sporadic Creutzfeldt-Jakob disease: a genome-wide association study.
                          DISEASE/TRAIT
                                 <char>
1: Creutzfeldt-Jakob disease (sporadic)
                                                INITIAL_SAMPLE_SIZE
                                                             <char>
1: 4,110 European ancestry cases, 13,569 European ancestry controls
                                              REPLICATION_SAMPLE_SIZE
                                                               <char>
1: 1,098 European ancestry cases, 498 ,016 European ancestry controls
                 PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                                     <char>             <int>
1: Affymetrix, Illumina [6314492] (imputed)                 4
                        MAPPED_TRAIT                     MAPPED_TRAIT_URI
                              <char>                               <char>
1: sporadic Creutzfeld Jacob disease http://www.ebi.ac.uk/efo/EFO_1000656
   STUDY_ACCESSION        GENOTYPING_TECHNOLOGY SUBMISSION_DATE
            <char>                       <char>          <lgcl>
1:    GCST90001389 Genome-wide genotyping array              NA
   STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
              <lgcl>           <lgcl>                  <char>
1:                NA               NA                        
   MAPPED_BACKGROUND_TRAIT_URI
                        <char>
1:                            
                                                                                                                                                                    COHORT
                                                                                                                                                                    <char>
1: Dutch controls|French controls|German controls|Italian controls|Spanish controls|UK controls|US controls|UK sCJD cases|US sCJD cases|German sCJD cases|other sCJD cases
   FULL_SUMMARY_STATISTICS
                    <char>
1:                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90001001-GCST90002000/GCST90001389
      GXE
   <char>
1:     no

gwas_study_info = 
    rows_update(gwas_study_info ,tibble(PUBMED_ID = 32949544, COHORT = "multiple"), unmatched = "ignore")

5.6 Odd naming convention in PUBMED ID 33649486

gwas_study_info |>
  filter(grepl("Multiethnic samples from the UK", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                          COHORT
       <int>                          <char>
1:  33649486 Multiethnic samples from the UK

gwas_study_info |>
    filter(PUBMED_ID == 33649486)

   DATE_ADDED_TO_CATALOG PUBMED_ID  FIRST_AUTHOR       DATE     JOURNAL
                  <IDat>     <int>        <char>     <IDat>      <char>
1:            2021-03-22  33649486 Hardcastle AJ 2021-03-01 Commun Biol
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/33649486
                                                                                                                                 STUDY
                                                                                                                                <char>
1: A multi-ethnic genome-wide association study implicates collagen matrix integrity and cell differentiation pathways in keratoconus.
   DISEASE/TRAIT
          <char>
1:   Keratoconus
                                                INITIAL_SAMPLE_SIZE
                                                             <char>
1: 2,116 European ancestry cases, 24,626 European ancestry controls
                                                                                                                                                                               REPLICATION_SAMPLE_SIZE
                                                                                                                                                                                                <char>
1: 1, 389 European ancestry cases, 79,727 European ancestry controls, 759 South Asian ancestry cases, 8,009 South Asian ancestry controls, 405 African ancestry cases, 4,185 African ancestry controls
       PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT MAPPED_TRAIT
                           <char>             <int>       <char>
1: Affymetrix [7701190] (imputed)                36  keratoconus
                               MAPPED_TRAIT_URI STUDY_ACCESSION
                                         <char>          <char>
1: http://purl.obolibrary.org/obo/MONDO_0015486    GCST90013442
          GENOTYPING_TECHNOLOGY SUBMISSION_DATE STATISTICAL_MODEL
                         <char>          <lgcl>            <lgcl>
1: Genome-wide genotyping array              NA                NA
   BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT_URI
             <lgcl>                  <char>                      <char>
1:               NA                                                    
                            COHORT FULL_SUMMARY_STATISTICS
                            <char>                  <char>
1: Multiethnic samples from the UK                     yes
                                                                              SUMMARY_STATS_LOCATION
                                                                                              <char>
1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90013001-GCST90014000/GCST90013442
      GXE
   <char>
1:     no

# looking at this study, discovery
# controls come from UKBB
# cases recruited from various places across the UK - so 

gwas_study_info = 
rows_update(gwas_study_info ,tibble(PUBMED_ID = 33649486, COHORT = "UKBB|other"), 
            unmatched = "ignore")

5.7 GAINT to GIANT

# GAINT appears to be a typo 
# see PUBMED_ID:    36376304 (https://pmc.ncbi.nlm.nih.gov/articles/PMC9663411/)

gwas_study_info |>
filter(grepl("GAINT", COHORT)) |>
select(PUBMED_ID, COHORT) |>
 distinct()

   PUBMED_ID     COHORT
       <int>     <char>
1:  36376304 UKBB|GAINT

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GAINT", "GIANT"))

5.8 1982 Pelotas (Brazil) Birth Cohort Study

unique(all_cohorts)[grepl("1982", unique(all_cohorts))]

[1] "1982 PELOTAS"                            
[2] "1982 Pelotas (Brazil) Birth Cohort Study"

gwas_study_info |>
  filter(grepl("1982", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  40537477
2:  39885687
3:  35399580
4:  35399580
                                                                                                                                                                                                                  COHORT
                                                                                                                                                                                                                  <char>
1:               ARIC|CARDIA|CHS|GENOA|HABC|HANDLS|JHS|MESA|WHI|SP2|other|BAEPENDI|1982 PELOTAS|AGES|ERF|FHS|HyperGEN|NEO|RS|WHI-GARNET|GeneSTAR|HRS|SMHS|SWHS|CoLaus|KORA|LBC|Lifelines|NESDA|SHIP-Trend|TRAILS|YFS|SOL
2:                                                                                                                                                       ZOE2.0|SLS|BioVU|MyCode|VFA|SOLYouth|1982 PELOTAS|CCHC|EGG|MOBA
3:             BioMe|Baependi|CANDELA|NC-BCFR|SFBCS|FIND|HCHS|SOL|Los Angeles Latino Eye Study|MEC|MESA|Mexico City 1|Mexico City 2|MHS|1982 Pelotas (Brazil) Birth Cohort Study|SAFS|STARR COUNTY|T2D SIGMA Studies|WHI
4: BioMe|Baependi|CANDELA|NC-BCFR|SFBCS|FIND|HCHS|SOL|Los Angeles Latino Eye Study|MEC|MESA|Mexico City 1|Mexico City 2|MHS|1982 Pelotas (Brazil) Birth Cohort Study|SAFS|STARR COUNTY|T2D SIGMA Studies|WHI|AAAGC|GIANT

# can confirm, 39885687 (https://pmc.ncbi.nlm.nih.gov/articles/PMC11875162/) 
# 1982 PELOTAS refers to 1982 Pelotas (Brazil) Birth Cohort Study

# can confirm: 40537477 (https://pmc.ncbi.nlm.nih.gov/articles/PMC12179276/#MOESM2)
# 1982 PELOTAS refers to 1982 Pelotas (Brazil) Birth Cohort Study
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "1982 Pelotas (Brazil) Birth Cohort Study", "1982 PELOTAS"))

gwas_study_info |>
  filter(grepl("\\bPELOTAS\\b", COHORT)) |>
  filter(!grepl("1982", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                           COHORT
       <int>                           <char>
1:  34059833 BioMe|IRAS|MESA|PELOTAS|HCHS|SOL

# can confirm: 34059833 
# PELOTAS refers to the 1982 PELOTAS study
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "(?<!1982 )PELOTAS", "1982 PELOTAS"))

5.9 GR@ACE

# i notice there are studies with cohort listed as
# GR@CE & GR@ACE - perhaps these are the same?

# checking GR@CE - as there is fewer of these studies listed ... 
gwas_study_info |>
filter(grepl("GR@CE", COHORT)) |>
select(PUBMED_ID, COHORT) |>
distinct()

   PUBMED_ID
       <int>
1:  35379992
2:  39046104
3:  39046104
                                                                                                 COHORT
                                                                                                 <char>
1:                                        EADB|GR@CE|EADI|GERAD|PERADES|DemGene|Bonn|RS|CCHS|UKBB|other
2: 3C|AGES|ARIC|ASPREE|CHS|FVG|FHS|GR@CE|Apulia|HKOS|HUNT|MEMENTO|MYHAT|ROSMAP|RS|ADGC|UKBB|other|SALSA
3:             3C|AGES|ARIC|ASPREE|CHS|FVG|FHS|GR@CE|Apulia|HKOS|HUNT|MEMENTO|MYHAT|ROSMAP|RS|ADGC|UKBB

# 35379992 -GR@CE appears to be a typo, should be: GR@ACE (https://pmc.ncbi.nlm.nih.gov/articles/PMC9005347/#Sec8)

# 39046104 - GR@CE also appears to be a typo,  should be: GR@ACE
# https://pmc.ncbi.nlm.nih.gov/articles/PMC11497727/#alz14115-sec-0080

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GR@CE", "GR@ACE"))

5.10 Tohoku Medical Megabank

gwas_study_info |> 
  filter(grepl("tohoku", tolower(COHORT))) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                          COHORT
       <int>                          <char>
1:  40226751         Tohoku Medical Megabank
2:  34782693 Tohoku Medical Megabank Project

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Tohoku Medical Megabank Project", "Tohoku Medical Megabank"))

5.11 Steno Diabetes

gwas_study_info |> 
  filter(grepl("steno", tolower(COHORT))) |>
  select(PUBMED_ID, COHORT, `DISEASE/TRAIT`) |>
  distinct()

   PUBMED_ID
       <int>
1:  34127860
2:  35627254
                                                                                                                                                                              COHORT
                                                                                                                                                                              <char>
1: BC58|BDA|NIHR BioResource|GRID|UKBS|BRI|CLEAR|EDIC|GoKinD|NYCP|NIMH|SEARCH|TrialNet|T1DGC|UAB|UC|UCSF|IDDMGEN|T1DGEN|MCW|GRID-NI|Young Hearts-NI|Steno Diabetes Center|HSG|HapMap
2:                                                                                                                                                                       Steno|other
                                           DISEASE/TRAIT
                                                  <char>
1:                                       Type 1 diabetes
2: Neuropeptide Y autoantibody levels in type 1 diabetes

all_cohorts[grep("steno", tolower(all_cohorts))] |> unique()

[1] "Steno Diabetes Center" "Steno"

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Steno Diabetes Center", "Steno"))

5.12 Nagahama

gwas_study_info |>
  filter(grepl("nagahama", tolower(COHORT))) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  35551307
2:  34059833
3:  34059833
4:  34887591
5:  40181193
6:  38277453
                                                                                                                                                                                                                             COHORT
                                                                                                                                                                                                                             <char>
1:                                                                                                               AASC|BBJ|BES|CAGE|CHNS|CKB|CLHNS|DC|SP2|HKDR|KARE|other|MESA|Nagahama Study|SBCS|SWHS|SCES|SCHS|SiMES|TAICHI|TWT2D
2:                                                                                                            AASC|BES|CAGE-GWAS1|CAGE|CLHNS|CHNS|KARE|Living Biobank|MESA|Nagahama Study|NHAPC|SCES|SiMES|SP2|TAICHI|CRC|SBCS|SMHS
3:                                                                                                                                  CAGE-GWAS1|CAGE|CHNS|KARE|LivingBiobank|MESA|NagahamaStudy|NHAPC|SCES|SiMES|SP2|TAICHI|CRC|TWSC
4:                                                                                      BAS|BBJ|BES|CAGE|CAS|CHNS|CKB|SDCS|JPDSC|KARE|Living-biobank|MESA|Nagahama Study|NHAPC|SBCS|SCES|SCHS|SiMES|SINDI|SP2|SWHS|TUDR|TWT2D|other
5: AGES|ALSPAC|ARIC|BHS_b|CARDIA|CCHC|CFS|CHS|COLAUS|DIACORE|DRS_EXTRA|EPIC-Norfolk|EB|FHS|Fenland|GAPP|GENSALT|HANDLS|HCS|IRASFS|JHS|KOGES|LBC|LifeLines|LLFS|MESA|MVP|Nagahama_Study|NEO|NESDA|SHIP|SOL|SWAN|TwinsUK|UKBB|WHI|YFS
6:                                                                                                                                                                                           HERPACC|J-MICC|JPHC|ToMMo|Nagahama|BBJ

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Nagahama_Study|NagahamaStudy", "Nagahama Study")) |> 
  # ? maybe check Nagahama == Nagahama Study
  mutate(COHORT = str_replace_all(COHORT, "Nagahama Study", "Nagahama"))

5.13 WTCCC - Bipolar disease cases

gwas_study_info |>
  filter(grepl("WTCCC - Bipolar disease cases", COHORT)) |>
  select(1:5)

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE      JOURNAL
                  <IDat>     <int>       <char>     <IDat>       <char>
1:            2021-04-23  33830302   Inshaw JRJ 2021-04-08 Diabetologia

gwas_study_info |>
  filter(PUBMED_ID == 33830302) |>
  select(PUBMED_ID, COHORT)

   PUBMED_ID
       <int>
1:  33830302
2:  33830302
                                                                                                                                                                     COHORT
                                                                                                                                                                     <char>
1: GRID|British 1958 birth cohort|National blood service|WTCCC - Bipolar disease cases|Oxford Regional Prospective Study of Childhood Diabetes (ORPS)|Sardinia case-control
2:

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "WTCCC - Bipolar disease cases", "WTCCC"))

5.14 Qatar Genome Project

gwas_study_info |>
  filter(grepl("QGP", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID                     COHORT
       <int>                     <char>
1:  33623009 Qatar Genome Program (QGP)
2:  36168886                        QGP

# Checked 36168886 - QGP is Qatar Genome Project

# so 
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Qatar Genome Program (QGP)", "QGP"))

5.15 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
unique(all_cohorts) |> length()

[1] 1125

6 Discrepancies corrected across papers (Not checked but likely):

6.1 canSCAD example:

# canSCAD"  "CanSCAD cases and MGI controls" 
gwas_study_info |>
  filter(grepl("CanSCAD cases and MGI controls", COHORT)) |>
select(PUBMED_ID, COHORT) |>
 distinct()

   PUBMED_ID                         COHORT
       <int>                         <char>
1:  32887874 CanSCAD cases and MGI controls

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "CanSCAD cases and MGI controls", "canSCAD|MGI"))

6.2 Potentionally simple checking similar names (just differ in capitalisation)

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
unique_cohort_names = unique(all_cohorts) 
# Convert to lowercase and check duplicates
dup_groups <- tapply(unique_cohort_names, tolower(unique_cohort_names), I)

# Keep only groups with >1 element (i.e., capitalization differences)
dup_groups[lengths(dup_groups) > 1]

$airwave
[1] "Airwave" "AIRWAVE"

$allofus
[1] "AllofUs" "AllOfUs"

$baependi
[1] "BAEPENDI" "Baependi"

$biome
[1] "BioMe" "BioME" "BIOME"

$biovu
[1] "BioVU" "BioVu" "BIOVU"

$cilento
[1] "CILENTO" "Cilento"

$colaus
[1] "CoLaus" "COLAUS"

$`croatia-korcula`
[1] "CROATIA-KORCULA" "CROATIA-Korcula"

$famhs
[1] "FamHS" "FAMHS"

$fenland
[1] "Fenland" "FENLAND"

$gel
[1] "GEL" "GeL"

$genestar
[1] "GeneSTAR" "GENESTAR" "GeneStar"

$gensalt
[1] "GENSalt" "GENSALT" "GenSalt"

$godarts
[1] "GoDARTS" "GODARTS"

$hypergen
[1] "HyperGEN" "HyperGen" "HYPERGEN"

$inchianti
[1] "InCHIANTI" "INCHIANTI"

$inter99
[1] "Inter99" "INTER99"

$koges
[1] "KoGES" "KOGES"

$`life-heart`
[1] "LIFE-HEART" "LIFE-Heart"

$lifelines
[1] "LifeLines" "Lifelines"

$`mayo-vdb`
[1] "MAYO-VDB" "Mayo-VDB"

$moba
[1] "MOBA" "MoBa"

$nugene
[1] "Nugene" "NUGENE"

$orcades
[1] "ORCADES" "Orcades"

$panscan
[1] "PANSCAN" "PanScan"

$raine
[1] "RAINE" "Raine"

$`ship-trend`
[1] "SHIP-TREND" "SHIP-Trend"

$sign
[1] "SiGN" "SIGN"

$viva
[1] "Viva" "VIVA"

6.3 Potentionally simple checking similar names (just different in spaces and _)

# Normalize by removing spaces and underscores
normalized <- gsub("[ _]", "", sort(unique_cohort_names))

# Group by normalized value
dup_groups <- tapply(sort(unique_cohort_names), normalized, I)

# Keep only groups with >1 element (i.e. variants)
dup_groups[lengths(dup_groups) > 1]

$DRSEXTRA
[1] "DRS_EXTRA" "DRSEXTRA" 

$GALAII
[1] "GALA II" "GALA_II"

$Health2000
[1] "Health 2000" "Health2000" 

$HealthABC
[1] "Health ABC" "HealthABC" 

$`INGI-ValBorbera`
[1] "INGI-Val Borbera" "INGI-Val_Borbera"

$LivingBiobank
[1] "Living Biobank" "LivingBiobank"

6.4 Airwave

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "AIRWAVE", "Airwave"))

6.5 AllOfUs

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "AllOfUs", "AllofUs"))

6.6 Baependi

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, toupper("Baependi"), "Baependi"))

7 BioMe

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "BIOME|BioME", "BioMe"))

7.1 BioVU

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "BIOVU|BioVu", "BioVU"))

7.2 Cilento

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, toupper("Cilento"), "Cilento"))

7.3 Colaus

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "COLAUS", "CoLaus"))

7.4 CROATIA-Korcula

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "CROATIA-KORCULA", "CROATIA-Korcula"))

7.5 DRS_EXTRA

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "DRSEXTRA", "DRS_EXTRA"))

7.6 FamHS

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, toupper("FamHS"), "FamHS"))

7.7 Fenland

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "FENLAND", "Fenland"))

7.8 GALA II

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GALA_II", "GALA II"))

7.9 GEL

gwas_study_info |>
  filter(grepl("GeL", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID COHORT
       <int> <char>
1:  36124557    GeL

# Only one study uses GeL (36124557)- from 
# https://pmc.ncbi.nlm.nih.gov/articles/PMC9512401/#s4 
# Appears to be typo, for Genomics England (GEL)
gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "GeL", "GEL"))

7.10 GeneSTAR

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "GENESTAR|GeneStar", "GeneSTAR"))

7.11 GENSalt

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, "GENSALT|GenSalt", "GENSalt"))

7.12 GoDARTS

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("godarts"), "GoDARTS"))

7.13 InCHIANTI

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("InCHIANTI"), "InCHIANTI"))

7.14 Inter99

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Inter99"), "Inter99"))

7.15 “Health ABC”

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Health ABC", "HealthABC"))

7.16 “Health 2000”

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Health 2000", "Health2000"))

7.17 HyperGen

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "HyperGEN|HYPERGEN", "HyperGen")) 

# ? LifeLines Deep

7.18 INGI-Val Borbera

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "INGI-Val_Borbera", "INGI-Val Borbera")) 

# ? LifeLines Deep

7.19 KoGES

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("KoGES"), "KoGES"))

7.20 Lifeheart

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "LIFE-HEART", "LIFE-Heart"))

7.21 LifeLines

gwas_study_info  = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Lifelines", "LifeLines")) 

# ? LifeLines Deep

7.22 Living Biobank

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Living-biobank|LivingBiobank", "Living Biobank"))

7.23 Mayo-VDB

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Mayo-VDB"), "Mayo-VDB"))

7.24 MoBa

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("MoBa"), "MoBa"))

7.25 Nugene

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Nugene", "NUGENE"))

7.26 Orcades

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Orcades"), "Orcades"))

7.27 PanScan

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("PanScan"), "PanScan"))

7.28 Raine

gwas_study_info = gwas_study_info |> 
  mutate(COHORT = str_replace_all(COHORT, toupper("Raine"), "Raine"))

7.29 ROSMAP

all_cohorts[grep("rosmap", tolower(all_cohorts))] |> unique()

[1] "ROSMAP"   "ROSMAP 1" "ROSMAP 2"

gwas_study_info |>
  filter(grepl("ROSMAP 1|ROSMAP 2", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct()

   PUBMED_ID
       <int>
1:  33510174
                                                                                                                                                                      COHORT
                                                                                                                                                                      <char>
1: ARIC|BASE-II|BPROOF|CHS|EPIC-Norfolk|FHS|HRS|InCHIANTI|LASA I|LASA II|Long Life Family Study|MrOS Gothenburg|MrOS Malmo|ROSMAP 1|ROSMAP 2|RS|RSI|RSII|SHIP|TSHA|UKBB|WLS|

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "ROSMAP 1|ROSMAP 2", "ROSMAP"))

7.30 SHIP trend

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SHIP-Trend", "SHIP-TREND"))

# ? "SHIPNATREND"  - comes from one study
gwas_study_info |>
 filter(grepl("SHIPNATREND", COHORT)) |>
 select(PUBMED_ID, COHORT) |>
 distinct()

    PUBMED_ID
        <int>
 1:  32888493
 2:  32888493
 3:  32888493
 4:  32888493
 5:  32888493
 6:  32888493
 7:  32888493
 8:  32888493
 9:  32888493
10:  32888493
11:  32888493
12:  32888493
13:  32888493
14:  32888493
15:  32888493
16:  32888493
17:  32888493
18:  32888493
19:  32888493
20:  32888493
21:  32888493
22:  32888493
    PUBMED_ID
                                                                                                                                                                                                                 COHORT
                                                                                                                                                                                                                 <char>
 1:                                                                                     Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIPNATREND|UKBB|WHI
 2:                                                                      Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIPNATREND|UKBB|WHI
 3:                                                                                           Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 4:                                                                            Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 5:                                                                                       Airwave|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 6:                                                                    Airwave|BBJ|BioMe|CaPS|CHS|CHS|Estonia|Estonia|FHS|FINCAVAS|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 7:                                                                            Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 8:                                                             Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
 9:                                                                                                              Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|INTERVAL|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|WHI
10:                                                                                               Airwave|BBJ|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|WHI
11:                                                                        Airwave|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
12:                                                         Airwave|BBJ|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
13:                                       Airwave|BioMe|CaPS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|Health2006|Health2008|Health2010|INTERVAL|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
14:                   Airwave|BBJ|BioMe|CaPS|CHNS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|HANDLS|Health2006|Health2008|Health2010|INTERVAL|JHS|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI|YFS
15:                                                                                                                     Airwave|BioMe|CaPS|Estonia|FHS|INTERVAL|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI
16:                                                                               Airwave|BioMe|BioMe|BioMe|CaPS|Estonia|FHS|HANDLS|INTERVAL|JHS|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI
17:                                                                                               Airwave|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|INTERVAL|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|WHI
18:                                 Airwave|BBJ|BioMe|BioMe|BioMe|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI
19: Airwave|BBJ|BioMe|BioMe|BioMe|CaPS|CHNS|CHS|CHS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI|YFS
20:         Airwave|BBJ|BioMe|BioMe|BioMe|CaPS|CHNS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI|YFS
21:                                                                                                              Airwave|BioMe|CaPS|FHS|GERA|GERA|GERA|INTERVAL|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|WHI
22:                                                              Airwave|BioMe|BioMe|BioMe|CaPS|FHS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MHIphase1|MHIphase2|RS|RS|RSI|SHIP|SHIPNATREND|UKBB|UKBB|UKBB|UKBB|WHI
                                                                                                                                                                                                                 COHORT

# from sup table, seems like SHIPNATREND is SHIP-TREND - 
# https://pmc.ncbi.nlm.nih.gov/articles/PMC7480402/#SD1

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SHIPNATREND", "SHIP-TREND"))

7.31 Sign

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "SIGN", "SiGN"))

7.32 VIVA

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Viva", "VIVA"))

7.33 Other - Rotterdam

gwas_study_info |>
  filter(grepl("Rotterdam", COHORT))

   DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE   JOURNAL
                  <IDat>     <int>       <char>     <IDat>    <char>
1:            2025-03-11  40050429    Roselli C 2025-03-06 Nat Genet
                                   LINK
                                 <char>
1: www.ncbi.nlm.nih.gov/pubmed/40050429
                                                                                                                         STUDY
                                                                                                                        <char>
1: Meta-analysis of genome-wide associations and polygenic risk prediction for atrial fibrillation in more than 180,000 cases.
         DISEASE/TRAIT
                <char>
1: Atrial fibrillation
                                                                                                                                                                                                                                                                                                                                                                                      INITIAL_SAMPLE_SIZE
                                                                                                                                                                                                                                                                                                                                                                                                   <char>
1: 1,782 Admix African and African American cases, 9,356 Admix African and African American controls, 11,350 East Asian ancestry cases, 137,515 East Asian ancestry controls, 166,322 European ancestry cases, 1,313,950 European ancestry controls, 1,774 Hispanic or Latin American cases, 7,665 Hispanic or Latin American controls, 218 South Asian ancestry cases, 413 South Asian ancestry controls
   REPLICATION_SAMPLE_SIZE                PLATFORM_[SNPS_PASSING_QC]
                    <char>                                    <char>
1:                    <NA> Affymetrix, Illumina [29789980] (imputed)
   ASSOCIATION_COUNT        MAPPED_TRAIT                     MAPPED_TRAIT_URI
               <int>              <char>                               <char>
1:               355 atrial fibrillation http://www.ebi.ac.uk/efo/EFO_0000275
   STUDY_ACCESSION        GENOTYPING_TECHNOLOGY SUBMISSION_DATE
            <char>                       <char>          <lgcl>
1:    GCST90559230 Genome-wide genotyping array              NA
   STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
              <lgcl>           <lgcl>                  <char>
1:                NA               NA                        
   MAPPED_BACKGROUND_TRAIT_URI
                        <char>
1:                            
                                                                                                                                                                                                                                 COHORT
                                                                                                                                                                                                                                 <char>
1: AGES|ARIC|BioMe|Broad CVDi|BBJ|CHS|MESA|SiGN|ENGAGE_AF-TIMI_48|SPHFC|CCAF|CHB|MyCode|EGCUT|FHS|GAPP|GS:SFHS|HRS|LURIC|HUNT|MGI|PHB|PIVUS|PREVEND|PROSPER|Rotterdam|SHIP|SiGN|TwinGene|ULSAM|Vanderbilt|WGHS|WTCCC|FinnGen|UKBB|other
   FULL_SUMMARY_STATISTICS SUMMARY_STATS_LOCATION    GXE
                    <char>                 <char> <char>
1:                      no                   <NA>     no

# Rotterdam study is typically listed as "RS"

# see e.g. 36568030 https://pmc.ncbi.nlm.nih.gov/articles/PMC9772568/
gwas_study_info |>
  filter(grepl("\\bRS\\b", COHORT))

     DATE_ADDED_TO_CATALOG PUBMED_ID FIRST_AUTHOR       DATE
                    <IDat>     <int>       <char>     <IDat>
  1:            2023-03-21  36662418     Faber BG 2023-01-20
  2:            2023-05-12  36918541     Young WJ 2023-03-14
  3:            2023-05-12  36918541     Young WJ 2023-03-14
  4:            2023-05-12  36918541     Young WJ 2023-03-14
  5:            2023-05-12  36918541     Young WJ 2023-03-14
 ---                                                        
332:            2023-01-31  36568030     Young KL 2022-11-25
333:            2023-01-31  36568030     Young KL 2022-11-25
334:            2023-01-31  36568030     Young KL 2022-11-25
335:            2023-01-31  36568030     Young KL 2022-11-25
336:            2023-01-31  36568030     Young KL 2022-11-25
                 JOURNAL                                 LINK
                  <char>                               <char>
  1: Arthritis Rheumatol www.ncbi.nlm.nih.gov/pubmed/36662418
  2:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
  3:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
  4:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
  5:          Nat Commun www.ncbi.nlm.nih.gov/pubmed/36918541
 ---                                                         
332:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
333:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
334:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
335:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
336:             HGG Adv www.ncbi.nlm.nih.gov/pubmed/36568030
                                                                                                                                 STUDY
                                                                                                                                <char>
  1: A GWAS meta-analysis of alpha angle suggests cam-type morphology may be a specific feature of hip osteoarthritis in older adults.
  2:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
  3:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
  4:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
  5:        Genetic architecture of spatial electrical biomarkers for cardiac arrhythmia and relationship with cardiovascular disease.
 ---                                                                                                                                  
332:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
333:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
334:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
335:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
336:    Whole-exome sequence analysis of anthropometric traits illustrates challenges in identifying effects of rare genetic variants.
           DISEASE/TRAIT
                  <char>
  1:         Alpha angle
  2: Frontal QRS-T angle
  3: Spatial QRS-T angle
  4: Spatial QRS-T angle
  5: Frontal QRS-T angle
 ---                    
332:     Waist-hip ratio
333:     Waist-hip ratio
334:     Waist-hip ratio
335:     Waist-hip ratio
336:     Waist-hip ratio
                                                                     INITIAL_SAMPLE_SIZE
                                                                                  <char>
  1:                                                44,214 European ancestry individuals
  2: 159,715 European ancestry, African ancestry, Hispanic or Latin American individuals
  3:                                                96,562 European ancestry individuals
  4: 118,780 European ancestry, African ancestry, Hispanic or Latin American individuals
  5:                                               134,567 European ancestry individuals
 ---                                                                                    
332:                                                15,503 European ancestry individuals
333:                                                       8,678 European ancestry women
334:                                                         6,825 European ancestry men
335:                         2,987 African ancestry women, 8,678 European ancestry women
336:                             1,307 African ancestry men, 6,825 European ancestry men
                                       REPLICATION_SAMPLE_SIZE
                                                        <char>
  1:                                                      <NA>
  2:                                                      <NA>
  3:                                                      <NA>
  4:                                                      <NA>
  5:                                                      <NA>
 ---                                                          
332:                       1,229 European ancestry individuals
333:                               771 European ancestry women
334:                                 758 European ancestry men
335: 771 European ancestry women, 2,308 African American women
336:     758 European ancestry men, 1,239 African American men
                   PLATFORM_[SNPS_PASSING_QC] ASSOCIATION_COUNT
                                       <char>             <int>
  1: Affymetrix, Illumina [9134976] (imputed)                 8
  2: Affymetrix, Illumina [8299259] (imputed)                11
  3: Affymetrix, Illumina [8603009] (imputed)                51
  4: Affymetrix, Illumina [9052360] (imputed)                61
  5: Affymetrix, Illumina [7954211] (imputed)                 9
 ---                                                           
332:                               NR [67633]                 0
333:                               NR [67633]                 0
334:                               NR [67633]                 0
335:                               NR [67633]                 0
336:                               NR [67633]                 0
                MAPPED_TRAIT                     MAPPED_TRAIT_URI
                      <char>                               <char>
  1: alpha angle measurement http://www.ebi.ac.uk/efo/EFO_0020071
  2:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
  3:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
  4:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
  5:             QRS-T angle http://www.ebi.ac.uk/efo/EFO_0020097
 ---                                                             
332:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
333:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
334:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
335:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
336:         waist-hip ratio http://www.ebi.ac.uk/efo/EFO_0004343
     STUDY_ACCESSION        GENOTYPING_TECHNOLOGY SUBMISSION_DATE
              <char>                       <char>          <lgcl>
  1:    GCST90129635 Genome-wide genotyping array              NA
  2:    GCST90246319 Genome-wide genotyping array              NA
  3:    GCST90246320 Genome-wide genotyping array              NA
  4:    GCST90246318 Genome-wide genotyping array              NA
  5:    GCST90246321 Genome-wide genotyping array              NA
 ---                                                             
332:    GCST90245813        Exome-wide sequencing              NA
333:    GCST90245814        Exome-wide sequencing              NA
334:    GCST90245815        Exome-wide sequencing              NA
335:    GCST90245816        Exome-wide sequencing              NA
336:    GCST90245817        Exome-wide sequencing              NA
     STATISTICAL_MODEL BACKGROUND_TRAIT MAPPED_BACKGROUND_TRAIT
                <lgcl>           <lgcl>                  <char>
  1:                NA               NA                        
  2:                NA               NA                        
  3:                NA               NA                        
  4:                NA               NA                        
  5:                NA               NA                        
 ---                                                           
332:                NA               NA                        
333:                NA               NA                        
334:                NA               NA                        
335:                NA               NA                        
336:                NA               NA                        
     MAPPED_BACKGROUND_TRAIT_URI
                          <char>
  1:                            
  2:                            
  3:                            
  4:                            
  5:                            
 ---                            
332:                            
333:                            
334:                            
335:                            
336:                            
                                                                                                                         COHORT
                                                                                                                         <char>
  1:                                                                                                                    UKBB|RS
  2: ARIC|other|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
  3: ARIC|other|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
  4: ARIC|other|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
  5: ARIC|other|BRIGHT|CHRIS|CHS|ERF|GS:SFHS|HCHS|SOL|Inter99|JHS|LifeLines|MESA|NEO|Orcades|PREVEND|PROSPER|RS|UKBB|VIKING|WHI
 ---                                                                                                                           
332:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS|other
333:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS|other
334:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS|other
335:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS|other
336:                                                                                            ARIC|CHS|ERF|FHS|GOLDN|RS|other
     FULL_SUMMARY_STATISTICS
                      <char>
  1:                     yes
  2:                     yes
  3:                     yes
  4:                     yes
  5:                     yes
 ---                        
332:                      no
333:                      no
334:                      no
335:                      no
336:                      no
                                                                                SUMMARY_STATS_LOCATION
                                                                                                <char>
  1: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90129001-GCST90130000/GCST90129635
  2: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246319
  3: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246320
  4: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246318
  5: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90246001-GCST90247000/GCST90246321
 ---                                                                                                  
332:                                                                                              <NA>
333:                                                                                              <NA>
334:                                                                                              <NA>
335:                                                                                              <NA>
336:                                                                                              <NA>
        GXE
     <char>
  1:     no
  2:     no
  3:     no
  4:     no
  5:     no
 ---       
332:     no
333:     no
334:     no
335:     no
336:     no

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "Rotterdam", "RS"))

7.34 China Kadoorie Biobank

# CKB is the acronym for the China Kadoorie Biobank (see:pubmed id 36777997) https://pmc.ncbi.nlm.nih.gov/articles/PMC9903787/#tbl1

gwas_study_info |>
  filter(grepl("\\bCKB\\b", COHORT)) |>
  select(PUBMED_ID, COHORT) |>
  distinct() |>
  tail()

   PUBMED_ID                                                             COHORT
       <int>                                                             <char>
1:  34586374                                                          CKB|other
2:  34586374                                                     CKB|other|UKBB
3:  34586374                                                      CKB|WHI|other
4:  33766948                                                                CKB
5:  36777997    BBJ|BioMe|BioVU|CCPM|CKB|EB|FinnGen|G&H|HUNT|MGBB|MGI|UCLA|UKBB
6:  36777997 BBJ|BioMe|BioVU|CCPM|CKB|EB|FinnGen|G&H|HUNT|MGBB|MGI|UCLA|UKBB|NR

gwas_study_info = gwas_study_info |>
  mutate(COHORT = str_replace_all(COHORT, "China Kadoorie Biobank", "CKB"))

7.35 Check in: how many unique cohorts now?

all_cohorts = gwas_study_info$COHORT
all_cohorts = unlist(strsplit(all_cohorts, "\\|"))
unique(all_cohorts) |> length()

[1] 1078

7.36 Check in: have we corrected the simple changes we sought to corect?

unique_cohort_names = unique(all_cohorts) 

# Convert to lowercase and check duplicates
dup_groups <- tapply(unique_cohort_names, tolower(unique_cohort_names), I)

# Keep only groups with >1 element (i.e., capitalization differences)
dup_groups[lengths(dup_groups) > 1]

named character(0)

normalized <- gsub("[ _]", "", sort(unique_cohort_names))

# Group by normalized value
dup_groups <- tapply(sort(unique_cohort_names), normalized, I)

# Keep only groups with >1 element (i.e. variants)
dup_groups[lengths(dup_groups) > 1]

named character(0)

7.37 Now additional checks:

normalized <- gsub("[ _]", "", sort(unique_cohort_names))

# Group by normalized value
dup_groups <- tapply(sort(unique_cohort_names), tolower(normalized), I)

# Keep only groups with >1 element (i.e. variants)
dup_groups[lengths(dup_groups) > 1]

named character(0)

8 Check in: how many cohorts are only used in one PUBMED ID (indicating possibly misnaming error?)

single_use_cohorts  =   
data.frame(cohort = all_cohorts) |>
  group_by(cohort) |>
  summarise(n_studies = n()) |>
  filter(n_studies == 1) |>
  pull(cohort)

length(single_use_cohorts)

[1] 208

9 Fuzzy name look-up

library(stringdist)
library(dplyr)

# Create a vector of unique cohort names
cohorts <- unique(all_cohorts)

# Compute pairwise string distances (Levenshtein distance)
dist_matrix <- stringdistmatrix(single_use_cohorts, cohorts, method = "lv")

# Identify pairs with small distance (e.g., <=2 edits)
threshold <- 2
matches <- which(dist_matrix > 0 & dist_matrix <= threshold, arr.ind = TRUE)
matches <- data.frame(
  cohort1 = single_use_cohorts[matches[,1]],
  cohort2 = cohorts[matches[,2]],
  distance = dist_matrix[matches]
)
matches <- matches[matches$cohort1 != matches$cohort2, ]
matches <- unique(matches)

matches |>
  arrange(distance) |>
  head()

  cohort1 cohort2 distance
1    CHIP    SHIP        1
2   SpBCS   SEBCS        1
3      NZ      NR        1
4    DCHS   DACHS        1
5     HIS     HAS        1
6    MACS    MCCS        1

10 Saving:

data.table::fwrite(gwas_study_info,
                  here::here("output/gwas_study_info_cohort_corrected.csv"), 
                  sep = ",")

11 Others to look into:

# in below study, unlisted cohort is combination of two cohorts
gwas_study_info |>
  filter(PUBMED_ID  == 32605384) |>
  select(PUBMED_ID, COHORT, STUDY_ACCESSION, "DISEASE/TRAIT", "INITIAL_SAMPLE_SIZE", "REPLICATION_SAMPLE_SIZE")


gwas_study_info |>
  filter(PUBMED_ID == 30510241) |>
    select(PUBMED_ID, COHORT, STUDY_ACCESSION, "DISEASE/TRAIT", "INITIAL_SAMPLE_SIZE", "REPLICATION_SAMPLE_SIZE")
# if go to supplement, can see made up of many many many studies - I believe includes other all other subsamples


gwas_study_info |>
  filter(PUBMED_ID == 33307546) |>
    select(PUBMED_ID, COHORT, STUDY_ACCESSION, "DISEASE/TRAIT", "INITIAL_SAMPLE_SIZE", "REPLICATION_SAMPLE_SIZE")
# COVID-19 Host Genetics Initiative (HGI) is this hispanic individuals I believe
#  European ancestry from the ‘broad respiratory phenotype’ study of 23andMe
# See replication section of https://www.nature.com/articles/s41586-020-03065-y#Sec4


gwas_study_info |>
  filter(PUBMED_ID == 38184787) |> 
  select(PUBMED_ID, COHORT, STUDY_ACCESSION, "DISEASE/TRAIT", "INITIAL_SAMPLE_SIZE", "REPLICATION_SAMPLE_SIZE")

# cohorts listed are for

# Raine Study -- ? Raine

# Penn - UPenn etc.

# ?CALGB  
# "SIGNET-REGARDS"  >? "SIGNET"  



# "RISC" & "RISK" appear to be different
# Relationship Between Insulin Sensitivity and Cardiovascular Disease Risk (RISC)

# Risk Stratification and Identification of Immunogenetic and Microbial Markers of Rapid Disease Progression in Children with Crohn’s Disease (RISK) 



 "CKB" 

 [231] "COHRA"                                                                                                                                                    
 [232] "COHRA1"                                                                                                                                                   
 [233] "COHRA2"  

# UK Blood Service (UKBS)

 [294] "DiscovEHR"                                                                                                                                                
 [295] "DISCOVeRY-BMT"   

 [330] "ELSA"                                                                                                                                                     
 [331] "ELSA-Brasil"   

 [340] "EPIC"                                                                                                                                                     
 [341] "EPIC_CAD"                                                                                                                                                 
 [342] "EPIC_Obs"                                                                                                                                                 
 [343] "EPIC-Norfolk"                                                                                                                                             
 [344] "EPICURE"        

 [372] "FinnTwin"                                                                                                                                                 
 [373] "FinnTwin12"  

 [463] "GOCS"                                                                                                                                                     
 [464] "GOCS_Chilean" 

 [480] "GRAAD"                                                                                                                                                    
 [481] "GRaD" 

# Colo2&3

 [513] "HELIC"                                                                                                                                                    
 [514] "HELIC-MANOLIS"                                                                                                                                            
 [515] "HELIC-Pomak"   

# ? QTR == QTR_Qindao

# ? is "other|UKB" == "UKB|other"

# ? is UK|NR == UKB|NR

# ? CF_TSS == TSS

gwas_study_info |>
  filter(grepl("ORPS", COHORT))

gwas_study_info |>
  filter(PUBMED_ID == 39749473) |>
  select(COHORT)

# PAGE vs PAGES

# PUBMED ID: 35754128 - should be PAGES
# see sup table 1. https://pmc.ncbi.nlm.nih.gov/articles/PMC9671132/

# COGEND COGENT

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringdist_0.9.15 stringr_1.5.1     ggplot2_3.5.2     dplyr_1.1.4      
[5] data.table_1.17.8 workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.3.1     renv_1.0.3        
 [5] promises_1.3.3     tidyselect_1.2.1   Rcpp_1.1.0         git2r_0.36.2      
 [9] parallel_4.3.1     callr_3.7.6        later_1.4.2        jquerylib_0.1.4   
[13] scales_1.4.0       yaml_2.3.10        fastmap_1.2.0      here_1.0.1        
[17] R6_2.6.1           generics_0.1.4     knitr_1.50         tibble_3.3.0      
[21] rprojroot_2.1.0    RColorBrewer_1.1-3 bslib_0.9.0        pillar_1.11.0     
[25] rlang_1.1.6        cachem_1.1.0       stringi_1.8.7      httpuv_1.6.16     
[29] xfun_0.52          getPass_0.2-4      fs_1.6.6           sass_0.4.10       
[33] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     ps_1.9.1          
[37] grid_4.3.1         digest_0.6.37      processx_3.8.6     rstudioapi_0.17.1 
[41] lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.4     glue_1.8.0        
[45] farver_2.1.2       whisker_0.4.1      rmarkdown_2.29     httr_1.4.7        
[49] tools_4.3.1        pkgconfig_2.0.3    htmltools_0.5.8.1

Harmonizing Cohort Labels

Isobel Beasley