Last updated: 2025-09-17

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version df34d71. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/figure/
    Ignored:    data/.DS_Store
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/who/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_study_info_cohort_corrected.csv
    Ignored:    output/gwas_study_info_trait_corrected.csv
    Ignored:    output/gwas_study_info_trait_ontology_info.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l1.csv
    Ignored:    output/gwas_study_info_trait_ontology_info_l2.csv
    Ignored:    output/trait_ontology/
    Ignored:    renv/

Unstaged changes:
    Modified:   code/get_term_descendants.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/disease_inves_by_ancest.Rmd) and HTML (docs/disease_inves_by_ancest.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd df34d71 IJbeasley 2025-09-17 More fixing up of disease grouping
html 79e26e5 IJbeasley 2025-09-11 Build site.
Rmd 7f2a8f7 IJbeasley 2025-09-11 Add more global burden vs. dalys plots
html fb089b4 IJbeasley 2025-09-11 Build site.
Rmd 708d5b3 IJbeasley 2025-09-11 Add GBD data to disease gwas ancestry investigation
html 437885b IJbeasley 2025-08-25 Build site.
Rmd 31e868c IJbeasley 2025-08-25 Update proportion euro invest for updated disease categories
html 3d94889 IJbeasley 2025-08-23 Build site.
Rmd 48dd80a IJbeasley 2025-08-23 Update proportion ancestry investigated by disease
html 42e854b IJbeasley 2025-08-21 Build site.
Rmd fa9a4da IJbeasley 2025-08-21 Starting test of relationship between proportion european and total sample size
html f5087d2 IJBeasley 2025-07-30 Build site.
Rmd 72172e3 IJBeasley 2025-07-30 Split page into disease by ancest
html 2fd5755 Isobel Beasley 2022-02-16 Build site.
Rmd 7347b5d Isobel Beasley 2022-02-16 Add initial plotting using gwas cat stats

1 Set up

library(dplyr)
library(data.table)
library(ggplot2)
source(here::here("code/custom_plotting.R"))

1.1 Load data

# gwas_study_info = data.table::fread("data/gwas_catalog/gwas-catalog-v1.0.3-studies-r2022-02-02.tsv", 
#                                     sep = "\t", 
#                                     quote = "")
 
# gwas_study_info <- fread(here::here("output/gwas_study_info_trait_corrected.csv"))

gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

gwas_ancest_info <-  fread(here::here("data/gwas_catalog/gwas-catalog-v1.0.3.1-ancestries-r2025-07-21.tsv"),
                         sep = "\t", 
                         quote = "")

1.2 Basic data cleaning

# fixing the column names
gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

gwas_ancest_info = gwas_ancest_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# making sure arranged by DATE (oldest at the top)
gwas_ancest_info = gwas_ancest_info |> 
                   dplyr::arrange(DATE)

gwas_study_info = gwas_study_info |> 
                  dplyr::arrange(DATE)

1.3 NA for number of individuals

# 44 studies / 44 rows
gwas_ancest_info |>
dplyr::filter(is.na(NUMBER_OF_INDIVIDUALS)) |>
nrow()
[1] 44
# from only 24 gwas papers
gwas_ancest_info |>
    dplyr::filter(is.na(NUMBER_OF_INDIVIDUALS)) |> 
    select(PUBMED_ID) |> 
    distinct() |>
    nrow()
[1] 24
gwas_ancest_info |>
  dplyr::filter(PUBMED_ID == 28679651) |>
  select(INITIAL_SAMPLE_DESCRIPTION, 
         REPLICATION_SAMPLE_DESCRIPTION, 
         BROAD_ANCESTRAL_CATEGORY) |>
  distinct()
   INITIAL_SAMPLE_DESCRIPTION REPLICATION_SAMPLE_DESCRIPTION
                       <char>                         <char>
1:        404 cases, controls                           <NA>
2:        194 cases, controls                           <NA>
3:        426 cases, controls                           <NA>
4:         85 cases, controls                           <NA>
5:        535 cases, controls                           <NA>
6:        345 cases, controls                           <NA>
7:        835 cases, controls                           <NA>
8:        844 cases, controls                           <NA>
9:        447 cases, controls                           <NA>
   BROAD_ANCESTRAL_CATEGORY
                     <char>
1:                       NR
2:                       NR
3:                       NR
4:                       NR
5:                       NR
6:                       NR
7:                       NR
8:                       NR
9:                       NR
# 28679651 - problem seems to be that number of controls per disease not specifically listed
# see https://pubmed.ncbi.nlm.nih.gov/28679651/


# although paper they cite as where data comes from (https://www.nature.com/articles/leu2016387#Tab1)
# discloses: 1229 AL amyloidosis patients from Germany, UK and Italy, and 7526 healthy local controls

1.3.1 Filter out NA number of individuals

gwas_ancest_info  =
gwas_ancest_info |>
dplyr::filter(!is.na(NUMBER_OF_INDIVIDUALS))

1.4 Set up - add trait information to ancestry information

gwas_ancest_info = 
left_join(
gwas_ancest_info,
gwas_study_info |> select(STUDY_ACCESSION, 
                          COHORT, 
                          MAPPED_TRAIT,
                          DISEASE_STUDY,
                          MAPPED_TRAIT_CATEGORY,
                          BACKGROUND_TRAIT_CATEGORY,
                          l2_all_disease_terms),
by = "STUDY_ACCESSION"
)

gwas_ancest_info = gwas_ancest_info |> filter(DISEASE_STUDY == T)

2 Top traits

2.1 Top traits by number of pubmed ids - including non-disease traits

The traits with the most number of pubmed ids are:

n_studies_trait = gwas_study_info |>
  dplyr::select(MAPPED_TRAIT, MAPPED_TRAIT_URI, PUBMED_ID) |>
  dplyr::mutate(MAPPED_TRAIT = stringr::str_split(MAPPED_TRAIT, ",\\s*")) |>
  tidyr::unnest_longer(MAPPED_TRAIT) |>
  dplyr::distinct() |>
  dplyr::group_by(MAPPED_TRAIT, MAPPED_TRAIT_URI) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))
`summarise()` has grouped output by 'MAPPED_TRAIT'. You can override using the
`.groups` argument.
head(n_studies_trait)
# A tibble: 6 × 3
# Groups:   MAPPED_TRAIT [6]
  MAPPED_TRAIT                                     MAPPED_TRAIT_URI    n_studies
  <chr>                                            <chr>                   <int>
1 high density lipoprotein cholesterol measurement http://www.ebi.ac.…       134
2 body mass index                                  http://www.ebi.ac.…       133
3 triglyceride measurement                         http://www.ebi.ac.…       129
4 low density lipoprotein cholesterol measurement  http://www.ebi.ac.…       119
5 type 2 diabetes mellitus                         http://purl.obolib…       118
6 total cholesterol measurement                    http://www.ebi.ac.…       103

2.2 Top traits by number of pubmed ids - disease traits only

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l2_all_disease_terms, PUBMED_ID) |>
  dplyr::mutate(l2_all_disease_terms = stringr::str_split(l2_all_disease_terms, ",\\s*")) |>
  tidyr::unnest_longer(l2_all_disease_terms) |>
  dplyr::distinct() |>
  dplyr::group_by(l2_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))

head(n_studies_trait)
# A tibble: 6 × 2
  l2_all_disease_terms                          n_studies
  <chr>                                             <int>
1 other mental disorders                              208
2 diabetes mellitus type 2                            192
3 other cardiovascular and circulatory diseases       170
4 alzheimer's disease and other dementias             162
5 schizophrenia                                       149
6 asthma                                              148
dim(n_studies_trait)
[1] 1485    2

3 Make ancestry groups

Here we make the column ‘ancestry_group’ in the gwas_study_info datasets, ‘ancestry_group’ defines the broad ancestry group (like in Martin et al. 2019, European, Greater Middle Eastern etc.) that each group of individuals belongs to.

grouped_ancest = vector()
broad_ancest_cat = unique(gwas_ancest_info$BROAD_ANCESTRAL_CATEGORY)

for(study_ancest in broad_ancest_cat){
  
grouped_ancest[study_ancest] = group_ancestry_fn(study_ancest)

}

grouped_ancest_map = data.frame(ancestry_group = grouped_ancest,
                                BROAD_ANCESTRAL_CATEGORY = broad_ancest_cat
                                )


print("See some example mappings between BROAD_ANCESTRAL_CATEGORY and ancestry_group")
[1] "See some example mappings between BROAD_ANCESTRAL_CATEGORY and ancestry_group"
print(dplyr::slice_sample(grouped_ancest_map, n = 5))
                                                                                                                                                                                             ancestry_group
European                                                                                                                                                                                           European
European, African unspecified                                                                                                                                                                      Multiple
European, Hispanic or Latin American, African unspecified, Asian unspecified                                                                                                                       Multiple
East Asian                                                                                                                                                                                            Asian
European, Asian unspecified, African American or Afro-Caribbean, Greater Middle Eastern (Middle Eastern, North African or Persian), Oceanian, Native American, Other, Other admixed ancestry       Multiple
                                                                                                                                                                                                                                                                                                                                                                 BROAD_ANCESTRAL_CATEGORY
European                                                                                                                                                                                                                                                                                                                                                                         European
European, African unspecified                                                                                                                                                                                                                                                                                                                               European, African unspecified
European, Hispanic or Latin American, African unspecified, Asian unspecified                                                                                                                                                                                                                                 European, Hispanic or Latin American, African unspecified, Asian unspecified
East Asian                                                                                                                                                                                                                                                                                                                                                                     East Asian
European, Asian unspecified, African American or Afro-Caribbean, Greater Middle Eastern (Middle Eastern, North African or Persian), Oceanian, Native American, Other, Other admixed ancestry European, Asian unspecified, African American or Afro-Caribbean, Greater Middle Eastern (Middle Eastern, North African or Persian), Oceanian, Native American, Other, Other admixed ancestry
gwas_ancest_info = dplyr::left_join(
            gwas_ancest_info,
            grouped_ancest_map,
            by = "BROAD_ANCESTRAL_CATEGORY")

gwas_ancest_info = gwas_ancest_info |>
                    dplyr::mutate(ancestry_group = factor(ancestry_group, levels = ancestry_levels))

3.1 Check: How many individuals in each ancestry group?

Expecting highest to be in European

total_gwas_n = 
gwas_ancest_info$NUMBER_OF_INDIVIDUALS |> sum(na.rm = T)

print("Total numbers (in millions) per ancestry group")
[1] "Total numbers (in millions) per ancestry group"
gwas_ancest_info |>
  dplyr::group_by(ancestry_group) |>
  dplyr::summarise(n = sum(NUMBER_OF_INDIVIDUALS, na.rm = TRUE)/10^6) |>
  dplyr::mutate(prop = n* 10^6/total_gwas_n) |>
  dplyr::arrange(desc(n)) 
# A tibble: 9 × 3
  ancestry_group                  n       prop
  <fct>                       <dbl>      <dbl>
1 European                5064.     0.865     
2 African                  316.     0.0539    
3 Asian                    150.     0.0256    
4 Hispanic/Latin American  135.     0.0231    
5 Not reported             118.     0.0201    
6 Multiple                  71.8    0.0123    
7 Other                      0.755  0.000129  
8 Middle Eastern             0.295  0.0000503 
9 Oceanic                    0.0388 0.00000662

3.2 Plot number of individuals per ancestry group over time

gwas_ancest_info |>
  dplyr::group_by(ancestry_group) |>
  dplyr::mutate(ancest_cumsum = cumsum(as.numeric(NUMBER_OF_INDIVIDUALS))) |>
  add_final_totals() |>
  # select(DATE, ancest_cumsum, ancestry_group, NUMBER_OF_INDIVIDUALS) |>
  ggplot(aes(x=DATE, 
             y=ancest_cumsum/(10^6), 
             fill = ancestry_group
             )
         ) + 
  geom_area(position = 'stack') + 
  scale_x_date(date_labels = '%Y', 
               date_breaks = "2 years"
               ) + 
  theme_classic() + 
  labs(x = "Year", 
       y = "Individuals in GWAS catalog (millions)") + 
  scale_fill_manual(values = ancestry_colors, name='Ancestry group')  

Version Author Date
fb089b4 IJbeasley 2025-09-11
437885b IJbeasley 2025-08-25
3d94889 IJbeasley 2025-08-23
42e854b IJbeasley 2025-08-21

4 Plot number of individuals per ancestry group for a single trait

4.1 Select trait

gwas_ancest_info_plot = 
gwas_ancest_info %>% 
  filter(!is.na(NUMBER_OF_INDIVIDUALS)) |>
  filter(MAPPED_TRAIT == 'high density lipoprotein cholesterol measurement')


print("Total numbers (in millions) per ancestry group - for high density lipoprotein cholesterol measurement")
[1] "Total numbers (in millions) per ancestry group - for high density lipoprotein cholesterol measurement"
gwas_ancest_info_plot %>% 
  group_by(ancestry_group) %>% 
  summarise(n = sum(NUMBER_OF_INDIVIDUALS, na.rm = TRUE)/10^6)
# A tibble: 4 × 2
  ancestry_group       n
  <fct>            <dbl>
1 European       0.00310
2 African        0.00319
3 Multiple       0.00255
4 Not reported   0.00104

4.2 Plot

gwas_ancest_info_plot = 
gwas_ancest_info_plot %>% 
  group_by(ancestry_group) %>% 
  mutate(ancest_cumsum = cumsum(as.numeric(NUMBER_OF_INDIVIDUALS))) 

gwas_ancest_info_plot = add_final_totals(gwas_ancest_info_plot)

gwas_ancest_info_plot |>
  ggplot(aes(x=DATE, y=ancest_cumsum/(10^6), fill = ancestry_group)) + 
  geom_area(position = 'stack') + 
  scale_x_date(date_labels = '%Y', date_breaks = "1 years") + 
  theme_classic() + 
  labs(x = "Year", y = "Individuals in GWAS catalog (millions)") + 
  scale_fill_manual(values = ancestry_colors, name='Ancestry group') 

Version Author Date
fb089b4 IJbeasley 2025-09-11
437885b IJbeasley 2025-08-25

5 Calculate Per Trait: Proportion European, Number of studies, Total number of individuals, Highest sample size in a single study

5.1 Proportion European overall

euro_n = gwas_ancest_info |>
         filter(ancestry_group == "European") |>
         pull(NUMBER_OF_INDIVIDUALS) |>
         sum(na.rm = T)

total_n = gwas_ancest_info |>
         pull(NUMBER_OF_INDIVIDUALS) |>
         sum(na.rm = T)

100 * euro_n / total_n
[1] 86.48042

5.2 Proportion European per trait

gwas_ancest_trait_info = gwas_ancest_info |> 
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(l2_all_disease_terms, 
                PUBMED_ID, ancestry_group, NUMBER_OF_INDIVIDUALS) |>
  dplyr::mutate(l2_all_disease_terms = stringr::str_split(l2_all_disease_terms, ",\\s*")) |>
  tidyr::unnest_longer(l2_all_disease_terms) |>
  dplyr::distinct()

n_studies_trait = n_studies_trait |> 
                  dplyr::filter(n_studies > 2) |>
                  dplyr::filter(l2_all_disease_terms != "")

total_n_euro_vec = vector()
prop_euro_vec = vector()
med_prop_euro_vec = vector()
first_quartile_prop_euro_vec = vector()
total_n_vec = vector()
med_sample_size_vec = vector()
n_studies_vec = vector()
highest_sample_size_vec = vector()

for(trait in n_studies_trait$l2_all_disease_terms){
  
  # Calculate the number of European ancestry individuals (studied for this trait)
  total_euro_n = gwas_ancest_trait_info |>
                 filter(ancestry_group == "European") |>
                 filter(l2_all_disease_terms %in% trait) |>
                 pull(NUMBER_OF_INDIVIDUALS) |>
                 sum(na.rm = T)
  
  total_n_euro_vec[trait] = total_euro_n
  
  # Calculate the total number of individuals (studied for this trait)
  all_study_n = gwas_ancest_trait_info |>
                filter(l2_all_disease_terms %in% trait) |>
                pull(NUMBER_OF_INDIVIDUALS) 
  
  total_n  = all_study_n |>
             sum(na.rm = T)
  
  total_n_vec[trait] = total_n
  
  # Get the highest sample size in a single study (for this trait)
  highest_sample_size = max(all_study_n, na.rm = T)
  
  highest_sample_size_vec[trait] = highest_sample_size
  
  # Get the median sample size in a single study (for this trait)
  med_sample_size_vec[trait] = median(all_study_n, na.rm = T)
  
  # Calculate the proportion of European ancestry individuals (across all studies for this trait)
  prop_euro_vec[trait] = 100 * total_euro_n / total_n
  
  # Calculate the number of unique studies (pubmed ids) for this trait
  n_studies = gwas_ancest_trait_info |>
              filter(l2_all_disease_terms %in% trait) |>
              pull(PUBMED_ID) |>
              unique() |>
              length()
  
  n_studies_vec[trait] = n_studies
  
  # Calculate the proportion of European ancestry individuals (per study for this trait)
  euro_n_per_study = gwas_ancest_trait_info |>
            filter(ancestry_group == "European") |>
            filter(l2_all_disease_terms %in% trait) |>
            group_by(PUBMED_ID) |>
            summarise(n_euro = sum(NUMBER_OF_INDIVIDUALS, na.rm = T))
  
  
  total_n_per_study = gwas_ancest_trait_info |>
            filter(l2_all_disease_terms %in% trait) |>
            group_by(PUBMED_ID) |>
            summarise(n_total = sum(NUMBER_OF_INDIVIDUALS, na.rm = T))
  
  prop_euro_per_study = inner_join(euro_n_per_study,
                                   total_n_per_study,
                                   by = "PUBMED_ID") |>
                         mutate(prop_euro = 100 * n_euro / n_total) 
  
  med_prop_euro_vec[trait] = median(prop_euro_per_study$prop_euro, na.rm = T)
  
  first_quartile_prop_euro_vec[trait] = quantile(prop_euro_per_study$prop_euro, probs = 0.25, na.rm = T)
  
}

prop_euro_df = data.frame(trait = n_studies_trait$l2_all_disease_terms,
                          total_n = total_n_vec,
                          total_n_euro = total_n_euro_vec,
                         prop_euro = prop_euro_vec,
                          median_prop_euro = med_prop_euro_vec,
                          n_studies = n_studies_vec,
                          highest_sample_size = highest_sample_size_vec,
                          median_sample_size = med_sample_size_vec,
                         first_quartile_prop_euro = first_quartile_prop_euro_vec
                          )
prop_euro_df |> ungroup() |> dplyr::slice_min(prop_euro, n = 10)
                                                      trait total_n
leprosy                                             leprosy   97690
hyperuricemia                                 hyperuricemia   65979
rare dyslipidemia                         rare dyslipidemia  218111
thyrotoxic periodic paralysis thyrotoxic periodic paralysis   14935
amphetamine use disorders         amphetamine use disorders   10827
kashin-beck disease                     kashin-beck disease    5653
moyamoya disease                           moyamoya disease    7290
nasopharyngeal neoplasm             nasopharyngeal neoplasm    5493
hypertriglyceridemia                   hypertriglyceridemia   30727
hidradenitis suppurativa           hidradenitis suppurativa  841761
                              total_n_euro prop_euro median_prop_euro n_studies
leprosy                                  0  0.000000               NA         7
hyperuricemia                            0  0.000000               NA         4
rare dyslipidemia                        0  0.000000               NA         4
thyrotoxic periodic paralysis            0  0.000000               NA         4
amphetamine use disorders                0  0.000000               NA         3
kashin-beck disease                      0  0.000000               NA         3
moyamoya disease                         0  0.000000               NA         3
nasopharyngeal neoplasm                  0  0.000000               NA         3
hypertriglyceridemia                  1660  5.402415              100         4
hidradenitis suppurativa             70568  8.383377              100         3
                              highest_sample_size median_sample_size
leprosy                                     17450             5613.5
hyperuricemia                               24535             4774.0
rare dyslipidemia                           52449            18250.0
thyrotoxic periodic paralysis                3835             1451.0
amphetamine use disorders                    6155             2219.0
kashin-beck disease                          1717              863.0
moyamoya disease                             3767              512.0
nasopharyngeal neoplasm                      1860             1082.0
hypertriglyceridemia                        13890             3169.5
hidradenitis suppurativa                   758033             4185.0
                              first_quartile_prop_euro
leprosy                                             NA
hyperuricemia                                       NA
rare dyslipidemia                                   NA
thyrotoxic periodic paralysis                       NA
amphetamine use disorders                           NA
kashin-beck disease                                 NA
moyamoya disease                                    NA
nasopharyngeal neoplasm                             NA
hypertriglyceridemia                         100.00000
hidradenitis suppurativa                      54.08478
prop_euro_df |> ungroup() |> dplyr::slice_max(prop_euro, n = 10)
                                                                                        trait
autoimmune disease                                                         autoimmune disease
polymyalgia rheumatica                                                 polymyalgia rheumatica
temporal arteritis                                                         temporal arteritis
cholangitis                                                                       cholangitis
hip pain                                                                             hip pain
photosensitivity disease                                             photosensitivity disease
chronic cystitis                                                             chronic cystitis
common cold                                                                       common cold
exanthem                                                                             exanthem
gingival bleeding                                                           gingival bleeding
granulomatosis with polyangiitis                             granulomatosis with polyangiitis
infectious mononucleosis                                             infectious mononucleosis
knee pain                                                                           knee pain
language impairment                                                       language impairment
lyme disease                                                                     lyme disease
myelodysplastic syndrome                                             myelodysplastic syndrome
self-injurious behavior                                               self-injurious behavior
thrombotic disease                                                         thrombotic disease
abnormality of head or neck                                       abnormality of head or neck
abnormality of the cervical spine                           abnormality of the cervical spine
abnormality of the skeletal system                         abnormality of the skeletal system
acute kidney failure                                                     acute kidney failure
antepartum hemorrhage                                                   antepartum hemorrhage
anti-neutrophil antibody associated vasculitis anti-neutrophil antibody associated vasculitis
articular cartilage disorder                                     articular cartilage disorder
bartholin gland disease                                               bartholin gland disease
chickenpox                                                                         chickenpox
common variable immunodeficiency                             common variable immunodeficiency
dental pulp disease                                                       dental pulp disease
esophagitis                                                                       esophagitis
fecal incontinence                                                         fecal incontinence
frontal fibrosing alopecia                                         frontal fibrosing alopecia
functional laterality                                                   functional laterality
gallbladder and liver cancer                                     gallbladder and liver cancer
glossitis                                                                           glossitis
heart aneurysm                                                                 heart aneurysm
hypermobility syndrome                                                 hypermobility syndrome
hyperventilation                                                             hyperventilation
iridocyclitis                                                                   iridocyclitis
juvenile dermatomyositis                                             juvenile dermatomyositis
labyrinthitis                                                                   labyrinthitis
lower respiratory tract disease                               lower respiratory tract disease
mastitis                                                                             mastitis
mastoiditis                                                                       mastoiditis
mesothelioma                                                                     mesothelioma
multiple system atrophy                                               multiple system atrophy
neurofibromatosis                                                           neurofibromatosis
nystagmus                                                                           nystagmus
odontogenic cyst                                                             odontogenic cyst
osteochondritis dissecans                                           osteochondritis dissecans
peritonsillar abscess                                                   peritonsillar abscess
pregnancy disorder                                                         pregnancy disorder
radiation-induced disorder                                         radiation-induced disorder
self-injurious ideation                                               self-injurious ideation
shingles                                                                             shingles
shoulder impingement syndrome                                   shoulder impingement syndrome
toothache                                                                           toothache
vomiting                                                                             vomiting
                                               total_n total_n_euro prop_euro
autoimmune disease                             1951082      1951082       100
polymyalgia rheumatica                         3827751      3827751       100
temporal arteritis                             1732337      1732337       100
cholangitis                                    2016250      2016250       100
hip pain                                       2216824      2216824       100
photosensitivity disease                        450574       450574       100
chronic cystitis                               1674324      1674324       100
common cold                                     896545       896545       100
exanthem                                        855033       855033       100
gingival bleeding                              1094882      1094882       100
granulomatosis with polyangiitis               1312860      1312860       100
infectious mononucleosis                       1077967      1077967       100
knee pain                                      1980067      1980067       100
language impairment                              10185        10185       100
lyme disease                                   1070058      1070058       100
myelodysplastic syndrome                        476950       476950       100
self-injurious behavior                         615417       615417       100
thrombotic disease                             1313807      1313807       100
abnormality of head or neck                    1313403      1313403       100
abnormality of the cervical spine              1457960      1457960       100
abnormality of the skeletal system             4223610      4223610       100
acute kidney failure                           1675254      1675254       100
antepartum hemorrhage                          1144445      1144445       100
anti-neutrophil antibody associated vasculitis   28421        28421       100
articular cartilage disorder                   1308890      1308890       100
bartholin gland disease                         742865       742865       100
chickenpox                                     1187938      1187938       100
common variable immunodeficiency                 31849        31849       100
dental pulp disease                            1101239      1101239       100
esophagitis                                    1048652      1048652       100
fecal incontinence                              859430       859430       100
frontal fibrosing alopecia                       12251        12251       100
functional laterality                          1278981      1278981       100
gallbladder and liver cancer                   1301135      1301135       100
glossitis                                      1310001      1310001       100
heart aneurysm                                 1284432      1284432       100
hypermobility syndrome                         1285724      1285724       100
hyperventilation                               1314418      1314418       100
iridocyclitis                                  1013674      1013674       100
juvenile dermatomyositis                         40362        40362       100
labyrinthitis                                  1239907      1239907       100
lower respiratory tract disease                1477048      1477048       100
mastitis                                       1068118      1068118       100
mastoiditis                                    1311145      1311145       100
mesothelioma                                    462038       462038       100
multiple system atrophy                          21730        21730       100
neurofibromatosis                               821765       821765       100
nystagmus                                       854184       854184       100
odontogenic cyst                               1305471      1305471       100
osteochondritis dissecans                       844059       844059       100
peritonsillar abscess                          1347550      1347550       100
pregnancy disorder                             2483491      2483491       100
radiation-induced disorder                      408687       408687       100
self-injurious ideation                         338014       338014       100
shingles                                       1252017      1252017       100
shoulder impingement syndrome                  1231437      1231437       100
toothache                                      1090805      1090805       100
vomiting                                       1602869      1602869       100
                                               median_prop_euro n_studies
autoimmune disease                                          100         8
polymyalgia rheumatica                                      100         7
temporal arteritis                                          100         7
cholangitis                                                 100         5
hip pain                                                    100         5
photosensitivity disease                                    100         5
chronic cystitis                                            100         4
common cold                                                 100         4
exanthem                                                    100         4
gingival bleeding                                           100         4
granulomatosis with polyangiitis                            100         4
infectious mononucleosis                                    100         4
knee pain                                                   100         4
language impairment                                         100         4
lyme disease                                                100         4
myelodysplastic syndrome                                    100         4
self-injurious behavior                                     100         4
thrombotic disease                                          100         4
abnormality of head or neck                                 100         3
abnormality of the cervical spine                           100         3
abnormality of the skeletal system                          100         3
acute kidney failure                                        100         3
antepartum hemorrhage                                       100         3
anti-neutrophil antibody associated vasculitis              100         3
articular cartilage disorder                                100         3
bartholin gland disease                                     100         3
chickenpox                                                  100         3
common variable immunodeficiency                            100         3
dental pulp disease                                         100         3
esophagitis                                                 100         3
fecal incontinence                                          100         3
frontal fibrosing alopecia                                  100         3
functional laterality                                       100         3
gallbladder and liver cancer                                100         3
glossitis                                                   100         3
heart aneurysm                                              100         3
hypermobility syndrome                                      100         3
hyperventilation                                            100         3
iridocyclitis                                               100         3
juvenile dermatomyositis                                    100         3
labyrinthitis                                               100         3
lower respiratory tract disease                             100         3
mastitis                                                    100         3
mastoiditis                                                 100         3
mesothelioma                                                100         3
multiple system atrophy                                     100         3
neurofibromatosis                                           100         3
nystagmus                                                   100         3
odontogenic cyst                                            100         3
osteochondritis dissecans                                   100         3
peritonsillar abscess                                       100         3
pregnancy disorder                                          100         3
radiation-induced disorder                                  100         3
self-injurious ideation                                     100         3
shingles                                                    100         3
shoulder impingement syndrome                               100         3
toothache                                                   100         3
vomiting                                                    100         3
                                               highest_sample_size
autoimmune disease                                          469184
polymyalgia rheumatica                                      456348
temporal arteritis                                          456348
cholangitis                                                 456348
hip pain                                                    455272
photosensitivity disease                                    422570
chronic cystitis                                            456348
common cold                                                 456348
exanthem                                                    448303
gingival bleeding                                           461031
granulomatosis with polyangiitis                            456348
infectious mononucleosis                                    403384
knee pain                                                   455272
language impairment                                           4291
lyme disease                                                617731
myelodysplastic syndrome                                    456348
self-injurious behavior                                     156880
thrombotic disease                                          456348
abnormality of head or neck                                 456348
abnormality of the cervical spine                           402528
abnormality of the skeletal system                          424024
acute kidney failure                                        456348
antepartum hemorrhage                                       401812
anti-neutrophil antibody associated vasculitis                6173
articular cartilage disorder                                456348
bartholin gland disease                                     284356
chickenpox                                                  403381
common variable immunodeficiency                             16753
dental pulp disease                                         456348
esophagitis                                                 456348
fecal incontinence                                          456348
frontal fibrosing alopecia                                    6668
functional laterality                                       455963
gallbladder and liver cancer                                456348
glossitis                                                   456348
heart aneurysm                                              456348
hypermobility syndrome                                      456348
hyperventilation                                            456348
iridocyclitis                                               456348
juvenile dermatomyositis                                     16530
labyrinthitis                                               448383
lower respiratory tract disease                             486484
mastitis                                                    450635
mastoiditis                                                 456348
mesothelioma                                                456348
multiple system atrophy                                       8016
neurofibromatosis                                           450894
nystagmus                                                   450009
odontogenic cyst                                            456348
osteochondritis dissecans                                   450895
peritonsillar abscess                                       456348
pregnancy disorder                                          408961
radiation-induced disorder                                  377968
self-injurious ideation                                     156716
shingles                                                    456348
shoulder impingement syndrome                               624133
toothache                                                   461031
vomiting                                                    450874
                                               median_sample_size
autoimmune disease                                        63003.0
polymyalgia rheumatica                                   382165.5
temporal arteritis                                       376871.0
cholangitis                                              391784.0
hip pain                                                 407746.0
photosensitivity disease                                   4452.5
chronic cystitis                                         418574.0
common cold                                              219591.0
exanthem                                                    745.0
gingival bleeding                                        314887.0
granulomatosis with polyangiitis                         201371.5
infectious mononucleosis                                 329052.0
knee pain                                                372892.0
language impairment                                         557.5
lyme disease                                               1060.0
myelodysplastic syndrome                                   4190.5
self-injurious behavior                                  139310.0
thrombotic disease                                       258962.0
abnormality of head or neck                              448094.0
abnormality of the cervical spine                        363190.0
abnormality of the skeletal system                       394642.0
acute kidney failure                                     415554.0
antepartum hemorrhage                                    265344.5
anti-neutrophil antibody associated vasculitis             2036.0
articular cartilage disorder                             449409.0
bartholin gland disease                                  247540.0
chickenpox                                               330403.0
common variable immunodeficiency                           6936.5
dental pulp disease                                      384486.0
esophagitis                                              383972.0
fecal incontinence                                       387201.0
frontal fibrosing alopecia                                 5161.0
functional laterality                                    406946.0
gallbladder and liver cancer                             451219.0
glossitis                                                450030.0
heart aneurysm                                           450283.0
hypermobility syndrome                                   450249.0
hyperventilation                                         449443.0
iridocyclitis                                            386569.0
juvenile dermatomyositis                                  13064.0
labyrinthitis                                            403594.0
lower respiratory tract disease                          170757.0
mastitis                                                 407701.0
mastoiditis                                              449737.0
mesothelioma                                               1697.0
multiple system atrophy                                    4777.5
neurofibromatosis                                        185406.0
nystagmus                                                401460.0
odontogenic cyst                                         450658.0
osteochondritis dissecans                                391100.0
peritonsillar abscess                                    343565.5
pregnancy disorder                                       402583.0
radiation-induced disorder                                  862.0
self-injurious ideation                                   80398.0
shingles                                                 330403.0
shoulder impingement syndrome                            300026.0
toothache                                                454565.0
vomiting                                                 384302.0
                                               first_quartile_prop_euro
autoimmune disease                                                  100
polymyalgia rheumatica                                              100
temporal arteritis                                                  100
cholangitis                                                         100
hip pain                                                            100
photosensitivity disease                                            100
chronic cystitis                                                    100
common cold                                                         100
exanthem                                                            100
gingival bleeding                                                   100
granulomatosis with polyangiitis                                    100
infectious mononucleosis                                            100
knee pain                                                           100
language impairment                                                 100
lyme disease                                                        100
myelodysplastic syndrome                                            100
self-injurious behavior                                             100
thrombotic disease                                                  100
abnormality of head or neck                                         100
abnormality of the cervical spine                                   100
abnormality of the skeletal system                                  100
acute kidney failure                                                100
antepartum hemorrhage                                               100
anti-neutrophil antibody associated vasculitis                      100
articular cartilage disorder                                        100
bartholin gland disease                                             100
chickenpox                                                          100
common variable immunodeficiency                                    100
dental pulp disease                                                 100
esophagitis                                                         100
fecal incontinence                                                  100
frontal fibrosing alopecia                                          100
functional laterality                                               100
gallbladder and liver cancer                                        100
glossitis                                                           100
heart aneurysm                                                      100
hypermobility syndrome                                              100
hyperventilation                                                    100
iridocyclitis                                                       100
juvenile dermatomyositis                                            100
labyrinthitis                                                       100
lower respiratory tract disease                                     100
mastitis                                                            100
mastoiditis                                                         100
mesothelioma                                                        100
multiple system atrophy                                             100
neurofibromatosis                                                   100
nystagmus                                                           100
odontogenic cyst                                                    100
osteochondritis dissecans                                           100
peritonsillar abscess                                               100
pregnancy disorder                                                  100
radiation-induced disorder                                          100
self-injurious ideation                                             100
shingles                                                            100
shoulder impingement syndrome                                       100
toothache                                                           100
vomiting                                                            100
prop_euro_df |> ungroup() |> dplyr::slice_max(total_n, n = 5)
                                                                                      trait
covid-19                                                                           covid-19
other cardiovascular and circulatory diseases other cardiovascular and circulatory diseases
other mental disorders                                               other mental disorders
other neoplasms                                                             other neoplasms
depressive disorders                                                   depressive disorders
                                                total_n total_n_euro prop_euro
covid-19                                      139881329    121127570  86.59309
other cardiovascular and circulatory diseases 100693085     87325142  86.72407
other mental disorders                         86139258     73749857  85.61701
other neoplasms                                72603760     63762734  87.82291
depressive disorders                           58247799     48099960  82.57816
                                              median_prop_euro n_studies
covid-19                                                   100        59
other cardiovascular and circulatory diseases              100       170
other mental disorders                                     100       208
other neoplasms                                            100        38
depressive disorders                                       100       143
                                              highest_sample_size
covid-19                                                  5519491
other cardiovascular and circulatory diseases             1893921
other mental disorders                                    2365010
other neoplasms                                            719047
depressive disorders                                      1942727
                                              median_sample_size
covid-19                                                   13420
other cardiovascular and circulatory diseases              28655
other mental disorders                                     12557
other neoplasms                                           168302
depressive disorders                                       56637
                                              first_quartile_prop_euro
covid-19                                                      78.76279
other cardiovascular and circulatory diseases                 79.29569
other mental disorders                                        87.22791
other neoplasms                                              100.00000
depressive disorders                                         100.00000

6 Plot variation across disease traits:

6.1 Distribution plots

6.1.1 Number of studies and sample size

6.1.1.1 Averaege number of individuals per study (for each disease)

prop_euro_df = prop_euro_df |>
  dplyr::mutate(avg_n_per_study = total_n / n_studies)

print("Average number of individuals per study (for this trait) - in millions")
[1] "Average number of individuals per study (for this trait) - in millions"
c(prop_euro_df$avg_n_per_study / 10^6) |> summary()
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000439 0.234135 0.438033 0.475204 0.571520 4.932637 
prop_euro_df |>
  ggplot(aes(x = avg_n_per_study)) + 
  geom_histogram() + 
  theme_bw() + 
  labs(x = "Average number of individuals per study (for this trait)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11
437885b IJbeasley 2025-08-25

6.1.2 Total number of individuals (per disease trait)

print("Total number of individuals (studied for each trait) - in millions")
[1] "Total number of individuals (studied for each trait) - in millions"
c(prop_euro_df$total_n / 10^6) |> summary()
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
  0.00137   1.30114   1.98007   4.74875   4.16118 139.88133 
prop_euro_df |>
  ggplot(aes(x = total_n)) + 
  geom_histogram() + 
  theme_bw() + 
  scale_x_continuous(labels = scales::label_log()) + 
  labs(x = "Total number of GWAS participants for each trait")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11
437885b IJbeasley 2025-08-25
prop_euro_df |>
  mutate(total_n = log10(total_n)) |>
  ggplot(aes(x = total_n)) + 
  geom_histogram() + 
  theme_bw() + 
  labs(x = "log10(total number of GWAS participants for each trait)") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version Author Date
79e26e5 IJbeasley 2025-09-11

6.1.2.1 Number of European individuals (per disease trait)

print("Number of European ancestry individuals (studied for each trait) - in millions")
[1] "Number of European ancestry individuals (studied for each trait) - in millions"
c(prop_euro_df$total_n_euro / 10^6) |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.196   1.725   4.092   3.684 121.128 
prop_euro_df |>
  ggplot(aes(x = total_n_euro)) + 
  geom_histogram() +
  theme_bw() +
  scale_x_continuous(labels = scales::label_log()) +
  labs(x = "Number of European ancestry individuals (studied for each trait)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11
437885b IJbeasley 2025-08-25

6.1.2.2 Highest sample size in a single study (per disease trait)

print("Highest sample size in a single study (for this trait) - in millions")
[1] "Highest sample size in a single study (for this trait) - in millions"
c(prop_euro_df$highest_sample_size / 10^6) |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00028 0.44396 0.45635 0.56714 0.48460 5.51949 
prop_euro_df |>
  ggplot(aes(x = highest_sample_size)) + 
  geom_histogram() + 
  theme_bw() + 
  scale_x_continuous(labels = scales::label_log()) + 
  labs(x = "Highest sample size in a single study (for this trait)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11

6.1.3 Proportion european

6.1.3.1 Distribution of proportion european

print("Proportion European ancestry individuals (studied for each trait)")
[1] "Proportion European ancestry individuals (studied for each trait)"
prop_euro_df$prop_euro |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   82.72   88.15   85.49   92.88  100.00 
prop_euro_df |>
  ggplot(aes(x = prop_euro)) + 
  geom_histogram() + 
  theme_bw() + 
  labs(x = "Proportion of European ancestry individuals (studied for each trait)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11

6.1.3.2 Distribution of median proportion european (per study, per disease trait)

print("Median proportion European ancestry individuals (per study, for each trait)")
[1] "Median proportion European ancestry individuals (per study, for each trait)"
prop_euro_df$median_prop_euro |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  31.98  100.00  100.00   98.00  100.00  100.00       8 
prop_euro_df |>
  ggplot(aes(x = median_prop_euro)) + 
  geom_histogram() + 
  theme_bw() + 
  labs(x = "Median proportion of European ancestry individuals (per study, for each trait)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_bin()`).

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11

6.1.3.3 Distribution of first quartile proportion european (per study, per disease trait)

print("First quartile proportion European ancestry individuals (per study, for each trait)")
[1] "First quartile proportion European ancestry individuals (per study, for each trait)"
prop_euro_df$first_quartile_prop_euro |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  8.448  85.637  92.834  90.469 100.000 100.000       8 
prop_euro_df |>
  ggplot(aes(x = first_quartile_prop_euro)) + 
  geom_histogram() + 
  theme_bw() + 
  labs(x = "First quartile proportion of European ancestry individuals (per study, for each trait)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_bin()`).

Version Author Date
79e26e5 IJbeasley 2025-09-11
fb089b4 IJbeasley 2025-09-11

6.2 Scatter plots against number and proportion European

third_quartile_prop = quantile(prop_euro_df$prop_euro, probs = 0.75)

first_quartile_prop = quantile(prop_euro_df$prop_euro, probs = 0.25)

fifth_percentile_prop = quantile(prop_euro_df$prop_euro, probs = 0.05)

6.2.1 Proportion european - vs. total number of individuals

print("Proportion European vs. total number of individuals - spearman correlation")
[1] "Proportion European vs. total number of individuals - spearman correlation"
cor(prop_euro_df$prop_euro, prop_euro_df$total_n, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] -0.05665648
print("Proportion European vs. total number of individuals - spearman correlation - only traits with > 5 studies")
[1] "Proportion European vs. total number of individuals - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(prop_euro, total_n,   
                      method = "spearman",
    use = "pairwise.complete.obs"))
         cor
1 0.04654382
plot = 
prop_euro_df |>
  ggplot(aes(x = total_n, y = prop_euro, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
  scale_x_log10(labels = scales::label_log(),
                limits = c(min(prop_euro_df$total_n), max(prop_euro_df$total_n) * 1.1)) +
  geom_hline(yintercept = third_quartile_prop, linetype="dashed", color = "red") +
  geom_hline(yintercept = first_quartile_prop, linetype="dashed", color = "blue") +
  geom_hline(yintercept = fifth_percentile_prop, linetype="dashed", color = "purple") +
  annotate("text", x = max(prop_euro_df$total_n) - 10^3, y = third_quartile_prop + 1,
           label = "3rd quart.", vjust = -0.5, hjust = 1, color = "red") +
  annotate("text", x = max(prop_euro_df$total_n) - 10^3, y = first_quartile_prop + 1,
           label = "1st quart.", vjust = -0.5, hjust = 1, color = "blue") +
  annotate("text", x = max(prop_euro_df$total_n) -10^3, y = fifth_percentile_prop + 1,
           label = "5th percent.", vjust = -0.5, hjust = 1, color = "purple") +
  labs(x = "Total number of individuals (studied for this trait)",
       y = "% European ancestry idividuals (studied for this trait)")


plotly::ggplotly(plot)
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

6.2.2 Number of European individuals - vs. total number of individuals

print("Number of European ancestry individuals vs. total number of individuals - spearman correlation")
[1] "Number of European ancestry individuals vs. total number of individuals - spearman correlation"
cor(prop_euro_df$total_n_euro, prop_euro_df$total_n, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] 0.9903774
print("Number of European ancestry individuals vs. total number of individuals - spearman correlation - only traits with > 5 studies")
[1] "Number of European ancestry individuals vs. total number of individuals - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(total_n_euro, total_n,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
        cor
1 0.9932263
plot = 
prop_euro_df |>
  ggplot(aes(x = total_n, y = total_n_euro, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
  scale_x_continuous(labels = scales::label_log()) +
  scale_y_continuous(labels = scales::label_log()) +
  labs(x = "Total number of individuals (studied for this trait)",
       y = "Number of European ancestry idividuals (studied for this trait)") + 
  geom_abline(slope = 1, intercept = 0, linetype="dashed", color = "red")

plotly::ggplotly(plot)
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

6.2.3 Proportion european - vs. number of studies

print("Proportion European vs. number of studies - spearman correlation")
[1] "Proportion European vs. number of studies - spearman correlation"
cor(prop_euro_df$prop_euro, prop_euro_df$n_studies, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] -0.1915013
print("Proportion European vs. number of studies - spearman correlation - only traits with > 5 studies")
[1] "Proportion European vs. number of studies - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(prop_euro, n_studies,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
         cor
1 -0.1771577
plot = 
prop_euro_df |>
  ggplot(aes(x = n_studies, y = prop_euro)) + 
  geom_point() + 
  theme_bw() + 
  scale_x_log10(labels = scales::label_log()) +
  labs(x = "Total number of unique PUBMED IDs for this trait",
       y = "Proportion of European ancestry idividuals (studied for this trait)")


plotly::ggplotly(plot)
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

6.2.4 Number of European individuals vs number of studies

print("Number of European ancestry individuals vs. number of studies - spearman correlation")
[1] "Number of European ancestry individuals vs. number of studies - spearman correlation"
cor(prop_euro_df$total_n_euro, prop_euro_df$n_studies, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] 0.6067528
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(total_n_euro, n_studies,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
       cor
1 0.521536
plot = 
prop_euro_df |>
  ggplot(aes(x = total_n_euro, y = n_studies, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
#  scale_x_continuous(labels = scales::label_log()) +
  labs(x = "Number of European ancestry idividuals (studied for this trait)",
       y = "Total number of unique PUBMED IDs for this trait")


plotly::ggplotly(plot)

6.2.5 First quartile proportion european vs number of studies

cor(prop_euro_df$first_quartile_prop_euro, prop_euro_df$n_studies, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] 0.1178953
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(first_quartile_prop_euro, n_studies,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
         cor
1 -0.1455219
plot = 
prop_euro_df |>
  ggplot(aes(x = n_studies, y = first_quartile_prop_euro, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
  scale_x_log10(labels = scales::label_log()) +
  labs(x = "Total number of unique PUBMED IDs for this trait",
       y = "First quartile proportion of European ancestry idividuals (studied for this trait)")


plotly::ggplotly(plot)
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

6.2.6 First quartile proportion european vs total number of individuals

print("First quartile proportion European vs. total number of individuals - spearman correlation")
[1] "First quartile proportion European vs. total number of individuals - spearman correlation"
cor(prop_euro_df$first_quartile_prop_euro, prop_euro_df$total_n, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] 0.04652889
print("First quartile proportion European vs. total number of individuals - spearman correlation - only traits with > 5 studies")
[1] "First quartile proportion European vs. total number of individuals - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(first_quartile_prop_euro, total_n,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
         cor
1 0.04519419
plot =
prop_euro_df |>
  ggplot(aes(x = total_n, y = first_quartile_prop_euro, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
  scale_x_log10() + 
  labs(x = "Total number of individuals (studied for this trait)",
       y = "First quartile proportion of European ancestry idividuals (studied for this trait)")


plotly::ggplotly(plot)

6.2.7 Number of individuals vs number of studies

print("Total number of individuals vs. number of studies - spearman correlation")
[1] "Total number of individuals vs. number of studies - spearman correlation"
cor(prop_euro_df$total_n, prop_euro_df$n_studies, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] 0.6236847
print("Total number of individuals vs. number of studies - spearman correlation - only traits with > 5 studies")
[1] "Total number of individuals vs. number of studies - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(total_n, n_studies,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
        cor
1 0.5410621
plot = 
prop_euro_df |>
  ggplot(aes(x = total_n, y = n_studies, diseae = trait)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "Total number of individuals (studied for this trait)",
       y = "Total number of unique PUBMED IDs for this trait")


plotly::ggplotly(plot)

6.2.8 Proportion european - vs. average number of individuals per study

print("Proportion European vs. average number of individuals per study - spearman correlation")
[1] "Proportion European vs. average number of individuals per study - spearman correlation"
cor(prop_euro_df$prop_euro, prop_euro_df$avg_n_per_study, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] 0.09739902
print("Proportion European vs. average number of individuals per study - spearman correlation - only traits with > 5 studies")
[1] "Proportion European vs. average number of individuals per study - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(prop_euro, avg_n_per_study,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
        cor
1 0.2145069
plot = 
prop_euro_df |>
  ggplot(aes(x = avg_n_per_study, y = prop_euro, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "Average number of individuals per study (for this trait)",
       y = "Proportion of European ancestry idividuals (studied for this trait)")

plotly::ggplotly(plot)

6.2.9 Number of studies vs average number of individuals per study

print("Total number of studies vs. average number of individuals per study - spearman correlation")
[1] "Total number of studies vs. average number of individuals per study - spearman correlation"
cor(prop_euro_df$n_studies, prop_euro_df$avg_n_per_study, 
    method = "spearman",
    use = "pairwise.complete.obs")
[1] -0.08061019
print("Total number of studies vs. average number of individuals per study - spearman correlation - only traits with > 5 studies")
[1] "Total number of studies vs. average number of individuals per study - spearman correlation - only traits with > 5 studies"
prop_euro_df |>
  filter(n_studies > 5) |>
  summarise(cor = cor(n_studies, avg_n_per_study,   
                      method = "spearman",
    use = "pairwise.complete.obs")
    )
        cor
1 -0.271537
plot = 
prop_euro_df |>
  ggplot(aes(x = n_studies, y = avg_n_per_study, disease = trait)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "Total number of unique PUBMED IDs for this trait",
       y = "Average number of individuals per study (for this trait)")

plotly::ggplotly(plot)

7 Disease statistics GBD

gbd_data <- data.table::fread(here::here("data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv"))

7.1 Total number of individuals vs global DALYs

compare_stats =
left_join(prop_euro_df |> rename(cause = trait),
          gbd_data |> mutate(cause = tolower(cause))
)
Joining with `by = join_by(cause)`
cor(compare_stats$total_n, 
    compare_stats$val, 
    method = "spearman",
    use = "pairwise.complete.obs"
)
[1] 0.4724287
plot = 
compare_stats |>
  ggplot(aes(y = total_n, x = val, trait = cause)) + 
  geom_point() + 
  theme_bw() + 
  scale_y_log10(labels = scales::label_log())  +
  scale_x_log10(labels = scales::label_log()) +
  labs(y = "Total number of individuals (studied for this trait)",
       x = "Global DALYs (2019, GBD)")

plotly::ggplotly(plot)
Warning in scale_x_log10(labels = scales::label_log()): log-10 transformation
introduced infinite values.
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

7.2 Largest sample size vs global DALYs

cor(compare_stats$highest_sample_size, 
    compare_stats$val, 
    method = "spearman",
    use = "pairwise.complete.obs"
)
[1] 0.313261
plot = compare_stats |>
  ggplot(aes(y = highest_sample_size, 
             x = val, 
             trait = cause)) + 
  geom_point() + 
  scale_y_log10(labels = scales::label_log())  +
  scale_x_log10(labels = scales::label_log()) +
  theme_bw() + 
  labs(y = "Largest sample size in a single study (for this trait)",
       x = "Global DALYs (2019, GBD)")

plotly::ggplotly(plot)
Warning in scale_x_log10(labels = scales::label_log()): log-10 transformation
introduced infinite values.
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

7.3 Number of studies vs global DALYs

cor(compare_stats$n_studies, 
    compare_stats$val, 
    method = "spearman",
    use = "pairwise.complete.obs"
)
[1] 0.2779357
plot = compare_stats |>
  ggplot(aes(y = n_studies, 
             x = val, 
             trait = cause)) + 
  geom_point() + 
  scale_y_log10(labels = scales::label_log())  +
  scale_x_log10(labels = scales::label_log()) +
  theme_bw() + 
  labs(y = "Total number of studies (for this trait)",
       x = "Global DALYs (2019, GBD)")

plotly::ggplotly(plot)
Warning in scale_x_log10(labels = scales::label_log()): log-10 transformation
introduced infinite values.
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(ticktext): is.na() applied to non-(list or vector) of type
'expression'

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] ggplot2_3.5.2     data.table_1.17.8 dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] plotly_4.11.0      sass_0.4.10        utf8_1.2.6         generics_0.1.4    
 [5] tidyr_1.3.1        renv_1.0.3         stringi_1.8.7      digest_0.6.37     
 [9] magrittr_2.0.3     evaluate_1.0.4     grid_4.3.1         RColorBrewer_1.1-3
[13] fastmap_1.2.0      rprojroot_2.1.0    jsonlite_2.0.0     processx_3.8.6    
[17] whisker_0.4.1      ps_1.9.1           promises_1.3.3     httr_1.4.7        
[21] purrr_1.1.0        crosstalk_1.2.1    viridisLite_0.4.2  scales_1.4.0      
[25] lazyeval_0.2.2     jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6       
[29] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        tools_4.3.1       
[33] httpuv_1.6.16      here_1.0.1         vctrs_0.6.5        R6_2.6.1          
[37] lifecycle_1.0.4    git2r_0.36.2       stringr_1.5.1      htmlwidgets_1.6.4 
[41] fs_1.6.6           pkgconfig_2.0.3    callr_3.7.6        pillar_1.11.0     
[45] bslib_0.9.0        later_1.4.2        gtable_0.3.6       glue_1.8.0        
[49] Rcpp_1.1.0         xfun_0.52          tibble_3.3.0       tidyselect_1.2.1  
[53] rstudioapi_0.17.1  knitr_1.50         farver_2.1.2       htmltools_0.5.8.1 
[57] rmarkdown_2.29     labeling_0.4.3     compiler_4.3.1     getPass_0.2-4