GWAS for fitness in Drosophila

Last updated: 2023-03-15

Checks: 7 0

Knit directory: fitnessGWAS/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0.4). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20180914)

The command set.seed(20180914) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 2080e0c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 2080e0c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rapp.history
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .httr-oauth
    Ignored:    .pversion
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/correlations_SNP_effects_cache/
    Ignored:    code/.DS_Store
    Ignored:    code/Drosophila_GWAS.Rmd
    Ignored:    data/.DS_Store
    Ignored:    data/derived/
    Ignored:    data/input/.DS_Store
    Ignored:    data/input/.pversion
    Ignored:    data/input/dgrp.fb557.annot.txt
    Ignored:    data/input/dgrp2.bed
    Ignored:    data/input/dgrp2.bim
    Ignored:    data/input/dgrp2.fam
    Ignored:    data/input/huang_transcriptome/
    Ignored:    figures/.DS_Store

Untracked files:
    Untracked:  big_model.rds
    Untracked:  code/quant_gen_1.R
    Untracked:  data/input/genomic_relatedness_matrix.rds

Unstaged changes:
    Modified:   .gitignore
    Modified:   code/pdf_supp_material.Rmd
    Modified:   code/pdf_supp_material.pdf

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/perform_gwas.Rmd) and HTML (docs/perform_gwas.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	2080e0c	lukeholman	2023-03-15	wflow_publish("analysis/perform_gwas.Rmd")
Rmd	d94e22d	lukeholman	2023-03-14	New quant gen and other changes
html	0a415e3	lukeholman	2021-11-12	Build site.
html	7449a90	lukeholman	2021-10-01	Build site.
html	9f76189	lukeholman	2021-09-27	Build site.
Rmd	1f52aab	lukeholman	2021-09-27	Commit Sept 2021
html	8d14298	lukeholman	2021-09-26	Build site.
Rmd	af15dd6	lukeholman	2021-09-26	Commit Sept 2021
html	410fb73	lukeholman	2021-03-04	Build site.
html	871ae81	lukeholman	2021-03-04	Build site.
html	e112260	lukeholman	2021-03-04	Build site.
html	836a780	lukeholman	2021-03-04	Build site.
html	359ff37	lukeholman	2021-03-04	Build site.
html	5506c4b	lukeholman	2021-03-04	Build site.
html	f64a3e9	lukeholman	2021-03-04	Build site.
Rmd	fa1b9b0	lukeholman	2021-03-04	big first commit 2021
Rmd	8d54ea5	Luke Holman	2018-12-23	Initial commit
html	8d54ea5	Luke Holman	2018-12-23	Initial commit

library(dplyr)
library(ggplot2)
# library(ggdendro)
library(glue)
library(bigsnpr) # to install:   devtools::install_github("privefl/bigsnpr")
library(readr)
library(pander)
library(purrr)
library(future.apply) # for parallel code

# Load the predicted line means, as calculated by get_predicted_line_means
predicted_line_means <- read_csv("data/derived/predicted_line_means.csv")
plink <- paste(getwd(), "code/plink", sep = "/")

# # Load the wolbachia infection status data from the DGRP website
# wolbachia <- read_csv("data/input/wolbachia.csv") %>% 
#   mutate(line = paste("line", line, sep = ""))
# 
# # Load the chomosomal inversion data from the DGRP website
# # these are the 5 inversions that Huang et al. PNAS corrected for
# inversions <- read_csv("data/input/inversion genotypes.csv") %>%
#     mutate(line = paste("line", line, sep = "")) %>%
#     select(line, `In(2L)t`, `In(2R)NS`, `In(3R)P`, `In(3R)K`, `In(3R)Mo`) 

# helper function to pass commands to the terminal
# Note that we set `intern = TRUE`, and pass the result of `system()` to `cat()`, 
# ensuring that the Terminal output will be printed in this knitr report. 
run_command <- function(shell_command, wd = getwd(), path = ""){
  cat(system(glue("cd ", wd, path, "\n",shell_command), intern = TRUE), sep = '\n')
}

Perform SNP quality control and imputation

We cleaned up the DGRP’s .bed/.bim/.fam files (available from the Mackay lab website) as follows:

Remove any SNPs for which genotypes are missing for >10% of the DGRP lines. We then use the software Beagle to impute the remaining missing genotypes.
Remove SNPs with a minor allele frequency of less than 5%

Note that in the PLINK-formatted genotype files, lines fixed for the major allele are coded as 2, and lines fixed for the minor allele as 0. This means that in the association tests we calculate, negative effect sizes mean that the minor allele is associated with lower fitness, while positive effect sizes means that the minor allele is associated with higher fitness.

beagle <- bigsnpr::download_beagle()

perform_SNP_QC_and_imputation <- function(phenotypes){
  
  if("block" %in% names(phenotypes)) phenotypes <- phenotypes %>% select(-block)
  
  # Make a list of the lines in our sample and save as a text file for passing to PLINK
  lines_to_keep <- gsub("_", "", phenotypes$line) %>% cbind(.,.)
  write.table(lines_to_keep, row.names = FALSE, col.names = FALSE, file = "data/derived/lines_to_keep.txt", quote = FALSE)
  
  # Define a function to add our phenotype data to a .fam file, which is needed for GWAS analysis and to make sure PLINK includes these samples
  # The 'phenotypes' data frame needs to have a column called 'line'
  add_phenotypes_to_fam <- function(filepath, phenotypes){
    read_delim(filepath, col_names = FALSE, delim = " ") %>% 
      select(X1, X2, X3, X4, X5) %>% # Get all the non-phenotype columns
      left_join(phenotypes %>% 
                  mutate(line = gsub("_", "", line)), 
                by = c("X1" = "line")) %>%
      write.table(file = filepath, 
                  col.names = FALSE, row.names = FALSE, 
                  quote = FALSE, sep = " ")
  }
  
  # Use Plink to clean and subset the DGRP's SNP data as follows:
  # Only keep SNPs for which at least 90% of DGRP lines were successfully genotyped (--geno 0.1)
  # Only keep SNPs with a minor allele frequency of 0.05 or higher (--maf 0.05)
  # Finally, write the processed BIM/BED/FAM files to the data/derived directory
  run_command(glue("{plink} --bfile dgrp2",
                   " --geno 0.1 --maf 0.05 --allow-no-sex", 
                   " --make-bed --out ../derived/dgrp2_QC_all_lines"), path = "/data/input/")
  
  # Use the shell command 'sed' to remove underscores from the DGRP line names in the .fam file (e.g. 'line_120' becomes 'line120')
  # Otherwise, these underscores cause trouble when we need to convert from PLINK to vcf format (vcf format uses underscore as a separator)
  for(i in 1:2) run_command("sed -i '' 's/_//' dgrp2_QC_all_lines.fam", path = "/data/derived/")
  
  # Now impute the missing genotypes using Beagle
  # This part uses the data for the full DGRP panel of >200 lines, to infer missing genotypes as accurately as possible. 
  # This step uses a lot of memory (I set to 28MB max, and it used 26.5GB), but maybe it can also run on a less powerful computer?
  # The bigsnpr package provides a helpful wrapper for Beagle called snp_beagleImpute(): it translates to a VCF file and back again using PLINK
  snp_beagleImpute(beagle, plink, 
                   bedfile.in = "data/derived/dgrp2_QC_all_lines.bed", 
                   bedfile.out = "data/derived/dgrp2_QC_all_lines_imputed.bed",
                   ncores = 7, 
                   memory.max = 20)
  
  # assign a sex of 'female' to all the DGRP lines (Beagle removes the sex, and it seems PLINK needs individuals to have a sex, 
  # despite PLINK having a command called --allow-no-sex)
  run_command("sed -i '' 's/    0   0   0/  0   0   2/' dgrp2_QC_all_lines_imputed.fam", path = "/data/derived/")
  
  # Re-write the .bed file, to make sure the MAF threshold and minor/major allele designations are correctly assigned post-Beagle
  run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed",
                   " --geno 0.1 --maf 0.05 --allow-no-sex", 
                   " --make-bed --out dgrp2_QC_all_lines_imputed_correct"), path = "/data/derived/")
  #unlink(list.files("data/derived", pattern = "~", full.names = TRUE)) # delete the original files, which were given a ~ file name by PLINK
  
  # Use PLINK to get the allele IDs and calculate the MAFs across the whole DGRP, for all SNPs that survived QC
  # The file created is called data/derived/plink.frq
  run_command("{plink} --bfile dgrp2_QC_all_lines_imputed_correct --freq", path = "/data/derived")

  # Now cull the PLINK files to just the lines that we measured, and re-apply the 
  # MAF cut-off of 0.05 for the new smaller sample of DGRP lines
  run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed_correct",
                   " --keep-allele-order", # force PLINK to retain the major/minor allele designations that apply to the DGRP as a whole
                   " --keep lines_to_keep.txt --geno 0.1 --maf 0.05", 
                   " --make-bed --out dgrp2_QC_focal_lines"), path = "/data/derived/")
  
  # edit the new FAM file to add the phenotype data from 'phenotypes'
  add_phenotypes_to_fam("data/derived/dgrp2_QC_focal_lines.fam", phenotypes)
  
  # Clean up:
  unlink(c("data/derived/lines_to_keep.txt", 
           "data/derived/plink.log",
           "data/derived/dgrp2_QC_all_lines_imputed.bed",
           "data/derived/dgrp2_QC_all_lines_imputed.bim",
           "data/derived/dgrp2_QC_all_lines_imputed.fam",
           "data/derived/dgrp2_QC_all_lines_imputed.log",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.bed",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.bim",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.fam",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.log"))
}

perform_SNP_QC_and_imputation(phenotypes = predicted_line_means)

Run GWAS using GEMMA

Decompose the genomic relatedness matrix

This step is needed as a set up for the mixed model association tests, which use the eigenvectors and values of the GRM to adjust for ‘population structure’ among our samples.

# Any command with {bed} uses this particular PLINK file, double-checking that all the SNPs have MAF > 0.05
bed <- "dgrp2_QC_focal_lines -maf 0.05" 
#cov <- "-c covariate_file_for_GEMMA.txt" # not used at present

# Calculate the *centered* GRM for the focal lines (option -gk 1)
run_command("gemma -bfile {bed} -r2 0.99 -gk 1 -o GRM", path = "/data/derived") 

# Perform decomposition of the GRM, and save its eigenvalues and eigenvectors
run_command("gemma -bfile {bed} -r2 0.99 -k ./output/GRM.cXX.txt -eigen -o eigen_decomp", path = "/data/derived") 

# Remove eigen vectors + values for which the eigenvalue is close to zero, and save the modified file
values <- read_tsv("data/derived/output/eigen_decomp.eigenD.txt", col_names = "eigenvalue")$eigenvalue
vectors <- read_tsv("data/derived/output/eigen_decomp.eigenU.txt", col_names = FALSE)
missing_row <- which(is.na(read_delim("data/derived/dgrp2_QC_focal_lines.fam", delim = " ", col_names = FALSE)$X8))

write_tsv(vectors, path = "data/derived/output/eigen_decomp.eigenU_female.txt", col_names = FALSE)
write_tsv(data.frame(values), path = "data/derived/output/eigen_decomp.eigenD_female.txt", col_names = FALSE)

write_tsv(vectors[-missing_row, -missing_row], path = "data/derived/output/eigen_decomp.eigenU_male.txt", col_names = FALSE)
write_tsv(data.frame(values[-missing_row]), path = "data/derived/output/eigen_decomp.eigenD_male.txt", col_names = FALSE)
unlink("data/derived/output/eigen_decomp.eigenD.txt")
unlink("data/derived/output/eigen_decomp.eigenU.txt")
# vectors <- vectors[,-(which(values < 0.2))]
# values <- values[-(which(values < 0.2))]
# write_tsv(vectors, path = "data/derived/output/eigen_decomp.eigenU.txt", col_names = FALSE)
# write_tsv(data.frame(value = values), path = "data/derived/output/eigen_decomp.eigenD.txt", col_names = FALSE)

Run univariate and multivariate association tests

The following code chunk runs 5 linear mixed models, implemented in GEMMA. Each of the univariate linear mixed models uses the decomposed genomic relatedness matrix from above, and has one of our four fitness traits as the response variable. The multivariate model contains all four traits, and also uses the decomposed genomic relatedness matrix.

# The option "-lmm 1" runs a linear mixed model using Wald test to get the p-values
# The option "-r2 0.95" only keeps SNPs that are <95% correlated with Wolbachia infection status
# The option "-c covariate_file_for_GEMMA.txt" adds in the Wolbahia infection status as a covariate, and also tells the model to fit an intercept
opts <- "-lmm 1"
GRM_female <- "-d ./output/eigen_decomp.eigenD_female.txt -u ./output/eigen_decomp.eigenU_female.txt"
GRM_male <- "-d ./output/eigen_decomp.eigenD_male.txt -u ./output/eigen_decomp.eigenU_male.txt"
#cov <- "-c covariate_file_for_GEMMA.txt -r2 0.99" # not used at present

c("gemma -bfile {bed} {GRM_female} {opts} -n 1 -o female_early_lmm",
  "gemma -bfile {bed} {GRM_female} {opts} -n 2 -o female_late_lmm",
  "gemma -bfile {bed} {GRM_male} {opts} -n 3 -o male_early_lmm",
  "gemma -bfile {bed} {GRM_male} {opts} -n 4 -o male_late_lmm") %>%
  #"gemma -bfile {bed} {GRM_male} {opts} -n 1 2 3 4 -o all_four_traits") %>% 
  lapply(run_command, path = "/data/derived")

Add the allele IDs and minor allele frequencies (MAFs) to variant annotation database

This part uses the full DGRP panel, because our aim is to determine how common these alleles are in the original wild population from which the DGRP was captured. These MAFs are the ones used in our statistical analyses and plots.

# Extract and save the MAFs, SNP positions, and major/minor alleles
MAFs <- read.table("data/derived/plink.frq", header = TRUE, stringsAsFactors = FALSE) %>% 
  mutate(position = map_chr(strsplit(SNP, split = "_"), function(x) x[2])) %>%
  select(SNP, position, MAF, A1, A2) %>% 
  rename(minor_allele = A1,
         major_allele = A2) 

db <- DBI::dbConnect(RSQLite::SQLite(), "data/derived/annotations.sqlite3", create = FALSE)

MAFs <- tbl(db, "variants") %>% 
  select(SNP, FBID, site.class, distance.to.gene, chr) %>% collect() %>%
  full_join(MAFs, by = "SNP") %>%
  filter(!is.na(MAF)) %>%
  mutate(site.class = replace(site.class, is.na(site.class), "INTERGENIC"))

# Delete the original variant annotation table from the db, and add the new one back in
db %>% db_drop_table(table = "variants") 
db %>% copy_to(MAFs, 
               "variants", temporary = FALSE, 
               indexes = list("SNP", "FBID", "chr", "site.class")) 
rm(MAFs)

Clean up the GEMMA results

Discard columns we don’t need, and rename the SNP column from the name assigned by GEMMA (rs) to something more sensible (SNP).

# The 4 univariate files:
c("data/derived/output/female_early_lmm.assoc.txt",
  "data/derived/output/female_late_lmm.assoc.txt",
  "data/derived/output/male_early_lmm.assoc.txt",
  "data/derived/output/male_late_lmm.assoc.txt") %>%
  lapply(function(filename){
    df <- read_tsv(filename)
    names(df)[names(df) == "rs"] <- "SNP"
    df %>% select(SNP, beta, se, p_wald) %>% 
      write_tsv(filename)
  })

Perform SNP clumping using PLINK

# Use the un-manipulated DGRP bed/bim/fam files, since they don't throw an error from plink
bed <- "../input/dgrp2"

# Remind plink to only use the SNPs that passed QC
# read_tsv("data/derived/output/female_early_lmm.assoc.txt") %>% select(SNP) %>%
#   write_tsv(col_names = FALSE, path = "data/derived/snp_list.txt")


do_snp_clumping <- function(trait){

  run_command(glue("plink --bfile {bed}", 
                   " --clump output/{trait}.assoc.txt",
                   " --clump-field p_wald --clump-p1 0.00001"), # Each clump must contain one snp with p < 10 ^ -5
              path = "/data/derived")
  trait <- gsub("_lmm", "", trait)
  file_name <- glue("data/derived/{trait}_SNP_clumps.txt")
  file.rename("data/derived/plink.clumped", file_name)
  
  clumps <- read.table(file_name, header = TRUE) %>% 
    as_tibble() %>%
    rename(index_SNP = SNP,
           index_SNP_p_value = P,
           num_SNPs_in_clump = TOTAL,
           not_sig = NSIG, # Number of clumped SNPs that are not significant ( p > 0.05 )
           sig_p05 = S05, # Number of clumped SNPs 0.01 < p < 0.05
           sig_p01 = S01, # Number of clumped SNPs 0.001 < p < 0.01
           sig_p001 = S001, # Number of clumped SNPs 0.0001 < p < 0.001 
           sig_p0001 = S0001, # Number of clumped SNPs p < 0.0001
           other_snps = SP2) %>%
    mutate(other_snps = gsub("\\(1\\),", ", ", other_snps),
           other_snps = gsub("\\(1\\)", "", other_snps),
           other_snps = replace(other_snps, other_snps == "NONE", NA)) %>%
    select(-CHR, -F, -BP) 
  
  SNPs <- apply(clumps, 1, function(x) {
    res <- c(x[1], strsplit(x[9], split = ", ")[[1]])
    res[!is.na(res)]
  })
  
  db_local <- tbl(db, "variants") %>% select(SNP, FBID) %>% collect() 
  clumps$FBIDs <- lapply(SNPs, function(x) {
    focal <- db_local %>% filter(SNP %in% x) 
    if(nrow(focal) == 0) return(NA)
    focal %>% pull(FBID) %>% unique() %>% paste0(collapse = ", ")
    }) %>% unlist()
  
  db_local <- tbl(db, "genes") %>% collect()
  clumps$genes <- lapply(clumps$FBIDs, function(x) {
    if(is.na(x)) return(NA)
    FB <- strsplit(x, split = ", ")[[1]]
    return(db_local %>% filter(FBID %in% FB) %>% pull(gene_name) %>% paste0(collapse = ", "))
  }) %>% unlist()
  
  write_tsv(clumps, file_name)
  unlink("data/derived/plink.log")
}

lapply(c("female_early_lmm", "female_late_lmm", "male_early_lmm", "male_late_lmm"), do_snp_clumping)

Concatenate the results of the four univariate GWAS

Load up the GWAS results, which were produced by using linear mixed models implemented in GEMMA, and arrange the data in a ‘wide’ format, where each row is one SNP, and the columns hold the estimates of \(\beta\), the standard error of \(\beta\), and the p-value for the GEMMA association tests on each of the four fitness traits.

load_gwas_results <- function(fitness.component){
  new_names <- c(x = "beta", y = "se", z = "p_wald")
  names(new_names) <- c(paste("beta_", fitness.component, sep = ""), 
                        paste("SE_", fitness.component, sep = ""), 
                        paste("pvalue_", fitness.component, sep = ""))
  
  paste("data/derived/output/", fitness.component, "_lmm.assoc.txt", sep = "") %>%
    read_tsv() %>%
    select(SNP, beta, se, p_wald) %>%
    rename(!!new_names)
}
all_univariate_lmm_results <- load_gwas_results("female_early") %>%
  full_join(load_gwas_results("female_late"), by = "SNP") %>%
  full_join(load_gwas_results("male_early"), by = "SNP") %>%
  full_join(load_gwas_results("male_late"), by = "SNP")

Prune SNPs that are in 100% linkage disequilibrium

Because GEMMA is quick and easy to run, we did not bother working out how to edit PLINK files to identify SNPs that are in complete linkage disequilibrium with one another (and there is lots of short- and long-range LD in the dataset). However, it is easy to perform LD pruning at this stage, because we can take advantage of the fact that SNPs in perfect LD with each other will have identical estimates of \(|\beta|\) and its standard error, for all four fitness traits. The following code groups together SNPs that have identical association test results, producing a data frame in which the number of rows is equal to the number of unique datasets (i.e. combinations of genotypes and phenotypes) that were statistically analysed by GEMMA.

# Group loci with identical GWAS results (these are in 100% LD with each other)
all_univariate_lmm_results <- all_univariate_lmm_results %>%
  mutate(pasted = paste(beta_female_early, beta_female_late, beta_male_early, beta_male_late, 
                        SE_female_early, SE_female_late, SE_male_early, SE_male_late)) %>%
  group_by(pasted) %>%
  summarise(SNPs = paste0(SNP, collapse = ", "),    # If there are multiple SNPs, concatenate their names
            beta_female_early = beta_female_early[1], 
            beta_female_late = beta_female_late[1], 
            beta_male_early = beta_male_early[1], 
            beta_male_late = beta_male_late[1],
            SE_female_early = SE_female_early[1], 
            SE_female_late = SE_female_late[1], 
            SE_male_early = SE_male_early[1], 
            SE_male_late = SE_male_late[1],
            pvalue_female_early = pvalue_female_early[1], 
            pvalue_female_late = pvalue_female_late[1], 
            pvalue_male_early = pvalue_male_early[1], 
            pvalue_male_late = pvalue_male_late[1]) %>%
  ungroup() %>% select(-pasted) %>% 
  arrange(SNPs)


all_univariate_lmm_results <- read_csv("data/derived/all_univariate_GEMMA_results.csv")
# a handful of SNPs with low MAF could not be estimated and they have beta and SE == NA, so remove them now
# I think these are SNPs where the only lines with the minor allele were the lines where we measured female but not male fitness
SNPs_to_remove <- all_univariate_lmm_results$SNPs[!complete.cases(all_univariate_lmm_results)] 

all_univariate_lmm_results <- all_univariate_lmm_results %>% 
  filter(!(SNPs %in% SNPs_to_remove))

write_csv(all_univariate_lmm_results, file = "data/derived/all_univariate_GEMMA_results.csv")
unlink("data/derived/output/female_early_lmm.assoc.txt")
unlink("data/derived/output/female_late_lmm.assoc.txt")
unlink("data/derived/output/male_early_lmm.assoc.txt")
unlink("data/derived/output/male_late_lmm.assoc.txt")

Create a reduced list of LD-pruned SNPs with PLINK

To keep the computation time and memory usage for mashr manageable, we did not analyse every SNP analysed in the GWAS (i.e. 1207357 SNPs), but rather a subset of them that were approximately in linkage disequilibrium. We identified this LD-pruned set of SNPs using the PLINK arguments --indep-pairwise 100 10 0.2, i.e. pruning within 100kB sliding windows, sliding 10 variants along with each step, and allowing a maximum pairwise \(r^2\) threshold of 0.2 between loci. With these parameters, 1420071 SNPs were removed, leaving 2.26581^{5} for downstream analysis.

# indep-pairwise arguments are: 
# 100kB window size, 
# variant count to shift the window by 10 variants at the end of each step, 
# pairwise r^2 threshold of 0.2
run_command(glue("{plink} --bfile dgrp2_QC_focal_lines",
                 " --indep-pairwise 100 10 0.2"), path = "/data/derived/")

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --bfile dgrp2_QC_focal_lines
  --indep-pairwise 100 10 0.2

32768 MB RAM detected; reserving 16384 MB for main workspace.
1609590 variants loaded from .bim file.
125 people (0 males, 125 females) loaded from .fam.
125 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 125 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is exactly 1.
1609590 variants and 125 people pass filters and QC.
Phenotype data is quantitative.

1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 304959 variants from chromosome 1, leaving 45707.

1%
2%
2%
3%
4%
5%
5%
6%
7%
8%
8%
9%
10%
11%
11%
12%
13%
14%
14%
15%
16%
17%
17%
18%
19%
20%
20%
21%
22%
23%
23%
24%
25%
26%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 275365 variants from chromosome 2, leaving 42305.

1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
56%
57%
57%
58%
58%
59%
59%
60%
60%
61%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 335169 variants from chromosome 3, leaving 47815.

1%
2%
2%
3%
4%
5%
6%
7%
8%
9%
10%
10%
11%
12%
13%
13%
14%
15%
16%
17%
18%
19%
20%
21%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
98%
99%
Pruned 280320 variants from chromosome 4, leaving 40044.

1%
2%
3%
4%
5%
6%
6%
7%
8%
9%
10%
11%
12%
13%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
22%
23%
24%
25%
26%
27%
28%
29%
29%
30%
31%
32%
33%
34%
35%
36%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
45%
46%
47%
48%
49%
50%
51%
52%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
68%
69%
70%
71%
72%
73%
74%
75%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
91%
92%
93%
94%
95%
96%
97%
98%
98%
99%
Pruned 200786 variants from chromosome 5, leaving 34649.

1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
14%
15%
16%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
31%
32%
33%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
48%
49%
50%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
65%
66%
67%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
82%
83%
84%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 2384 variants from chromosome 6, leaving 87.
Pruning complete.  1398983 of 1609590 variants removed.
Writing...
Marker lists written to plink.prune.in and plink.prune.out .

file.rename("data/derived/plink.prune.in", "data/derived/SNPs_in_LD")

[1] TRUE

unlink("gwas_data/derived/plink.prune.out")

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future.apply_1.10.0 future_1.30.0       purrr_1.0.1        
 [4] pander_0.6.5        readr_2.1.4         bigsnpr_1.11.6     
 [7] bigstatsr_1.5.12    glue_1.6.2          ggplot2_3.4.1      
[10] dplyr_1.1.0         workflowr_1.7.0.4  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10        lattice_0.20-45    listenv_0.9.0      getPass_0.2-2     
 [5] ps_1.7.2           rprojroot_2.0.3    digest_0.6.31      foreach_1.5.2     
 [9] utf8_1.2.2         parallelly_1.34.0  R6_2.5.1           bigsparser_0.6.1  
[13] evaluate_0.20      httr_1.4.5         pillar_1.8.1       flock_0.7         
[17] rlang_1.0.6        rstudioapi_0.14    data.table_1.14.6  whisker_0.4.1     
[21] callr_3.7.3        jquerylib_0.1.4    Matrix_1.5-1       rmarkdown_2.20    
[25] bigparallelr_0.3.2 stringr_1.5.0      bit_4.0.5          munsell_0.5.0     
[29] compiler_4.2.2     httpuv_1.6.9       xfun_0.37          pkgconfig_2.0.3   
[33] globals_0.16.2     htmltools_0.5.4    tidyselect_1.2.0   tibble_3.1.8      
[37] codetools_0.2-18   fansi_1.0.4        crayon_1.5.2       tzdb_0.3.0        
[41] withr_2.5.0        later_1.3.0        grid_4.2.2         jsonlite_1.8.4    
[45] gtable_0.3.1       lifecycle_1.0.3    git2r_0.31.0       magrittr_2.0.3    
[49] scales_1.2.1       vroom_1.6.1        cli_3.6.0          stringi_1.7.12    
[53] cachem_1.0.7       fs_1.6.1           promises_1.2.0.1   doRNG_1.8.6       
[57] doParallel_1.0.17  bslib_0.4.2        ellipsis_0.3.2     generics_0.1.3    
[61] vctrs_0.5.2        cowplot_1.1.1      iterators_1.0.14   tools_4.2.2       
[65] bit64_4.0.5        hms_1.1.2          rngtools_1.5.2     processx_3.8.0    
[69] parallel_4.2.2     fastmap_1.1.1      yaml_2.3.7         colorspace_2.1-0  
[73] bigassertr_0.1.6   knitr_1.42         sass_0.4.5