Last updated: 2023-03-15
Checks: 7 0
Knit directory: fitnessGWAS/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0.4). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20180914)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 2080e0c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rapp.history
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .httr-oauth
Ignored: .pversion
Ignored: analysis/.DS_Store
Ignored: analysis/correlations_SNP_effects_cache/
Ignored: code/.DS_Store
Ignored: code/Drosophila_GWAS.Rmd
Ignored: data/.DS_Store
Ignored: data/derived/
Ignored: data/input/.DS_Store
Ignored: data/input/.pversion
Ignored: data/input/dgrp.fb557.annot.txt
Ignored: data/input/dgrp2.bed
Ignored: data/input/dgrp2.bim
Ignored: data/input/dgrp2.fam
Ignored: data/input/huang_transcriptome/
Ignored: figures/.DS_Store
Untracked files:
Untracked: big_model.rds
Untracked: code/quant_gen_1.R
Untracked: data/input/genomic_relatedness_matrix.rds
Unstaged changes:
Modified: .gitignore
Modified: code/pdf_supp_material.Rmd
Modified: code/pdf_supp_material.pdf
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/perform_gwas.Rmd
) and HTML
(docs/perform_gwas.html
) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 2080e0c | lukeholman | 2023-03-15 | wflow_publish("analysis/perform_gwas.Rmd") |
Rmd | d94e22d | lukeholman | 2023-03-14 | New quant gen and other changes |
html | 0a415e3 | lukeholman | 2021-11-12 | Build site. |
html | 7449a90 | lukeholman | 2021-10-01 | Build site. |
html | 9f76189 | lukeholman | 2021-09-27 | Build site. |
Rmd | 1f52aab | lukeholman | 2021-09-27 | Commit Sept 2021 |
html | 8d14298 | lukeholman | 2021-09-26 | Build site. |
Rmd | af15dd6 | lukeholman | 2021-09-26 | Commit Sept 2021 |
html | 410fb73 | lukeholman | 2021-03-04 | Build site. |
html | 871ae81 | lukeholman | 2021-03-04 | Build site. |
html | e112260 | lukeholman | 2021-03-04 | Build site. |
html | 836a780 | lukeholman | 2021-03-04 | Build site. |
html | 359ff37 | lukeholman | 2021-03-04 | Build site. |
html | 5506c4b | lukeholman | 2021-03-04 | Build site. |
html | f64a3e9 | lukeholman | 2021-03-04 | Build site. |
Rmd | fa1b9b0 | lukeholman | 2021-03-04 | big first commit 2021 |
Rmd | 8d54ea5 | Luke Holman | 2018-12-23 | Initial commit |
html | 8d54ea5 | Luke Holman | 2018-12-23 | Initial commit |
library(dplyr)
library(ggplot2)
# library(ggdendro)
library(glue)
library(bigsnpr) # to install: devtools::install_github("privefl/bigsnpr")
library(readr)
library(pander)
library(purrr)
library(future.apply) # for parallel code
# Load the predicted line means, as calculated by get_predicted_line_means
predicted_line_means <- read_csv("data/derived/predicted_line_means.csv")
plink <- paste(getwd(), "code/plink", sep = "/")
# # Load the wolbachia infection status data from the DGRP website
# wolbachia <- read_csv("data/input/wolbachia.csv") %>%
# mutate(line = paste("line", line, sep = ""))
#
# # Load the chomosomal inversion data from the DGRP website
# # these are the 5 inversions that Huang et al. PNAS corrected for
# inversions <- read_csv("data/input/inversion genotypes.csv") %>%
# mutate(line = paste("line", line, sep = "")) %>%
# select(line, `In(2L)t`, `In(2R)NS`, `In(3R)P`, `In(3R)K`, `In(3R)Mo`)
# helper function to pass commands to the terminal
# Note that we set `intern = TRUE`, and pass the result of `system()` to `cat()`,
# ensuring that the Terminal output will be printed in this knitr report.
run_command <- function(shell_command, wd = getwd(), path = ""){
cat(system(glue("cd ", wd, path, "\n",shell_command), intern = TRUE), sep = '\n')
}
We cleaned up the DGRP’s .bed/.bim/.fam files (available from the Mackay lab website) as follows:
Note that in the PLINK-formatted genotype files, lines fixed for the major allele are coded as 2, and lines fixed for the minor allele as 0. This means that in the association tests we calculate, negative effect sizes mean that the minor allele is associated with lower fitness, while positive effect sizes means that the minor allele is associated with higher fitness.
beagle <- bigsnpr::download_beagle()
perform_SNP_QC_and_imputation <- function(phenotypes){
if("block" %in% names(phenotypes)) phenotypes <- phenotypes %>% select(-block)
# Make a list of the lines in our sample and save as a text file for passing to PLINK
lines_to_keep <- gsub("_", "", phenotypes$line) %>% cbind(.,.)
write.table(lines_to_keep, row.names = FALSE, col.names = FALSE, file = "data/derived/lines_to_keep.txt", quote = FALSE)
# Define a function to add our phenotype data to a .fam file, which is needed for GWAS analysis and to make sure PLINK includes these samples
# The 'phenotypes' data frame needs to have a column called 'line'
add_phenotypes_to_fam <- function(filepath, phenotypes){
read_delim(filepath, col_names = FALSE, delim = " ") %>%
select(X1, X2, X3, X4, X5) %>% # Get all the non-phenotype columns
left_join(phenotypes %>%
mutate(line = gsub("_", "", line)),
by = c("X1" = "line")) %>%
write.table(file = filepath,
col.names = FALSE, row.names = FALSE,
quote = FALSE, sep = " ")
}
# Use Plink to clean and subset the DGRP's SNP data as follows:
# Only keep SNPs for which at least 90% of DGRP lines were successfully genotyped (--geno 0.1)
# Only keep SNPs with a minor allele frequency of 0.05 or higher (--maf 0.05)
# Finally, write the processed BIM/BED/FAM files to the data/derived directory
run_command(glue("{plink} --bfile dgrp2",
" --geno 0.1 --maf 0.05 --allow-no-sex",
" --make-bed --out ../derived/dgrp2_QC_all_lines"), path = "/data/input/")
# Use the shell command 'sed' to remove underscores from the DGRP line names in the .fam file (e.g. 'line_120' becomes 'line120')
# Otherwise, these underscores cause trouble when we need to convert from PLINK to vcf format (vcf format uses underscore as a separator)
for(i in 1:2) run_command("sed -i '' 's/_//' dgrp2_QC_all_lines.fam", path = "/data/derived/")
# Now impute the missing genotypes using Beagle
# This part uses the data for the full DGRP panel of >200 lines, to infer missing genotypes as accurately as possible.
# This step uses a lot of memory (I set to 28MB max, and it used 26.5GB), but maybe it can also run on a less powerful computer?
# The bigsnpr package provides a helpful wrapper for Beagle called snp_beagleImpute(): it translates to a VCF file and back again using PLINK
snp_beagleImpute(beagle, plink,
bedfile.in = "data/derived/dgrp2_QC_all_lines.bed",
bedfile.out = "data/derived/dgrp2_QC_all_lines_imputed.bed",
ncores = 7,
memory.max = 20)
# assign a sex of 'female' to all the DGRP lines (Beagle removes the sex, and it seems PLINK needs individuals to have a sex,
# despite PLINK having a command called --allow-no-sex)
run_command("sed -i '' 's/ 0 0 0/ 0 0 2/' dgrp2_QC_all_lines_imputed.fam", path = "/data/derived/")
# Re-write the .bed file, to make sure the MAF threshold and minor/major allele designations are correctly assigned post-Beagle
run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed",
" --geno 0.1 --maf 0.05 --allow-no-sex",
" --make-bed --out dgrp2_QC_all_lines_imputed_correct"), path = "/data/derived/")
#unlink(list.files("data/derived", pattern = "~", full.names = TRUE)) # delete the original files, which were given a ~ file name by PLINK
# Use PLINK to get the allele IDs and calculate the MAFs across the whole DGRP, for all SNPs that survived QC
# The file created is called data/derived/plink.frq
run_command("{plink} --bfile dgrp2_QC_all_lines_imputed_correct --freq", path = "/data/derived")
# Now cull the PLINK files to just the lines that we measured, and re-apply the
# MAF cut-off of 0.05 for the new smaller sample of DGRP lines
run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed_correct",
" --keep-allele-order", # force PLINK to retain the major/minor allele designations that apply to the DGRP as a whole
" --keep lines_to_keep.txt --geno 0.1 --maf 0.05",
" --make-bed --out dgrp2_QC_focal_lines"), path = "/data/derived/")
# edit the new FAM file to add the phenotype data from 'phenotypes'
add_phenotypes_to_fam("data/derived/dgrp2_QC_focal_lines.fam", phenotypes)
# Clean up:
unlink(c("data/derived/lines_to_keep.txt",
"data/derived/plink.log",
"data/derived/dgrp2_QC_all_lines_imputed.bed",
"data/derived/dgrp2_QC_all_lines_imputed.bim",
"data/derived/dgrp2_QC_all_lines_imputed.fam",
"data/derived/dgrp2_QC_all_lines_imputed.log",
"data/derived/dgrp2_QC_all_lines_imputed_correct.bed",
"data/derived/dgrp2_QC_all_lines_imputed_correct.bim",
"data/derived/dgrp2_QC_all_lines_imputed_correct.fam",
"data/derived/dgrp2_QC_all_lines_imputed_correct.log"))
}
perform_SNP_QC_and_imputation(phenotypes = predicted_line_means)
The following code chunk runs 5 linear mixed models, implemented in GEMMA. Each of the univariate linear mixed models uses the decomposed genomic relatedness matrix from above, and has one of our four fitness traits as the response variable. The multivariate model contains all four traits, and also uses the decomposed genomic relatedness matrix.
# The option "-lmm 1" runs a linear mixed model using Wald test to get the p-values
# The option "-r2 0.95" only keeps SNPs that are <95% correlated with Wolbachia infection status
# The option "-c covariate_file_for_GEMMA.txt" adds in the Wolbahia infection status as a covariate, and also tells the model to fit an intercept
opts <- "-lmm 1"
GRM_female <- "-d ./output/eigen_decomp.eigenD_female.txt -u ./output/eigen_decomp.eigenU_female.txt"
GRM_male <- "-d ./output/eigen_decomp.eigenD_male.txt -u ./output/eigen_decomp.eigenU_male.txt"
#cov <- "-c covariate_file_for_GEMMA.txt -r2 0.99" # not used at present
c("gemma -bfile {bed} {GRM_female} {opts} -n 1 -o female_early_lmm",
"gemma -bfile {bed} {GRM_female} {opts} -n 2 -o female_late_lmm",
"gemma -bfile {bed} {GRM_male} {opts} -n 3 -o male_early_lmm",
"gemma -bfile {bed} {GRM_male} {opts} -n 4 -o male_late_lmm") %>%
#"gemma -bfile {bed} {GRM_male} {opts} -n 1 2 3 4 -o all_four_traits") %>%
lapply(run_command, path = "/data/derived")
This part uses the full DGRP panel, because our aim is to determine how common these alleles are in the original wild population from which the DGRP was captured. These MAFs are the ones used in our statistical analyses and plots.
# Extract and save the MAFs, SNP positions, and major/minor alleles
MAFs <- read.table("data/derived/plink.frq", header = TRUE, stringsAsFactors = FALSE) %>%
mutate(position = map_chr(strsplit(SNP, split = "_"), function(x) x[2])) %>%
select(SNP, position, MAF, A1, A2) %>%
rename(minor_allele = A1,
major_allele = A2)
db <- DBI::dbConnect(RSQLite::SQLite(), "data/derived/annotations.sqlite3", create = FALSE)
MAFs <- tbl(db, "variants") %>%
select(SNP, FBID, site.class, distance.to.gene, chr) %>% collect() %>%
full_join(MAFs, by = "SNP") %>%
filter(!is.na(MAF)) %>%
mutate(site.class = replace(site.class, is.na(site.class), "INTERGENIC"))
# Delete the original variant annotation table from the db, and add the new one back in
db %>% db_drop_table(table = "variants")
db %>% copy_to(MAFs,
"variants", temporary = FALSE,
indexes = list("SNP", "FBID", "chr", "site.class"))
rm(MAFs)
Discard columns we don’t need, and rename the SNP column from the
name assigned by GEMMA (rs
) to something more sensible
(SNP
).
# The 4 univariate files:
c("data/derived/output/female_early_lmm.assoc.txt",
"data/derived/output/female_late_lmm.assoc.txt",
"data/derived/output/male_early_lmm.assoc.txt",
"data/derived/output/male_late_lmm.assoc.txt") %>%
lapply(function(filename){
df <- read_tsv(filename)
names(df)[names(df) == "rs"] <- "SNP"
df %>% select(SNP, beta, se, p_wald) %>%
write_tsv(filename)
})
# Use the un-manipulated DGRP bed/bim/fam files, since they don't throw an error from plink
bed <- "../input/dgrp2"
# Remind plink to only use the SNPs that passed QC
# read_tsv("data/derived/output/female_early_lmm.assoc.txt") %>% select(SNP) %>%
# write_tsv(col_names = FALSE, path = "data/derived/snp_list.txt")
do_snp_clumping <- function(trait){
run_command(glue("plink --bfile {bed}",
" --clump output/{trait}.assoc.txt",
" --clump-field p_wald --clump-p1 0.00001"), # Each clump must contain one snp with p < 10 ^ -5
path = "/data/derived")
trait <- gsub("_lmm", "", trait)
file_name <- glue("data/derived/{trait}_SNP_clumps.txt")
file.rename("data/derived/plink.clumped", file_name)
clumps <- read.table(file_name, header = TRUE) %>%
as_tibble() %>%
rename(index_SNP = SNP,
index_SNP_p_value = P,
num_SNPs_in_clump = TOTAL,
not_sig = NSIG, # Number of clumped SNPs that are not significant ( p > 0.05 )
sig_p05 = S05, # Number of clumped SNPs 0.01 < p < 0.05
sig_p01 = S01, # Number of clumped SNPs 0.001 < p < 0.01
sig_p001 = S001, # Number of clumped SNPs 0.0001 < p < 0.001
sig_p0001 = S0001, # Number of clumped SNPs p < 0.0001
other_snps = SP2) %>%
mutate(other_snps = gsub("\\(1\\),", ", ", other_snps),
other_snps = gsub("\\(1\\)", "", other_snps),
other_snps = replace(other_snps, other_snps == "NONE", NA)) %>%
select(-CHR, -F, -BP)
SNPs <- apply(clumps, 1, function(x) {
res <- c(x[1], strsplit(x[9], split = ", ")[[1]])
res[!is.na(res)]
})
db_local <- tbl(db, "variants") %>% select(SNP, FBID) %>% collect()
clumps$FBIDs <- lapply(SNPs, function(x) {
focal <- db_local %>% filter(SNP %in% x)
if(nrow(focal) == 0) return(NA)
focal %>% pull(FBID) %>% unique() %>% paste0(collapse = ", ")
}) %>% unlist()
db_local <- tbl(db, "genes") %>% collect()
clumps$genes <- lapply(clumps$FBIDs, function(x) {
if(is.na(x)) return(NA)
FB <- strsplit(x, split = ", ")[[1]]
return(db_local %>% filter(FBID %in% FB) %>% pull(gene_name) %>% paste0(collapse = ", "))
}) %>% unlist()
write_tsv(clumps, file_name)
unlink("data/derived/plink.log")
}
lapply(c("female_early_lmm", "female_late_lmm", "male_early_lmm", "male_late_lmm"), do_snp_clumping)
Load up the GWAS results, which were produced by using linear mixed models implemented in GEMMA, and arrange the data in a ‘wide’ format, where each row is one SNP, and the columns hold the estimates of β, the standard error of β, and the p-value for the GEMMA association tests on each of the four fitness traits.
load_gwas_results <- function(fitness.component){
new_names <- c(x = "beta", y = "se", z = "p_wald")
names(new_names) <- c(paste("beta_", fitness.component, sep = ""),
paste("SE_", fitness.component, sep = ""),
paste("pvalue_", fitness.component, sep = ""))
paste("data/derived/output/", fitness.component, "_lmm.assoc.txt", sep = "") %>%
read_tsv() %>%
select(SNP, beta, se, p_wald) %>%
rename(!!new_names)
}
all_univariate_lmm_results <- load_gwas_results("female_early") %>%
full_join(load_gwas_results("female_late"), by = "SNP") %>%
full_join(load_gwas_results("male_early"), by = "SNP") %>%
full_join(load_gwas_results("male_late"), by = "SNP")
Because GEMMA is quick and easy to run, we did not bother working out how to edit PLINK files to identify SNPs that are in complete linkage disequilibrium with one another (and there is lots of short- and long-range LD in the dataset). However, it is easy to perform LD pruning at this stage, because we can take advantage of the fact that SNPs in perfect LD with each other will have identical estimates of |β| and its standard error, for all four fitness traits. The following code groups together SNPs that have identical association test results, producing a data frame in which the number of rows is equal to the number of unique datasets (i.e. combinations of genotypes and phenotypes) that were statistically analysed by GEMMA.
# Group loci with identical GWAS results (these are in 100% LD with each other)
all_univariate_lmm_results <- all_univariate_lmm_results %>%
mutate(pasted = paste(beta_female_early, beta_female_late, beta_male_early, beta_male_late,
SE_female_early, SE_female_late, SE_male_early, SE_male_late)) %>%
group_by(pasted) %>%
summarise(SNPs = paste0(SNP, collapse = ", "), # If there are multiple SNPs, concatenate their names
beta_female_early = beta_female_early[1],
beta_female_late = beta_female_late[1],
beta_male_early = beta_male_early[1],
beta_male_late = beta_male_late[1],
SE_female_early = SE_female_early[1],
SE_female_late = SE_female_late[1],
SE_male_early = SE_male_early[1],
SE_male_late = SE_male_late[1],
pvalue_female_early = pvalue_female_early[1],
pvalue_female_late = pvalue_female_late[1],
pvalue_male_early = pvalue_male_early[1],
pvalue_male_late = pvalue_male_late[1]) %>%
ungroup() %>% select(-pasted) %>%
arrange(SNPs)
all_univariate_lmm_results <- read_csv("data/derived/all_univariate_GEMMA_results.csv")
# a handful of SNPs with low MAF could not be estimated and they have beta and SE == NA, so remove them now
# I think these are SNPs where the only lines with the minor allele were the lines where we measured female but not male fitness
SNPs_to_remove <- all_univariate_lmm_results$SNPs[!complete.cases(all_univariate_lmm_results)]
all_univariate_lmm_results <- all_univariate_lmm_results %>%
filter(!(SNPs %in% SNPs_to_remove))
write_csv(all_univariate_lmm_results, file = "data/derived/all_univariate_GEMMA_results.csv")
unlink("data/derived/output/female_early_lmm.assoc.txt")
unlink("data/derived/output/female_late_lmm.assoc.txt")
unlink("data/derived/output/male_early_lmm.assoc.txt")
unlink("data/derived/output/male_late_lmm.assoc.txt")
To keep the computation time and memory usage for mashr
manageable, we did not analyse every SNP analysed in the GWAS
(i.e. 1207357 SNPs), but rather a subset of them that were approximately
in linkage disequilibrium. We identified this LD-pruned set of SNPs
using the PLINK arguments --indep-pairwise 100 10 0.2
,
i.e. pruning within 100kB sliding windows, sliding 10 variants along
with each step, and allowing a maximum pairwise r2 threshold of 0.2 between loci. With
these parameters, 1420071 SNPs were removed, leaving 2.26581^{5} for
downstream analysis.
# indep-pairwise arguments are:
# 100kB window size,
# variant count to shift the window by 10 variants at the end of each step,
# pairwise r^2 threshold of 0.2
run_command(glue("{plink} --bfile dgrp2_QC_focal_lines",
" --indep-pairwise 100 10 0.2"), path = "/data/derived/")
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink.log.
Options in effect:
--bfile dgrp2_QC_focal_lines
--indep-pairwise 100 10 0.2
32768 MB RAM detected; reserving 16384 MB for main workspace.
1609590 variants loaded from .bim file.
125 people (0 males, 125 females) loaded from .fam.
125 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 125 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is exactly 1.
1609590 variants and 125 people pass filters and QC.
Phenotype data is quantitative.
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 304959 variants from chromosome 1, leaving 45707.
1%
2%
2%
3%
4%
5%
5%
6%
7%
8%
8%
9%
10%
11%
11%
12%
13%
14%
14%
15%
16%
17%
17%
18%
19%
20%
20%
21%
22%
23%
23%
24%
25%
26%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 275365 variants from chromosome 2, leaving 42305.
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
56%
57%
57%
58%
58%
59%
59%
60%
60%
61%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 335169 variants from chromosome 3, leaving 47815.
1%
2%
2%
3%
4%
5%
6%
7%
8%
9%
10%
10%
11%
12%
13%
13%
14%
15%
16%
17%
18%
19%
20%
21%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
98%
99%
Pruned 280320 variants from chromosome 4, leaving 40044.
1%
2%
3%
4%
5%
6%
6%
7%
8%
9%
10%
11%
12%
13%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
22%
23%
24%
25%
26%
27%
28%
29%
29%
30%
31%
32%
33%
34%
35%
36%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
45%
46%
47%
48%
49%
50%
51%
52%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
68%
69%
70%
71%
72%
73%
74%
75%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
91%
92%
93%
94%
95%
96%
97%
98%
98%
99%
Pruned 200786 variants from chromosome 5, leaving 34649.
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
14%
15%
16%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
31%
32%
33%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
48%
49%
50%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
65%
66%
67%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
82%
83%
84%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
Pruned 2384 variants from chromosome 6, leaving 87.
Pruning complete. 1398983 of 1609590 variants removed.
Writing...
Marker lists written to plink.prune.in and plink.prune.out .
file.rename("data/derived/plink.prune.in", "data/derived/SNPs_in_LD")
[1] TRUE
unlink("gwas_data/derived/plink.prune.out")
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] future.apply_1.10.0 future_1.30.0 purrr_1.0.1
[4] pander_0.6.5 readr_2.1.4 bigsnpr_1.11.6
[7] bigstatsr_1.5.12 glue_1.6.2 ggplot2_3.4.1
[10] dplyr_1.1.0 workflowr_1.7.0.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 lattice_0.20-45 listenv_0.9.0 getPass_0.2-2
[5] ps_1.7.2 rprojroot_2.0.3 digest_0.6.31 foreach_1.5.2
[9] utf8_1.2.2 parallelly_1.34.0 R6_2.5.1 bigsparser_0.6.1
[13] evaluate_0.20 httr_1.4.5 pillar_1.8.1 flock_0.7
[17] rlang_1.0.6 rstudioapi_0.14 data.table_1.14.6 whisker_0.4.1
[21] callr_3.7.3 jquerylib_0.1.4 Matrix_1.5-1 rmarkdown_2.20
[25] bigparallelr_0.3.2 stringr_1.5.0 bit_4.0.5 munsell_0.5.0
[29] compiler_4.2.2 httpuv_1.6.9 xfun_0.37 pkgconfig_2.0.3
[33] globals_0.16.2 htmltools_0.5.4 tidyselect_1.2.0 tibble_3.1.8
[37] codetools_0.2-18 fansi_1.0.4 crayon_1.5.2 tzdb_0.3.0
[41] withr_2.5.0 later_1.3.0 grid_4.2.2 jsonlite_1.8.4
[45] gtable_0.3.1 lifecycle_1.0.3 git2r_0.31.0 magrittr_2.0.3
[49] scales_1.2.1 vroom_1.6.1 cli_3.6.0 stringi_1.7.12
[53] cachem_1.0.7 fs_1.6.1 promises_1.2.0.1 doRNG_1.8.6
[57] doParallel_1.0.17 bslib_0.4.2 ellipsis_0.3.2 generics_0.1.3
[61] vctrs_0.5.2 cowplot_1.1.1 iterators_1.0.14 tools_4.2.2
[65] bit64_4.0.5 hms_1.1.2 rngtools_1.5.2 processx_3.8.0
[69] parallel_4.2.2 fastmap_1.1.1 yaml_2.3.7 colorspace_2.1-0
[73] bigassertr_0.1.6 knitr_1.42 sass_0.4.5