Last updated: 2019-12-02

Checks: 7 0

Knit directory: PSYMETAB/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 179fb3b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .drake/
    Ignored:    data/processed/
    Ignored:    data/raw/

Untracked files:
    Untracked:  analysis/QC/
    Untracked:  post_imputation_qc.log
    Untracked:  pre_impute_qc.out
    Untracked:  qc_part2.out

Unstaged changes:
    Deleted:    pre_imputation_qc.out
    Deleted:    qc_part1.out

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	179fb3b	Jenny	2019-12-02	eval false to drake launch
Rmd	0dd02a7	Jenny	2019-12-02	modify website
html	2849dcb	Jenny Sjaarda	2019-12-02	wflow_git_commit(all = T)
Rmd	49a7ba9	Sjaarda Jennifer Lynn	2019-12-02	modify git ignore

Last updated: 2019-12-02

Code version: 179fb3bcc6cd7f84a181e8c0f2f0e6db2d939f94

To reproduce these results, please follow these instructions. See Data for details on data sources.

Step 1: initiate project on remote server.

All processing scripts were run from the root sgg directory. Project was initialized using workflowr rpackage, see here.

On sgg server:

project_name <- "PSYMETAB"
library("workflowr")

wflow_start(project_name) # creates directory called project_name

options("workflowr.view" = FALSE) # if using cluster
wflow_build() # create directories
options(workflowr.sysgit = "")

wflow_publish(c("analysis/index.Rmd", "analysis/about.Rmd", "analysis/license.Rmd"),
              "Publish the initial files for myproject")

wflow_use_github("jennysjaarda")
# select option 2. Create the remote repository yourself by going to https://github.com/new
# and entering the Repository name that matches the name of the directory of your workflowr project.

wflow_git_push()

You have now successfully created a github repository for your project that is accessible on github and the servers.

Next setup a local copy.

Step 2: Create local copy on personal computer.

Within terminal of personal computer, clone the git repository.

cd ~/Dropbox/UNIL/projects/
git clone https://github.com/jennysjaarda/PSYMETAB.git PSYMETAB

Open project in atom (or preferred text editor) and modify the following files:

Because Terminal cannot generate a preview and workflowr doesn’t like the sysgit, to the .Rprofile file, add:
- options(workflowr.sysgit = "")
- options("workflowr.view" = FALSE)
To ensure git hub isn’t manaaging large files, modify the .gitignore file, by adding the following lines:
- data/*
- !analysis/*.Rmd
- !data/*.md
- .git/
Save and push these changes to github.
Pull to the server.

Step 3: Create internal project folders.

Return to sgg server and run the following:

project_dir=/data/sgg2/jenny/projects/PSYMETAB
mkdir $project_dir/data/raw
mkdir $project_dir/data/processed
mkdir $project_dir/data/raw/reference_files
mkdir $project_dir/data/raw/phenotype_data
mkdir $project_dir/data/raw/extraction
mkdir $project_dir/data/processed/phenotype_data
mkdir $project_dir/data/processed/extraction
mkdir $project_dir/docs/assets

This will create the following directory structure in PSYMETAB/:

PSYMETAB/
├── .gitignore
├── .Rprofile
├── _workflowr.yml
├── analysis/
│   ├── about.Rmd
│   ├── index.Rmd
│   ├── license.Rmd
│   └── _site.yml
├── code/
│   ├── README.md
├── data/
│   ├── README.md
│   ├── raw/
|       ├── phenotype_data/
|       ├── reference_files/
|       └── extraction/
│   └── processed/
|       ├── phenotype_data/
|       ├── reference_files/
|       └── extraction/
├── docs/
|       └── assets/
├── myproject.Rproj
├── output/
│   └── README.md
└── README.md

Raw PLINK (ped/map files) data were copied from the CHUV :L/ folder after being built in genomestudio.

Step 4: Setup drake plan.

see make.R

Configure a slurm template

options(clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl")
drake_hpc_template_file("slurm_clustermq.tmpl")

# Write the file slurm_clustermq.tmpl. and edit manually

The file created using the clustermq template was edited manually to match slurm_clustermq.tmpl

cat(readLines('slurm_clustermq.tmpl'), sep = '\n')
#!/bin/sh
# From https://github.com/mschubert/clustermq/wiki/SLURM
#SBATCH --job-name={{ job_name }}           # job name
#SBATCH --partition={{ partition }}                 # partition
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}  # log file
####SBATCH --mem-per-cpu={{ memory | 4096 }}   # memory
#SBATCH --array=1-{{ n_jobs }}              # job array
#SBATCH --cpus-per-task={{ cpus }}
# module load R                             # Uncomment if R is an environment module.
####ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Data

provided from iGE3

General information

Files received from: Mylene Docquier (Mylene.Docquier@unige.ch), iGE3 Genomics Platform Manager, University of Geneva
ftp server details:
- host: sftp://129.194.88.17
- username: 080219CE
- password: chinjennys
Genotype data received in genomestudio format on August 28, 2019; for processing and converting to PLINK format see docs/miscellaneous/data_processing.md

Genotype data

Genotype data found in data/raw/genotypes
Each folder contains:
- Initial data provided from Mylene in genomestudio format, with the original folder name (‘XXX/’).
- Cluster files in genomestudio format (see docs/miscellaneous/data_processing.md), and named ‘XXX_cluster/’.
- PLINK files exported from genomestudio.

Miscellaneous GSA information provided in the following files:

GSA v2 + MD Consortium.csv
GSAMD-24v2-0_20024620_A1.csv
GSAMD-24v2-0_A1-ACMG-GeneAnnotation.xlsx
GSAMD-24v2-0_A1-ADME-CPIC-GeneAnnotation.xlsx
GSAMD-24v2-0_A1-HLA-GeneAnnotation.xlsx
GSAMD-24v2-0_A1-TruSight-GeneAnnotation.xlsx
GSAv2_MDConsortium.bpm
GSPMA24v1_0-A_4349HNR_Samples.egt

Files 1 and 2 appear to be identical and correspond to strand illumina strand information, same file can be found here.
xlsx files contain 2 tabs: “Coverage Summary” and “GSAMD-24v2-0_A1-XXX-GeneAnnota”
bpm file corresponds to manifest file for use in genomestudio. Manifest files provide a description of the SNP or probe content on a standard BeadChip or in an assay product.
egt file corresponds to cluster file for making genotype calls.
all saved in data/reference_files

Chip details from Illumina

Files received from: Fe Magbanua (techsupport@illumina.com),Technical Applications Scientist, Technical Support, Illumina
GSAMD-24v2-0_20024620_A4_StrandReport_FDT.txt: strand report build38 (build37 not available).
GSAMD-24v2-0_20024620_A1_b151_rsids.txt: loci to rsid conversion file build37.
GSAMD-24v2-0_20024620_A4_b151_rsids.txt: loci to rsid conversion file build38.
all saved in data/reference_files (copied using FileZilla)

Strand files from Welcome Centre

The data for each chip and genome build combination are freely downloadable from the links localted here, each zip file contains three files, these are:
- .strand file
- .miss file
- .multiple file
More details can be found at the link above
Chipendium was used to comfirm that bim files are on the TOP strand .
Contacted William Rayner (wrayner@well.ox.ac.uk) to find out what to do about custom SNPs, all correspondence on 22/07/2019.
- Query: “The chip used to generate the data was the GSAMD-24v2, however about 10,000 custom SNPs were also added to the chip. Do you have any recommendations for adding such SNPs to the strand file for processing?”
- Response: “If you have a chip with custom content on it as you do if you are able to send me the .csv annotation file (that contains the TopGenomicSeq information) I can use that to create you a custom strand file that you can then download on a private link, this will ensure the extra SNPs are not lost in the strand update (at the moment they would be removed as non-matching)”
- Trying to obtain such .csv file from Mylene or Smita at Illumina (spathak@illumina.com) who designed the chip.
- On 15/07/2019 Smita provided such a file: GSA_UPPC_20023490X357589_A1_custom_only.csv.
- The file was downloaded and save to UPPC (Jenny/PSYMETAB_GWAS/GSA).
- Sent .csv file to William Rayner and he provided the strand file for the custom SNP list on 16/07/2019:
  - GSA_UPPC_20023490X357589_A1_custom_only-b37-strand.zip
  - GSA_UPPC_20023490X357589_A1_custom_only-b37-strand.zip
- Zipped strand files were copied to SGG server (${project_dir}/data/raw/reference_files/) and subsequently unzipped and used in QC (only b37 files was needed).

Phenotype data

Sex and ethnicity data provided by Celine (via email) for each batch on July 18, 2019: GSA_sex-ethnicity.xlsx
Downloaded and saved to UPPC folder (Jenny/PSYMETAB_GWAS/).
Opened, manually changed all accents to standard letters (ctrl-F and replace) and re-saved as csv/xlsx file (with ‘no_accents’) for easier use in R.
Moved to SGG folders via filezilla (manually).
Name was changed (see ‘Master.sh’), as follows:

mv data/raw/phenotype_data/GSA_sex-ethnicity.xlsx data/raw/phenotype_data/QC_sex_eth.xlsx

sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-conda_cos6-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)
# 
# Matrix products: default
# BLAS/LAPACK: /data/sgg2/jenny/bin/anaconda3/envs/r_env/lib/R/lib/libRblas.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#  [1] workflowr_1.5.0 Rcpp_1.0.1      rprojroot_1.3-2 digest_0.6.18  
#  [5] later_0.8.0     R6_2.4.0        backports_1.1.4 git2r_0.26.1   
#  [9] magrittr_1.5    evaluate_0.13   stringi_1.4.3   fs_1.3.1       
# [13] promises_1.0.1  whisker_0.3-2   rmarkdown_1.12  tools_3.6.1    
# [17] stringr_1.4.0   glue_1.3.1      httpuv_1.5.1    xfun_0.6       
# [21] yaml_2.2.0      compiler_3.6.1  htmltools_0.3.6 knitr_1.22

Setup for PSYMETAB data analysis

Jenny Sjaarda