Data sources

Last updated: 2020-06-12

Checks: 6 1

Knit directory: PSYMETAB/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ff40613

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    ._docs
    Ignored:    .drake/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/._GWAS.Rmd
    Ignored:    analysis/._data_processing_in_genomestudio.Rmd
    Ignored:    analysis/._quality_control.Rmd
    Ignored:    analysis/GWAS/
    Ignored:    analysis/PRS/
    Ignored:    analysis/QC/
    Ignored:    analysis/figure/
    Ignored:    analysis_prep_1_clustermq.out
    Ignored:    analysis_prep_2_clustermq.out
    Ignored:    analysis_prep_3_clustermq.out
    Ignored:    analysis_prep_4_clustermq.out
    Ignored:    data/processed/
    Ignored:    data/raw/
    Ignored:    download_impute_1_clustermq.out
    Ignored:    init_analysis_1_clustermq.out
    Ignored:    init_analysis_2_clustermq.out
    Ignored:    init_analysis_3_clustermq.out
    Ignored:    init_analysis_4_clustermq.out
    Ignored:    init_analysis_5_clustermq.out
    Ignored:    init_analysis_6_clustermq.out
    Ignored:    packrat/lib-R/
    Ignored:    packrat/lib-ext/
    Ignored:    packrat/lib/
    Ignored:    post_impute_1_clustermq.out
    Ignored:    pre_impute_qc_1_clustermq.out
    Ignored:    process_init_10_clustermq.out
    Ignored:    process_init_11_clustermq.out
    Ignored:    process_init_12_clustermq.out
    Ignored:    process_init_13_clustermq.out
    Ignored:    process_init_14_clustermq.out
    Ignored:    process_init_15_clustermq.out
    Ignored:    process_init_16_clustermq.out
    Ignored:    process_init_17_clustermq.out
    Ignored:    process_init_18_clustermq.out
    Ignored:    process_init_19_clustermq.out
    Ignored:    process_init_1_clustermq.out
    Ignored:    process_init_20_clustermq.out
    Ignored:    process_init_21_clustermq.out
    Ignored:    process_init_22_clustermq.out
    Ignored:    process_init_23_clustermq.out
    Ignored:    process_init_24_clustermq.out
    Ignored:    process_init_25_clustermq.out
    Ignored:    process_init_26_clustermq.out
    Ignored:    process_init_27_clustermq.out
    Ignored:    process_init_28_clustermq.out
    Ignored:    process_init_29_clustermq.out
    Ignored:    process_init_2_clustermq.out
    Ignored:    process_init_30_clustermq.out
    Ignored:    process_init_31_clustermq.out
    Ignored:    process_init_3_clustermq.out
    Ignored:    process_init_4_clustermq.out
    Ignored:    process_init_5_clustermq.out
    Ignored:    process_init_6_clustermq.out
    Ignored:    process_init_7_clustermq.out
    Ignored:    process_init_8_clustermq.out
    Ignored:    process_init_9_clustermq.out
    Ignored:    prs_1_clustermq.out
    Ignored:    prs_2_clustermq.out
    Ignored:    prs_3_clustermq.out
    Ignored:    prs_4_clustermq.out

Untracked files:
    Untracked:  analysis/genetic_quality_control.Rmd
    Untracked:  analysis/pheno_quality_control.Rmd
    Untracked:  analysis/plans.Rmd
    Untracked:  analysis_prep.log
    Untracked:  download_impute.log
    Untracked:  file1.txt
    Untracked:  file2.txt
    Untracked:  file3.txt
    Untracked:  flagged_rows.txt
    Untracked:  grs.log
    Untracked:  init_analysis.log
    Untracked:  problem_drugs.txt
    Untracked:  problem_ids.txt
    Untracked:  process_init.log
    Untracked:  prs.log

Unstaged changes:
    Modified:   analysis/GWAS.Rmd
    Modified:   analysis/data_sources.Rmd
    Deleted:    analysis/project.Rmd
    Modified:   analysis/quality_control.Rmd
    Modified:   cache_log.csv
    Modified:   post_impute.log
    Modified:   slurm_clustermq.tmpl

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	e0d6032	Sjaarda Jennifer Lynn	2019-12-06	edit typos
html	46477dd	Jenny Sjaarda	2019-12-06	Build site.
Rmd	b503ef0	Sjaarda Jennifer Lynn	2019-12-06	add more details to website
html	9f1ba5e	Jenny Sjaarda	2019-12-06	Build site.
Rmd	c1d579b	Jenny	2019-12-04	add details on data

provided from iGE3

General information

Files received from: Mylene Docquier (Mylene.Docquier@unige.ch), iGE3 Genomics Platform Manager, University of Geneva
ftp server details:
- host: sftp://129.194.88.17
- username: 080219CE
- password: chinjennys
Genotype data received in genomestudio format on August 28, 2019; for processing and converting to PLINK format see docs/miscellaneous/data_processing.md

Genotype data

Genotype data found in data/raw/genotypes
Each folder contains:
- Initial data provided from Mylene in genomestudio format, with the original folder name (‘XXX/’).
- Cluster files in genomestudio format (see docs/miscellaneous/data_processing.md), and named ‘XXX_cluster/’.
- PLINK files exported from genomestudio.

Miscellaneous GSA information provided in the following files:

GSA v2 + MD Consortium.csv
GSAMD-24v2-0_20024620_A1.csv
GSAMD-24v2-0_A1-ACMG-GeneAnnotation.xlsx
GSAMD-24v2-0_A1-ADME-CPIC-GeneAnnotation.xlsx
GSAMD-24v2-0_A1-HLA-GeneAnnotation.xlsx
GSAMD-24v2-0_A1-TruSight-GeneAnnotation.xlsx
GSAv2_MDConsortium.bpm
GSPMA24v1_0-A_4349HNR_Samples.egt

Details

Files 1 and 2 appear to be identical and correspond to strand illumina strand information, same file can be found here.
xlsx files contain 2 tabs: “Coverage Summary” and “GSAMD-24v2-0_A1-XXX-GeneAnnota”
bpm file corresponds to manifest file for use in genomestudio. Manifest files provide a description of the SNP or probe content on a standard BeadChip or in an assay product.
egt file corresponds to cluster file for making genotype calls.
all saved in data/reference_files

Chip details from Illumina

Files received from: Fe Magbanua (techsupport@illumina.com),Technical Applications Scientist, Technical Support, Illumina
GSAMD-24v2-0_20024620_A4_StrandReport_FDT.txt: strand report build38 (build37 not available).
GSAMD-24v2-0_20024620_A1_b151_rsids.txt: loci to rsid conversion file build37.
GSAMD-24v2-0_20024620_A4_b151_rsids.txt: loci to rsid conversion file build38.
all saved in data/reference_files (copied using FileZilla)

Strand files from Welcome Centre

The data for each chip and genome build combination are freely downloadable from the links localted here, each zip file contains three files, these are:
- .strand file
- .miss file
- .multiple file
More details can be found at the link above
Chipendium was used to comfirm that bim files are on the TOP strand .
Contacted William Rayner (wrayner@well.ox.ac.uk) to find out what to do about custom SNPs, all correspondence on 22/07/2019.
- Query:
The chip used to generate the data was the GSAMD-24v2, however about 10,000 custom SNPs were also added to the chip. Do you have any recommendations for adding such SNPs to the strand file for processing?
- Response:
If you have a chip with custom content on it as you do if you are able to send me the .csv annotation file (that contains the TopGenomicSeq information) I can use that to create you a custom strand file that you can then download on a private link, this will ensure the extra SNPs are not lost in the strand update (at the moment they would be removed as non-matching).
- Trying to obtain such .csv file from Mylene or Smita at Illumina (spathak@illumina.com) who designed the chip.
- On 15/07/2019 Smita provided such a file: GSA_UPPC_20023490X357589_A1_custom_only.csv.
- The file was downloaded and save to UPPC (Jenny/PSYMETAB_GWAS/GSA).
- Sent .csv file to William Rayner and he provided the strand file for the custom SNP list on 16/07/2019:
  - GSA_UPPC_20023490X357589_A1_custom_only-b37-strand.zip
  - GSA_UPPC_20023490X357589_A1_custom_only-b37-strand.zip
- Zipped strand files were copied to SGG server (${project_dir}/data/raw/reference_files/) and subsequently unzipped and used in QC (only b37 files was needed).

Phenotype data

Sex/ethnicity data

Provided by Celine (via email) for each batch on July 18, 2019: GSA_sex-ethnicity.xlsx.
Used for genetic quality control.
Downloaded and saved to UPPC folder (Jenny/PSYMETAB_GWAS/).
Opened, manually changed all accents to standard letters (ctrl-F and replace) and re-saved as csv/xlsx file (with no_accents) for easier use in R.
Moved to SGG folders via filezilla (manually).
Name was changed, as follows:

mv data/raw/phenotype_data/GSA_sex-ethnicity.xlsx data/raw/phenotype_data/QC_sex_eth.xlsx

Full phenotype data

Phenotype data to be used in GWAS (and other) analyses.
All data was saved on sgg folders to: PSYMETAB/data/raw/phenotype_data/.
Iniitial data provided by Celine (via email) on October 24, 2019: PHENO_GWAS_241019.xlsx.
- Saved to sgg folder PSYMETAB/data/phenotype_data/.
- Opened in excel, saved as a .csv file.
- Manually changed all accents to standard letters (ctrl-F and replace) and re-saved as csv file (with extension _no_accents.csv) for easier use in R.
Discovered problems with phenotype database in early 2020, new phenotype data was provided on March 16, 2020: PHENO_GWAS_160320.xlsx.
- Same process as above was followed to remove accents and convert to csv file.
- Details on problems detected can be found on the Phenotype Quality Control page.
Additional problems were discovered and a new dataset was provided on April 16, 2020: PHENO_GWAS_160420.xlsx (processed the same as above).

Caffeine data

Requested to perform analysis using Nermine’s caffeine data.
All data was saved on sgg folders to: PSYMETAB/data/raw/phenotype_data/.
Initial data provided by Claire (via email) on March 11, 2020: Code_GEN_CG_11.03.2020.xlsx.
New variables (data of blood draw and age) were provided by Claire on May 5, 2020: CAF_Sleep_Jenny_05_05_2020_CG.xlsx.
- Noticed that there was a problem with the data set (email describing problem to Nermine on 07/05/2020):
I am a bit confused by the age and date columns in some cases. For example, for the participant below we have measurements at 4 time points: age 34, 52, 53 and 39 and the date of measurement in the next column. Does the date correspond to the date the blood draw took place? If so, I don’t understand why in 2018 (row 4) this participant was 39 and in 2012 (row 3) they were 53! Or does this column represent the date caffeine was measured (which could be very different from date of extraction)? Ideally I need the age the participant was when the extraction took place.

GEN age Date

UXEWHQEZ 52 2011-04-26

UXEWHQEZ 34 2013-04-02

UXEWHQEZ 52 2011-03-24

UXEWHQEZ 53 2012-02-27

UXEWHQEZ 39 2018-05-07
- After discussing with Nermine and Claire, Claire discovered the problem (as per email on 07/05/2020):
I spotted the problem, it was a mismatch in the ambulatory codes as some of them are written XXXAMB instead of AMBXXX or XXX+letter. I will send Jenny a new version of the file with paying attention to these codes if ok for you?

GEN	age	Date
UXEWHQEZ	52	2011-04-26
UXEWHQEZ	34	2013-04-02
UXEWHQEZ	52	2011-03-24
UXEWHQEZ	53	2012-02-27
UXEWHQEZ	39	2018-05-07

A New dataset was provided by Cliare on May 7, 2020: CAF_Sleep_Jenny_07_05_2020_CG.xlsx.

There were still issues with this dataset (as per email on 07/07/2020):

Il y a toujours un petit problème avec les IDs : GAWZCNNL et WIFRYNSK. Je mets les problèmes en dessous. Je crois que deux participants ont reçu le meme GEN ID. Peut-être il y avait encore un problème avec le merge ?

GEN	Date	age	age2	Date2	days_difference	age_plus_days	age_check
GAWZCNNL	2011-02-03	53	22	2018-02-22	2576	60.05753	problem
GAWZCNNL	2011-02-03	53	60	2018-05-08	2651	60.26301	sensible
GAWZCNNL	2018-02-22	22	60	2018-05-08	75	22.20548	sensible
WIFRYNSK	2009-01-19	31	31	2018-09-10	3521	40.64658	problem

Celine performed the anonymization to compare to Claire’s data to see if the same problems arose: GEN_CAF_Sleep_Jenny_05_05_2020.xlsx.
- There was still one issue with the dataset at ID WIFRYNSK.
  
  GEN Date age age2 Date2 days_difference age_plus_days age_check
  
  WIFRYNSK 2009-01-07 31 31 2018-08-29 3521 40.64658 problem
- After asking Celine about this problem (08/05/2020), she said:
Okay alors j’ai vérifié et cette erreur est déjà présente dans le fichier que Nermine nous a envoyé, donc ce n’est pas un problème lié au changement de codes.
- Nermine responded (08/05/2020) with the explanation:
Je viens de vérifier et en effet j’avais changé l’âge manuellement et il était changé dans les deux observations même si j’ai précisé la date de l’observation… je ne sais pas comment j’ai fais. Alors à la date du 22.12.2008, le patient avait 21 ans, et à la date du 13.08.2018 le patient avait 31 ans.
- The following line was added to the script to manually change this entry: caffeine_raw %>% mutate(age=replace(age, GEN=="WIFRYNSK" & as.Date(Date)=="2009-01-07", 21)

GEN	Date	age	age2	Date2	days_difference	age_plus_days	age_check
WIFRYNSK	2009-01-07	31	31	2018-08-29	3521	40.64658	problem

sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRblas.so
LAPACK: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fuzzyjoin_0.1.5         kableExtra_1.1.0        R.utils_2.9.2          
 [4] R.oo_1.23.0             R.methodsS3_1.7.1       TwoSampleMR_0.4.25     
 [7] reader_1.0.6            NCmisc_1.1.6            optparse_1.6.4         
[10] readxl_1.3.1            ggthemes_4.2.0          tryCatchLog_1.1.6      
[13] futile.logger_1.4.3     DataExplorer_0.8.0      taRifx_1.0.6.1         
[16] qqman_0.1.4             MASS_7.3-51.5           bit64_0.9-7            
[19] bit_1.1-14              rslurm_0.5.0            rmeta_3.0              
[22] devtools_2.2.1          usethis_1.5.1           data.table_1.12.8      
[25] clustermq_0.8.8.1       future.batchtools_0.8.1 future_1.15.1          
[28] rlang_0.4.5             knitr_1.26              drake_7.12.0.9000      
[31] forcats_0.4.0           stringr_1.4.0           dplyr_0.8.3            
[34] purrr_0.3.3             readr_1.3.1             tidyr_1.0.3            
[37] tibble_2.1.3            ggplot2_3.2.1           tidyverse_1.3.0        
[40] pacman_0.5.1            processx_3.4.1          workflowr_1.6.0        

loaded via a namespace (and not attached):
 [1] backports_1.1.6      plyr_1.8.5           igraph_1.2.5        
 [4] lazyeval_0.2.2       storr_1.2.1          listenv_0.8.0       
 [7] digest_0.6.25        htmltools_0.4.0      fansi_0.4.1         
[10] magrittr_1.5         checkmate_1.9.4      memoise_1.1.0       
[13] base64url_1.4        remotes_2.1.0        globals_0.12.5      
[16] modelr_0.1.5         prettyunits_1.1.0    colorspace_1.4-1    
[19] rvest_0.3.5          rappdirs_0.3.1       haven_2.2.0         
[22] xfun_0.11            callr_3.4.0          crayon_1.3.4        
[25] jsonlite_1.6         brew_1.0-6           glue_1.4.0          
[28] gtable_0.3.0         webshot_0.5.2        pkgbuild_1.0.6      
[31] scales_1.1.0         futile.options_1.0.1 DBI_1.1.0           
[34] Rcpp_1.0.3           viridisLite_0.3.0    progress_1.2.2      
[37] txtq_0.2.0           htmlwidgets_1.5.1    httr_1.4.1          
[40] getopt_1.20.3        calibrate_1.7.5      ellipsis_0.3.0      
[43] pkgconfig_2.0.3      dbplyr_1.4.2         tidyselect_0.2.5    
[46] reshape2_1.4.3       later_1.0.0          munsell_0.5.0       
[49] cellranger_1.1.0     tools_3.5.3          cli_2.0.1           
[52] generics_0.0.2       broom_0.5.3          evaluate_0.14       
[55] yaml_2.2.0           fs_1.3.1             packrat_0.5.0       
[58] nlme_3.1-143         whisker_0.4          formatR_1.7         
[61] proftools_0.99-2     xml2_1.2.2           compiler_3.5.3      
[64] rstudioapi_0.10      filelock_1.0.2       testthat_2.3.1      
[67] reprex_0.3.0         stringi_1.4.5        highr_0.8           
[70] ps_1.3.0             desc_1.2.0           lattice_0.20-38     
[73] vctrs_0.2.4          pillar_1.4.3         lifecycle_0.1.0     
[76] networkD3_0.4        httpuv_1.5.2         R6_2.4.1            
[79] promises_1.1.0       gridExtra_2.3        sessioninfo_1.1.1   
[82] codetools_0.2-16     lambda.r_1.2.4       assertthat_0.2.1    
[85] pkgload_1.0.2        rprojroot_1.3-2      withr_2.1.2         
[88] batchtools_0.9.12    parallel_3.5.3       hms_0.5.3           
[91] grid_3.5.3           rmarkdown_1.18       git2r_0.26.1        
[94] lubridate_1.7.4