Last updated: 2021-03-03

Checks: 6 1

Knit directory: PSYMETAB/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute relative
/data/sgg2/jenny/projects/PSYMETAB .

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    ._docs
    Ignored:    .drake/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/._GWAS.Rmd
    Ignored:    analysis/._data_processing_in_genomestudio.Rmd
    Ignored:    analysis/._quality_control.Rmd
    Ignored:    analysis/GWAS/
    Ignored:    analysis/GWAS_results.knit.md
    Ignored:    analysis/GWAS_results.utf8.md
    Ignored:    analysis/PRS/
    Ignored:    analysis/QC/
    Ignored:    analysis/Rlogo2.png
    Ignored:    analysis/figure/
    Ignored:    analysis/rplot.jpg
    Ignored:    analysis/site_libs/
    Ignored:    data/processed/
    Ignored:    data/raw/
    Ignored:    packrat/lib-R/
    Ignored:    packrat/lib-ext/
    Ignored:    packrat/lib/
    Ignored:    process_init_10_clustermq.out
    Ignored:    process_init_11_clustermq.out
    Ignored:    process_init_12_clustermq.out
    Ignored:    process_init_13_clustermq.out
    Ignored:    process_init_14_clustermq.out
    Ignored:    process_init_15_clustermq.out
    Ignored:    process_init_16_clustermq.out
    Ignored:    process_init_17_clustermq.out
    Ignored:    process_init_18_clustermq.out
    Ignored:    process_init_19_clustermq.out
    Ignored:    process_init_1_clustermq.out
    Ignored:    process_init_20_clustermq.out
    Ignored:    process_init_21_clustermq.out
    Ignored:    process_init_22_clustermq.out
    Ignored:    process_init_2_clustermq.out
    Ignored:    process_init_3_clustermq.out
    Ignored:    process_init_4_clustermq.out
    Ignored:    process_init_5_clustermq.out
    Ignored:    process_init_6_clustermq.out
    Ignored:    process_init_7_clustermq.out
    Ignored:    process_init_8_clustermq.out
    Ignored:    process_init_9_clustermq.out
    Ignored:    prs_1_clustermq.out
    Ignored:    prs_2_clustermq.out
    Ignored:    prs_3_clustermq.out
    Ignored:    prs_4_clustermq.out
    Ignored:    prs_5_clustermq.out
    Ignored:    prs_6_clustermq.out
    Ignored:    prs_7_clustermq.out
    Ignored:    prs_8_clustermq.out
    Ignored:    ukbb_analysis_10_clustermq.out
    Ignored:    ukbb_analysis_11_clustermq.out
    Ignored:    ukbb_analysis_12_clustermq.out
    Ignored:    ukbb_analysis_13_clustermq.out
    Ignored:    ukbb_analysis_14_clustermq.out
    Ignored:    ukbb_analysis_15_clustermq.out
    Ignored:    ukbb_analysis_16_clustermq.out
    Ignored:    ukbb_analysis_17_clustermq.out
    Ignored:    ukbb_analysis_18_clustermq.out
    Ignored:    ukbb_analysis_19_clustermq.out
    Ignored:    ukbb_analysis_1_clustermq.out
    Ignored:    ukbb_analysis_20_clustermq.out
    Ignored:    ukbb_analysis_21_clustermq.out
    Ignored:    ukbb_analysis_22_clustermq.out
    Ignored:    ukbb_analysis_2_clustermq.out
    Ignored:    ukbb_analysis_3_clustermq.out
    Ignored:    ukbb_analysis_4_clustermq.out
    Ignored:    ukbb_analysis_5_clustermq.out
    Ignored:    ukbb_analysis_6_clustermq.out
    Ignored:    ukbb_analysis_7_clustermq.out
    Ignored:    ukbb_analysis_8_clustermq.out
    Ignored:    ukbb_analysis_9_clustermq.out

Untracked files:
    Untracked:  Rlogo.png
    Untracked:  Rlogo2.png
    Untracked:  analysis_prep.log
    Untracked:  download_impute.log
    Untracked:  extract_sig.log
    Untracked:  grs.log
    Untracked:  init_analysis.log
    Untracked:  output/PSYMETAB_GWAS_UKBB_comparison.csv
    Untracked:  output/PSYMETAB_GWAS_UKBB_comparison2.csv
    Untracked:  output/PSYMETAB_GWAS_baseline_CEU_result.csv
    Untracked:  output/PSYMETAB_GWAS_subgroup_CEU_result.csv
    Untracked:  output/coffee_consumed_Neale_UKBB_analysis.csv
    Untracked:  process_init.log
    Untracked:  prs.log
    Untracked:  rplot.jpg
    Untracked:  test
    Untracked:  ukbb_analysis.log

Unstaged changes:
    Modified:   analysis/plans.Rmd
    Modified:   cache_log.csv
    Modified:   post_impute.log

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 105ad4f Jenny Sjaarda 2021-03-03 Testing genetic QC Rmd
Rmd 551bdb5 Jenny Sjaarda 2021-03-03 testing genetics_qc
html 843aad9 Jenny Sjaarda 2021-03-03 Build site.
Rmd e98e3a0 Jenny Sjaarda 2021-03-03 testing genetics_qc
Rmd 941b66d Jenny Sjaarda 2021-03-02 add new Rmd files and respective html files
html 941b66d Jenny Sjaarda 2021-03-02 add new Rmd files and respective html files

The following document outlines and summarizes the genetic quality control and processing procedure that was followed to create a clean, imputed dataset.

Step 1: Prepare and cluster genomestudio files.

Step 1 was performed entirely on CHUV computer

Part A: Randomize IDs.

  • Genetic sampleIDs were recoded according to GPCR algorithm to ensure genetic participants are not identifiable.
  • code/radomize_IDs.r was run on CHUV computer before building GenomeStudio project.
  • Creates a new csv file which was used to create a GenomeStudio project with data provided by lab in Geneva.
  • Requires manual addition of header before uploading to GenomeStudio.
[Header],,,,,,,,,,,,,
Investigator Name,,,,,,,,,,,,,
Project Name,,,,,,,,,,,,,
Experiment Name,,,,,,,,,,,,,
Date,,,,,,,,,,,,,
[Manifests],,,,,,,,,,,,,
A,GSA_UPPC_20023490X357589_A1,,,,,,,,,,,,
[Data],,,,,,,,,,,,,
  • Some samples were found to be duplicates (i.e. 2 samples at 2 different time points were analyzed for the same individual) and they were recoded to have ID as: ${ID}002.

Part B: Create GenomeStudio files.

  • Instructions can be found here.
  • Required files:
  • Sample sheet: as csv file (created above).
  • Data repository: as idat files.
  • Manifest file: as bpm file.
  • Cluster file: as egt file.
  • Data provided from Mylene Docquier, copied from sftp and saved here: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data.
  • Create new IDs based on GPCR randomization (see code/randomize_IDs.r), and save to above folder as: Eap0819_1t26_27to29corrected_7b9b_randomizedID.csv.
  • Note that original IDs can be found in the same folder at the file: Eap0819_1t26_27to29corrected_7b9.csv, if needed.
  • Create empty folder here: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS, named: GS_project_26092019 (data of creation).
  • Using new IDs, create genome studio project as follows:
  1. Open GenomeStudio.
  2. Select: File > New Genotyping Project.
  3. Select L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS as project repository.
  4. Under project name: use “GS_project_26092019” and click “Next”.
  5. Select “Use sample sheet to load intensities” and click “Next”.
  6. Select sample, data and manifests as specified below and click “Next”:
    • Sample sheet: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data\Eap0819_1t26_27to29corrected_7b9b_randomizedID.csv,
    • Data repository: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data,
    • Manifest repository: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data.
  7. Select “Import cluster positions from cluster file” and choose cluster file located here: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data\GSPMA24v1_0-A_4349HNR_Samples.egt and click “Finish”.
  • Genome studio files were created using the following required files:
  • Sample sheet: as csv file (created above).
  • Data repository: as idat files (provided from Mylene Docquier).
  • Manifest file: as bpm file (provided from Mylene Docquier).
  • Cluster file: as egt file (provided from Mylene Docquier).

Part D: Copy files to SGG server.

  • PLINK files were copied to SGG servers using FileZilla
  • Host name: je4649@hpc1.chuv.ch
  • Password: <chuv-password>
  • Port: 22
  • Output saved to: /data/sgg2/jenny/projects/PSYMETAB_GWAS/data/raw.

All subsequent steps were performed on the sgg server and run using drake plan

Step 2: Pre-quality control data prep.

See qc_prep drake plan in code/plans.sh.

  • Processed sex and ethnicity files to be used in QC scripts.
  • Sex file was created according to the input specified on plink man page (FID, IID, sex [M/F]).
  • Ethnicity input file to be used in R script for comparison to genetically derived ethnic groups (by snpweights).
  • Recodes ethnic groups as follows:
  • Changes French codes to English.
  • Changes missing to unknown.
  • Groups small ethnic groups to missing.
  • A1 rsid conversion file was updated to remove all SNPs labeled with a [.] (see data sources).
  • Create duplicates file

Step 3: Pre-imputation quality control.

Results of Step 3-6 are saved to analysis/QC. The majority of analyses were performed using PLINK (either version 2.0 or 1.9) Each sub-spet (i.e. 0-15) corresponds to one folder within analysis/QC

Source code for Step 3 can be found at: code/pre_imputation_qc.sh.

0. Preprocessing.

  1. Create binary PLINK files (if necessary):
    • 761641 variants initially.
    • 2767 individuals initially.
  2. Exclude Y and MT variants:
    • 11641 Y / MT variants removed.
    • 750000 variants remaining.
  3. Remove duplicate variants:
    • 7945 duplicates removed.
    • 742055 variants remaining.
  4. Update sex:
    • Using the file located at: data/processed/phenotype_data/PSYMETAB_GWAS_sex.txt (created above).
    
       F    M 
    1298 1469 
  5. Remove duplicate individuals that were identified previously (i.e. two different IDs for the same participant):
    • 2 duplicates identified and removed
    • 2752 indviduals remaining.
  6. Sanity check: ensure that duplicates have the same genetic info"
    • From the PLINK webpage: Note that KING kinship coefficients are scaled such that duplicate samples have kinship 0.5, not 1. First-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc.
    • If all duplicate ID samples have KINSHIP ~0.5, then indeed they are genetic duplicates.
   #FID1         ID1 FID2      ID2   NSNP   HETHET          IBS0  KINSHIP
1   2071 BEEEDIGO002  224 BEEEDIGO 703110 0.178137 0.00000000000 0.499539
2   1873 CQLIXEZP002   64 CQLIXEZP 703045 0.153413 0.00000142238 0.499504
3   1965 EFWKQOIK002 1433 EFWKQOIK 697403 0.151525 0.00000860335 0.496680
4   1886 HFNWJHCI002 1448 HFNWJHCI 702845 0.153089 0.00000426837 0.499547
5   2075 HROOJNCI002  553 HROOJNCI 702167 0.155970 0.00000284833 0.499257
6   1974 IOAWLZGK002  549 IOAWLZGK 704278 0.153028 0.00000000000 0.499847
7   2314 KLFEBCIE002 1916 KLFEBCIE 700799 0.153949 0.00000570777 0.499007
8   2073 LWCGLSDP002  317 LWCGLSDP 702226 0.150114 0.00000427213 0.499363
9   2379 PBAIFEMQ002 2070 PBAIFEMQ 700642 0.154083 0.00000285452 0.498820
10  2009 PNWDYVRH002  494 PNWDYVRH 703993 0.153736 0.00000284094 0.499806
11  2068 QHNUPGWK002  318 QHNUPGWK 702891 0.154500 0.00000569078 0.499363
12  1928 QZAUHIPY002  559 QZAUHIPY 702896 0.144711 0.00000142269 0.499826
13  2067 SSITXXAY002  283 SSITXXAY 702409 0.152603 0.00000284734 0.499418
14  1947 WKBFDWJF002  566 WKBFDWJF 703642 0.153506 0.00000284235 0.499783
15  1657 XABRILAR002 1385 XABRILAR 698282 0.154672 0.00000000000 0.497315

## 1. Strand alignment (so all SNPs are on positive strand).

  1. Update chromosome
  2. Update position
  3. Flip alleles on negative strand
  4. Extract SNPs that are not in strand file
  5. Remove any non autosomal / X chromosomal variants (might have changed due to the pos and chr update):
    • 11 duplicates removed.
    • 741937 variants remaining.
  6. Update rsids using data/processed/reference_files/rsid_conversion.txt
  7. Remove duplicate variants:
    • 6723 duplicates removed.
    • 734328 variants remaining.

## 2. Removal of SNPs that have MAF zero.

  1. Calculate frequency of all SNPs
  2. Remove MAF 0 SNPs:
    • 93578 variants with MAF = 0.
    • 640750 variants remaining.

## 3. Missingness.

  1. Exclude variants with >10% missingness (using geno --0.1):
    • 6480 variants removed.
    • 634270 variants remaining.
  2. Exclude individuals with >10% missingness (using mind --0.1):
    • 3 individuals removed.
    • 2749 individuals remaining.
  3. Exclude variants with >5% missingness (using geno --0.05):
    • 8256 variants removed.
    • 626014 variants remaining.
  4. Exclude individuals with >5% missingness (using mind --0.05):
    • 3 individuals removed.
    • 2746 individuals remaining.
  5. Exclude variants with >1% missingness (using geno --0.01):
    • 35957 variants removed.
    • 590057 variants remaining.
  6. Exclude individuals with >1% missingness (using mind --0.01):
    • 5 individuals removed.
    • 2741 individuals remaining.

*Total removed: 50693 variants (7.91%) and 11 individuals (0.40%).**

## 4. Sex check.

  1. Perform sex check.
  2. Remove unambiguous sex violations.
    • 26 individuals removed.
    • 2715 individuals remaining.

5. Imputation preparation.

  1. Write frequency of final QC’d file (from #4) to file (using -- freq).
  2. Using McCarthy Group Tools, QC’d files were prepared for imputation using the script HRC-1000G-check-bim-NoReadKey.pl (download link).
    • This script checks: Strand, alleles, position, Ref/Alt assignments and frequency differences.
    • Produces: A set of plink commands to update or remove SNPs based on the checks as well as a file (FreqPlot) of cohort allele frequency vs reference panel allele frequency.
    • Updates: Strand, position, ref/alt assignment.
    • Removes: A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference (can be removed/changed in V4.2.2), SNPs not in reference panel
  3. Replace underscores in fam file since vcf conversion uses understcores between FID and IID.
  4. Run the generated Run-plink.sh script from #2.
  5. Sort outputed vcf files.
  6. Download zipped files to personl or CHUV files to copy to Michigan Imputation Server.

# Step 4: Imputation.

Source code for Step 4 can be found at: code/download_imputation.sh and code/check_imputation.sh.

6. Run and download imputation.

  1. Upload downloaded QC’d vcf.gz files to Michigan Imputation Server as follows:
    • Select Run, Genotype Imputation (Minimac4).
    • Reference panel: HRC r1.1 2016 (GRCh37/hg19).
    • Array build: GRCh37/hg19.
    • rsq Filter: off.
    • Phasing: Eagle v2.4 (phased output).
    • Population: EUR.
    • Mode: Quality Control & Imputation.

2. Download imputation, using password from email retrieve the following files: - QC report. - QC stats. - Logs. - Imputation results.

 [1] "archive"           "chr1.info.gz"      "chr10.info.gz"    
 [4] "chr11.info.gz"     "chr12.info.gz"     "chr13.info.gz"    
 [7] "chr14.info.gz"     "chr15.info.gz"     "chr16.info.gz"    
[10] "chr17.info.gz"     "chr18.info.gz"     "chr19.info.gz"    
[13] "chr2.info.gz"      "chr20.info.gz"     "chr21.info.gz"    
[16] "chr22.info.gz"     "chr3.info.gz"      "chr4.info.gz"     
[19] "chr5.info.gz"      "chr6.info.gz"      "chr7.info.gz"     
[22] "chr8.info.gz"      "chr9.info.gz"      "qcreport.html"    
[25] "snps-excluded.txt"
   Chr Num imputed variants
1    1              3069931
2    2              3392237
3    3              2821894
4    4              2787581
5    5              2588168
6    6              2460111
7    7              2289305
8    8              2242705
9    9              1686471
10  10              1927503
11  11              1936990
12  12              1848117
13  13              1385433
14  14              1270436
15  15              1139215
16  16              1281297
17  17              1090072
18  18              1104755
19  19               868554
20  20               884983
21  21               531276
22  22               524544
23 all             39131578

## 7. Check imputation.

Step 5: Post imputation quality control.