Last updated: 2018-11-09
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Repository version: f98a31e
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .vscode/
Ignored: code/.DS_Store
Ignored: data/raw/
Ignored: src/.DS_Store
Ignored: src/Rmd/.Rhistory
Untracked files:
Untracked: Snakefile_clonality
Untracked: Snakefile_somatic_calling
Untracked: code/analysis_for_garx.Rmd
Untracked: code/selection/
Untracked: code/yuanhua/
Untracked: data/canopy/
Untracked: data/cell_assignment/
Untracked: data/de_analysis_FTv62/
Untracked: data/donor_info_070818.txt
Untracked: data/donor_info_core.csv
Untracked: data/donor_neutrality.tsv
Untracked: data/exome-point-mutations/
Untracked: data/fdr10.annot.txt.gz
Untracked: data/human_H_v5p2.rdata
Untracked: data/human_c2_v5p2.rdata
Untracked: data/human_c6_v5p2.rdata
Untracked: data/neg-bin-rsquared-petr.csv
Untracked: data/neutralitytestr-petr.tsv
Untracked: data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
Untracked: data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
Untracked: data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
Untracked: data/sces/
Untracked: data/selection/
Untracked: data/simulations/
Untracked: data/variance_components/
Untracked: figures/
Untracked: output/differential_expression/
Untracked: output/donor_specific/
Untracked: output/line_info.tsv
Untracked: output/nvars_by_category_by_donor.tsv
Untracked: output/nvars_by_category_by_line.tsv
Untracked: output/variance_components/
Untracked: references/
Untracked: tree.txt
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 218d792 | John Blischak | 2018-09-11 | Fix some links on homepage. |
html | 0540cdb | davismcc | 2018-09-02 | Build site. |
html | f0ed980 | davismcc | 2018-08-31 | Build site. |
html | ca3438f | davismcc | 2018-08-29 | Build site. |
html | e573f2f | davismcc | 2018-08-27 | Build site. |
html | 9ec2a59 | davismcc | 2018-08-26 | Build site. |
Rmd | cae617f | davismcc | 2018-08-26 | Updating simulation analyses |
html | 36acf15 | davismcc | 2018-08-25 | Build site. |
Rmd | 56d90a6 | davismcc | 2018-08-25 | Completing index with descriptions of data availability and new analyses. |
Rmd | d618fe5 | davismcc | 2018-08-25 | Updating analyses |
html | 090c1b9 | davismcc | 2018-08-24 | Build site. |
html | 02a8343 | davismcc | 2018-08-24 | Build site. |
Rmd | 97e062e | davismcc | 2018-08-24 | Updating Rmd’s |
Rmd | 43f15d6 | davismcc | 2018-08-24 | Adding data pre-processing workflow and updating analyses. |
html | d2e8b31 | davismcc | 2018-08-19 | Build site. |
html | 1489d32 | davismcc | 2018-08-17 | Add html files |
Rmd | 6b5f8c7 | davismcc | 2018-08-17 | Updating organisational pages. |
Rmd | 1cbadbd | davismcc | 2018-08-10 | Updating analyses. |
html | 2531565 | davismcc | 2018-08-08 | Tweaking clone prevalences |
Rmd | 7397e00 | davismcc | 2018-08-08 | Updating stylez and tweaking Rmds |
html | 9856275 | davismcc | 2018-08-07 | Build site. |
Rmd | 5fc189d | davismcc | 2018-08-07 | Start workflowr project. |
This project investigates clonality in human dermal fibroblast cell populations in 32 cell lines from distinct donors, using bulk whole-exome sequencing and single-cell RNA-sequencing data.
Key findings:
For a richer overview, see the About page.
The data pre-processing for this project from the raw data described above is complicated and computationally expensive, so this repository does not reproduce the data pre-processing in an automated way. However, we provide the source code for the Snakemake workflow for data pre-processing in this repository. Docker images providing the computing environment and software used are publicly available, split into an image for command line bioinformatics tools and an R installation with necessary packages installed.
If you would like to pre-process the data from raw reads to results as we have, please consult our description of how to run the workflow.
Here we present the reproducible the results of our analyses. They were generated by rendering the R Markdown documents into webpages available at the links below.
The results presented in the paper were produced with these analyses.
This is a complicated project, and reproducing all of the results presented, especially from raw data is highly non-trivial. Nevertheless, we have made all data available so that everything is entirely reproducible.
Single-cell RNA-seq data have been deposited in the ArrayExpress database at EMBL-EBI under accession number E-MTAB-7167. Whole-exome sequencing data is available through the HipSci portal. Processed data and large results files are available from Zenodo with DOI 10.5281/zenodo.1403510.
To set up the project to reproduce our analyses, first clone the source code repository from GitHub. Next, download all of the reference, metadata and results files and add them to the (cloned) project folder with the following structure:
.
├── data
│ ├── canopy
│ │ ├── canopy_results.*.rds
│ ├── cell_assignment
│ │ ├── cardelino_results.*.rds
│ ├── de_analysis_FTv62
│ │ ├── cellcycle_analyses
│ │ │ ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.cc.rds
│ │ │ ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│ │ │ ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.cc.rds
│ │ │ └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│ │ ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.rds
│ │ ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.rds
│ │ ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.rds
│ │ └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.rds
│ ├── donor_info_070818.txt
│ ├── donor_info_core.csv
│ ├── donor_neutrality.tsv
│ ├── exome-point-mutations
│ │ ├── high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
│ │ └── high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
│ ├── human_H_v5p2.rdata
│ ├── human_c2_v5p2.rdata
│ ├── human_c6_v5p2.rdata
│ ├── neg-bin-rsquared-petr.csv
│ ├── neutralitytestr-petr.tsv
| ├── sces
│ │ ├── sce_*.rds
│ ├── selection
│ │ ├── neg-bin-params-fit.csv
│ │ ├── neg-bin-rsquared-fit.csv
│ ├── simulations
│ │ ├── *.filt_lenient.cell_coverage_sites.mult.rds
│ │ ├── *.simulate.rds
│ └── variance_components
│ ├── covar_all.csv
│ ├── donorVar
│ │ ├── *.var_part.var1.csv
│ ├── fit_all_gene_highVar.csv
│ ├── fit_per_gene_highVar.csv
│ ├── gene_info_all.csv
│ └── logcnt_all.csv
├── metadata
│ ├── cell_metadata.csv
│ └── data_processing_metadata.tsv
├── references
│ ├── 1000G_phase1.indels.hg19.sites.vcf.gz
│ ├── GRCh37.p13.genome.ERCC92.fa
│ ├── Homo_sapiens.GRCh37.rel75.cdna.all.ERCC92.fa.gz
│ ├── Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
│ ├── dbsnp_138.hg19.biallelicSNPs.HumanCoreExome12.Top1000ExpressedIpsGenes.Maf0.01.HWE0.0001.HipSci.vcf.gz
│ ├── dbsnp_138.hg19.vcf.gz
│ ├── gencode.v19.annotation_ERCC.gtf
│ ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz
│ ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz.csi
│ └── knownIndels.intervals
For simplicity, we ignore all the directories and files present in the source code repository (that you should have clones) to focus just on where you should add the files downloaded from Zenodo. Yes, it’s still complicated, but such is life.
There is a large number of canopy_results.*.rds
files: these should be stored in the data/canopy
directory. Similarly, all of the cardelino_results.*.rds
files should be stored in data/cell_assignment
. All of the SingleCellExperiment object files (sce_*.rds
) should be stored in data/sces
. Simulation results files (*.mult.rds
; *.simulate.rds
) should be stored in data/simulations
. Variance components results should be stored in data/variance_components
as shown above.
Differential expression results belong in data/de_analysis_FTv62
.
Metadata files belong in metadata
. Reference files belong in references
.
With the data downloaded and organised as above, you will be able to reproduce the analyses presented in the RMarkdown files linked to above and, if desired, even run the whole analysis pipeline from raw reads to results following these instructions.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This reproducible R Markdown analysis was created with workflowr 1.1.1