Human dermal fibroblast clonality project

Last updated: 2018-09-02

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

✔ Repository version: f5a4631

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .vscode/
    Ignored:    code/.DS_Store
    Ignored:    data/raw/
    Ignored:    src/.DS_Store
    Ignored:    src/Rmd/.Rhistory

Untracked files:
    Untracked:  Snakefile_clonality
    Untracked:  Snakefile_somatic_calling
    Untracked:  code/analysis_for_garx.Rmd
    Untracked:  code/selection/
    Untracked:  code/yuanhua/
    Untracked:  data/canopy/
    Untracked:  data/cell_assignment/
    Untracked:  data/de_analysis_FTv62/
    Untracked:  data/donor_info_070818.txt
    Untracked:  data/donor_info_core.csv
    Untracked:  data/donor_neutrality.tsv
    Untracked:  data/exome-point-mutations/
    Untracked:  data/fdr10.annot.txt.gz
    Untracked:  data/human_H_v5p2.rdata
    Untracked:  data/human_c2_v5p2.rdata
    Untracked:  data/human_c6_v5p2.rdata
    Untracked:  data/neg-bin-rsquared-petr.csv
    Untracked:  data/neutralitytestr-petr.tsv
    Untracked:  data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
    Untracked:  data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
    Untracked:  data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
    Untracked:  data/sces/
    Untracked:  data/selection/
    Untracked:  data/simulations/
    Untracked:  data/variance_components/
    Untracked:  figures/
    Untracked:  output/differential_expression/
    Untracked:  output/donor_specific/
    Untracked:  output/line_info.tsv
    Untracked:  output/nvars_by_category_by_donor.tsv
    Untracked:  output/nvars_by_category_by_line.tsv
    Untracked:  output/variance_components/
    Untracked:  references/
    Untracked:  tree.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
html	f0ed980	davismcc	2018-08-31	Build site.
html	ca3438f	davismcc	2018-08-29	Build site.
html	e573f2f	davismcc	2018-08-27	Build site.
html	9ec2a59	davismcc	2018-08-26	Build site.
Rmd	cae617f	davismcc	2018-08-26	Updating simulation analyses
html	36acf15	davismcc	2018-08-25	Build site.
Rmd	56d90a6	davismcc	2018-08-25	Completing index with descriptions of data availability and new analyses.
Rmd	d618fe5	davismcc	2018-08-25	Updating analyses
html	090c1b9	davismcc	2018-08-24	Build site.
html	02a8343	davismcc	2018-08-24	Build site.
Rmd	97e062e	davismcc	2018-08-24	Updating Rmd’s
Rmd	43f15d6	davismcc	2018-08-24	Adding data pre-processing workflow and updating analyses.
html	d2e8b31	davismcc	2018-08-19	Build site.
html	1489d32	davismcc	2018-08-17	Add html files
Rmd	6b5f8c7	davismcc	2018-08-17	Updating organisational pages.
Rmd	1cbadbd	davismcc	2018-08-10	Updating analyses.
html	2531565	davismcc	2018-08-08	Tweaking clone prevalences
Rmd	7397e00	davismcc	2018-08-08	Updating stylez and tweaking Rmds
html	9856275	davismcc	2018-08-07	Build site.
Rmd	5fc189d	davismcc	2018-08-07	Start workflowr project.

Project overview

This project investigates clonality in human dermal fibroblast cell populations in 32 cell lines from distinct donors, using bulk whole-exome sequencing and single-cell RNA-sequencing data.

Key findings:

A novel approach for integrating DNA-seq and single-cell RNA-seq data to reconstruct clonal substructure and single-cell transcriptomes.
A new computational method, cardelino, to map single-cell RNA-seq profiles to clones.
Evidence for non-neutral evolution of clonal populations in human fibroblasts.
Proliferation and cell cycle pathways are commonly distorted in mutated clonal populations, with implications for cancer and ageing.

For a richer overview, see the About page.

Data pre-processing

The data pre-processing for this project from the raw data described above is complicated and computationally expensive, so this repository does not reproduce the data pre-processing in an automated way. However, we provide the source code for the Snakemake workflow for data pre-processing in this repository. Docker images providing the computing environment and software used are publicly available, split into an image for command line bioinformatics tools and an R installation with necessary packages installed.

If you would like to pre-process the data from raw reads to results as we have, please consult our description of how to run the workflow.

Analyses

Here we present the reproducible the results of our analyses. They were generated by rendering the R Markdown documents into webpages available at the links below.

The results presented in the paper were produced with these analyses.

Data availability

This is a complicated project, and reproducing all of the results presented, especially from raw data is highly non-trivial. Nevertheless, we have made all data available so that everything is entirely reproducible.

Single-cell RNA-seq data have been deposited in the ArrayExpress database at EMBL-EBI under accession number E-MTAB-7167. Whole-exome sequencing data is available through the HipSci portal. Processed data and large results files are available from Zenodo with DOI 10.5281/zenodo.1403510.

To set up the project to reproduce our analyses, first clone the source code repository from GitHub. Next, download all of the reference, metadata and results files and add them to the (cloned) project folder with the following structure:

.
├── data
│   ├── canopy
│   │   ├── canopy_results.*.rds
│   ├── cell_assignment
│   │   ├── cardelino_results.*.rds
│   ├── de_analysis_FTv62
│   │   ├── cellcycle_analyses
│   │   │   ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.cc.rds
│   │   │   ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│   │   │   ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.cc.rds
│   │   │   └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│   │   ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.rds
│   │   ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.rds
│   │   ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.rds
│   │   └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.rds
│   ├── donor_info_070818.txt
│   ├── donor_info_core.csv
│   ├── donor_neutrality.tsv
│   ├── exome-point-mutations
│   │   ├── high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
│   │   └── high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
│   ├── human_H_v5p2.rdata
│   ├── human_c2_v5p2.rdata
│   ├── human_c6_v5p2.rdata
│   ├── neg-bin-rsquared-petr.csv
│   ├── neutralitytestr-petr.tsv
|   ├── sces
│   │   ├── sce_*.rds
│   ├── selection
│   │   ├── neg-bin-params-fit.csv
│   │   ├── neg-bin-rsquared-fit.csv
│   ├── simulations
│   │   ├── *.filt_lenient.cell_coverage_sites.mult.rds
│   │   ├── *.simulate.rds
│   └── variance_components
│       ├── covar_all.csv
│       ├── donorVar
│       │   ├── *.var_part.var1.csv
│       ├── fit_all_gene_highVar.csv
│       ├── fit_per_gene_highVar.csv
│       ├── gene_info_all.csv
│       └── logcnt_all.csv
├── metadata
│   ├── cell_metadata.csv
│   └── data_processing_metadata.tsv
├── references
│   ├── 1000G_phase1.indels.hg19.sites.vcf.gz
│   ├── GRCh37.p13.genome.ERCC92.fa
│   ├── Homo_sapiens.GRCh37.rel75.cdna.all.ERCC92.fa.gz
│   ├── Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
│   ├── dbsnp_138.hg19.biallelicSNPs.HumanCoreExome12.Top1000ExpressedIpsGenes.Maf0.01.HWE0.0001.HipSci.vcf.gz
│   ├── dbsnp_138.hg19.vcf.gz
│   ├── gencode.v19.annotation_ERCC.gtf
│   ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz
│   ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz.csi
│   └── knownIndels.intervals

For simplicity, we ignore all the directories and files present in the source code repository (that you should have clones) to focus just on where you should add the files downloaded from Zenodo. Yes, it’s still complicated, but such is life.

There is a large number of canopy_results.*.rds files: these should be stored in the data/canopy directory. Similarly, all of the cardelino_results.*.rds files should be stored in data/cell_assignment. All of the SingleCellExperiment object files (sce_*.rds) should be stored in data/sces. Simulation results files (*.mult.rds; *.simulate.rds) should be stored in data/simulations. Variance components results should be stored in data/variance_components as shown above.

Differential expression results belong in data/de_analysis_FTv62.

Metadata files belong in metadata. Reference files belong in references.

With the data downloaded and organised as above, you will be able to reproduce the analyses presented in the RMarkdown files linked to above and, if desired, even run the whole analysis pipeline from raw reads to results following these instructions.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This reproducible R Markdown analysis was created with workflowr 1.1.1

Human dermal fibroblast clonality project

Davis J. McCarthy

Project overview

Data pre-processing

Analyses

Data availability