Data overview

Last updated: 2019-02-12

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: 391ed92

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    code/method-npreg.Rmd/slurm-46123907.out
    Ignored:    code/method-npreg.Rmd/slurm-46123908.out
    Ignored:    code/method-npreg.Rmd/slurm-46123909.out
    Ignored:    code/method-npreg.Rmd/slurm-46123910.out
    Ignored:    code/method-npreg.Rmd/slurm-46123911.out
    Ignored:    code/method-npreg.Rmd/slurm-46123912.out
    Ignored:    code/method-npreg.Rmd/slurm-46123913.out
    Ignored:    code/method-npreg.Rmd/slurm-46123914.out
    Ignored:    code/method-npreg.Rmd/slurm-46123915.out
    Ignored:    code/method-npreg.Rmd/slurm-46123916.out
    Ignored:    code/npreg-methods.Rmd/slurm-45076734.out
    Ignored:    code/npreg/slurm-45320823.out
    Ignored:    code/trendfilter-individual.Rmd/slurm-49893750.out
    Ignored:    code/trendfilter-individual.Rmd/slurm-49893751.out
    Ignored:    code/trendfilter-individual.Rmd/slurm-49893752.out
    Ignored:    code/trendfilter-individual.Rmd/slurm-49893753.out
    Ignored:    code/trendfilter-individual.Rmd/slurm-49893754.out
    Ignored:    code/trendfilter-individual.Rmd/slurm-49893755.out
    Ignored:    data/batch-paper/
    Ignored:    data/confess-rds/
    Ignored:    data/data-labwebsite/
    Ignored:    data/results/
    Ignored:    dsc/data/
    Ignored:    notes/
    Ignored:    output/npreg-trendfilter.Rmd/
    Ignored:    output_tmp/

Untracked files:
    Untracked:  analysis/cellcycler-seqdata-fucci.Rmd
    Untracked:  analysis/method-eval-bottcher.Rmd
    Untracked:  analysis/method-eval-buettner.Rmd
    Untracked:  analysis/method-eval-leng.Rmd
    Untracked:  analysis/norm-ercc-batch.Rmd
    Untracked:  analysis/norm-ercc-fucci.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	0201c1e	Joyce Hsiao	2019-02-12	add lab page data link
Rmd	435d656	Chiaowen Joyce Hsiao	2019-02-12	Merge branch ‘master’ into master
html	435d656	Chiaowen Joyce Hsiao	2019-02-12	Merge branch ‘master’ into master
Rmd	9c05421	Chiaowen Joyce Hsiao	2019-02-12	Merge pull request #57 from jdblischak/download
html	85b83c0	Joyce Hsiao	2019-01-31	Build site.
Rmd	453d90b	Joyce Hsiao	2019-01-31	minor edits
Rmd	2fd7c02	Joyce Hsiao	2019-01-31	minor edits
html	e87325c	Joyce Hsiao	2018-04-11	Build site.
Rmd	816dccd	Joyce Hsiao	2018-04-11	tidy up data processing overview
html	19db0c4	Joyce Hsiao	2018-04-10	Build site.
Rmd	77d10d4	Joyce Hsiao	2018-04-10	save final dataset to data/eset-final.rds: add estimated cell time to
html	28cf63f	Joyce Hsiao	2018-04-09	Build site.
Rmd	a1584da	Joyce Hsiao	2018-01-31	edits
Rmd	78d5a2c	Joyce Hsiao	2018-01-16	data description
html	b5c6d55	Joyce Hsiao	2017-12-23	Build site.
html	820a38a	Joyce Hsiao	2017-12-13	Build site.
Rmd	7509725	Joyce Hsiao	2017-12-13	wflow_publish(“analysis/data-overview.Rmd”)

Data availability

This document lists the datasets analyzed in the study.
We stored all datasets as expressionSets (require Biobase package).
We provided data in TXT format on the Gilad lab website for the 11,093 genes analyzed in the study and for each of the 888 samples: molecule count, sample phenotpyes, gene information, phenotype label descriptiond and FUCCI intensity data.

Data structure

We collected two types of data for each single cell sample: single-cell RNA-seq using C1 plates and FUCCI image intensity data.

Raw RNA-seq data: data/eset-raw.rds
Filtered RNA-seq data: data/eset-filtered.rds
FUCCI intensity data: data/intensity.rds
FUCCI intensity data adjusted for batch effect: output/images-normalize-anova.Rmd/pdata.adj.rds
Final data combining filtered intensity and RNA-seq, including 11093 genes and 888 samples: data/eset-final.rds

Code used to generate data from data/eset-raw.rds to data/eset-final.rds is stored in code/output-raw-2-final.R.

Downloading the data files

You have two main options for downloading the data files. First, you can manually download the individual files by clicking on the links on this page or navigating to the files in the fucci-seq GitHub repository. This is the recommended strategy if you only need a few data files.

Second, you can install git-lfs. To handle large files, we used Git Large File Storage (LFS). This means that the files that you download with git clone are only plain text files that contain identifiers for the files saved on GitHub’s servers. If you want to download all of the data files at once, you can do this with after you install git-lfs.

To install git-lfs, follow their instructions to download, install, and setup (git lfs install). Alternatively, if you use conda, you can install git-lfs with conda install -c conda-forge git-lfs. Once installed, you can download the latest version of the data files with git lfs pull.

How to access expressionSets

We store feature-level (gene) read count and molecule count in expressionSet (data/eset) objects, which also contain sample metadata (e.g., assigned indivdual ID, cDNA concentraion) and quality filtering criteria (e.g., number of reads mapped to FUCCI transgenes, ERCC conversion rate). Data from different C1 plates are stored in separate eset objects:

To combine eset objects from the different C1 plates:

eset <- Reduce(combine, Map(readRDS, Sys.glob("data/eset/*.rds")))

To access data stored in expressionSet:

exprs(eset): access count data, 20,421 features by 1,536 single cell samples.
pData(eset): access sample metadata. Returns data.frame of 1,536 samples by 43 labels. Use varMetadata(phenoData(eset)) to view label descriptions.
fData(eset): access feature metadata. Returns data.frame of 20,421 features by 6 labels. Use varMetadata(featureData(eset)) to view label descriptions.
varMetadata(phenoData(eset)): view the sample metadata labels.
varMetadata(featureData(eset)): view the feature (gene) metadata labels.

Additional data information

FUCCI intensity data

Combined intensity data are stored in data/intensity.rds. These include samples that were identified to have a single nuclei .
Data generated by combine-intensity-data.R. Combining image analysis output stored in /project2/gilad/fucci-seq/intensities_stats/ into one data.frame and computes summary statistics, including background-corrected RFP and GFP intensity measures.

Sequencing data

Raw data from each C1 plate are stored separatley in data/eset/ by experiment (batch) ID.
Raw data combining C1 plate are stored in data/eset-raw.rds.
Filtered raw data excluding low-quality sequencing samples and genes that are lowly expressed or overly expressed are stored in data/eset-filtered.rds.

Phenotypic data of singleton samples

Data file 1: all 1536 samples before filtering (output/data-overview.Rmd/phenotypes_allsamples.txt)
Data file 2: 888 samples after filtering (output/data-overview.Rmd/phenotypes_singletonsamples.txt)
Data file 3: phenotype labels (output/data-overview.Rmd/phenotypes_labels.txt)

library(Biobase)

eset_raw <- readRDS("../data/eset-raw.rds")
df <- data.frame(sample_id=rownames(pData(eset_raw)), pData(eset_raw), stringsAsFactors = F)
write.table(df, quote=F, sep="\t", 
            row.names = F, col.names = T,
            file = "../output/data-overview.Rmd/phenotypes_allsamples.txt")


eset_final <- readRDS("../data/eset-final.rds")
df <- data.frame(sample_id=rownames(pData(eset_final)), pData(eset_final), stringsAsFactors = F)
write.table(df, quote=F, sep="\t", 
            row.names = F, col.names = T,
            file = "../output/data-overview.Rmd/phenotypes_singletonsamples.txt")

labels <- data.frame(var_names=rownames(varMetadata(eset_raw)),
                       labels=varMetadata(eset_raw)$labeDescription, stringsAsFactors = F)
labels <- rbind(labels,
                data.frame(var_names=rownames(varMetadata(eset_final)),
                       labels=varMetadata(eset_final)$labelDescription, stringsAsFactors = F)[45:54,])
write.table(labels, quote=F,
            sep="\t", row.names = F, col.names = T,
            file = "../output/data-overview.Rmd/phenotypes_labels.txt")

# testing reading files
library(data.table)
df_all <- fread(file = "../output/data-overview.Rmd/phenotypes_allsamples.txt")
df_singles <- fread(file = "../output/data-overview.Rmd/phenotypes_singletonsamples.txt")
df_labels <- fread(file = "../output/data-overview.Rmd/phenotypes_labels.txt")

eset_final <- readRDS("../data/eset-final.rds")

df <- data.frame(sample_id=rownames(pData(eset_raw)), pData(eset_raw), stringsAsFactors = F)

write.table(df, quote=F, sep="\t", quote=F,
            row.names = F, col.names = T,
            file = "../output/data-overview.Rmd/phenotypes_allsamples.txt")

write.table(data.frame(var_names=rownames(varMetadata(eset_raw)),
                       labels=varMetadata(eset_raw)$labeDescription, stringsAsFactors = F), 
            quote=F,
            sep="\t", row.names = F, col.names = T,
            file = "../output/data-overview.Rmd/phenotypes_allsamples_labels.txt")

# testing reading files
library(data.table)
df_test <- fread(file = "../output/data-overview.Rmd/phenotypes_allsamples.txt")
df_labels_test <- fread(file = "../output/data-overview.Rmd/phenotypes_allsamples_labels.txt")

Session information

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.1.1   Rcpp_1.0.0        digest_0.6.18    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] magrittr_1.5      git2r_0.23.0      evaluate_0.12    
[10] stringi_1.2.4     whisker_0.3-2     R.oo_1.22.0      
[13] R.utils_2.7.0     rmarkdown_1.10    tools_3.5.1      
[16] stringr_1.3.1     yaml_2.2.0        compiler_3.5.1   
[19] htmltools_0.3.6   knitr_1.20

This reproducible R Markdown analysis was created with workflowr 1.1.1