Last updated: 2019-02-14

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20181214)

The command set.seed(20181214) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 8d9fe9a
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    BUSpaRse_notebooks.Rproj
    Ignored:    data/fastqs/
    Ignored:    data/hgmm_100_fastqs.tar
    Ignored:    data/hgmm_1k_fastqs.tar
    Ignored:    data/hgmm_1k_v3_fastqs.tar
    Ignored:    data/hgmm_1k_v3_fastqs/
    Ignored:    data/hs_cdna.fa.gz
    Ignored:    data/mm_cdna.fa.gz
    Ignored:    data/whitelist_v2.txt
    Ignored:    data/whitelist_v3.txt.gz
    Ignored:    output/hs_mm_tr_index.idx

Untracked files:
    Untracked:  Rplots.pdf
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	8d9fe9a	Lambda Moses	2019-02-14	More detailed explanations for kallisto bus workshop
html	b7b21a0	Lambda Moses	2019-02-02	Build site.
html	a0becdf	Lambda Moses	2018-12-19	Build site.
html	93c1053	Lambda Moses	2018-12-19	Build site.
html	85a5770	Lambda Moses	2018-12-14	Build site.
Rmd	ca1d6ce	Lambda Moses	2018-12-14	Clean up
html	6276894	Lambda Moses	2018-12-14	Build site.
html	db70187	Lambda Moses	2018-12-14	Build site.
Rmd	b67fc92	Lambda Moses	2018-12-14	Added 10xv3 notebook and further elaboration to 10xv2 notebook
html	fff442d	Lambda Moses	2018-12-14	Build site.
Rmd	695e202	Lambda Moses	2018-12-14	Changed name to BUSpaRse
html	09a56c1	Lambda Moses	2018-12-14	Build site.
Rmd	d288e19	Lambda Moses	2018-12-14	Cache rather than skip evaluation
html	30c9aa3	Lambda Moses	2018-12-14	Build site.
Rmd	fd3d5ae	Lambda Moses	2018-12-14	Don’t collapse output
html	2da3b35	Lambda Moses	2018-12-14	Build site.
Rmd	1fc3e91	Lambda Moses	2018-12-14	Publish 10xv2 notebook
Rmd	074d55f	Lambda Moses	2018-12-14	Initial commit, already with 10xv2 notebook

In this vignette, we process fastq data from scRNA-seq (10x v2 chemistry) with to make a sparse matrix that can be used in downstream analysis. Then we will start that standard downstream analysis with Seurat.

Setup

First, if you want to run this notebook, please make sure you have git installed, go to the terminal and

git clone https://github.com/BUStools/BUS_notebooks_R.git

This notebook is in the analysis directory. Also, in RStudio, go to Tools -> Global Options -> R Markdown, and set “Evaluate chunks in directory” to “Project”. The reason why is that this website was built with workflowr, which renders the notebooks in the project directory rather than the document directory.

Alternatively, you can directly download this notebook here and open it in RStudio, but make sure to adjust the paths where files are downloaded and where output files are stored to match those in your environment.

Install packages

The primary analysis section of this notebook demonstrates the use of command line tools kallisto and bustools. Please use kallisto 0.45, whose binary can be downloaded here. The binary of bustools can be found here.

After you download the binary, you should decompress the file (if it is tar.gz) with tar -xzvf file.tar.gz in the bash terminal, and add the directory containing the binary to PATH by export PATH=$PATH:/foo/bar, where /foo/bar here is the directory of interest. Then you can directly invoke the binary on the command line as we will do in this notebook.

Note for Windows users: bustools does not have a Windows binary, but you can use a Linux subsystem in Windows 10. If you use earlier version of Windows, then unfortunately you need to either use Linux dual boot or virtual machine or find a Linux or Mac computer such as a server.

We will be using the R packages below. BUSpaRse is not yet on CRAN or Bioconductor. For Mac users, see the installation note for BUSpaRse. We will also use Seurat 3.0, which is not yet on CRAN (the CRAN version is 2.3.4), for the standard downstream analysis. Seurat 3.0 has some cool new features such as getting batch corrected gene count matrix and cell label transfer across datasets. See this paper for more detail. Though we will not get into those new features here, it’s worthwhile to check them out.

# Install devtools if it's not already installed
if (!require(devtools)) {
  install.packages("devtools")
}
# Install from GitHub
devtools::install_github("BUStools/BUSpaRse")
devtools::install_github("satijalab/seurat", ref = "release/3.0")

The package DropletUtils will be used to estimate the number of real cells as opposed to empty droplets. It’s on Bioconductor, and here is how it should be installed:

if (!require(BiocManager)) {
  install.packages("BiocManager")
}
BiocManager::install("DropletUtils")

The other R packages below are on CRAN, and can be installed with install.packages.

library(BUSpaRse)
library(data.table)
library(Seurat)
library(tidyverse)
library(DropletUtils)
library(Matrix)

Download data

The data set we are using here is 1k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells from the 10x website. First, we download the fastq files (6.34 GB).

download.file("http://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_1k/hgmm_1k_fastqs.tar", destfile = "./data/hgmm_1k_fastqs.tar", quiet = TRUE)

Then untar this file

cd ./data
tar -xvf ./hgmm_1k_fastqs.tar

#> fastqs/
#> fastqs/hgmm_1k_S1_L001_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L002_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L003_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L004_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L005_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L006_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L007_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L008_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz

Primary analysis

Build the `kallisto` index

Here we use kallisto to pseudoalign the reads to the transcriptome and then to create the bus file to be converted to a sparse matrix. The first step is to build an index of the transcriptome. This data set has both human and mouse cells, so we need both human and mouse transcriptomes. The transcriptomes downloaded here are from Ensembl version 95, released in January 2019. As of the writing of this notebook, this is the most recent Ensembl release.

# Human transcriptome
download.file("ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz", "./data/hs_cdna.fa.gz", quiet = TRUE)
# Mouse transcriptome
download.file("ftp://ftp.ensembl.org/pub/release-95/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz", "./data/mm_cdna.fa.gz", quiet = TRUE)

kallisto version

#> kallisto, version 0.45.0

Actually, we don’t need to unzip the fasta files

kallisto index -i ./output/hs_mm_tr_index.idx ./data/hs_cdna.fa.gz ./data/mm_cdna.fa.gz

#> 
#> [build] loading fasta file ./data/hs_cdna.fa.gz
#> [build] loading fasta file ./data/mm_cdna.fa.gz
#> [build] k-mer length: 31
#> [build] warning: clipped off poly-A tail (longer than 10)
#>         from 2083 target sequences
#> [build] warning: replaced 8 non-ACGUT characters in the input sequence
#>         with pseudorandom nucleotides
#> [build] counting k-mers ... done.
#> [build] building target de Bruijn graph ...  done 
#> [build] creating equivalence classes ...  done
#> [build] target de Bruijn graph has 2145397 contigs and contains 206575877 k-mers

Run `kallisto bus`

Here we will generate the bus file. Here bus stands for Barbode, UMI, Set (i.e. equivalent class). In text form, it is a table whose first column is the barcode. The second column is the UMI that are associated with the barcode. The third column is the index of the equivalence class reads with the UMI maps to (equivalence class will be explained later). The fourth column is count of reads with this barcode, UMI, and equivalence class combination, which is ignored as one UMI should stand for one molecule. See this paper for more detail.

These are the technologies supported by kallisto bus:

system("kallisto bus --list", intern = TRUE)

#> Warning in system("kallisto bus --list", intern = TRUE): running command
#> 'kallisto bus --list' had status 1

#>  [1] "List of supported single cell technologies"
#>  [2] ""                                          
#>  [3] "short name       description"              
#>  [4] "----------       -----------"              
#>  [5] "10Xv1            10X chemistry version 1"  
#>  [6] "10Xv2            10X chemistry verison 2"  
#>  [7] "DropSeq          DropSeq"                  
#>  [8] "inDrop           inDrop"                   
#>  [9] "CELSeq           CEL-Seq"                  
#> [10] "CELSeq2          CEL-Seq version 2"        
#> [11] "SCRBSeq          SCRB-Seq"                 
#> [12] ""                                          
#> attr(,"status")
#> [1] 1

Here we have 8 samples. Each sample has 3 files: I1 means sample index, R1 means barcode and UMI, and R2 means the piece of cDNA. The -i argument specifies the index file we just built. The -o argument specifies the output directory. The -x argument specifies the sequencing technology used to generate this data set. The -t argument specifies the number of threads used. I ran this on a server and used 8 threads.

cd ./data
kallisto bus -i ../output/hs_mm_tr_index.idx -o ../output/out_hgmm1k -x 10xv2 -t8 \
./fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz

#> 
#> [index] k-mer length: 31
#> [index] number of targets: 303,693
#> [index] number of k-mers: 206,575,877
#> [index] number of equivalence classes: 1,256,267
#> [quant] will process sample 1: ./fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz
#> [quant] will process sample 2: ./fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz
#> [quant] will process sample 3: ./fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz
#> [quant] will process sample 4: ./fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz
#> [quant] will process sample 5: ./fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz
#> [quant] will process sample 6: ./fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz
#> [quant] will process sample 7: ./fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz
#> [quant] will process sample 8: ./fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz
#> [quant] finding pseudoalignments for the reads ... done
#> [quant] processed 63,252,296 reads, 52,231,469 reads pseudoaligned

See what the outputs are

list.files("./output/out_hgmm1k/")

#> [1] "matrix.ec"       "output.bus"      "run_info.json"   "transcripts.txt"

Run `BUStools`

The output.bus file is a binary. In order to make R parse it, we need to convert it into a sorted text file. There’s a command line tool bustools for this. The reason why kallisto bus outputs this binary rather than the sorted text file directly is that the binary can be very compressed to save disk space. The such compression is achieved, a future version of R package BUSpaRse may support directly reading the binary to convert to sparse matrix. At present, as bustools is still under development, such compression has not been achieved, and as a result, the binary is slightly larger than the text file.

# Add where I installed bustools to PATH
export PATH=$PATH:/home/lambda/mylibs/bin/
# Sort, with 8 threads
bustools sort -o ./output/out_hgmm1k/output.sorted -t8 ./output/out_hgmm1k/output.bus
# Convert sorted file to text
bustools text -o ./output/out_hgmm1k/output.sorted.txt ./output/out_hgmm1k/output.sorted

#> Read in 52231469 number of busrecords
#> All sorted
#> Read in 43778790 number of busrecords

Secondary analysis

Map transcripts to genes

For the sparse matrix, most people are interested in how many UMIs per gene per cell, we here we will quantify this from the bus output, and to do so, we need to find which gene corresponds to each transcript. Remember in the output of kallisto bus, there’s the file transcripts.txt. Those are the transcripts in the transcriptome index. Now we’ll only keep the transcripts present there and make sure that the transcripts in tr2g are in the same order as those in the index. You will soon see why the order is important.

Remember that we downloaded transcriptome FASTA files from Ensembl just now. In FASTA files, each entry is a sequence with a name. In Ensembl FASTA files, the sequence name has genome annotation of the corresponding sequence, so we can extract transcript IDs and corresponding gene IDs and gene names from there.

tr2g <- transcript2gene(fasta_file = c("./data/hs_cdna.fa.gz", "./data/mm_cdna.fa.gz"),
                        kallisto_out_path = "./output/out_hgmm1k")

#> Reading FASTA file.
#> Reading FASTA file.

#> Sorting transcripts

head(tr2g) %>% knitr::kable()

transcript	gene	gene_name
ENST00000632684.1	ENSG00000282431.1	TRBD1
ENST00000434970.2	ENSG00000237235.2	TRDD2
ENST00000448914.1	ENSG00000228985.1	TRDD3
ENST00000415118.1	ENSG00000223997.1	TRDD1
ENST00000631435.1	ENSG00000282253.1	TRBD1
ENST00000390583.1	ENSG00000211923.1	IGHD3-10

Alternative ways of getting tr2g have been implemented in the BUSpaRse package. You may use tr2g_ensembl to query Ensembl with biomart to get transcript and gene IDs. If you use this method, then please make sure that the Ensembl version used in the query matches that of the transcriptome. This method is convenient for the user since you only need to input species names, but it can be slow since biomart database query can be slow. You may also use tr2g_gtf for GTF files and tr2g_gff3 for GFF3 files, which are more useful for non-model organisms absent from Ensemble. After calling the tr2g_* family of functions, you should sort the transcripts from those functions with sort_tr2g so the transcripts are in the same order as those in the kallisto index. The FASTA way is the fastest, as reading FASTA files is faster than reading GTF and GFF files. Whichever way you choose, this shouldn’t take much more than half a minute per species.

The function transcript2gene not only gets the transcript and gene IDs but also sorts the transcripts. This function only supports Ensembl biomart query and Ensembl fasta files as the source of transcript and gene IDs, as the attribute field of GTF and GFF3 files can differ between sources and further cleanup may be needed.

Map ECs to genes

The 3rd column in the output.sorted.txt is the equivalence class (EC) index of each UMI for each cell barcode. Equivalence class (EC) means the set of transcripts in the transcriptome that the read is compatible to. While in most cases, an EC only has transcripts for the same gene, there are some ECs that have transcripts for different genes. The file in the kallisto bus output, matrix.ec, maps the EC index in output.sorted.txt to sets of line numbers in the transcriptome assembly. That’s why we ensured that the tr2g data frame has the same order as the transcripts in the index.

Also note that while this function supports multithreading, unless the matrix.ec file is really large, setting ncores > 1 will not improve performance since only 1 core will be used anyway. It may even compromise performance since creating multiple processes has overhead. For this dataset, where matrix.ec is 68.8MB, it is not large enough to warrant the use of more than 1 core.

genes <- EC2gene(tr2g, "./output/out_hgmm1k", ncores = 1)

#> Reading matrix.ec

#> Processing genes

Now for each EC, we have a set of genes the EC is compatible to.

head(genes)

#> [[1]]
#> [1] "ENSG00000282431.1"
#> 
#> [[2]]
#> [1] "ENSG00000237235.2"
#> 
#> [[3]]
#> [1] "ENSG00000228985.1"
#> 
#> [[4]]
#> [1] "ENSG00000223997.1"
#> 
#> [[5]]
#> [1] "ENSG00000282253.1"
#> 
#> [[6]]
#> [1] "ENSG00000211923.1"

tail(genes)

#> [[1]]
#>  [1] "ENSG00000256683.6"  "ENSG00000112276.13" "ENSG00000274671.4" 
#>  [4] "ENSG00000278767.4"  "ENSG00000100625.8"  "ENSG00000140093.9" 
#>  [7] "ENSG00000152939.15" "ENSG00000135956.8"  "ENSG00000215271.7" 
#> [10] "ENSG00000141965.4"  "ENSG00000255561.6"  "ENSG00000196376.10"
#> [13] "ENSG00000185960.14" "ENSG00000172000.7"  "ENSG00000227454.2" 
#> [16] "ENSG00000280700.1" 
#> 
#> [[2]]
#>  [1] "ENSMUSG00000062647.16" "ENSMUSG00000115584.1" 
#>  [3] "ENSMUSG00000115188.1"  "ENSMUSG00000117228.1" 
#>  [5] "ENSMUSG00000083553.1"  "ENSMUSG00000107920.1" 
#>  [7] "ENSMUSG00000114756.1"  "ENSMUSG00000067147.7" 
#>  [9] "ENSMUSG00000104775.1"  "ENSMUSG00000081238.3" 
#> [11] "ENSMUSG00000110109.1"  "ENSMUSG00000062758.6" 
#> [13] "ENSMUSG00000116481.1"  "ENSMUSG00000081675.1" 
#> [15] "ENSMUSG00000090305.2"  "ENSMUSG00000106258.1" 
#> [17] "ENSMUSG00000106243.1"  "ENSMUSG00000117585.1" 
#> [19] "ENSMUSG00000071052.4"  "ENSMUSG00000086925.1" 
#> [21] "ENSMUSG00000099927.1"  "ENSMUSG00000112574.1" 
#> [23] "ENSMUSG00000087321.1"  "ENSMUSG00000099980.1" 
#> [25] "ENSMUSG00000098893.1"  "ENSMUSG00000082920.1" 
#> [27] "ENSMUSG00000117002.1"  "ENSMUSG00000083826.1" 
#> [29] "ENSMUSG00000116757.1" 
#> 
#> [[3]]
#>  [1] "ENSG00000165792.17" "ENSG00000171346.15" "ENSG00000180336.17"
#>  [4] "ENSG00000105443.14" "ENSG00000148225.15" "ENSG00000101940.17"
#>  [7] "ENSG00000157540.20" "ENSG00000197563.10" "ENSG00000003402.19"
#> [10] "ENSG00000136731.12" "ENSG00000160049.11"
#> 
#> [[4]]
#>  [1] "ENSG00000100626.16" "ENSG00000153283.12" "ENSG00000196312.13"
#>  [4] "ENSG00000132680.10" "ENSG00000196418.12" "ENSG00000174197.16"
#>  [7] "ENSG00000186185.13" "ENSG00000130164.13" "ENSG00000198089.15"
#> [10] "ENSG00000081320.10" "ENSG00000122779.17" "ENSG00000089053.12"
#> [13] "ENSG00000010244.18" "ENSG00000177034.15" "ENSG00000039068.18"
#> [16] "ENSG00000068079.7"  "ENSG00000149554.12" "ENSG00000104408.9" 
#> [19] "ENSG00000165525.17" "ENSG00000167523.13" "ENSG00000172339.9" 
#> [22] "ENSG00000167721.10" "ENSG00000149182.14" "ENSG00000134759.13"
#> [25] "ENSG00000100554.11" "ENSG00000116871.15" "ENSG00000198252.11"
#> [28] "ENSG00000090382.6"  "ENSG00000067533.5"  "ENSG00000185869.14"
#> [31] "ENSG00000117505.12" "ENSG00000152229.18" "ENSG00000135966.12"
#> [34] "ENSG00000148655.14" "ENSG00000100243.20" "ENSG00000115756.12"
#> [37] "ENSG00000225973.3"  "ENSG00000140451.12" "ENSG00000105793.15"
#> [40] "ENSG00000162144.9"  "ENSG00000196417.12" "ENSG00000069020.18"
#> [43] "ENSG00000175575.12" "ENSG00000153066.12" "ENSG00000160062.14"
#> [46] "ENSG00000265491.4"  "ENSG00000125965.8"  "ENSG00000078674.17"
#> [49] "ENSG00000162241.12" "ENSG00000167703.14" "ENSG00000149483.11"
#> [52] "ENSG00000158987.20" "ENSG00000110031.12" "ENSG00000144730.17"
#> [55] "ENSG00000163705.12" "ENSG00000278550.4"  "ENSG00000007545.15"
#> [58] "ENSG00000178927.17" "ENSG00000115947.13" "ENSG00000186687.15"
#> [61] "ENSG00000152348.15" "ENSG00000184898.6"  "ENSG00000138380.17"
#> [64] "ENSG00000108587.15" "ENSG00000172000.7"  "ENSG00000119906.12"
#> [67] "ENSG00000169371.13" "ENSG00000100034.13" "ENSG00000108448.21"
#> [70] "ENSG00000169895.5"  "ENSG00000088888.17" "ENSG00000178950.16"
#> [73] "ENSG00000214176.9"  "ENSG00000226789.1"  "ENSG00000204118.2" 
#> 
#> [[5]]
#> [1] "ENSG00000127311.9" "ENSG00000135722.8"
#> 
#> [[6]]
#>  [1] "ENSG00000111832.12" "ENSG00000167637.16" "ENSG00000066422.4" 
#>  [4] "ENSG00000131269.16" "ENSG00000110871.14" "ENSG00000140416.20"
#>  [7] "ENSG00000151131.10" "ENSG00000151338.18" "ENSG00000085382.11"
#> [10] "ENSG00000143315.7"  "ENSG00000144460.12" "ENSG00000128973.12"
#> [13] "ENSG00000011021.22" "ENSG00000135643.4"  "ENSG00000177565.16"
#> [16] "ENSG00000185324.21" "ENSG00000105726.16" "ENSG00000184304.14"
#> [19] "ENSG00000135747.11" "ENSG00000248905.8"  "ENSG00000143322.19"
#> [22] "ENSG00000102144.14" "ENSG00000166822.12" "ENSG00000215883.10"
#> [25] "ENSG00000163728.10" "ENSG00000127603.25" "ENSG00000161328.10"
#> [28] "ENSG00000141551.14" "ENSG00000166938.12" "ENSG00000103657.13"
#> [31] "ENSG00000114982.17" "ENSG00000148225.15" "ENSG00000114902.13"
#> [34] "ENSG00000114904.12" "ENSG00000134539.16" "ENSG00000265491.4" 
#> [37] "ENSG00000162971.10" "ENSG00000134245.17" "ENSG00000273831.2" 
#> [40] "ENSG00000224383.7"  "ENSG00000101417.11" "ENSG00000148688.13"
#> [43] "ENSG00000073605.18" "ENSG00000115641.18" "ENSG00000102290.22"
#> [46] "ENSG00000275111.4"  "ENSG00000124357.12" "ENSG00000106069.22"
#> [49] "ENSG00000163964.14" "ENSG00000138592.13" "ENSG00000090238.11"
#> [52] "ENSG00000113532.12" "ENSG00000213949.8"  "ENSG00000175387.15"
#> [55] "ENSG00000282513.1"  "ENSG00000189136.9"  "ENSG00000268949.1" 
#> [58] "ENSG00000267253.1"  "ENSG00000249616.1"  "ENSG00000229816.1" 
#> [61] "ENSG00000265158.1"

Apparently some ECs map to many genes.

Make the sparse matrix

For 10x, we do have a file with all valid cell barcodes that comes with CellRanger.

# Copy v2 chemistry whitelist to working directory
cp ~/cellranger-3.0.1/cellranger-cs/3.0.1/lib/python/cellranger/barcodes/737K-august-2016.txt \
./data/whitelist_v2.txt

# Read in the whitelist
whitelist_v2 <- fread("./data/whitelist_v2.txt", header = FALSE)$V1
length(whitelist_v2)

#> [1] 737280

There about 737K valid cell barcodes.

Now we have everything we need to make the sparse matrix. This function reads in output.sorted.txt line by line and processes them. It does not do barcode correction for now, so the barcode must exactly match those in the whitelist if one is provided. It took 5 to 6 minutes to construct the sparse matrix in the hgmm6k dataset, which has over 280 million lines in output.sorted.txt, which is over 9GB. Here the data set is smaller, and it takes less than a minute. A future version of BUSpaRse will parallelize this process to make it more scalable for huge datasets. For now, it’s single threaded.

Note that the arguments est_ncells (estimated number of cells) and est_ngenes (estimated number of genes) are important. With the estimate, this function reserves memory for the data to be added into, reducing the need of reallocation, which will slow the function down. Since the vast majority of “cells” you get in this sparse matrix are empty droplets rather than cells, please put at least 200 times more “cells” than you actually expect in est_ncells.

If you do not have a whitelist of barcodes, then it’s fine; the whitelist argument is optional.

res_mat <- make_sparse_matrix("./output/out_hgmm1k/output.sorted.txt",
                              genes = genes, est_ncells = 3.5e5,
                              est_ngenes = nrow(tr2g),
                              whitelist = whitelist_v2)

#> Reading data
#> Read 5 million lines
#> Read 10 million lines
#> Read 15 million lines
#> Read 20 million lines
#> Read 25 million lines
#> Read 30 million lines
#> Read 35 million lines
#> Read 40 million lines
#> Constructing sparse matrix

The matrix we get has genes in rows and barcode in columns. The row names are the gene IDs (not using human readable gene names since they’re not guaranteed to be unique), and the column names are cell barcodes.

For Ensembl transcriptomes, all the 3 steps in BUSpaRse we have just done can be condensed into one step, with the busparse_gene_count function.

Explore the data

Remove empty droplets

Cool, so now we have the sparse matrix. What does it look like?

dim(res_mat)

#> [1]  47367 353961

That’s way more cells than we expect, which is about 1000. So what’s going on?

How many UMIs per barcode?

tot_counts <- Matrix::colSums(res_mat)
summary(tot_counts)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     1.0     1.0     2.0   101.8    11.0 92917.0

The vast majority of “cells” have only a few UMI detected. Those are empty droplets. 10x claims to have cell capture rate of up to 65%, but in practice, depending on how many cells are in fact loaded, the rate can be much lower. A commonly used method to estimate the number of empty droplets is barcode ranking knee and inflection points, as those are often assumed to represent transition between two components of a distribution. While more sophisticated method exist (e.g. see emptyDrops in DropletUtils), for simplicity, we will use the barcode ranking method here. However, whichever way we go, we don’t have the ground truth.

# Compute barcode rank
bc_rank <- barcodeRanks(res_mat)

qplot(bc_rank$rank, bc_rank$total, geom = "line") +
  geom_hline(yintercept = bc_rank$knee, color = "blue", linetype = 2) +
  geom_hline(yintercept = bc_rank$inflection, color = "green", linetype = 2) +
  annotate("text", x = 1000, y = 1.5 * c(bc_rank$knee, bc_rank$inflection),
           label = c("knee", "inflection"), color = c("blue", "green")) +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Rank", y = "Total reads") +
  theme_bw()

The inflection point looks like a reasonable number of cells.

# Filter the matrix
res_mat <- res_mat[, tot_counts > bc_rank$inflection]
dim(res_mat)

#> [1] 47367  1123

Cell species

How many cells are from humans and how many from mice? The number of cells with mixed species indicates doublet rate.

gene_species <- ifelse(str_detect(rownames(res_mat), "^ENSMUSG"), "mouse", "human")
mouse_inds <- gene_species == "mouse"
human_inds <- gene_species == "human"
# mark cells as mouse or human
cell_species <- tibble(n_mouse_umi = Matrix::colSums(res_mat[mouse_inds,]),
                       n_human_umi = Matrix::colSums(res_mat[human_inds,]),
                       tot_umi = Matrix::colSums(res_mat),
                       prop_mouse = n_mouse_umi / tot_umi,
                       prop_human = n_human_umi / tot_umi)

# Classify species based on proportion of UMI, with cutoff of 90%
cell_species <- cell_species %>% 
  mutate(species = case_when(
    prop_mouse > 0.9 ~ "mouse",
    prop_human > 0.9 ~ "human",
    TRUE ~ "mixed"
  ))

ggplot(cell_species, aes(n_human_umi, n_mouse_umi, color = species)) +
  geom_point(size = 0.5) +
  theme_bw()

Great, looks like the vast majority of cells are not mixed.

cell_species %>% 
  dplyr::count(species) %>% 
  mutate(proportion = n / ncol(res_mat))

#> # A tibble: 3 x 3
#>   species     n proportion
#>   <chr>   <int>      <dbl>
#> 1 human     579    0.516  
#> 2 mixed       3    0.00267
#> 3 mouse     541    0.482

Great, only about 0.3% of cells here are doublets, which is lower than the ~1% 10x lists. Doublet rate tends to be lower when cell concentration is lower. However, doublets can still be formed with cells from the same species, so the number of mixed species “cells” is only a lower bound of doublet rate.

Dimension reduction

seu <- CreateSeuratObject(res_mat, min.cells = 3) %>% 
  NormalizeData(verbose = FALSE) %>% 
  ScaleData(verbose = FALSE) %>% 
  FindVariableFeatures(verbose = FALSE)

# Add species to meta data
seu <- AddMetaData(seu, metadata = cell_species$species, col.name = "species")

See how number of total counts and number of genes expressed are distributed.

VlnPlot(seu, c("nCount_RNA", "nFeature_RNA"), group.by = "species")

seu <- RunPCA(seu, verbose = FALSE, npcs = 30)
ElbowPlot(seu, ndims = 30)

DimPlot(seu, reduction = "pca", pt.size = 0.5, group.by = "species")

The first PC separates species, as expected. Also as expected, the doublets are in between human and mouse cells in this plot.

seu <- RunTSNE(seu, dims = 1:20, check_duplicates = FALSE)
DimPlot(seu, reduction = "tsne", pt.size = 0.5, group.by = "species")

The species separate, and the few doublets form its own cluster, as expected.

Session information

sessionInfo()

#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] bindrcpp_0.2.2              Matrix_1.2-15              
#>  [3] DropletUtils_1.2.2          SingleCellExperiment_1.4.1 
#>  [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
#>  [7] matrixStats_0.54.0          Biobase_2.42.0             
#>  [9] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
#> [11] IRanges_2.16.0              S4Vectors_0.20.1           
#> [13] BiocGenerics_0.28.0         BiocParallel_1.16.5        
#> [15] forcats_0.3.0               stringr_1.4.0              
#> [17] dplyr_0.7.8                 purrr_0.3.0                
#> [19] readr_1.3.1                 tidyr_0.8.2                
#> [21] tibble_2.0.1                ggplot2_3.1.0              
#> [23] tidyverse_1.2.1             Seurat_3.0.0.9000          
#> [25] data.table_1.12.0           BUSpaRse_0.99.0            
#> 
#> loaded via a namespace (and not attached):
#>   [1] utf8_1.1.4               reticulate_1.10         
#>   [3] R.utils_2.7.0            tidyselect_0.2.5        
#>   [5] RSQLite_2.1.1            AnnotationDbi_1.44.0    
#>   [7] htmlwidgets_1.3          grid_3.5.1              
#>   [9] Rtsne_0.15               devtools_2.0.1          
#>  [11] munsell_0.5.0            codetools_0.2-15        
#>  [13] ica_1.0-2                future_1.11.1.1         
#>  [15] withr_2.1.2              colorspace_1.4-0        
#>  [17] highr_0.7                knitr_1.21              
#>  [19] rstudioapi_0.9.0         ROCR_1.0-7              
#>  [21] gbRd_0.4-11              listenv_0.7.0           
#>  [23] labeling_0.3             Rdpack_0.10-1           
#>  [25] git2r_0.23.0             GenomeInfoDbData_1.2.0  
#>  [27] bit64_0.9-7              rhdf5_2.26.0            
#>  [29] rprojroot_1.3-2          generics_0.0.2          
#>  [31] xfun_0.4                 R6_2.3.0                
#>  [33] doParallel_1.0.14        rsvd_1.0.0              
#>  [35] locfit_1.5-9.1           bitops_1.0-6            
#>  [37] assertthat_0.2.0         SDMTools_1.1-221        
#>  [39] scales_1.0.0             gtable_0.2.0            
#>  [41] npsurv_0.4-0             globals_0.12.4          
#>  [43] processx_3.2.1           workflowr_1.1.1         
#>  [45] rlang_0.3.1              zeallot_0.1.0           
#>  [47] splines_3.5.1            rtracklayer_1.42.1      
#>  [49] lazyeval_0.2.1           broom_0.5.1             
#>  [51] plyranges_1.2.0          yaml_2.2.0              
#>  [53] modelr_0.1.2             backports_1.1.2         
#>  [55] tools_3.5.1              usethis_1.4.0           
#>  [57] gplots_3.0.1.1           RColorBrewer_1.1-2      
#>  [59] sessioninfo_1.1.1        ggridges_0.5.1          
#>  [61] Rcpp_1.0.0               plyr_1.8.4              
#>  [63] progress_1.2.0           zlibbioc_1.28.0         
#>  [65] RCurl_1.95-4.11          ps_1.2.1                
#>  [67] prettyunits_1.0.2        pbapply_1.4-0           
#>  [69] cowplot_0.9.4            zoo_1.8-4               
#>  [71] haven_2.0.0              ggrepel_0.8.0           
#>  [73] cluster_2.0.7-1          fs_1.2.6                
#>  [75] magrittr_1.5             lmtest_0.9-36           
#>  [77] RANN_2.6.1               whisker_0.3-2           
#>  [79] fitdistrplus_1.0-14      pkgload_1.0.2           
#>  [81] hms_0.4.2                lsei_1.2-0              
#>  [83] evaluate_0.12            XML_3.98-1.17           
#>  [85] readxl_1.1.0             testthat_2.0.1          
#>  [87] compiler_3.5.1           biomaRt_2.38.0          
#>  [89] KernSmooth_2.23-15       crayon_1.3.4            
#>  [91] R.oo_1.22.0              htmltools_0.3.6         
#>  [93] lubridate_1.7.4          DBI_1.0.0               
#>  [95] MASS_7.3-51.1            cli_1.0.1               
#>  [97] R.methodsS3_1.7.1        gdata_2.18.0            
#>  [99] metap_1.0                bindr_0.1.1             
#> [101] igraph_1.2.2             pkgconfig_2.0.2         
#> [103] GenomicAlignments_1.18.1 plotly_4.8.0            
#> [105] xml2_1.2.0               foreach_1.4.4           
#> [107] XVector_0.22.0           bibtex_0.4.2            
#> [109] rvest_0.3.2              callr_3.1.1             
#> [111] digest_0.6.18            tsne_0.1-3              
#> [113] Biostrings_2.50.2        rmarkdown_1.11          
#> [115] cellranger_1.1.0         edgeR_3.24.3            
#> [117] Rsamtools_1.34.1         gtools_3.8.1            
#> [119] nlme_3.1-137             jsonlite_1.6            
#> [121] Rhdf5lib_1.4.0           fansi_0.4.0             
#> [123] desc_1.2.0               viridisLite_0.3.0       
#> [125] limma_3.38.3             pillar_1.3.1            
#> [127] lattice_0.20-38          httr_1.4.0              
#> [129] pkgbuild_1.0.2           survival_2.43-3         
#> [131] glue_1.3.0               remotes_2.0.2           
#> [133] png_0.1-7                iterators_1.0.10        
#> [135] bit_1.1-14               stringi_1.3.1           
#> [137] HDF5Array_1.10.1         blob_1.1.1              
#> [139] caTools_1.17.1.1         memoise_1.1.0           
#> [141] irlba_2.3.2              future.apply_1.1.0      
#> [143] ape_5.2

This reproducible R Markdown analysis was created with workflowr 1.1.1

Processing kallisto bus Output (10x v2 chemistry)

Lambda Moses

2019-02-14