In this R Notebook we preprocess spatial and corresponding reference scRNA-seq data of human HER2+ Breast Cancer for cell type deconvolution.
Spatial data preprocessing:
1.1 Input original data files
H1.tsv.gz is from zipped
file count-matrices.zip
downloaded from zenodo.4751624, and this
first section from patient H is selected for
analysis.1.2 Output data files for cell type deconvolution
x and y coordinates are directly
extracted from spot names.Reference scRNA-seq data preprocessing:
2.1 Input original data files
scRNA-seq data GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz are downloaded from GSE176078. This compressed file contains 4 files:
count_matrix_sparse.mtx,
count_matrix_barcodes.tsv and
count_matrix_genes.tsv: Raw nUMI of all 100,064 cells and
29,733 genes
metadata.csv: meta data of all 100,064
cells
2.2 Output data files for cell type deconvolution
We select 19,311 HER2+ cells. NO filtering on genes, i.e. all genes are included for analysis.
Raw nUMI of 19,311 cells with 9 cell types and 29,733 genes: Breast_Cancer_ref_scRNA_cell_nUMI.csv.gz.
Cell type annotation for those 19,311 cells: Breast_Cancer_ref_scRNA_cell_celltype.csv..
version[['version.string']][1] "R version 4.2.2 Patched (2022-11-10 r83330)"
H1.tsv.gzfile_name = file.path(home.dir, 'H1.tsv.gz')
org_data = data.table::fread(file_name, sep = "\t", check.names = FALSE)Registered S3 method overwritten by 'data.table':
method from
print.data.table
print(sprintf('load data from %s', file_name))[1] "load data from /home/hill103/Documents/SharedFolder/ToHost/CVAE-GLRM_Analysis/RealData/Breast_Cancer/H1.tsv.gz"
org_data = as.data.frame(org_data)
# extract first column as row name
row.names(org_data) = org_data$V1
org_data = org_data[, 2:ncol(org_data)]
print(sprintf('spots: %d; genes: %d', nrow(org_data), ncol(org_data)))[1] "spots: 613; genes: 15029"
org_data[1:5, 1:5]No filtering on spots or genes, directly save all spots and genes into file Breast_Cancer_spatial_spot_nUMI.csv. Rows as spatial spots and columns as genes.
write.csv(org_data, 'Breast_Cancer_spatial_spot_nUMI.csv')
print(sprintf('save %d gene nUMIs of %d spatial spots into file %s', ncol(org_data), nrow(org_data), 'Breast_Cancer_spatial_spot_nUMI.csv'))[1] "save 15029 gene nUMIs of 613 spatial spots into file Breast_Cancer_spatial_spot_nUMI.csv"
Directly extract the spatial x and y
coordinates from spot names, then save it into file Breast_Cancer_spatial_spot_loc.csv.
local_df = data.frame(names = row.names(org_data), row.names = row.names(org_data))
local_df = local_df %>%
tidyr::separate_wider_delim(names, 'x', names = c('x', 'y'))
local_df = as.data.frame(local_df)
row.names(local_df) = row.names(org_data)
local_df['x'] = as.numeric(local_df$x)
local_df['y'] = as.numeric(local_df$y)
local_df[1:5, ]
write.csv(local_df, 'Breast_Cancer_spatial_spot_loc.csv')
print(sprintf('save Physical Locations of spatial spots into file %s', 'Breast_Cancer_spatial_spot_loc.csv'))[1] "save Physical Locations of spatial spots into file Breast_Cancer_spatial_spot_loc.csv"
We define the neighborhood of a spatial spot contains the adjacent left, right, top and bottom spot, that is, one spot has at most 4 neighbors.
The generated Adjacency Matrix A only contains
1 and 0, where 1 represents
corresponding two spots are adjacent spots according to the definition
of neighborhood, while value 0 for non-adjacent spots. Note all
diagonal entries are 0s.
Adjacency Matrix are saved into file Breast_Cancer_spatial_spot_adjacency_matrix.csv.
getNeighbour = function(array_row, array_col) {
# based on the (row, col) of one spot, return the (row, col) of all 4 neighbours
return(list(c(array_row-1, array_col),
c(array_row+1, array_col),
c(array_row+0, array_col-1),
c(array_row+0, array_col+1)))
}
# adjacency matrix
A = matrix(0, nrow = nrow(local_df), ncol = nrow(local_df))
row.names(A) = rownames(local_df)
colnames(A) = rownames(local_df)
for (i in 1:nrow(local_df)) {
barcode = rownames(local_df)[i]
array_row = local_df[i, 'y']
array_col = local_df[i, 'x']
# get neighbors
neighbours = getNeighbour(array_row, array_col)
# fill the adjacency matrix
for (this.vec in neighbours) {
tmp.p = rownames(local_df[local_df$y==this.vec[1] & local_df$x==this.vec[2], ])
if (length(tmp.p) >= 1) {
# target spots have neighbors in selected spots
for (neigh.barcode in tmp.p) {
A[barcode, neigh.barcode] = 1
}
}
}
}
A[1:5, 1:5] 10x10 10x11 10x12 10x13 10x14
10x10 0 1 0 0 0
10x11 1 0 1 0 0
10x12 0 1 0 1 0
10x13 0 0 1 0 1
10x14 0 0 0 1 0
write.csv(A, 'Breast_Cancer_spatial_spot_adjacency_matrix.csv')
print(sprintf('save Adjacency Matrix of spatial spots into file %s', 'Breast_Cancer_spatial_spot_adjacency_matrix.csv'))[1] "save Adjacency Matrix of spatial spots into file Breast_Cancer_spatial_spot_adjacency_matrix.csv"
Plot Adjacency Matrix. Each node is spot, spots within neighborhood are connected with edges.
Note multiply the y coordinated with -1 to match the figure with H&E staining image
g = graph_from_adjacency_matrix(A, 'undirected', add.colnames = NA, add.rownames = NA)
# manually set nodes x and y coordinates
vertex_attr(g, name = 'x') = local_df$x
vertex_attr(g, name = 'y') = -(local_df$y)
plot(g, vertex.size=5, edge.width=4, margin=-0.05)Original meta data file metadata.csv is in the
compressed file GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz
downloaded from GSE176078.
It contains meta data of 100,064 cells from human breast cancer
tissue. We select cells from HER2+ subjects
(subtype=HER2+), and 19,311 cells from 5
subjects are selected.
The cell type annotation is stored in column
celltype_major, which includes total 9 distinct
annotations.
file_name = file.path(home.dir, 'Wu_etal_2021_BRCA_scRNASeq', 'metadata.csv')
ref_meta = read.csv(file_name, sep=',', check.names = F, header = T, row.names = 1)
print(sprintf('load data from %s', file_name))[1] "load data from /home/hill103/Documents/SharedFolder/ToHost/CVAE-GLRM_Analysis/RealData/Breast_Cancer/Wu_etal_2021_BRCA_scRNASeq/metadata.csv"
print(sprintf('total %d cells with distinct %d cell type annotations', nrow(ref_meta), length(unique(ref_meta$celltype_major))))[1] "total 100064 cells with distinct 9 cell type annotations"
# select HER2+ cells
ref_meta = ref_meta %>%
filter(subtype == 'HER2+')
print(sprintf('remain %d HER2+ cells', nrow(ref_meta)))[1] "remain 19311 HER2+ cells"
table(ref_meta$orig.ident)
CID3586 CID3838 CID3921 CID4066 CID45171
6178 2353 3024 5309 2447
count = as.matrix(table(ref_meta$celltype_major))
colnames(count) = 'num'
Matrix::t(count) B-cells CAFs Cancer Epithelial Endothelial Myeloid Normal Epithelial Plasmablasts PVL T-cells
num 624 1449 1775 1016 1422 968 226 894 10937
ref_meta[1:5, 'celltype_major', drop=F]Save cell type annotation of selected cells to file Breast_Cancer_ref_scRNA_cell_celltype.csv.
colnames(ref_meta)[colnames(ref_meta)=='celltype_major'] = 'celltype'
write.csv(ref_meta[, 'celltype', drop=F], 'Breast_Cancer_ref_scRNA_cell_celltype.csv')
print(sprintf('save cell type annotation of reference scRNA-seq cells into file %s', 'Breast_Cancer_ref_scRNA_cell_celltype.csv'))[1] "save cell type annotation of reference scRNA-seq cells into file Breast_Cancer_ref_scRNA_cell_celltype.csv"
Original gene nUMI count data file
count_matrix_sparse.mtx,
count_matrix_barcodes.tsv and
count_matrix_genes.tsv are in the compressed file GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz
downloaded from GSE176078.
It contains total 100,064 cells and 29,733 genes.
We just selected 19,311 HER2+ cells by barcodes, and discard other cells. NO filtering on genes, i.e. all 29,733 genes will be used for cell type deconvolution.
# https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices
barcode.path = file.path(home.dir, 'Wu_etal_2021_BRCA_scRNASeq', "count_matrix_barcodes.tsv")
features.path = file.path(home.dir, 'Wu_etal_2021_BRCA_scRNASeq', "count_matrix_genes.tsv")
matrix.path = file.path(home.dir, 'Wu_etal_2021_BRCA_scRNASeq', "count_matrix_sparse.mtx")
mat = Matrix::readMM(file = matrix.path)
feature.names = read.delim(features.path,
header = FALSE,
stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path,
header = FALSE,
stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
row.names(mat) = feature.names$V1
print(sprintf('load data from %s', matrix.path))[1] "load data from /home/hill103/Documents/SharedFolder/ToHost/CVAE-GLRM_Analysis/RealData/Breast_Cancer/Wu_etal_2021_BRCA_scRNASeq/count_matrix_sparse.mtx"
print(sprintf('total cells: %d; genes: %d', ncol(mat), nrow(mat)))[1] "total cells: 100064; genes: 29733"
# select HER2+ cells
mat = mat[, row.names(ref_meta)]
print(sprintf('select cells: %d; genes: %d', ncol(mat), nrow(mat)))[1] "select cells: 19311; genes: 29733"
# transpose it
mat = Matrix::t(mat)Save scRNA-seq nUMI matrix to file Breast_Cancer_ref_scRNA_cell_nUMI.csv.gz
ref_data = as.data.frame(as.matrix(mat), check.names=F)
ref_data[1:5, 1:5]
data.table::fwrite(ref_data, 'Breast_Cancer_ref_scRNA_cell_nUMI.csv.gz', row.names = T)
Written 18.2% of 19311 rows in 2 secs using 4 threads. maxBuffUsed=2%. ETA 9 secs.
Written 26.9% of 19311 rows in 3 secs using 4 threads. maxBuffUsed=2%. ETA 8 secs.
Written 33.5% of 19311 rows in 4 secs using 4 threads. maxBuffUsed=2%. ETA 7 secs.
Written 40.0% of 19311 rows in 5 secs using 4 threads. maxBuffUsed=2%. ETA 7 secs.
Written 46.3% of 19311 rows in 6 secs using 4 threads. maxBuffUsed=2%. ETA 7 secs.
Written 54.2% of 19311 rows in 7 secs using 4 threads. maxBuffUsed=2%. ETA 5 secs.
Written 60.8% of 19311 rows in 8 secs using 4 threads. maxBuffUsed=2%. ETA 5 secs.
Written 66.2% of 19311 rows in 9 secs using 4 threads. maxBuffUsed=2%. ETA 4 secs.
Written 71.9% of 19311 rows in 10 secs using 4 threads. maxBuffUsed=2%. ETA 3 secs.
Written 78.8% of 19311 rows in 11 secs using 4 threads. maxBuffUsed=2%. ETA 2 secs.
Written 86.5% of 19311 rows in 12 secs using 4 threads. maxBuffUsed=2%. ETA 1 secs.
Written 95.4% of 19311 rows in 13 secs using 4 threads. maxBuffUsed=2%. ETA 0 secs.
print(sprintf('save nUMI matrix of reference scRNA-seq cells into gzip compressed file %s', 'Breast_Cancer_ref_scRNA_cell_nUMI.csv.gz'))[1] "save nUMI matrix of reference scRNA-seq cells into gzip compressed file Breast_Cancer_ref_scRNA_cell_nUMI.csv.gz"