Identifying Selected Genes

Last updated: 2020-02-17

Checks: 7 0

Knit directory: Blancetal/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200217)

The command set.seed(20200217) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 09676ba

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  _workflowr.yml~
    Untracked:  data/FlintGarciaTableS1_fixednew.csv
    Untracked:  data/Kinship_matrices/
    Untracked:  data/Mean_centered_expression/
    Untracked:  output/Identifying_Selected_Genes/

Unstaged changes:
    Modified:   _workflowr.yml
    Modified:   analysis/Population_Structure.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	09676ba	jgblanc	2020-02-17	wflow_publish(c(“analysis/Identifying_Selected_Genes.Rmd”, “analysis/index.Rmd”))
html	d6e709f	jgblanc	2020-02-17	Build site.
Rmd	feebd86	jgblanc	2020-02-17	wflow_publish(c(“analysis/Identifying_Selected_Genes.Rmd”, “analysis/index.Rmd”))

Intro

Here is the code used to identify selected genes in each of the 9 tissue types and make Figure 2.

Code

The code below will be written to identify selected genes in the Kernel tissue. This code can be re-run for other tissues by specifying the path to the expression for that tissue.

# Read in mean-centered expression values
df1 <- read.table("../data/Mean_centered_expression/Kern.txt")
# Read in tissue specific kinship matrix 
myF <- read.table('../data/Kinship_matrices/F_Kern.txt')

First, we will calculate the Eigen values and vectors.

## Get Eigen Values and Vectors 
myE <- eigen(myF)
E_vectors <- myE$vectors
E_values <- myE$values

Next, we will multiply the mean centered expression value by each of the m eigen vectors (numerator of equation 2). This step takes a little time (~10 minutes) so I will pre-load the results to speed up the knitting process

#df2 <- data.frame(matrix(ncol=ncol(df1), nrow=nrow(df1)))
#colnames(df2) <- colnames(df1[1:ncol(df1)])
#rownames(df2) <- rownames(df1)

## Calculate Q values by multiplying the mean-centered expression value by each eigen vector 
#for (i in 1:ncol(df2)) {
#  mean_centered_data <- t(as.matrix(as.numeric(df1[,i])))
#  for (k in 1:nrow(df2)){
#    u <- as.matrix(as.numeric(E_vectors[,k]))
#    value <- mean_centered_data %*% u
#    df2[k,i] <- value
#  }
#}
df2 <- read.table("../output/Identifying_Selected_Genes/intermediate_df2.txt")

Now divide by the square root of the eigenvalue to get the \(C_m\) value from equation 2.

## Get the square root of the Eigen values   
de <- data.frame(matrix(nrow = nrow(df1),ncol = 2))
de$Egien_values <- E_values 
de$Sqrt_EV <- sqrt(de$Egien_values)

## Calculate C-values by dividing Q values by the square root of the eigen values
df4 <- data.frame(matrix(ncol=ncol(df2),nrow=nrow(df2)))
for (i in 1:ncol(df2)){
  df4[,i] <- (df2[,i] / de$Sqrt_EV)
}

We want to get the F value (equation 3) and the corresponding p-value for each of the first 5 PC’s individually

## Calculate F-values by dividing variances 
F_values <- data.frame(matrix(ncol=ncol(df2), nrow = 1))
for (j in 1:ncol(df2)){
  for (i in 1:5){
    q <- df4[i,j] 
    t <- df4[11:20,j] 
    var_q <- mean(q^2)
    var_t <- mean(t^2)
    F_value <- var_q / var_t
    F_values[i,j] <- F_value
  }
}

## Calculate P-values from recorded F values 
P_values_ind <- data.frame(matrix(ncol=ncol(df2), nrow =1))
for (j in 1:ncol(F_values)){
  for (r in 1:5) {
    f_stat <- F_values[r, j]
    p_value <- pf(q=f_stat, df1=1, df2=10, lower.tail=FALSE) 
    P_values_ind[r, j] <- p_value
  }
}

Finally, we want to get the F value for the first 5 PC’s combined (equation 4)

## Calculate F-values by dividing variances 
F_values <- data.frame(matrix(ncol=ncol(df2), nrow = 1))
for (j in 1:ncol(df2)){
  for (i in 1:1){
    q <- df4[1:5,j] 
    t <- df4[11:20,j]
    var_q <- mean(q^2)
    var_t <- mean(t^2)
    F_value <- var_q / var_t
    F_values[i,j] <- F_value
  }
}

## Calculate P-values from recorded F values 
P_values_comb <- data.frame(matrix(ncol=ncol(df2), nrow =1))
for (j in 1:ncol(F_values)){
  for (r in 1:1) {
    f_stat <- F_values[r, j]
    p_value <- pf(q=f_stat, df1=5, df2=10, lower.tail=FALSE) 
    P_values_comb[r, j] <- p_value
  }
}

The code above can be used to generate the p-values for all 9 tissues, the running this code is in “output/Identifying_Selected_Genes/”. The files labeled “ALL” have the p-values for the first 5 PC’s individually. The files labeled “Combined” have the p-values for the first 5 PC’s combined.

Figure

Read in the P-values for the first 5 PCs for each tissue

kern <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_Kern.txt"))
gshoot <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_GShoot.txt"))
groot <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_GRoot.txt"))
base <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_L3Base.txt"))
tip <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_L3Tip.txt"))
lmad8 <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_LMAD8.txt"))
lman8 <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_LMAN8.txt"))
lmad26 <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_LMAD26.txt"))
lman26 <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_LMAN26.txt"))

Calculate the Q-values for each PC in each tissue and record the number of genes with Q-value < 0.1

# Create table to collect selected genes
FDR <- as.data.frame(matrix(ncol=6, nrow=9))
colnames(FDR) <- c("PC1", "PC2", "PC3", "PC4", "PC5", "PC1:5")
rownames(FDR) <- c("Kern", "GShoot", "GRoot", "L3Base", "L3Tip", "LMAD8", "LMAN8", "LMAD26", "LMAN26")

# For each tissue calculate q-values and record the number of genes with q-values (and raw p-values) under 0.1
for (i in 1:5) {
  qobj <- qvalue::qvalue(kern[,i])
  FDR[1,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(gshoot[,i])
  FDR[2,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(groot[,i])
  FDR[3,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(base[,i])
  FDR[4,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(tip[,i])
  FDR[5,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(lmad8[,i])
  FDR[6,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(lman8[,i])
  FDR[7,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(lmad26[,i])
  FDR[8,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

for (i in 1:5) {
  qobj <- qvalue::qvalue(lman26[,i])
  FDR[9,i] <- length(qobj$qvalues[qobj$qvalues < 0.1])
}

Read in P-values for PC1:5 combined for each tissue

kern2 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_Kern.txt"))
gshoot2 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_GShoot.txt"))
groot2 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_GRoot.txt"))
base2 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_L3Base.txt"))
tip2 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_L3Tip.txt"))
lmad82 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_LMAD8.txt"))
lman82 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_LMAN8.txt"))
lmad262 <- t(read.table("../output/Identifying_Selected_Genes/P_values_Combined_LMAD26.txt"))
lman262 <- t(read.table("../output/Identifying_Selected_Genes/P_values_ALL_LMAN26.txt"))

Record number of genes with qvalues < 0.1

qobj <- qvalue::qvalue(kern2[,1])
FDR[1,6] <- length(qobj$qvalues[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(gshoot2[,1])
FDR[2,6] <- length(qobj$qvalues[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(groot2[,1])
FDR[3,6] <- length(qobj$qvalues[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(base2[,1])
FDR[4,6] <- length(qobj$qvalues[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(tip2[,1])
FDR[5,6] <- length(qobj$qvalues2[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(lmad82[,1])
FDR[6,6] <- length(qobj$qvalues2[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(lman82[,1])
FDR[7,6] <- length(qobj$qvalues2[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(lmad262[,1])
FDR[8,6] <- length(qobj$qvalues[qobj$qvalues < 0.1])
qobj <- qvalue::qvalue(lman262[,1])
FDR[9,6] <- length(qobj$qvalues[qobj$qvalues < 0.1])

Reshape Data for Plotting

FDR$Tissue <- c("Kern", "GShoot", "GRoot", "L3Base", "L3Tip", "LMAD8", "LMAN8", "LMAD26", "LMAN26")
dat <- melt(FDR, id.vars = "Tissue")
dat[dat == 0] <- NA

Plot figure 2A

pl <- ggplot(data=dat,aes(x=variable,y=Tissue)) + 
  geom_tile(aes(fill=value),color='black') + scale_fill_gradient(low = 'lightyellow', high = "#CC79A7", guide = FALSE, na.value = "white") + theme_bw() + xlab("\n") + ylab("Tissue") + 
  theme(axis.ticks=element_blank(),panel.border=element_blank(),panel.grid.major = element_blank(), axis.text.y = element_text(size=10,angle=0), axis.title.y = element_text(size=16),axis.title.x  = element_text(size=16),axis.text.x = element_text(angle = 0, hjust = 0.5,size=14)) + geom_text(aes(label=value),colour="grey15",size=3.5)

pl

Warning: Removed 26 rows containing missing values (geom_text).

Version	Author	Date
d6e709f	jgblanc	2020-02-17

Plot figure 2B

First read in population and expression information

pop <- read.csv("../data/FlintGarciaTableS1_fixednew.csv")
kern_all <- read.table("../data/Mean_centered_expression/Kern.txt")
rows <- row.names(kern_all)
kern_all$lines <- rows
pop_dat <- merge(pop, kern_all, by.x = "Inbred", by.y = "lines" )

Get Eigen values and vectors

## Kinship Matrix for all 207 Kern lines
myF <- read.table('../data/Kinship_matrices/F_Kern.txt')

## Get Eigen Values and Vectors 
myE <- eigen(myF)
E_vectors <- myE$vectors
E_values <- myE$values

Expression Plot

pop_dat$sel <- kern_all[,1706]
pop_dat$one <- E_vectors[,1]


lambda <- E_values[1]
var_C11_20 <- 190.0949 #Expected denominator calculated from script 
k <-  1.96 * sqrt(var_C11_20 * lambda) 

k_plus <- k 
k_minus <- -k 

lR <- lm(pop_dat$sel ~ pop_dat$one) 
coeff <- lR$coefficients[[2]]

col <- c('#E69F00', '#56B4E9', "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pl1 <- ggplot(data=pop_dat, aes(x = one, y= sel , color=Subpopulation)) + scale_colour_manual(values = col, labels=c("mixed", "non-stiff stalk", "popcorn", "stiff stalk", "sweet", "tropical")) + xlab("PC 1") +  ylab("Expression") + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title.y = element_text(size=12), axis.title.x  = element_text(size=12), legend.position = "right", legend.title = element_text(size = 12), legend.text = element_text(size = 10), plot.title = element_text(hjust = 0.5)) + geom_point(size = 2.5) + geom_abline(slope = k_plus, linetype = 2) + geom_abline(slope = coeff, size = 1.5)+  geom_abline(slope = k_minus, linetype = 2) + ggtitle("GRMZM2G038195")

pl1

Version	Author	Date
d6e709f	jgblanc	2020-02-17

Plot Figure 2C

pop_dat$two <- E_vectors[,2]
pop_dat$sel <- kern_all[,1867]

lambda <- E_values[2]
# Expected denominator calculated from script 
k <-  1.96 * sqrt(20.96418 * lambda) 

k_plus <- k 
k_minus <- -k 

lR <- lm(pop_dat$sel ~ pop_dat$two) 
coeff <- lR$coefficients[[2]]

pl2 <- ggplot(data=pop_dat, aes(x = two, y= sel , color=Subpopulation)) + scale_colour_manual(values = col, labels=c("mixed", "non-stiff stalk", "popcorn", "stiff stalk", "sweet", "tropical")) + xlab("PC 2") +  ylab("Expression") + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title.y = element_text(size=12), axis.title.x  = element_text(size=12), legend.position = "right", legend.title = element_text(size = 12), legend.text = element_text(size = 10),plot.title = element_text(hjust = 0.5)) + geom_point(size = 2.5) + geom_abline(slope = k_plus, linetype = 2) + geom_abline(slope = coeff, size = 1.5)+  geom_abline(slope = k_minus, linetype = 2) + ggtitle('GRMZM2G042055')

pl2

Version	Author	Date
d6e709f	jgblanc	2020-02-17

Final Figure

final <- ggarrange(pl,                                                 
          ggarrange(pl1, pl2, ncol = 2, labels = c("B", "C"),common.legend = T, legend = "bottom"), 
          nrow = 2, 
          labels = "A"                                       
          )
final

Version	Author	Date
d6e709f	jgblanc	2020-02-17

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggpubr_0.2     magrittr_1.5   reshape2_1.4.3 ggplot2_3.2.1 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       compiler_3.5.1   pillar_1.4.2     git2r_0.25.2    
 [5] plyr_1.8.4       workflowr_1.4.0  tools_3.5.1      digest_0.6.22   
 [9] evaluate_0.14    lifecycle_0.1.0  tibble_2.1.3     gtable_0.3.0    
[13] pkgconfig_2.0.3  rlang_0.4.1      yaml_2.2.0       xfun_0.7        
[17] gridExtra_2.3    withr_2.1.2      stringr_1.4.0    dplyr_0.8.1     
[21] knitr_1.23       fs_1.3.1         cowplot_0.9.4    rprojroot_1.3-2 
[25] grid_3.5.1       tidyselect_0.2.5 qvalue_2.14.1    glue_1.3.1      
[29] R6_2.4.1         rmarkdown_1.13   farver_2.0.1     purrr_0.3.2     
[33] whisker_0.3-2    splines_3.5.1    backports_1.1.5  scales_1.1.0    
[37] htmltools_0.3.6  assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
[41] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    crayon_1.3.4

Identifying Selected Genes

Jennifer Blanc

2/17/2020

Intro

Code

Figure