Linear Modeling to Assess Differential Contacts (DC) Across Species

Last updated: 2019-04-23

Checks: 6 0

Knit directory: HiCiPSC/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190311)

The command set.seed(20190311) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 29c50c3

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  Rplot.jpeg
    Untracked:  Rplot001.jpeg
    Untracked:  Rplot002.jpeg
    Untracked:  Rplot003.jpeg
    Untracked:  Rplot004.jpeg
    Untracked:  Rplot005.jpeg
    Untracked:  Rplot006.jpeg
    Untracked:  Rplot007.jpeg
    Untracked:  Rplot008.jpeg
    Untracked:  Rplot009.jpeg
    Untracked:  Rplot010.jpeg
    Untracked:  Rplot011.jpeg
    Untracked:  Rplot012.jpeg
    Untracked:  Rplot013.jpeg
    Untracked:  Rplot014.jpeg
    Untracked:  Rplot015.jpeg
    Untracked:  Rplot016.jpeg
    Untracked:  Rplot017.jpeg
    Untracked:  Rplot018.jpeg
    Untracked:  Rplot019.jpeg
    Untracked:  Rplot020.jpeg
    Untracked:  Rplot021.jpeg
    Untracked:  Rplot022.jpeg
    Untracked:  Rplot023.jpeg
    Untracked:  Rplot024.jpeg
    Untracked:  Rplot025.jpeg
    Untracked:  Rplot026.jpeg
    Untracked:  Rplot027.jpeg
    Untracked:  Rplot028.jpeg
    Untracked:  Rplot029.jpeg
    Untracked:  Rplot030.jpeg
    Untracked:  Rplot031.jpeg
    Untracked:  Rplot032.jpeg
    Untracked:  Rplot033.jpeg
    Untracked:  Rplot034.jpeg
    Untracked:  Rplot035.jpeg
    Untracked:  Rplot036.jpeg
    Untracked:  Rplot037.jpeg
    Untracked:  Rplot038.jpeg
    Untracked:  Rplot039.jpeg
    Untracked:  Rplot040.jpeg
    Untracked:  Rplot041.jpeg
    Untracked:  Rplot042.jpeg
    Untracked:  Rplot043.jpeg
    Untracked:  Rplot044.jpeg
    Untracked:  Rplot045.jpeg
    Untracked:  Rplot046.jpeg
    Untracked:  Rplot047.jpeg
    Untracked:  Rplot048.jpeg
    Untracked:  Rplot049.jpeg
    Untracked:  Rplot050.jpeg
    Untracked:  Rplot051.jpeg
    Untracked:  Rplot052.jpeg
    Untracked:  S2A.jpeg
    Untracked:  S2B.jpeg
    Untracked:  analysis/TADs.Rmd
    Untracked:  data/Chimp_orthoexon_extended_info.txt
    Untracked:  data/Human_orthoexon_extended_info.txt
    Untracked:  data/Meta_data.txt
    Untracked:  data/TADs/
    Untracked:  data/chimp_lengths.txt
    Untracked:  data/counts_iPSC.txt
    Untracked:  data/epigenetic_enrichments/
    Untracked:  data/final.10kb.homer.df
    Untracked:  data/final.juicer.10kb.KR
    Untracked:  data/final.juicer.10kb.VC
    Untracked:  data/hic_gene_overlap/
    Untracked:  data/human_lengths.txt
    Untracked:  data/old_mediation_permutations/
    Untracked:  output/DC_regions.txt
    Untracked:  output/IEE.RPKM.RDS
    Untracked:  output/IEE_voom_object.RDS
    Untracked:  output/data.4.filtered.lm.QC
    Untracked:  output/data.4.fixed.init.LM
    Untracked:  output/data.4.fixed.init.QC
    Untracked:  output/data.4.init.LM
    Untracked:  output/data.4.init.QC
    Untracked:  output/data.4.lm.QC
    Untracked:  output/full.data.10.init.LM
    Untracked:  output/full.data.10.init.QC
    Untracked:  output/full.data.10.lm.QC
    Untracked:  output/gene.hic.filt.RDS

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	158b881	Ittai Eres	2019-03-13	Build site.
Rmd	67d6f15	Ittai Eres	2019-03-13	Update directory storage of files; add in data frame of only contacts fixed within species.

First, load necessary libraries: limma, plyr, tidyr, data.table, reshape2, cowplot, plotly, dplyr, Hmisc, gplots, stringr, heatmaply, RColorBrewer, edgeR, tidyverse, and compiler

#Linear Modeling Having performed some basic quality control and filtering on the data, I now move to quantify how well Hi-C interaction frequency values can be predicted from species identity. To accomplish this, I utilize limma to linearly model Hi-C interaction frequencies as a function of all the metadata about each sample’s species, sex, and batch. I then add information from linear modeling to the full dataframe, including log-fold changes between species, p-values, betas, and variances. I then make a plot of the mean vs. the log-fold change as an MA plot proxy for quality control. I move on to look at distributions of all the p-values and betas for each of the terms, make some QQ plots to check for normality in the p-values for each term, and represent the log-fold change and p-values in a volcano plot of the data.

###Read in data, normalized in initial_QC document, both with all hits and conditioned upon a hit being discovered in 4 individuals. Also include the data frame conditioned upon significant contacts fixed in either species, as requested by reviewer 2.
setwd("/Users/ittaieres/HiCiPSC")
full.data <- fread("output/full.data.10.init.QC", header = TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)
data.4 <- fread("output/data.4.init.QC", header=TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)
data.4.fixed <- fread("output/data.4.fixed.init.QC", header=TRUE, data.table=FALSE, stringsAsFactors=FALSE, showProgress = FALSE)

###Reassign metadata here, just for quick reference.
meta.data <- data.frame("SP"=c("H", "H", "C", "C", "H", "H", "C", "C"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2))

###Now to move on to linear modeling, doing it in both the full data set and the | 4 individuals condition. I utilize limma to make this quick, parllelizable, and simple. First make the model, then do the actual fitting, and finally do multiple testing adjustment with topTable.
design <- model.matrix(~1+meta.data$SP+meta.data$SX+meta.data$Batch) #Parameterize the linear model, with an intercept, and fixed effect terms for species, sex, and library prep batch. If you prefer to think in contrasts, my contrast is humans minus chimps. I prefer to think of one species as the baseline in the linear model, and in this case that's chimps (so chimps get a 0 for species, and humans get a 1).
lmFit(full.data[,304:311], design) %>% eBayes(.) -> model.full
lmFit(data.4[,304:311], design) %>% eBayes(.) -> model.4
lmFit(data.4.fixed[,304:311], design) %>% eBayes(.) -> model.4.fixed
volc.full <- topTable(model.full, coef = 2, sort.by = "none", number = Inf)
volc.4 <- topTable(model.4, coef = 2, sort.by = "none", number = Inf)
volc.4.fixed <- topTable(model.4.fixed, coef=2, sort.by="none", number=Inf)

###Now append the information extracted from linear modeling (p-values and betas for species, sex, and batch) to the 3 data frames:
full.data$sp_beta <- volc.full$logFC #Beta already is logFC since values are log2(obs/exp) Hi-C reads.
full.data$sp_pval <- volc.full$P.Value #Unadjusted P-val
full.data$sp_BH_pval <- volc.full$adj.P.Val #Benjamini-Hochberg adjusted p-values
full.data$avg_IF <- volc.full$AveExpr #Average Interaction Frequency value across all individuals. Useful later for a variety of plots.
full.data$t_statistic <- volc.full$t #t-statistic from limma topTable
full.data$B_statistic <- volc.full$B #B-statistic (LOD) from limma topTable
full.data$sx_pval <- model.full$p.value[,3]
full.data$btc_pval <- model.full$p.value[,4]
full.data$sx_beta <- model.full$coefficients[,3]
full.data$btc_beta <- model.full$coefficients[,4]
full.data$SE <- sqrt(model.full$s2.post)*model.full$stdev.unscaled[,2]


data.4$sp_beta <- volc.4$logFC
data.4$sp_pval <- volc.4$P.Value #Unadjusted P-val, normal scale
data.4$sp_BH_pval <- volc.4$adj.P.Val #Benajimini-Hochberg adjusted p-values
data.4$avg_IF <- volc.4$AveExpr #Average interaction frequency
data.4$t_statistic <- volc.4$t #t-statistic from limma topTable
data.4$B_statistic <- volc.4$B #B-statistic (LOD) from limma topTable
data.4$sx_pval <- model.4$p.value[,3]
data.4$btc_pval <- model.4$p.value[,4]
data.4$sx_beta <- model.4$coefficients[,3]
data.4$btc_beta <- model.4$coefficients[,4]
data.4$SE <- sqrt(model.4$s2.post) * model.4$stdev.unscaled[,2]

data.4.fixed$sp_beta <- volc.4.fixed$logFC
data.4.fixed$sp_pval <- volc.4.fixed$P.Value #Unadjusted P-val, normal scale
data.4.fixed$sp_BH_pval <- volc.4.fixed$adj.P.Val #Benajimini-Hochberg adjusted p-values
data.4.fixed$avg_IF <- volc.4.fixed$AveExpr #Average interaction frequency
data.4.fixed$t_statistic <- volc.4.fixed$t #t-statistic from limma topTable
data.4.fixed$B_statistic <- volc.4.fixed$B #B-statistic (LOD) from limma topTable
data.4.fixed$sx_pval <- model.4.fixed$p.value[,3]
data.4.fixed$btc_pval <- model.4.fixed$p.value[,4]
data.4.fixed$sx_beta <- model.4.fixed$coefficients[,3]
data.4.fixed$btc_beta <- model.4.fixed$coefficients[,4]
data.4.fixed$SE <- sqrt(model.4.fixed$s2.post) * model.4.fixed$stdev.unscaled[,2]

#MA plots; P-value and Beta distributions for the Linear Model

###Before moving to any actual assessment of the linear modeling, do a QC check by producing MA plots for all 3 data frames. The MA plot here is the mean of Hi-C contact frequencies (avg_IF) on the x-axis, and the logFC (species beta, in this case) on the y-axis. What we are generally looking for is a fairly uniform cloud around 0 stretching across the x-axis.
ggplot(data=full.data[,c('avg_IF','sp_beta')], aes(x=avg_IF, y=sp_beta)) + geom_point() + xlab("Means of Hi-C Contact Frequencies") + ylab("Species Beta (log ratio H/C)") + ggtitle("MA Plot, All Data")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4[,c('avg_IF', 'sp_beta')], aes(x=avg_IF, y=sp_beta)) + geom_point() + xlab("Means of Hi-C Contact Frequencies") + ylab("Species Beta (log ratio H/C)") + ggtitle("MA Plot | 4 Individuals")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4.fixed[,c('avg_IF', 'sp_beta')], aes(x=avg_IF, y=sp_beta)) + geom_point() + xlab("Means of Hi-C Contact Frequencies") + ylab("Species Beta (log ratio H/C)") + ggtitle("MA Plot | Species Fixed")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

#All these plots certainly have some outliers, and have a linear trend I have previously examined that is symmetric across betas--I believe these lines to primarily be created by an orthology-calling issue, as many of the hits along them have all contact frequencies in one species extremely close to 0 (and not near-zero in the other species). Aside from this, there is also a large cloud in the negative beta side that is somewhat disturbing; this is something I still need to check. At the very least, the linear trends seen are symmetric across betas, so this is hopefully an issue that is affecting the two species SOMEWHAT equally.

###Check some of the more basic linear modeling issues: distributions of betas, p-vals, QQplot, volcano plot, etc.
##Check p-vals and betas for all covariates in the full model, on all the data.
ggplot(data=full.data[,322:332], aes(x=sp_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Species") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=full.data[,322:332], aes(x=sx_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Sex") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=full.data[,322:332], aes(x=btc_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Batch") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=full.data[,322:332], aes(x=sp_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Species") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 3500))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=full.data[,322:332], aes(x=sx_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Sex") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 3500))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=full.data[,322:332], aes(x=btc_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Batch") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 3500))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

##Check p-vals and betas for all covariates in the full model, on data conditioned upon hit discovery in 4 individuals.
ggplot(data=data.4[,322:332], aes(x=sp_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Species | 4") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4[,322:332], aes(x=sx_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Sex | 4") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4[,322:332], aes(x=btc_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Batch | 4") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4[,322:332], aes(x=sp_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Species | 4") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4[,322:332], aes(x=sx_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Sex | 4") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4[,322:332], aes(x=btc_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Batch | 4") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

##Check p-vals and betas for all covariates in the full model, on data conditioned upon hit discovery in all 4 individuals of at least one species.
ggplot(data=data.4.fixed[,322:332], aes(x=sp_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Species | 4 Fixed") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4.fixed[,322:332], aes(x=sx_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Sex | 4 Fixed") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4.fixed[,322:332], aes(x=btc_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Batch | 4 Fixed") + xlab("P-value")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4.fixed[,322:332], aes(x=sp_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Species | 4 Fixed") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4.fixed[,322:332], aes(x=sx_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Sex | 4 Fixed") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=data.4.fixed[,322:332], aes(x=btc_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Batch | 4 Fixed") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

Version	Author	Date
158b881	Ittai Eres	2019-03-13

#These all look pretty good, with p-val distributions for sex and batch being fairly uniform, and betas all around being fairly normally distributed. We see distinct enrichment for significant p-values for the species term, which is what we were hoping for! Also of note is the fact that the beta distribution for species looks wider than those for batch and sex, reassuring us that species is a driving factor in differential Hi-C contacts.

#QQ plots for the Linear Model

###Now, to double-check on the p-values with some QQplots. First, I define a function for creating a qqplot easily and coloring elements of the distribution in order to understand where most of the density on the plot is:
newqqplot=function(pvals, quant, title){  
  len = length(pvals)
  res=qqplot(-log10((1:len)/(1+len)),pvals,plot.it=F)
  plot(res$x,res$y, main=title, xlab="Theoretical", ylab="Actual", col=ifelse(res$y>as.numeric(quantile(res$y, quant[1])), ifelse(res$y>as.numeric(quantile(res$y, quant[2])), "red", "blue"), "black"))
  abline(0, 1)
}

##Start QQplotting some of these actual values.
newqqplot(-log10(full.data$sp_pval), c(0.5, 0.75), "QQ Plot, Species P-vals, Full Data")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(full.data$sx_pval), c(0.5, 0.75), "QQ Plot, Sex P-vals, Full Data")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(full.data$btc_pval), c(0.5, 0.75), "QQ Plot, Batch P-vals, Full Data")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(data.4$sp_pval), c(0.5, 0.75), "QQ Plot, Species P-vals, Data | 4")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(data.4$sx_pval), c(0.5, 0.75), "QQ Plot, Sex P-vals, Data | 4")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(data.4$btc_pval), c(0.5, 0.75), "QQ Plot, Batch P-vals, Data | 4")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(data.4.fixed$sp_pval), c(0.5, 0.75), "QQ Plot, Species P-vals, Data | 4 Fixed")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(data.4.fixed$sx_pval), c(0.5, 0.75), "QQ Plot, Sex P-vals, Data | 4 Fixed")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

newqqplot(-log10(data.4.fixed$btc_pval), c(0.5, 0.75), "QQ Plot, Batch P-vals, Data | 4 Fixed")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

#The batch and sex QQ plots look fairly decent, but the species QQplot has the distribution rising off the axis extremely quickly, before even reaching 50% of the data. I will examine this in a separate markdown document for quality control on the linear modeling. For now, I show volcano plots of the data before looking to rectify this issue.

#Volcano Plots of the Linear Modeling Results

volcplot.full <- data.frame(pval=-log10(full.data$sp_pval), beta=full.data$sp_beta, species=full.data$disc_species)
volcplot.4 <- data.frame(pval=-log10(data.4$sp_pval), beta=data.4$sp_beta, species=data.4$disc_species)
volcplot.4.fixed <- data.frame(pval=-log10(data.4.fixed$sp_pval), beta=data.4.fixed$sp_beta, species=data.4.fixed$disc_species)
ggplot(data=volcplot.full, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("P-values") + ggtitle("Contact Frequency Differences Colored by Species of Discovery")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=volcplot.4, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 p-values") + ggtitle("Contact Frequency Differences Colored by Species of Discovery | 4") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) #Modify to add clearer labels

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=volcplot.4.fixed, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 p-values") + ggtitle("Contact Frequency Differences Colored by Species of Discovery | 4 Fixed") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) #Modify to add clearer labels

Version	Author	Date
158b881	Ittai Eres	2019-03-13

#The same, but lacking the colors
ggplot(data=volcplot.full, aes(x=beta, y=pval)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Contact Frequency Differences")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=volcplot.4, aes(x=beta, y=pval)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 p-values") + ggtitle("Contact Frequency Differences | 4 Individuals")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

ggplot(data=volcplot.4.fixed, aes(x=beta, y=pval)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 p-values") + ggtitle("Contact Frequency Differences | 4 Fixed")

Version	Author	Date
158b881	Ittai Eres	2019-03-13

#Volcano plots show a distinct asymmetry, with  many more hits showing stronger contact on the chimpanzee side than on the human side. Since this makes little biological sense, in the next document I will look for a technical factor that could be driving this.

#Write out the data with the new columns added on!
fwrite(full.data, "output/full.data.10.init.LM", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE, na="NA", showProgress = FALSE)
fwrite(data.4, "output/data.4.init.LM", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress = FALSE)
fwrite(data.4.fixed, "output/data.4.fixed.init.LM", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress=FALSE)

From this analysis two concerns emerge for further exploration: the QQ plots for the species p-values seem inflated, and the volcano plot seems highly asymmetrical. I’ll examine this further in the next analysis on linear modeling QC.

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] forcats_0.4.0      purrr_0.3.2        readr_1.3.1       
 [4] tibble_2.1.1       tidyverse_1.2.1    edgeR_3.20.9      
 [7] RColorBrewer_1.1-2 heatmaply_0.15.2   viridis_0.5.1     
[10] viridisLite_0.3.0  stringr_1.4.0      gplots_3.0.1.1    
[13] Hmisc_4.2-0        Formula_1.2-3      survival_2.44-1   
[16] lattice_0.20-38    dplyr_0.8.0.1      plotly_4.8.0      
[19] cowplot_0.9.4      ggplot2_3.1.0      reshape2_1.4.3    
[22] data.table_1.12.0  tidyr_0.8.3        plyr_1.8.4        
[25] limma_3.34.9      

loaded via a namespace (and not attached):
 [1] nlme_3.1-137        bitops_1.0-6        fs_1.2.7           
 [4] lubridate_1.7.4     webshot_0.5.1       httr_1.4.0         
 [7] rprojroot_1.3-2     prabclus_2.2-7      tools_3.4.0        
[10] backports_1.1.3     R6_2.4.0            rpart_4.1-13       
[13] KernSmooth_2.23-15  lazyeval_0.2.2      colorspace_1.4-1   
[16] trimcluster_0.1-2.1 nnet_7.3-12         withr_2.1.2        
[19] tidyselect_0.2.5    gridExtra_2.3       git2r_0.25.2       
[22] cli_1.1.0           rvest_0.3.2         htmlTable_1.13.1   
[25] TSP_1.1-6           xml2_1.2.0          labeling_0.3       
[28] diptest_0.75-7      caTools_1.17.1.2    scales_1.0.0       
[31] checkmate_1.9.1     DEoptimR_1.0-8      mvtnorm_1.0-8      
[34] robustbase_0.92-8   digest_0.6.18       foreign_0.8-71     
[37] rmarkdown_1.12      base64enc_0.1-3     pkgconfig_2.0.2    
[40] htmltools_0.3.6     readxl_1.3.1        htmlwidgets_1.3    
[43] rlang_0.3.3         rstudioapi_0.10     generics_0.0.2     
[46] jsonlite_1.6        mclust_5.4.3        gtools_3.8.1       
[49] acepack_1.4.1       dendextend_1.10.0   magrittr_1.5       
[52] modeltools_0.2-22   Matrix_1.2-15       Rcpp_1.0.1         
[55] munsell_0.5.0       stringi_1.4.3       whisker_0.3-2      
[58] yaml_2.2.0          MASS_7.3-51.1       flexmix_2.3-15     
[61] grid_3.4.0          gdata_2.18.0        crayon_1.3.4       
[64] haven_2.1.0         splines_3.4.0       hms_0.4.2          
[67] locfit_1.5-9.1      knitr_1.22          pillar_1.3.1       
[70] fpc_2.1-11.1        codetools_0.2-16    stats4_3.4.0       
[73] glue_1.3.1          gclus_1.3.2         evaluate_0.13      
[76] latticeExtra_0.6-28 modelr_0.1.4        foreach_1.4.4      
[79] cellranger_1.1.0    gtable_0.3.0        kernlab_0.9-27     
[82] assertthat_0.2.1    xfun_0.5            broom_0.5.1        
[85] class_7.3-15        seriation_1.2-3     iterators_1.0.10   
[88] registry_0.5-1      workflowr_1.2.0     cluster_2.0.7-1

Linear Modeling to Assess Differential Contacts (DC) Across Species

Ittai Eres

2019-03-12