Last updated: 2019-04-30

Checks: 6 0

Knit directory: HiCiPSC/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190311) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  Rplot.jpeg
    Untracked:  Rplot001.jpeg
    Untracked:  Rplot002.jpeg
    Untracked:  Rplot003.jpeg
    Untracked:  Rplot004.jpeg
    Untracked:  Rplot005.jpeg
    Untracked:  Rplot006.jpeg
    Untracked:  Rplot007.jpeg
    Untracked:  Rplot008.jpeg
    Untracked:  Rplot009.jpeg
    Untracked:  Rplot010.jpeg
    Untracked:  Rplot011.jpeg
    Untracked:  Rplot012.jpeg
    Untracked:  Rplot013.jpeg
    Untracked:  Rplot014.jpeg
    Untracked:  Rplot015.jpeg
    Untracked:  Rplot016.jpeg
    Untracked:  Rplot017.jpeg
    Untracked:  Rplot018.jpeg
    Untracked:  Rplot019.jpeg
    Untracked:  Rplot020.jpeg
    Untracked:  Rplot021.jpeg
    Untracked:  Rplot022.jpeg
    Untracked:  Rplot023.jpeg
    Untracked:  Rplot024.jpeg
    Untracked:  Rplot025.jpeg
    Untracked:  Rplot026.jpeg
    Untracked:  Rplot027.jpeg
    Untracked:  Rplot028.jpeg
    Untracked:  Rplot029.jpeg
    Untracked:  Rplot030.jpeg
    Untracked:  Rplot031.jpeg
    Untracked:  Rplot032.jpeg
    Untracked:  Rplot033.jpeg
    Untracked:  Rplot034.jpeg
    Untracked:  Rplot035.jpeg
    Untracked:  Rplot036.jpeg
    Untracked:  Rplot037.jpeg
    Untracked:  Rplot038.jpeg
    Untracked:  Rplot039.jpeg
    Untracked:  Rplot040.jpeg
    Untracked:  Rplot041.jpeg
    Untracked:  Rplot042.jpeg
    Untracked:  Rplot043.jpeg
    Untracked:  Rplot044.jpeg
    Untracked:  Rplot045.jpeg
    Untracked:  Rplot046.jpeg
    Untracked:  Rplot047.jpeg
    Untracked:  Rplot048.jpeg
    Untracked:  Rplot049.jpeg
    Untracked:  Rplot050.jpeg
    Untracked:  Rplot051.jpeg
    Untracked:  Rplot052.jpeg
    Untracked:  Rplot053.jpeg
    Untracked:  Rplot054.jpeg
    Untracked:  Rplot055.jpeg
    Untracked:  Rplot056.jpeg
    Untracked:  Rplot057.jpeg
    Untracked:  Rplot058.jpeg
    Untracked:  Rplot059.jpeg
    Untracked:  Rplot060.jpeg
    Untracked:  Rplot061.jpeg
    Untracked:  Rplot062.jpeg
    Untracked:  Rplot063.jpeg
    Untracked:  Rplot064.jpeg
    Untracked:  Rplot065.jpeg
    Untracked:  Rplot066.jpeg
    Untracked:  Rplot067.jpeg
    Untracked:  Rplot068.jpeg
    Untracked:  Rplot069.jpeg
    Untracked:  Rplot070.jpeg
    Untracked:  Rplot071.jpeg
    Untracked:  Rplot072.jpeg
    Untracked:  Rplot073.jpeg
    Untracked:  Rplot074.jpeg
    Untracked:  Rplot075.jpeg
    Untracked:  Rplot076.jpeg
    Untracked:  Rplot077.jpeg
    Untracked:  Rplot078.jpeg
    Untracked:  Rplot079.jpeg
    Untracked:  Rplot080.jpeg
    Untracked:  Rplot081.jpeg
    Untracked:  Rplot082.jpeg
    Untracked:  Rplot083.jpeg
    Untracked:  Rplot084.jpeg
    Untracked:  Rplot085.jpeg
    Untracked:  Rplot086.jpeg
    Untracked:  Rplot087.jpeg
    Untracked:  Rplot088.jpeg
    Untracked:  Rplot089.jpeg
    Untracked:  Rplot090.jpeg
    Untracked:  Rplot091.jpeg
    Untracked:  Rplot092.jpeg
    Untracked:  Rplot093.jpeg
    Untracked:  Rplot094.jpeg
    Untracked:  Rplot095.jpeg
    Untracked:  Rplot096.jpeg
    Untracked:  Rplot097.jpeg
    Untracked:  Rplot098.jpeg
    Untracked:  Rplot099.jpeg
    Untracked:  Rplot100.jpeg
    Untracked:  Rplot101.jpeg
    Untracked:  Rplot102.jpeg
    Untracked:  Rplot103.jpeg
    Untracked:  Rplot104.jpeg
    Untracked:  Rplot105.jpeg
    Untracked:  Rplot106.jpeg
    Untracked:  Rplot107.jpeg
    Untracked:  Rplot108.jpeg
    Untracked:  Rplot109.jpeg
    Untracked:  Rplot110.jpeg
    Untracked:  Rplot111.jpeg
    Untracked:  Rplot112.jpeg
    Untracked:  Rplot113.jpeg
    Untracked:  Rplot114.jpeg
    Untracked:  Rplot115.jpeg
    Untracked:  Rplot116.jpeg
    Untracked:  Rplot117.jpeg
    Untracked:  Rplot118.jpeg
    Untracked:  Rplot119.jpeg
    Untracked:  Rplot120.jpeg
    Untracked:  Rplot121.jpeg
    Untracked:  Rplot122.jpeg
    Untracked:  Rplot123.jpeg
    Untracked:  Rplot124.jpeg
    Untracked:  Rplot125.jpeg
    Untracked:  Rplot126.jpeg
    Untracked:  Rplot127.jpeg
    Untracked:  Rplot128.jpeg
    Untracked:  Rplot129.jpeg
    Untracked:  Rplot130.jpeg
    Untracked:  Rplot131.jpeg
    Untracked:  Rplot132.jpeg
    Untracked:  Rplot133.jpeg
    Untracked:  Rplot134.jpeg
    Untracked:  Rplot135.jpeg
    Untracked:  Rplot136.jpeg
    Untracked:  Rplot137.jpeg
    Untracked:  Rplot138.jpeg
    Untracked:  Rplot139.jpeg
    Untracked:  Rplot140.jpeg
    Untracked:  Rplot141.jpeg
    Untracked:  Rplot142.jpeg
    Untracked:  Rplot143.jpeg
    Untracked:  Rplot144.jpeg
    Untracked:  Rplot145.jpeg
    Untracked:  Rplot146.jpeg
    Untracked:  Rplot147.jpeg
    Untracked:  Rplot148.jpeg
    Untracked:  Rplot149.jpeg
    Untracked:  Rplot150.jpeg
    Untracked:  Rplot151.jpeg
    Untracked:  Rplot152.jpeg
    Untracked:  Rplot153.jpeg
    Untracked:  Rplot154.jpeg
    Untracked:  Rplot155.jpeg
    Untracked:  Rplot156.jpeg
    Untracked:  Rplot157.jpeg
    Untracked:  Rplot158.jpeg
    Untracked:  Rplot159.jpeg
    Untracked:  Rplot160.jpeg
    Untracked:  Rplot161.jpeg
    Untracked:  Rplot162.jpeg
    Untracked:  Rplot163.jpeg
    Untracked:  Rplot164.jpeg
    Untracked:  Rplot165.jpeg
    Untracked:  Rplot166.jpeg
    Untracked:  Rplot167.jpeg
    Untracked:  Rplot168.jpeg
    Untracked:  Rplot169.jpeg
    Untracked:  Rplot170.jpeg
    Untracked:  Rplot171.jpeg
    Untracked:  Rplot172.jpeg
    Untracked:  Rplot173.jpeg
    Untracked:  Rplot174.jpeg
    Untracked:  Rplot175.jpeg
    Untracked:  Rplot176.jpeg
    Untracked:  Rplot177.jpeg
    Untracked:  Rplot178.jpeg
    Untracked:  Rplot179.jpeg
    Untracked:  Rplot180.jpeg
    Untracked:  Rplot181.jpeg
    Untracked:  Rplot182.jpeg
    Untracked:  Rplot183.jpeg
    Untracked:  Rplot184.jpeg
    Untracked:  Rplot185.jpeg
    Untracked:  Rplot186.jpeg
    Untracked:  Rplot187.jpeg
    Untracked:  Rplot188.jpeg
    Untracked:  Rplot189.jpeg
    Untracked:  Rplot190.jpeg
    Untracked:  Rplot191.jpeg
    Untracked:  Rplot192.jpeg
    Untracked:  Rplot193.jpeg
    Untracked:  Rplot194.jpeg
    Untracked:  Rplot195.jpeg
    Untracked:  Rplot196.jpeg
    Untracked:  Rplot197.jpeg
    Untracked:  Rplot198.jpeg
    Untracked:  Rplot199.jpeg
    Untracked:  Rplot200.jpeg
    Untracked:  Rplot201.jpeg
    Untracked:  Rplot202.jpeg
    Untracked:  Rplot203.jpeg
    Untracked:  Rplot204.jpeg
    Untracked:  Rplot205.jpeg
    Untracked:  Rplot206.jpeg
    Untracked:  Rplot207.jpeg
    Untracked:  Rplot208.jpeg
    Untracked:  Rplot209.jpeg
    Untracked:  Rplot210.jpeg
    Untracked:  Rplot211.jpeg
    Untracked:  Rplot212.jpeg
    Untracked:  Rplot213.jpeg
    Untracked:  Rplot214.jpeg
    Untracked:  Rplot215.jpeg
    Untracked:  Rplot216.jpeg
    Untracked:  Rplot217.jpeg
    Untracked:  Rplot218.jpeg
    Untracked:  Rplot219.jpeg
    Untracked:  Rplot220.jpeg
    Untracked:  Rplot221.jpeg
    Untracked:  Rplot222.jpeg
    Untracked:  Rplot223.jpeg
    Untracked:  Rplot224.jpeg
    Untracked:  Rplot225.jpeg
    Untracked:  Rplot226.jpeg
    Untracked:  Rplot227.jpeg
    Untracked:  Rplot228.jpeg
    Untracked:  Rplot229.jpeg
    Untracked:  Rplot230.jpeg
    Untracked:  Rplot231.jpeg
    Untracked:  Rplot232.jpeg
    Untracked:  Rplot233.jpeg
    Untracked:  Rplot234.jpeg
    Untracked:  Rplot235.jpeg
    Untracked:  Rplot236.jpeg
    Untracked:  Rplot237.jpeg
    Untracked:  Rplot238.jpeg
    Untracked:  Rplot239.jpeg
    Untracked:  Rplot240.jpeg
    Untracked:  Rplot241.jpeg
    Untracked:  Rplot242.jpeg
    Untracked:  Rplot243.jpeg
    Untracked:  Rplot244.jpeg
    Untracked:  Rplot245.jpeg
    Untracked:  Rplot246.jpeg
    Untracked:  Rplot247.jpeg
    Untracked:  Rplot248.jpeg
    Untracked:  Rplot249.jpeg
    Untracked:  Rplot250.jpeg
    Untracked:  Rplot251.jpeg
    Untracked:  Rplot252.jpeg
    Untracked:  Rplot253.jpeg
    Untracked:  Rplot254.jpeg
    Untracked:  Rplot255.jpeg
    Untracked:  Rplot256.jpeg
    Untracked:  Rplot257.jpeg
    Untracked:  Rplot258.jpeg
    Untracked:  Rplot259.jpeg
    Untracked:  Rplot260.jpeg
    Untracked:  Rplot261.jpeg
    Untracked:  Rplot262.jpeg
    Untracked:  Rplot263.jpeg
    Untracked:  Rplot264.jpeg
    Untracked:  Rplot265.jpeg
    Untracked:  Rplot266.jpeg
    Untracked:  Rplot267.jpeg
    Untracked:  Rplot268.jpeg
    Untracked:  Rplot269.jpeg
    Untracked:  Rplot270.jpeg
    Untracked:  Rplot271.jpeg
    Untracked:  Rplot272.jpeg
    Untracked:  Rplot273.jpeg
    Untracked:  Rplot274.jpeg
    Untracked:  Rplot275.jpeg
    Untracked:  Rplot276.jpeg
    Untracked:  Rplot277.jpeg
    Untracked:  Rplot278.jpeg
    Untracked:  Rplot279.jpeg
    Untracked:  Rplot280.jpeg
    Untracked:  Rplot281.jpeg
    Untracked:  Rplot282.jpeg
    Untracked:  Rplot283.jpeg
    Untracked:  Rplot284.jpeg
    Untracked:  Rplot285.jpeg
    Untracked:  Rplot286.jpeg
    Untracked:  Rplot287.jpeg
    Untracked:  Rplot288.jpeg
    Untracked:  Rplot289.jpeg
    Untracked:  Rplot290.jpeg
    Untracked:  Rplot291.jpeg
    Untracked:  Rplot292.jpeg
    Untracked:  Rplot293.jpeg
    Untracked:  Rplot294.jpeg
    Untracked:  Rplot295.jpeg
    Untracked:  Rplot296.jpeg
    Untracked:  Rplot297.jpeg
    Untracked:  Rplot298.jpeg
    Untracked:  Rplot299.jpeg
    Untracked:  Rplot300.jpeg
    Untracked:  Rplot301.jpeg
    Untracked:  Rplot302.jpeg
    Untracked:  Rplot303.jpeg
    Untracked:  Rplot304.jpeg
    Untracked:  S2A.jpeg
    Untracked:  S2B.jpeg
    Untracked:  analysis/juicer_enrichment.Rmd
    Untracked:  code/mediate.test.regressing.R
    Untracked:  data/Chimp_orthoexon_extended_info.txt
    Untracked:  data/Human_orthoexon_extended_info.txt
    Untracked:  data/Meta_data.txt
    Untracked:  data/TADs/
    Untracked:  data/chimp_lengths.txt
    Untracked:  data/counts_iPSC.txt
    Untracked:  data/epigenetic_enrichments/
    Untracked:  data/final.10kb.homer.df
    Untracked:  data/final.juicer.10kb.KR
    Untracked:  data/final.juicer.10kb.VC
    Untracked:  data/hic_gene_overlap/
    Untracked:  data/human_lengths.txt
    Untracked:  data/old_mediation_permutations/
    Untracked:  output/DC_regions.txt
    Untracked:  output/IEE.RPKM.RDS
    Untracked:  output/IEE_voom_object.RDS
    Untracked:  output/data.4.filtered.lm.QC
    Untracked:  output/data.4.fixed.init.LM
    Untracked:  output/data.4.fixed.init.QC
    Untracked:  output/data.4.init.LM
    Untracked:  output/data.4.init.QC
    Untracked:  output/data.4.lm.QC
    Untracked:  output/full.data.10.init.LM
    Untracked:  output/full.data.10.init.QC
    Untracked:  output/full.data.10.lm.QC
    Untracked:  output/full.data.annotations.RDS
    Untracked:  output/gene.hic.filt.RDS
    Untracked:  output/juicer.filt.KR
    Untracked:  output/juicer.filt.KR.lm
    Untracked:  output/juicer.filt.VC
    Untracked:  output/juicer.filt.VC.lm

Unstaged changes:
    Modified:   analysis/TADs.Rmd
    Modified:   analysis/gene_expression.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 8380e8b Ittai Eres 2019-04-30 Add variety of juicer analyses into website files.

###Initial Data read-in and quality control First, I read in the data and normalize it with cyclic pairwise loess normalization. Then I look at histograms of the distributions of the contact frequencies on an individual-by-individual basis, to see if they are comparable. I also look at a plot of the mean vs. the variance as an additional quality control metric; this is typically done on RNA-seq count data. In that case, it’s done to check if the normalized data are distributed in a particular fashion (e.g. a poisson model would have a linear mean-variance relationship). In the limma/voom pipeline, count data are log-cpm transformed and a loess trend line of variance vs. mean is then fit to create weights for individual genes to be fed into linear modeling with limma. Since this is not a QC metric typically used for Hi-C data, the only thing I hope to see is no particularly strong relationship between variance and mean.

###Read in VC and KR normalized data, normalize each with cyclic loess (pairwise), clean it from any rows that have missing values.
full.KR <- fread("data/final.juicer.10kb.KR", header=TRUE, stringsAsFactors = FALSE, data.table=FALSE, showProgress = FALSE) #Start w/ roughly 31.5k hits here
full.VC <- fread("data/final.juicer.10kb.VC", header=TRUE, stringsAsFactors = FALSE, data.table = FALSE, showProgress = FALSE) #~84k hits...

#Note that many more hits show up as significant under the VC normalization paradigm than do under KR balancing. This is especially bad for A-21792 for some reason.

#Subsetting down to only complete cases (I.E. none of the individuals have NA values goes to ~28k and ~77k hits for KR and VC, respectively. Reductions of both ~10%.)
full.KR <- full.KR[complete.cases(full.KR[,112:119]),]
full.VC <- full.VC[complete.cases(full.VC[,112:119]),]

#Visualize both.
boxplot(full.KR[,112:119], ylim=c(0, 30), horizontal = TRUE, main="KR Distributions")

boxplot(full.VC[,112:119], ylim=c(0, 30), horizontal=TRUE, main="VC Distributions")

KR.contacts <- full.KR[,112:119]
VC.contacts <- full.VC[,112:119]

###Pearson is a product moment correlation, meaning it evaluates the relationship b/t two continuous variables. Spearman is a rank-order correlation, is mainly evaluating monotonic rlationship b/t two continuous/ordinal variables (variables change together, but not necessarily at a constant rate). It's based on ranked values for each variable, rather than the raw data itself, and is thus not quite as precise IMO. Hence here I optimize my normalization schemes WRT seeing proper clustering in a Pearson correlation heatmap. OR I do it WRT a Spearman heatmap b/c Pearson is more quantitative and thus harder to fine-tune, and keeps messing up here--can't find a single normalization scheme that gives good clustering that isn't super dirty on Pearson.
###For now, just use Spearman correlations 
full.contacts.VC <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.28, iterations=3, method="pairs"))#Pearson clusters properly, but correlation values are not very pretty for separation. Same situation in Spearman but slightly less ugly. #6 iterations of 0.1 aren't bad, and can also get beautiful results with method="fast" instead of pairs (but this takes loess to the average of all the data so that's why, maybe not as valid). 0.21 and 0.28 spans also worth checking? They produce nice looking boxplots but not great looking heatmaps in either case...

#VC Correlations
corheat <- cor(full.contacts.VC, use="complete.obs", method="pearson")
corheats <- cor(full.contacts.VC, use="complete.obs", method="spearman")
colnames(corheat) <- c("H_F1", "H_M1", "C_M1", "C_F1", "H_M2", "H_F2", "C_M2", "C_F2")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#VC Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters well, but not super distinct. #FIGS1B
heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Same as above.
#Can't manage to find a cyclic loess normalization here that makes samples cluster properly for both types of correlations...can get samples clustering in appropriate places but not get the dendrogram to look how I expect--just use same settings as above, in interest of consistency.
full.contacts.KR <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.36, iterations=3, method="pairs")) #Method="fast" doesn't give us the same awesomeness here, but that's not too surprising considering different normalization schemes. .26 and .32 may also be worth checking here.

#KR Correlations
corheat <- cor(full.contacts.KR, use="complete.obs", method="pearson") #Corheat for the full data set, and heatmap. Pearson clusters poorly.
corheats <- cor(full.contacts.KR, use="complete.obs", method="spearman")
colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#KR Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters poorly, but at least species groups are maintained!
heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters excellently!
#FINALLY have something that works for both, and boxplots look good. One last check on them before moving to hists of dist'ns:
boxplot(full.contacts.KR, ylim=c(0, 30), horizontal = TRUE, main="KR Distributions, Normalized")

boxplot(full.contacts.VC, ylim=c(0, 30), horizontal=TRUE, main="VC Distributions, Normalized")

###First, a quick look at histograms of the distributions of the significant Hi-C hits from homer, in both humans and chimps. Create the melted dfs for each first otherwise it takes an absurd amount of time/memory:
VC.humans <- melt(full.contacts.VC[,c(1:2, 5:6)])
No id variables; using all as measure variables
VC.chimps <- melt(full.contacts.VC[,c(3:4, 7:8)])
No id variables; using all as measure variables
KR.humans <- melt(full.contacts.KR[,c(1:2, 5:6)])
No id variables; using all as measure variables
KR.chimps <- melt(full.contacts.KR[,c(3:4, 7:8)])
No id variables; using all as measure variables
#VC-normalized distributions:
ggplot(data=VC.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Humans") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Human dist'ns

ggplot(data=VC.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Chimps") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Chimp Dist'ns

#Both sets of distributions look fairly bimodal, with chimps in particular showing a peak around 0.

###Now, the same thing for the KR-normalized interaction frequencies:
ggplot(data=KR.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Humans") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Human dist'ns

ggplot(data=KR.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Chimps") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Chimp Dist'ns

###########Take a look at these again, but on the raw data before normalization between samples:
VC.humans <- melt(VC.contacts[,c(1:2, 5:6)])
No id variables; using all as measure variables
VC.chimps <- melt(VC.contacts[,c(3:4, 7:8)])
No id variables; using all as measure variables
KR.humans <- melt(KR.contacts[,c(1:2, 5:6)])
No id variables; using all as measure variables
KR.chimps <- melt(KR.contacts[,c(3:4, 7:8)])
No id variables; using all as measure variables
#VC-normalized distributions:
ggplot(data=VC.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Human dist'ns

ggplot(data=VC.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Chimp Dist'ns

#Both sets of distributions look fairly bimodal, with chimps in particular showing a peak around 0.

###Now, the same thing for the KR-normalized interaction frequencies:
ggplot(data=KR.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Human dist'ns

ggplot(data=KR.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Chimp Dist'ns

#There is no longer a good case for not normalizing in a pairwise cyclic loess fashion, especially now that I have clustering working on both types of normalization--just use what we have from here on out.

######PiCKUP HERE and see if I then slotted these pairwise cyclic loess back into the full DF in the other paradigm. I did indeed slot things into the data frame again in the other context, but here it must be two separate DFs--one for KR and one for VC, as the significant hits differ b/t these normalization schemes anyway.
full.KR[,112:119] <- full.contacts.KR
full.VC[,112:119] <- full.contacts.VC

###As an initial quality control metric, take a look at the mean vs. variance plot for the normalized data--first for KR, then for VC. Also go ahead and add on columns for mean frequencies and variances both within and across the species, while they're being calculated here anyway:

#KR:
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR") %>% apply(., 1, mean) -> full.KR$Hmean
select(full.KR, "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, mean) -> full.KR$Cmean
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR", "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, mean) -> full.KR$ALLmean #Get means across species.
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR") %>% apply(., 1, var) -> full.KR$Hvar #Get human variances.
select(full.KR, "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, var) -> full.KR$Cvar #Get chimp variances.
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR", "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, var) -> full.KR$ALLvar #Get variance across species.
{plot(full.KR$ALLmean, full.KR$ALLvar, main="Mean vs. Variance in Juicer KR contact frequency", xlab="Mean Hi-C Contact Frequency", ylab="Variance in Hi-C Contact Frequency")
  abline(lm(full.KR$ALLmean~full.KR$ALLvar), col="red")}

#Very flat line of the regression. Solid looking.

#VC:
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC") %>% apply(., 1, mean) -> full.VC$Hmean
select(full.VC, "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, mean) -> full.VC$Cmean
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC", "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, mean) -> full.VC$ALLmean #Get means across species.
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC") %>% apply(., 1, var) -> full.VC$Hvar #Get human variances.
select(full.VC, "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, var) -> full.VC$Cvar #Get chimp variances.
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC", "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, var) -> full.VC$ALLvar #Get variance across species.
{plot(full.VC$ALLmean, full.VC$ALLvar, main="Mean vs. Variance in Juicer VC contact frequency", xlab="Mean Hi-C Contact Frequency", ylab="Variance in Hi-C Contact Frequency")
  abline(lm(full.VC$ALLmean~full.VC$ALLvar), col="red")}

#Very flat line of the regression. Solid looking.

On the whole, the distributions look fairly comparable across individuals and species, and the mean vs. variance plots show weak correlation at best between the two metrics. Now knowing the data are comparable in this sense, the next question would be whether they are different enough between the species to separate them from one another with unsupervised clustering and PCA methods.

###Data clustering with Principal Components Analysis (PCA) and correlation heatmap clustering. Now I use PCA, breaking down the interaction frequency values into principal components that best represent the variability within the data. This technique uses orthogonal transformation to convert the many dimensions in variability of this dataset into a lower-dimensional picture that can be used to separate out functional groups in the data. In this case, the expectation is that the strongest principal component, representing the plurality of the variance, will separate out humans and chimps along its axis, grouping the two species independently, as we expect their interaction frequency values to cluster together. I then also compute pairwise pearson correlations between each of the individuals across all Hi-C contacts, and use unsupervised hierarchical clustering to create a heatmap that once again will group individuals based on similarity in interaction frequency values. Again, I would expect this technique to separate out the species very distinctly from one another.

###Now do principal components analysis (PCA) on these data to see how they separate:
meta.data <- data.frame("SP"=c("H", "H", "C", "C", "H", "H", "C", "C"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2), "PE_reads"=c(1084472930, 1103077950, 1015696574, 1047650944, 980287418, 1037054332, 930089380, 964085606), "tags_per_BP"=c(0.351291, 0.357315, 0.342325, 0.353097, 0.317553, 0.335937, 0.313494, 0.324936), "inter_reads"=c(0.1825411, 0.1734168, 0.1559306, 0.2010798, 0.1711258, 0.1479523, 0.1604700, 0.1712287)) #need the metadata first to make this interesting; obtain this information from my study design
pca <- prcomp(t(full.contacts.KR), scale=TRUE, center=TRUE)
pca2 <- prcomp(t(full.contacts.VC), scale=TRUE, center=TRUE)
ggplot(data=as.data.frame(pca$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on KR normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", 100*summary(pca)$importance[2,1], "% of variance)")) + ylab((paste("PC2 (", 100*summary(pca)$importance[2,2], "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17))

barplot(summary(pca)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on KR normalized Hi-C contact frequency") #Scree plot showing all the PCs and the proportion of the variance they explain.

ggplot(data=as.data.frame(pca2$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on VC normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", 100*summary(pca)$importance[2,1], "% of variance)")) + ylab((paste("PC2 (", 100*summary(pca)$importance[2,2], "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17))

barplot(summary(pca2)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on VC normalized Hi-C contact frequency")

#Scree plot

#PC Correlations for VC PCA (pca2):
#Actually statistically test correlation w/ PCs:
PC1 <- pca2$x[,1]
PC2 <- pca2$x[,2]
PC3 <- pca2$x[,3]
PC4 <- pca2$x[,4]
PC5 <- pca2$x[,5]
PC6 <- pca2$x[,6]
PC7 <- pca2$x[,7]
PCS <- data.frame(PC1, PC2, PC3, PC4, PC5, PC6, PC7)
summary <- summary(pca2)
PC_pvals <- matrix(data=NA, nrow=7, ncol=6, dimnames=list(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"), c("Species", "Sex", "Batch", "PE_reads", "tags_per_BP", "inter_reads")))
PC_rsquareds <- matrix(data=NA, nrow=7, ncol=6, dimnames=list(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"), c("Species", "Sex", "Batch", "PE_reads", "tags_per_BP", "inter_reads")))
for(i in 1:7){
  #For species
  speciesPC1 <- lm(PCS[,i] ~ as.factor(meta.data$SP))
  fstat <- as.data.frame(summary(speciesPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,1] <- p_fstat
  PC_rsquareds[i,1] <- summary(speciesPC1)$r.squared
  
  #For sex
  sexPC1 <- lm(PCS[,i] ~ as.factor(meta.data$SX))
  fstat <- as.data.frame(summary(sexPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,2] <- p_fstat
  PC_rsquareds[i,2] <- summary(sexPC1)$r.squared
  
  #For batch
  btcPC1 <- lm(PCS[,i] ~ as.factor(meta.data$Batch))
  fstat <- as.data.frame(summary(btcPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,3] <- p_fstat
  PC_rsquareds[i,3] <- summary(btcPC1)$r.squared
  
  #For PE_reads
  PEPC1 <- lm(PCS[,i] ~ meta.data$PE_reads)
  fstat <- as.data.frame(summary(PEPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,4] <- p_fstat
  PC_rsquareds[i,4] <- summary(PEPC1)$r.squared
  
  #For tags
  tagsPC1 <- lm(PCS[,i] ~ meta.data$tags_per_BP)
  fstat <- as.data.frame(summary(tagsPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,5] <- p_fstat
  PC_rsquareds[i,5] <- summary(tagsPC1)$r.squared
  
  #For inter
  interPC1 <- lm(PCS[,i] ~ meta.data$inter_reads)
  fstat <- as.data.frame(summary(interPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,6] <- p_fstat
  PC_rsquareds[i,6] <- summary(interPC1)$r.squared
}
PC_pvals
       Species        Sex      Batch   PE_reads tags_per_BP inter_reads
PC1 0.01142577 0.30096937 0.60793795 0.08091143   0.3127508 0.602368920
PC2 0.37165036 0.77102754 0.16803309 0.68908188   0.4036627 0.930390385
PC3 0.39399059 0.18911234 0.46144460 0.25667818   0.3601012 0.567914977
PC4 0.66390784 0.09822534 0.82210831 0.52928526   0.3513486 0.376834341
PC5 0.72877547 0.78735358 0.35036759 0.79662950   0.6501039 0.009185124
PC6 0.96541467 0.91378702 0.08580873 0.21316471   0.1163830 0.567808248
PC7 0.88337621 0.36756652 0.92420574 0.76637027   0.6741631 0.993834150
PC_rsquareds
         Species         Sex       Batch   PE_reads tags_per_BP
PC1 0.6830233808 0.175898521 0.046525011 0.42272655  0.16829338
PC2 0.1344364328 0.015218818 0.290544520 0.02856576  0.11853216
PC3 0.1231722619 0.267697271 0.093483148 0.20751138  0.14057675
PC4 0.0336042094 0.389394466 0.009111427 0.06912277  0.14538288
PC5 0.0215344493 0.013088684 0.145930021 0.01195358  0.03656567
PC6 0.0003403913 0.002120169 0.412768233 0.24426223  0.35915324
PC7 0.0038888744 0.136582070 0.001637640 0.01585773  0.03149592
     inter_reads
PC1 4.794721e-02
PC2 1.380815e-03
PC3 5.733129e-02
PC4 1.317522e-01
PC5 7.042157e-01
PC6 5.736197e-02
PC7 1.081408e-05
#So PC1 correlates strongly and significantly with species, whereas PC4 has some moderate correlation with sex--but nothing else has a significant effect.

###Variance exploration Now I look at several metrics to assess variance in the data, checking which species hits were discovered as significant in and how many individuals a given hit was discovered in. I look at this to understand if there are some cutoffs that should be made to reduce the noisiness of the data and maximize the significance of further findings.

###Add columns to full.data to indicate species of discovery and number of individuals discovered in. These pieces of information are good to know about each of the hits generally, but can also be used to make decisions about filtering out certain classes of hits if variance is associated with any of these metrics.  
humNAs <- rowSums(is.na(full.KR[,1:53])) #52 is when there is no human info. 13 NAs per individual.
chimpNAs <- rowSums(is.na(full.KR[,54:105])) #Same as above of course.
full.KR$found_in_H <- (4-humNAs/13) #Set a new column identifying number of humans hit was found in
full.KR$found_in_C <- (4-chimpNAs/13) #Set a new column identifying number of chimps hit was found in
full.KR$disc_species <- ifelse(full.KR$found_in_C>0&full.KR$found_in_H>0, "B", ifelse(full.KR$found_in_C==0, "H", "C")) #Set a column identifying which species (H, C, or Both) the hit in question was identified in. Works with nested ifelses.
full.KR$tot_indiv_IDd <- full.KR$found_in_C+full.KR$found_in_H #Add a column specifying total number of individuals homer found the significant hit in.

###Take a look at what proportion of the significant hits were discovered in either of the species (or both of them).
sum(full.KR$disc_species=="H") #~9.5k
[1] 9526
sum(full.KR$disc_species=="C") #~12k
[1] 11916
sum(full.KR$disc_species=="B") #~7k
[1] 6736
#This is reassuring, as similar numbers of discoveries in both species suggests comparable power for detection.

###Now I look at variance in interaction frequency as a function of the number of individuals in which a pair was independently called as significant. Essentially, I'm checking here to see if there is some kind of cutoff I should make for the significant hits--i.e., if the variance looks to drop off significantly after a hit is discovered in at least 2 individuals, this suggests possibly filtering out any hit only discovered in 1 individual.
myplot <- data.frame(disc_species=full.KR$disc_species, found_in_H=full.KR$found_in_H, found_in_C=full.KR$found_in_C, tot_indiv_IDd=full.KR$tot_indiv_IDd, Hvar=full.KR$Hvar, Cvar=full.KR$Cvar, ALLvar=full.KR$ALLvar)
ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in KR Interaction Frequency") + coord_cartesian(ylim=c(0, 10))

ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + scale_y_continuous(limits=c(0, 1)) + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in KR Interaction Frequency")
Warning: Removed 4631 rows containing non-finite values (stat_boxplot).

#There appears to be no clear trend here; simply decide to require a hit is discovered in at least 2 individuals to call it as biologically significant and not just technical nonsense? Do the same process for the VC data after this as well:
filt.KR <- filter(full.KR, tot_indiv_IDd>=2) #Brings us down from ~13k hits to only 4k hits though...


###EXACT same thing for VC data:
humNAs <- rowSums(is.na(full.VC[,1:53])) #52 is when there is no human info. 13 NAs per individual.
chimpNAs <- rowSums(is.na(full.VC[,54:105])) #Same as above of course.
full.VC$found_in_H <- (4-humNAs/13) #Set a new column identifying number of humans hit was found in
full.VC$found_in_C <- (4-chimpNAs/13) #Set a new column identifying number of chimps hit was found in
full.VC$disc_species <- ifelse(full.VC$found_in_C>0&full.VC$found_in_H>0, "B", ifelse(full.VC$found_in_C==0, "H", "C")) #Set a column identifying which species (H, C, or Both) the hit in question was identified in. Works with nested ifelses.
full.VC$tot_indiv_IDd <- full.VC$found_in_C+full.VC$found_in_H #Add a column specifying total number of individuals homer found the significant hit in.

###Take a look at what proportion of the significant hits were discovered in either of the species (or both of them).
sum(full.VC$disc_species=="H") #~56k
[1] 55851
sum(full.VC$disc_species=="C") #~12k
[1] 11718
sum(full.VC$disc_species=="B") #~9k
[1] 9228
#This is the opposite of reassuring, as different numbers of discoveries in both species suggests incomparable power for detection...how is the coverage normalization making such a big difference here for humans?! I need to probably go back and look at my initial subsetting and significance calling and extraction again to make sure it's all right...

###Now I look at variance in interaction frequency as a function of the number of individuals in which a pair was independently called as significant. Essentially, I'm checking here to see if there is some kind of cutoff I should make for the significant hits--i.e., if the variance looks to drop off significantly after a hit is discovered in at least 2 individuals, this suggests possibly filtering out any hit only discovered in 1 individual.
myplot <- data.frame(disc_species=full.VC$disc_species, found_in_H=full.VC$found_in_H, found_in_C=full.VC$found_in_C, tot_indiv_IDd=full.VC$tot_indiv_IDd, Hvar=full.VC$Hvar, Cvar=full.VC$Cvar, ALLvar=full.VC$ALLvar)
ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in VC Interaction Frequency") + coord_cartesian(ylim=c(0, 10))

ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + scale_y_continuous(limits=c(0, 1)) + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in VC Interaction Frequency")
Warning: Removed 21571 rows containing non-finite values (stat_boxplot).

#There appears to be no clear trend here; simply decide to require a hit is discovered in at least 2 individuals to call it as biologically significant and not just technical nonsense?
filt.VC <- filter(full.VC, tot_indiv_IDd>=2) #Brings us down from ~13k hits to only 4k hits though...


sum(filt.VC$disc_species=="H") #~4k
[1] 3991
sum(filt.VC$disc_species=="C") #~2.5k
[1] 2448
sum(filt.VC$disc_species=="B") #~9k
[1] 9228
sum(filt.KR$disc_species=="H") #~56k
[1] 2136
sum(filt.KR$disc_species=="C") #~12k
[1] 2824
sum(filt.KR$disc_species=="B") #~9k
[1] 6736
###Take one last look at the new distributions after doing the filtering, just to see what they look like, and also take a look at the PCA and hierarchical clustering, just for reference (don't really think it matters at this point?):
VC.humans <- melt(filt.VC[,c(112:113, 116:117)])
No id variables; using all as measure variables
VC.chimps <- melt(filt.VC[,c(114:115, 118:119)])
No id variables; using all as measure variables
KR.humans <- melt(filt.KR[,c(112:113, 116:117)])
No id variables; using all as measure variables
KR.chimps <- melt(filt.KR[,c(114:115, 118:119)])
No id variables; using all as measure variables
#VC-normalized distributions:
ggplot(data=VC.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 750)) #Human dist'ns

ggplot(data=VC.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 750)) #Chimp Dist'ns

#Both sets of distributions look fairly bimodal, with chimps in particular showing a peak around 0.

###Now, the same thing for the KR-normalized interaction frequencies:
ggplot(data=KR.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 500)) #Human dist'ns

ggplot(data=KR.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 500)) #Chimp Dist'ns

####Check on clustering and PCA now:
corheat <- cor(filt.KR[,112:119], use="complete.obs", method="pearson") #Corheat for the full data set, and heatmap. Pearson clusters poorly.
corheats <- cor(filt.KR[,112:119], use="complete.obs", method="spearman")
colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#KR Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters poorly, but at least species groups are maintained!
heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters excellently!
#FOR KR, after filtering, Spearman holds up perfectly, Pearson has slight issues (dendrogram isn't right although larger groups clustered properly)

##Repeat for VC before looking at PCA on both:
corheat <- cor(filt.VC[,112:119], use="complete.obs", method="pearson") #Corheat for the full data set, and heatmap. Pearson clusters poorly.
corheats <- cor(filt.VC[,112:119], use="complete.obs", method="spearman")
colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#KR Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters poorly, but at least species groups are maintained!
heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters excellently!
#FOR VC as well, after filtering, Spearman holds up perfectly, Pearson has slight issues (dendrogram isn't right although larger groups clustered properly)


###PCA:
pca <- prcomp(t(filt.KR[,112:119]), scale=TRUE, center=TRUE)
pca2 <- prcomp(t(filt.VC[,112:119]), scale=TRUE, center=TRUE)
ggplot(data=as.data.frame(pca$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on KR normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", 100*summary(pca)$importance[2,1], "% of variance)")) + ylab((paste("PC2 (", 100*summary(pca)$importance[2,2], "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17))

barplot(summary(pca)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on KR normalized Hi-C contact frequency") #Scree plot showing all the PCs and the proportion of the variance they explain.

ggplot(data=as.data.frame(pca2$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on VC normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", signif(100*summary(pca)$importance[2,1],2), "% of variance)")) + ylab((paste("PC2 (", signif(100*summary(pca)$importance[2,2],2), "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17)) #FIGS1A

barplot(summary(pca2)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on VC normalized Hi-C contact frequency")

#Scree plot

#Ok, well nice to know. Now to write out files.

fwrite(filt.KR, "output/juicer.filt.KR", quote = TRUE, sep = "\t", row.names = FALSE, col.names = TRUE, na="NA", showProgress = FALSE)
fwrite(filt.VC, "output/juicer.filt.VC", quote=TRUE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress=FALSE)
###No longer used, just notes about the normalization scheme.
#full.contacts2 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.25, iterations=3, method="pairs")) #Good Spearman, poor Pearson (not even proper clustering)
#full.contacts3 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.29, iterations=3, method="pairs")) #Both terrible!
#full.contacts4 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.39, iterations=3, method="pairs")) #Spearman clusters properly but correlation values not great in terms of separation; Pearson just terrible and doesn't cluster right.
#full.contacts5 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.35, iterations=3, method="pairs")) #Both TERRIBUL
#full.contacts6 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.2, iterations=2, method="pairs")) #Gives the right clustering but not great correlations on Pearson. Same but a little better with Spearman.
#full.contacts7 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.3, iterations=1, method="pairs")) #Spearman decent but Pearson terrible.
#full.contacts8 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.333, iterations=5, method="pairs")) #Terrible!


#Normalization schemes
#KR.1 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.17, iterations=3, method="pairs"))
#KR.2 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.31, iterations=2, method="pairs"))
#VC.contacts <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.31, iterations=3, method="pairs"))

# Kfull.contacts2 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=1/4, iterations=3, method="pairs")) #Awful Pearson, slightly worse than above Spearman.
# Kfull.contacts3 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.29, iterations=3, method="pairs")) #Terrible Pearson, middling Spearman.
# Kfull.contacts6 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.3, iterations=1, method="pairs")) #Terrible Pearson, solid Spearman.
# Kfull.contacts7 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.5, iterations=2, method="pairs"))
# 
# for(myspan in seq(0.2, 0.25, 0.01)){
#   iteration <- 2
#     Kcontacts <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=myspan, iterations=iteration, method="pairs"))
#     corheat <- cor(Kcontacts, use="complete.obs", method="pearson")
#     corheats <- cor(Kcontacts, use="complete.obs", method="spearman")
#     colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
#     rownames(corheat) <- colnames(corheat)
#     colnames(corheats) <- colnames(corheat)
#     rownames(corheats) <- colnames(corheat)
#     print(heatmaply(corheat, main=paste("Pearson, 10kb, ", myspan, " span, ", iteration, " iterations.", sep=""), k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)))
#     print(heatmaply(corheats, main=paste("Spearman, 10kb, ", myspan, " span, ", iteration, " iterations.", sep=""), k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)))
# }


sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bedr_1.0.6         forcats_0.4.0      purrr_0.3.2       
 [4] readr_1.3.1        tibble_2.1.1       tidyverse_1.2.1   
 [7] edgeR_3.20.9       RColorBrewer_1.1-2 heatmaply_0.15.2  
[10] viridis_0.5.1      viridisLite_0.3.0  stringr_1.4.0     
[13] gplots_3.0.1.1     Hmisc_4.2-0        Formula_1.2-3     
[16] survival_2.44-1    lattice_0.20-38    dplyr_0.8.0.1     
[19] plotly_4.8.0       cowplot_0.9.4      ggplot2_3.1.0     
[22] reshape2_1.4.3     data.table_1.12.0  tidyr_0.8.3       
[25] plyr_1.8.4         limma_3.34.9      

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1     class_7.3-15         modeltools_0.2-22   
  [4] mclust_5.4.3         rprojroot_1.3-2      htmlTable_1.13.1    
  [7] futile.logger_1.4.3  base64enc_0.1-3      fs_1.2.7            
 [10] rstudioapi_0.10      flexmix_2.3-15       mvtnorm_1.0-8       
 [13] lubridate_1.7.4      xml2_1.2.0           R.methodsS3_1.7.1   
 [16] codetools_0.2-16     splines_3.4.0        robustbase_0.92-8   
 [19] knitr_1.22           jsonlite_1.6         workflowr_1.2.0     
 [22] broom_0.5.1          cluster_2.0.7-1      kernlab_0.9-27      
 [25] R.oo_1.22.0          shiny_1.2.0          httr_1.4.0          
 [28] backports_1.1.3      assertthat_0.2.1     Matrix_1.2-15       
 [31] lazyeval_0.2.2       cli_1.1.0            later_0.8.0         
 [34] formatR_1.6          acepack_1.4.1        htmltools_0.3.6     
 [37] tools_3.4.0          gtable_0.3.0         glue_1.3.1          
 [40] Rcpp_1.0.1           cellranger_1.1.0     trimcluster_0.1-2.1 
 [43] gdata_2.18.0         nlme_3.1-137         crosstalk_1.0.0     
 [46] iterators_1.0.10     fpc_2.1-11.1         xfun_0.5            
 [49] testthat_2.0.1       rvest_0.3.2          mime_0.6            
 [52] gtools_3.8.1         dendextend_1.10.0    DEoptimR_1.0-8      
 [55] MASS_7.3-51.1        scales_1.0.0         TSP_1.1-6           
 [58] promises_1.0.1       hms_0.4.2            parallel_3.4.0      
 [61] lambda.r_1.2.3       yaml_2.2.0           gridExtra_2.3       
 [64] rpart_4.1-13         latticeExtra_0.6-28  stringi_1.4.3       
 [67] gclus_1.3.2          foreach_1.4.4        checkmate_1.9.1     
 [70] seriation_1.2-3      caTools_1.17.1.2     rlang_0.3.3         
 [73] pkgconfig_2.0.2      prabclus_2.2-7       bitops_1.0-6        
 [76] evaluate_0.13        labeling_0.3         htmlwidgets_1.3     
 [79] tidyselect_0.2.5     magrittr_1.5         R6_2.4.0            
 [82] generics_0.0.2       pillar_1.3.1         haven_2.1.0         
 [85] whisker_0.3-2        foreign_0.8-71       withr_2.1.2         
 [88] nnet_7.3-12          modelr_0.1.4         crayon_1.3.4        
 [91] futile.options_1.0.1 KernSmooth_2.23-15   rmarkdown_1.12      
 [94] locfit_1.5-9.1       grid_3.4.0           readxl_1.3.1        
 [97] git2r_0.25.2         digest_0.6.18        diptest_0.75-7      
[100] webshot_0.5.1        xtable_1.8-3         VennDiagram_1.6.20  
[103] httpuv_1.5.0         R.utils_2.8.0        stats4_3.4.0        
[106] munsell_0.5.0        registry_0.5-1