Last updated: 2025-04-06
Checks: 6 1
Knit directory: CX5461_Project/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20250129)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 0e53214. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: 0.1 box.svg
Ignored: Rplot04.svg
Untracked files:
Untracked: 0.1 density.svg
Untracked: 0.1.emf
Untracked: 0.1.svg
Untracked: 0.5 box.svg
Untracked: 0.5 density.svg
Untracked: 0.5.svg
Untracked: Figure 1.jpeg
Untracked: Figure 1.pdf
Untracked: Rplot.svg
Untracked: Rplot01.svg
Untracked: Rplot02.svg
Untracked: Rplot03.svg
Untracked: Rplot05.svg
Untracked: Rplot06.svg
Untracked: Rplot07.svg
Untracked: Rplot08.jpeg
Untracked: Rplot08.svg
Untracked: data/CX-5461_alt.csv
Unstaged changes:
Modified: analysis/CX_DOX.Rmd
Modified: analysis/Colorectal.Rmd
Modified: analysis/GO_KEGG_DEG.Rmd
Modified: analysis/Tissue.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/Tissue.Rmd
) and HTML
(docs/Tissue.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 1c1e1e4 | sayanpaul01 | 2025-02-28 | Commit |
Rmd | cb00774 | sayanpaul01 | 2025-02-28 | Commit |
html | cb00774 | sayanpaul01 | 2025-02-28 | Commit |
Rmd | fa75295 | sayanpaul01 | 2025-02-28 | Commit |
html | fa75295 | sayanpaul01 | 2025-02-28 | Commit |
Rmd | cf29408 | sayanpaul01 | 2025-02-28 | Commit |
html | cf29408 | sayanpaul01 | 2025-02-28 | Commit |
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.3.2
library(tidyr)
Warning: package 'tidyr' was built under R version 4.3.3
library(org.Hs.eg.db)
Warning: package 'AnnotationDbi' was built under R version 4.3.2
Warning: package 'BiocGenerics' was built under R version 4.3.1
Warning: package 'Biobase' was built under R version 4.3.1
Warning: package 'IRanges' was built under R version 4.3.1
Warning: package 'S4Vectors' was built under R version 4.3.1
library(clusterProfiler)
Warning: package 'clusterProfiler' was built under R version 4.3.3
library(biomaRt)
Warning: package 'biomaRt' was built under R version 4.3.2
# Read the CSV file into R
file_path <- "data/count.csv"
df <- read.csv(file_path, check.names = FALSE)
# Remove 'x' from column headers
colnames(df) <- gsub("^x", "", colnames(df))
# View the updated dataframe
head(df)
# View updated column names
colnames(df)
# Step 1: Calculate IPSC_CM
# Select columns with 'VEH' in their names
veh_columns <- grep("VEH", colnames(df), value = TRUE)
# Calculate the average logCPM across VEH samples
df$IPSC_CM <- rowMeans(df[, veh_columns], na.rm = TRUE)
# Create a new dataframe with ENTREZID, SYMBOL, and IPSC_CM
veh_avg_df <- df[, c("Entrez_ID","IPSC_CM")]
# Step 2: Read the Tissue_Gtex dataset
Tissue_Gtex <- read.csv("data/Tissue_Gtex.csv")
# Step 3: Convert Ensembl IDs to Entrez IDs using biomaRt
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Extract Entrez IDs for the Ensembl IDs in the gene.id column
gene_ids <- Tissue_Gtex$gene.id
conversion <- getBM(
attributes = c("ensembl_gene_id", "entrezgene_id"),
filters = "ensembl_gene_id",
values = gene_ids,
mart = mart
)
# Merge Entrez IDs back into Tissue_Gtex
Tissue_Gtex <- merge(Tissue_Gtex, conversion, by.x = "gene.id", by.y = "ensembl_gene_id", all.x = TRUE)
# Rename the column for consistency
colnames(Tissue_Gtex)[colnames(Tissue_Gtex) == "entrezgene_id"] <- "Entrez_ID"
# Step 4: Read the veh_avg_df dataframe
veh_avg_df <- df[, c("Entrez_ID", "IPSC_CM")]
# Step 5: Merge veh_avg_df with Tissue_Gtex by Entrez_ID
# Ensure column types match
veh_avg_df$Entrez_ID <- as.character(veh_avg_df$Entrez_ID)
Tissue_Gtex$Entrez_ID <- as.character(Tissue_Gtex$Entrez_ID)
# Perform the merge
merged_df <- merge(veh_avg_df, Tissue_Gtex, by = "Entrez_ID", all.x = TRUE)
# Step 6: Remove rows with NA values
cleaned_df <- na.omit(merged_df)
# Filter relevant tissue columns
tissue_cols <- colnames(cleaned_df)[which(colnames(cleaned_df) %in% c(
"IPSC_CM", "Adrenal.Gland", "Spleen", "Heart...Atrial", "Pancreas","Artery","Breast",
"small.Intestine","Colon","Nerve...Tibial","Esophagus","Muscle...Skeletal",
"Thyroid","Heart..Ventricle", "Stomach", "Uterus","Vagina", "Skin", "Ovary", "Liver", "Lung", "Brain", "Pituitary", "Testis",
"Prostate", "Salivary.Gland"))]
data_subset <- cleaned_df[, tissue_cols]
# Compute Pearson and Spearman correlations
pearson_corr <- cor(data_subset, method = "pearson", use = "complete.obs")
spearman_corr <- cor(data_subset, method = "spearman", use = "complete.obs")
# Reorder tissues by highest correlation with IPSC_CM
order_pearson <- order(pearson_corr["IPSC_CM", ], decreasing = TRUE)
order_spearman <- order(spearman_corr["IPSC_CM", ], decreasing = TRUE)
pearson_corr <- pearson_corr[order_pearson, order_pearson]
spearman_corr <- spearman_corr[order_spearman, order_spearman]
# Plot heatmaps
library(pheatmap)
Warning: package 'pheatmap' was built under R version 4.3.1
# Pearson correlation heatmap
pheatmap(pearson_corr,
cluster_rows = TRUE, # Enable dendrogram for rows
cluster_cols = TRUE, # Enable dendrogram for columns
main = "Pearson Correlation Heatmap",
color = colorRampPalette(c("blue", "white", "red"))(100),
display_numbers = TRUE,
number_format = "%.2f",
fontsize_number = 8)
Version | Author | Date |
---|---|---|
cf29408 | sayanpaul01 | 2025-02-28 |
# Spearman correlation heatmap
pheatmap(spearman_corr,
cluster_rows = TRUE,
cluster_cols = TRUE,
main = "Spearman Correlation Heatmap",
color = colorRampPalette(c("blue", "white", "red"))(100),
display_numbers = TRUE,
number_format = "%.2f",
fontsize_number = 8)
Version | Author | Date |
---|---|---|
cf29408 | sayanpaul01 | 2025-02-28 |
# Load your Heart_genes dataset
heart_genes <- read.csv("data/Human_Heart_Genes.csv", stringsAsFactors = FALSE)
# Convert Gene names to Entrez IDs in heart_genes
heart_genes$Entrez_ID <- mapIds(
org.Hs.eg.db,
keys = heart_genes$Gene,
column = "ENTREZID",
keytype = "SYMBOL",
multiVals = "first"
)
# Create a vector of Entrez IDs specific to the heart
heart_entrez_ids <- na.omit(heart_genes$Entrez_ID)
# Load the saved datasets
prob_all_1 <- read.csv("data/prob_all_1.csv")$Entrez_ID
prob_all_2 <- read.csv("data/prob_all_2.csv")$Entrez_ID
prob_all_3 <- read.csv("data/prob_all_3.csv")$Entrez_ID
prob_all_4 <- read.csv("data/prob_all_4.csv")$Entrez_ID
# Example Response Groups Data (Replace with actual data)
response_groups <- list(
"Non Response" = prob_all_1, # Replace 'prob_all_1', 'prob_all_2', etc. with your actual response group dataframes
"CX_DOX Shared Late Response" = prob_all_2,
"DOX-Specific Response" = prob_all_3,
"Late High Dose DOX-Specific Response" = prob_all_4
)
# Combine all response groups into a single dataframe
response_groups_df <- bind_rows(
lapply(response_groups, function(ids) data.frame(Entrez_ID = ids)),
.id = "Set"
)
# Categorize genes into Heart-specific and Non-Heart-specific
response_groups_df <- response_groups_df %>%
mutate(
Category = ifelse(Entrez_ID %in% heart_entrez_ids, "Heart-specific Genes", "Non-Heart-specific Genes")
)
# Calculate counts for Heart-specific and Non-Heart-specific genes in each response group
proportion_data <- response_groups_df %>%
group_by(Set, Category) %>%
summarize(Count = n(), .groups = "drop")
# Reference counts for "Non Response"
non_response_counts <- proportion_data %>%
filter(Set == "Non Response") %>%
dplyr::select(Category, Count) %>%
{setNames(.$Count, .$Category)} # Create named vector for "Non Response" counts
# Perform Chi-square test for selected response groups compared to "Non Response"
chi_results <- proportion_data %>%
filter(Set != "Non Response") %>% # Exclude "Non Response"
group_by(Set) %>%
summarize(
p_value = {
# Extract counts for the current response group
group_counts <- Count[Category %in% c("Heart-specific Genes", "Non-Heart-specific Genes")]
# Ensure there are no missing categories, fill with 0 if missing
if (length(group_counts) < 2) group_counts <- c(group_counts, 0)
# Create contingency table
contingency_table <- matrix(c(
group_counts[1], group_counts[2],
non_response_counts["Heart-specific Genes"], non_response_counts["Non-Heart-specific Genes"]
), nrow = 2)
# Print the contingency table for debugging
print(paste("Set:", unique(Set)))
print("Contingency Table:")
print(contingency_table)
# Perform chi-square test
if (all(contingency_table >= 0 & is.finite(contingency_table))) {
chisq.test(contingency_table)$p.value
} else {
NA
}
},
.groups = "drop"
) %>%
mutate(Significance = ifelse(!is.na(p_value) & p_value < 0.05, "*", ""))
[1] "Set: CX_DOX Shared Late Response"
[1] "Contingency Table:"
[,1] [,2]
[1,] 2 123
[2,] 412 6908
[1] "Set: DOX-Specific Response"
[1] "Contingency Table:"
[,1] [,2]
[1,] 62 123
[2,] 1469 6908
[1] "Set: Late High Dose DOX-Specific Response"
[1] "Contingency Table:"
[,1] [,2]
[1,] 159 123
[2,] 4715 6908
# Merge chi-square results back into the proportion data
proportion_data <- proportion_data %>%
left_join(chi_results %>% dplyr::select(Set, p_value, Significance), by = "Set")
# Calculate proportions for plotting
proportion_data <- proportion_data %>%
group_by(Set) %>%
mutate(Percentage = (Count / sum(Count)) * 100)
# Define the correct order for response groups
response_order <- c(
"Non Response",
"CX_DOX Shared Late Response",
"DOX-Specific Response",
"Late High Dose DOX-Specific Response"
)
proportion_data$Set <- factor(proportion_data$Set, levels = response_order)
# Create the proportion plot with significance stars
ggplot(proportion_data, aes(x = Set, y = Percentage, fill = Category)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(
data = proportion_data %>% distinct(Set, Significance), # Add stars for significant groups
aes(x = Set, y = 105, label = Significance), # Position stars above the bars
inherit.aes = FALSE,
size = 6,
color = "black",
hjust = 0.5
) +
scale_fill_manual(values = c(
"Heart-specific Genes" = "#4daf4a", # Green
"Non-Heart-specific Genes" = "#377eb8" # Blue
)) +
labs(
title = "Proportion of Heart and Non-Heart-Specific Genes Across Response Sets",
x = "Response Sets",
y = "Percentage of Genes",
fill = "Gene Category"
) +
theme_minimal() +
theme(
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = 15, color = "black"),
axis.ticks = element_line(linewidth = 1.5),
axis.line = element_line(linewidth = 1.5),
axis.text.x = element_text(size = 10, color = "black", angle = 45, hjust = 1),
strip.text = element_text(size = 12, face = "bold"),
panel.border = element_rect(color = "black", fill = NA, linewidth = 1.5) # Add border
)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 26100)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] pheatmap_1.0.12 biomaRt_2.58.2 clusterProfiler_4.10.1
[4] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 IRanges_2.36.0
[7] S4Vectors_0.40.1 Biobase_2.62.0 BiocGenerics_0.48.1
[10] tidyr_1.3.1 dplyr_1.1.4 ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_1.8.9
[4] magrittr_2.0.3 farver_2.1.2 rmarkdown_2.29
[7] fs_1.6.3 zlibbioc_1.48.0 vctrs_0.6.5
[10] memoise_2.0.1 RCurl_1.98-1.13 ggtree_3.10.1
[13] htmltools_0.5.8.1 progress_1.2.3 curl_6.0.1
[16] gridGraphics_0.5-1 sass_0.4.9 bslib_0.8.0
[19] plyr_1.8.9 cachem_1.0.8 whisker_0.4.1
[22] igraph_2.1.1 lifecycle_1.0.4 pkgconfig_2.0.3
[25] Matrix_1.6-1.1 R6_2.5.1 fastmap_1.1.1
[28] gson_0.1.0 GenomeInfoDbData_1.2.11 digest_0.6.34
[31] aplot_0.2.3 enrichplot_1.22.0 colorspace_2.1-0
[34] patchwork_1.3.0 rprojroot_2.0.4 RSQLite_2.3.3
[37] labeling_0.4.3 filelock_1.0.3 httr_1.4.7
[40] polyclip_1.10-7 compiler_4.3.0 bit64_4.0.5
[43] withr_3.0.2 BiocParallel_1.36.0 viridis_0.6.5
[46] DBI_1.2.3 ggforce_0.4.2 MASS_7.3-60
[49] rappdirs_0.3.3 HDO.db_0.99.1 tools_4.3.0
[52] ape_5.8 scatterpie_0.2.4 httpuv_1.6.15
[55] glue_1.7.0 nlme_3.1-166 GOSemSim_2.28.1
[58] promises_1.3.0 grid_4.3.0 shadowtext_0.1.4
[61] reshape2_1.4.4 fgsea_1.28.0 generics_0.1.3
[64] gtable_0.3.6 data.table_1.14.10 hms_1.1.3
[67] xml2_1.3.6 tidygraph_1.3.1 XVector_0.42.0
[70] ggrepel_0.9.6 pillar_1.10.1 stringr_1.5.1
[73] yulab.utils_0.1.8 later_1.3.2 splines_4.3.0
[76] tweenr_2.0.3 BiocFileCache_2.10.2 treeio_1.26.0
[79] lattice_0.22-5 bit_4.0.5 tidyselect_1.2.1
[82] GO.db_3.18.0 Biostrings_2.70.1 knitr_1.49
[85] git2r_0.35.0 gridExtra_2.3 xfun_0.50
[88] graphlayouts_1.2.0 stringi_1.8.3 workflowr_1.7.1
[91] lazyeval_0.2.2 ggfun_0.1.8 yaml_2.3.10
[94] evaluate_1.0.3 codetools_0.2-20 ggraph_2.2.1
[97] tibble_3.2.1 qvalue_2.34.0 ggplotify_0.1.2
[100] cli_3.6.1 munsell_0.5.1 jquerylib_0.1.4
[103] Rcpp_1.0.12 GenomeInfoDb_1.38.8 dbplyr_2.5.0
[106] png_0.1-8 XML_3.99-0.17 parallel_4.3.0
[109] blob_1.2.4 prettyunits_1.2.0 DOSE_3.28.2
[112] bitops_1.0-7 viridisLite_0.4.2 tidytree_0.4.6
[115] scales_1.3.0 purrr_1.0.2 crayon_1.5.3
[118] rlang_1.1.3 cowplot_1.1.3 fastmatch_1.1-4
[121] KEGGREST_1.42.0