Colorectal cancer dataset comparison

Last updated: 2025-02-28

Checks: 6 1

Knit directory: CX5461_Project/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20250129)

The command set.seed(20250129) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: dd3f95a

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version dd3f95a. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/Colorectal.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Colorectal.Rmd) and HTML (docs/Colorectal.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	1c1e1e4	sayanpaul01	2025-02-28	Commit
html	1c1e1e4	sayanpaul01	2025-02-28	Commit

📌 Proportion of Colorectal cancer DE genes in Corrmotif clusters

📌 Load Required Libraries

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.3.3

library(dplyr)

Warning: package 'dplyr' was built under R version 4.3.2

library(tidyr)

Warning: package 'tidyr' was built under R version 4.3.3

library(org.Hs.eg.db)

Warning: package 'AnnotationDbi' was built under R version 4.3.2

Warning: package 'BiocGenerics' was built under R version 4.3.1

Warning: package 'Biobase' was built under R version 4.3.1

Warning: package 'IRanges' was built under R version 4.3.1

Warning: package 'S4Vectors' was built under R version 4.3.1

library(clusterProfiler)

Warning: package 'clusterProfiler' was built under R version 4.3.3

library(biomaRt)

Warning: package 'biomaRt' was built under R version 4.3.2

📌 Load Data

# Define correct timepoints
timepoints <- c("6hr", "24hr", "72hr")

# Initialize an empty dataframe to store chi-square results
all_chi_results <- data.frame()

# Loop through each timepoint and perform Chi-square test
for (time in timepoints) {
  
  # Load proportion data for the given timepoint
  file_path <- paste0("data/Colorectal/Proportion_data_", time, ".csv")
  
  # Check if the file exists before reading
  if (!file.exists(file_path)) {
    cat("\n🚨 Warning: File does not exist:", file_path, "\nSkipping this timepoint...\n")
    next  # Skip this iteration if the file is missing
  }
  
  proportion_data <- read.csv(file_path)
  
  # Extract counts for Non-Response group
  non_response_counts <- proportion_data %>%
    filter(Set == "Non Response") %>%
    dplyr::select(DEG, `Non.DEG`) %>%
    unlist(use.names = FALSE)  # Convert to numeric vector
  
  # Debugging: Print Non-Response counts
  cat("\nNon-Response Counts for", time, ":\n")
  print(non_response_counts)
  
  # Initialize a list to store chi-square test results for this timepoint
  chi_results <- list()
  
  # Perform chi-square test for each response group
  for (group in unique(proportion_data$Set)) {
    if (group == "Non Response") next  # Skip Non Response group
    
    # Extract counts for the current response group
    group_counts <- proportion_data %>%
      filter(Set == group) %>%
      dplyr::select(DEG, `Non.DEG`) %>%
      unlist(use.names = FALSE)  # Convert to numeric vector
    
    # Ensure valid counts for chi-square test
    if (length(group_counts) < 2) group_counts <- c(group_counts, 0)
    if (length(non_response_counts) < 2) non_response_counts <- c(non_response_counts, 0)
    
    # Create contingency table
    contingency_table <- matrix(c(
      group_counts[1], group_counts[2],  # Current response group counts
      non_response_counts[1], non_response_counts[2]  # Non-Response counts
    ), nrow = 2, byrow = TRUE)
    
    # Debugging: Print contingency table
    cat("\nProcessing Group:", group, "at", time, "\n")
    cat("Contingency Table:\n")
    print(contingency_table)
    
    # Perform chi-square test
    test_result <- chisq.test(contingency_table)
    p_value <- test_result$p.value
    significance <- ifelse(p_value < 0.05, "*", "")  # Mark * for p < 0.05
    
    # Store results
    chi_results[[group]] <- data.frame(
      Set = group,
      Timepoint = time,
      Chi2 = test_result$statistic,
      p_value = p_value,
      Significance = significance
    )
  }
  
  # Combine results for this timepoint into a single dataframe
  chi_results <- do.call(rbind, chi_results)
  
  # Append to the overall results dataframe
  all_chi_results <- rbind(all_chi_results, chi_results)
}

# Save final chi-square results
write.csv(all_chi_results, "data/Colorectal//Chi_Square_Results_All.csv", row.names = FALSE)

📌 Proportion of DE genes across response groups

# Load the saved datasets
prob_all_1 <- read.csv("data/prob_all_1.csv")$Entrez_ID
prob_all_2 <- read.csv("data/prob_all_2.csv")$Entrez_ID
prob_all_3 <- read.csv("data/prob_all_3.csv")$Entrez_ID
prob_all_4 <- read.csv("data/prob_all_4.csv")$Entrez_ID

# Example Response Groups Data (Replace with actual data)
response_groups <- list(
  "Non Response" = prob_all_1, # Replace 'prob_all_1', 'prob_all_2', etc. with your actual response group dataframes
  "CX_DOX Shared Late Response" = prob_all_2,
  "DOX-Specific Response" = prob_all_3,
  "Late High Dose DOX-Specific Response" = prob_all_4
)

# Load the proportion data again for visualization
proportion_data <-read.csv("data/Colorectal/Proportion_data.csv")
# Merge chi-square results into proportion data for plotting
proportion_data <- proportion_data %>%
  left_join(all_chi_results %>% dplyr::select(Set, Timepoint, Significance), by = c("Set", "Timepoint"))

# Convert to factors for ordered display
proportion_data$Set <- factor(proportion_data$Set, levels = c(
  "Non Response",
  "CX_DOX Shared Late Response",
  "DOX-Specific Response",
  "Late High Dose DOX-Specific Response"
))
proportion_data$Timepoint <- factor(proportion_data$Timepoint, levels = timepoints)

# Plot proportions with significance stars
ggplot(proportion_data, aes(x = Set, y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", position = "stack") +
  facet_wrap(~Timepoint, scales = "fixed") +
  scale_fill_manual(values = c("DEG" = "#e41a1c", "Non-DEG" = "#377eb8")) +
  geom_text(
    data = proportion_data %>% filter(Significance == "*") %>% distinct(Set, Timepoint, Significance),
    aes(x = Set, y = 105, label = Significance),
    inherit.aes = FALSE,
    size = 6,
    color = "black",
    hjust = 0.5
  ) +
  labs(
    title = "Proportion of Colorectal cancer DEGs and Non-DEGs\nAcross Response Groups (6hr, 24hr, 72hr)",
    x = "Response Groups",
    y = "Percentage",
    fill = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = rel(1.5), hjust = 0.5),
    axis.title = element_text(size = 15, color = "black"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    legend.title = element_blank(),
    panel.border = element_rect(color = "black", fill = NA, size = 1.2),
    strip.background = element_blank(),
    strip.text = element_text(size = 12, face = "bold")
  )

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Version	Author	Date
1c1e1e4	sayanpaul01	2025-02-28

📌 Proportion of colorectal cancer genes in CX and DOX DEGs

📌 Read and Process Data

# Load DEGs Data
CX_0.1_3 <- read.csv("data/DEGs/Toptable_CX_0.1_3.csv")
CX_0.1_24 <- read.csv("data/DEGs/Toptable_CX_0.1_24.csv")
CX_0.1_48 <- read.csv("data/DEGs/Toptable_CX_0.1_48.csv")
CX_0.5_3 <- read.csv("data/DEGs/Toptable_CX_0.5_3.csv")
CX_0.5_24 <- read.csv("data/DEGs/Toptable_CX_0.5_24.csv")
CX_0.5_48 <- read.csv("data/DEGs/Toptable_CX_0.5_48.csv")

DOX_0.1_3 <- read.csv("data/DEGs/Toptable_DOX_0.1_3.csv")
DOX_0.1_24 <- read.csv("data/DEGs/Toptable_DOX_0.1_24.csv")
DOX_0.1_48 <- read.csv("data/DEGs/Toptable_DOX_0.1_48.csv")
DOX_0.5_3 <- read.csv("data/DEGs/Toptable_DOX_0.5_3.csv")
DOX_0.5_24 <- read.csv("data/DEGs/Toptable_DOX_0.5_24.csv")
DOX_0.5_48 <- read.csv("data/DEGs/Toptable_DOX_0.5_48.csv")

# Extract Significant DEGs
DEG1 <- as.character(CX_0.1_3$Entrez_ID[CX_0.1_3$adj.P.Val < 0.05])
DEG2 <- as.character(CX_0.1_24$Entrez_ID[CX_0.1_24$adj.P.Val < 0.05])
DEG3 <- as.character(CX_0.1_48$Entrez_ID[CX_0.1_48$adj.P.Val < 0.05])
DEG4 <- as.character(CX_0.5_3$Entrez_ID[CX_0.5_3$adj.P.Val < 0.05])
DEG5 <- as.character(CX_0.5_24$Entrez_ID[CX_0.5_24$adj.P.Val < 0.05])
DEG6 <- as.character(CX_0.5_48$Entrez_ID[CX_0.5_48$adj.P.Val < 0.05])
DEG7 <- as.character(DOX_0.1_3$Entrez_ID[DOX_0.1_3$adj.P.Val < 0.05])
DEG8 <- as.character(DOX_0.1_24$Entrez_ID[DOX_0.1_24$adj.P.Val < 0.05])
DEG9 <- as.character(DOX_0.1_48$Entrez_ID[DOX_0.1_48$adj.P.Val < 0.05])
DEG10 <- as.character(DOX_0.5_3$Entrez_ID[DOX_0.5_3$adj.P.Val < 0.05])
DEG11 <- as.character(DOX_0.5_24$Entrez_ID[DOX_0.5_24$adj.P.Val < 0.05])
DEG12 <- as.character(DOX_0.5_48$Entrez_ID[DOX_0.5_48$adj.P.Val < 0.05])

📌 Proportion of colorectal cancer genes in CX and DOX DEGs datasets

data_6hr <- read.csv("data/Colorectal/6hr.csv", stringsAsFactors = FALSE)
data_24hr <- read.csv("data/Colorectal/24hr.csv", stringsAsFactors = FALSE)
data_72hr <- read.csv("data/Colorectal/72hr.csv", stringsAsFactors = FALSE)

# Map gene symbols to Entrez IDs using org.Hs.eg.db
data_6hr <- data_6hr %>%
  mutate(Entrez_ID = mapIds(org.Hs.eg.db,
                            keys = Symbol,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first"))

data_24hr <- data_24hr %>%
  mutate(Entrez_ID = mapIds(org.Hs.eg.db,
                            keys = Symbol,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first"))

data_72hr <- data_72hr %>%
  mutate(Entrez_ID = mapIds(org.Hs.eg.db,
                            keys = Symbol,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first"))

# Define CX-5461 DEG lists
CX_DEGs <- list(
  "CX_0.1_3" = DEG1, "CX_0.1_24" = DEG2, "CX_0.1_48" = DEG3,
  "CX_0.5_3" = DEG4, "CX_0.5_24" = DEG5, "CX_0.5_48" = DEG6
)

# Define DOX DEG lists
DOX_DEGs <- list(
  "DOX_0.1_3" = DEG7, "DOX_0.1_24" = DEG8, "DOX_0.1_48" = DEG9,
  "DOX_0.5_3" = DEG10, "DOX_0.5_24" = DEG11, "DOX_0.5_48" = DEG12
)

# Extract Entrez_IDs from cancer gene datasets for 6hr, 24hr, and 72hr
lit_genes <- list(
  "6hr" = na.omit(data_6hr$Entrez_ID),
  "24hr" = na.omit(data_24hr$Entrez_ID),
  "72hr" = na.omit(data_72hr$Entrez_ID)
)

# Combine CX-5461 DEGs into a dataframe with a "Drug" column
CX_DEGs_df <- bind_rows(
  lapply(CX_DEGs, function(ids) data.frame(Entrez_ID = ids, Drug = "CX-5461")),
  .id = "Sample"
)

# Combine DOX DEGs into a dataframe with a "Drug" column
DOX_DEGs_df <- bind_rows(
  lapply(DOX_DEGs, function(ids) data.frame(Entrez_ID = ids, Drug = "DOX")),
  .id = "Sample"
)

# Merge CX-5461 and DOX datasets
DEGs_df <- bind_rows(CX_DEGs_df, DOX_DEGs_df)

# Match Entrez_IDs with DEGs from 6hr, 24hr, and 72hr datasets
DEGs_df <- DEGs_df %>%
  mutate(
    Category_6hr = ifelse(Entrez_ID %in% lit_genes[["6hr"]], "Yes", "No"),
    Category_24hr = ifelse(Entrez_ID %in% lit_genes[["24hr"]], "Yes", "No"),
    Category_72hr = ifelse(Entrez_ID %in% lit_genes[["72hr"]], "Yes", "No")
  )

# Convert to long format for ggplot
proportion_data <- DEGs_df %>%
  pivot_longer(cols = starts_with("Category"), names_to = "Timepoint", values_to = "Category") %>%
  group_by(Sample, Drug, Timepoint, Category) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Sample, Drug, Timepoint) %>%
  mutate(Percentage = (Count / sum(Count)) * 100)

# **Fix: Normalize Percentages to Sum Exactly 100%**
proportion_data <- proportion_data %>%
  group_by(Sample, Timepoint) %>%
  mutate(Percentage = round(Percentage, 2)) %>%  # Round each percentage
  mutate(Adjustment = 100 - sum(Percentage, na.rm = TRUE)) %>%  # Compute rounding difference
  mutate(Percentage = ifelse(Category == "No", Percentage + Adjustment, Percentage)) %>%  # Apply adjustment to "No"
  mutate(Percentage = ifelse(Percentage < 0, 0, Percentage)) %>%  # Prevent negatives
  mutate(Percentage = ifelse(Percentage > 100, 100, Percentage)) %>%  # Prevent values > 100
  ungroup() %>%  # Remove grouping to avoid select() errors
  replace_na(list(Percentage = 0))  # Ensure no NA values

# **Ensure "Yes" is at the Bottom and "No" is at the Top**
proportion_data$Category <- factor(proportion_data$Category, levels = c("Yes", "No"))

# Define correct factor orders for samples and timepoints
sample_order <- c(
  "CX_0.1_3", "CX_0.1_24", "CX_0.1_48", "CX_0.5_3", "CX_0.5_24", "CX_0.5_48",
  "DOX_0.1_3", "DOX_0.1_24", "DOX_0.1_48", "DOX_0.5_3", "DOX_0.5_24", "DOX_0.5_48"
)
proportion_data$Sample <- factor(proportion_data$Sample, levels = sample_order)

proportion_data$Timepoint <- factor(
  proportion_data$Timepoint,
  levels = c("Category_6hr", "Category_24hr", "Category_72hr"),
  labels = c("6hr", "24hr", "72hr")
)

# **Generate Combined Proportion Plot for CX-5461 and DOX**
ggplot(proportion_data, aes(x = Sample, y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", position = "stack") +  # Stacked bars
  facet_wrap(~Timepoint, scales = "fixed") +  # Facet by Timepoint (6hr, 24hr, 72hr)
  scale_y_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) +  # Y-axis as percentage
  scale_fill_manual(values = c("Yes" = "#e41a1c", "No" = "#377eb8")) +  # Yes (Red) at Bottom, No (Blue) on Top
  labs(
    title = "Proportion of Human Colorectal Cancer Genes in CX-5461 and DOX DEGs",
    x = "Samples (CX-5461 and DOX)",
    y = "Percentage",
    fill = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = rel(1.5), hjust = 0.5),
    axis.title = element_text(size = 15, color = "black"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    legend.title = element_blank(),
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1.2),  # Updated for ggplot2 3.4.0+
    strip.background = element_blank(),
    strip.text = element_text(size = 12, face = "bold")
  )

📌 Correlation of colorectal cancer genes with CX and DOX expressed genes

📌 6hr

# Import the CSV file
data_6hr_logFC <- read.csv("data/Colorectal/6hrlogFC.csv", stringsAsFactors = FALSE)

# Map gene symbols to Entrez IDs using org.Hs.eg.db
data_6hr_logFC <- data_6hr_logFC %>%
  mutate(Entrez_ID = mapIds(org.Hs.eg.db,
                            keys = Symbol,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first"))

# Merge datasets on Entrez_ID for CX and DOX at both concentrations (3 hours)
merged_CX_0.1 <- merge(data_6hr_logFC, CX_0.1_3, by = "Entrez_ID")
merged_CX_0.5 <- merge(data_6hr_logFC, CX_0.5_3, by = "Entrez_ID")
merged_DOX_0.1 <- merge(data_6hr_logFC, DOX_0.1_3, by = "Entrez_ID")
merged_DOX_0.5 <- merge(data_6hr_logFC, DOX_0.5_3, by = "Entrez_ID")

# Remove NA values
merged_CX_0.1 <- na.omit(merged_CX_0.1)
merged_CX_0.5 <- na.omit(merged_CX_0.5)
merged_DOX_0.1 <- na.omit(merged_DOX_0.1)
merged_DOX_0.5 <- na.omit(merged_DOX_0.5)

# Rename columns explicitly to avoid conflicts
colnames(merged_CX_0.1) <- colnames(merged_CX_0.5) <-
  colnames(merged_DOX_0.1) <- colnames(merged_DOX_0.5) <-
  c("Entrez_ID", "Symbol_6hr", "logFC_6hr", "logFC_Drug", "AveExpr_Drug", "t_Drug", "P.Value_Drug", "adj.P.Val_Drug", "B_Drug")

# Add drug and concentration labels for faceting
merged_CX_0.1$Drug <- "CX (3hr)"
merged_CX_0.5$Drug <- "CX (3hr)"
merged_DOX_0.1$Drug <- "DOX (3hr)"
merged_DOX_0.5$Drug <- "DOX (3hr)"

merged_CX_0.1$Concentration <- "0.1 µM"
merged_CX_0.5$Concentration <- "0.5 µM"
merged_DOX_0.1$Concentration <- "0.1 µM"
merged_DOX_0.5$Concentration <- "0.5 µM"

# Combine all datasets into a single data frame
merged_data <- rbind(
  merged_CX_0.1[, c("Entrez_ID", "logFC_Drug", "logFC_6hr", "Concentration", "Drug")],
  merged_CX_0.5[, c("Entrez_ID", "logFC_Drug", "logFC_6hr", "Concentration", "Drug")],
  merged_DOX_0.1[, c("Entrez_ID", "logFC_Drug", "logFC_6hr", "Concentration", "Drug")],
  merged_DOX_0.5[, c("Entrez_ID", "logFC_Drug", "logFC_6hr", "Concentration", "Drug")]
)

# Calculate correlations for each facet
cor_CX_0.1 <- cor.test(merged_CX_0.1$logFC_Drug, merged_CX_0.1$logFC_6hr, method = "pearson")
cor_CX_0.5 <- cor.test(merged_CX_0.5$logFC_Drug, merged_CX_0.5$logFC_6hr, method = "pearson")
cor_DOX_0.1 <- cor.test(merged_DOX_0.1$logFC_Drug, merged_DOX_0.1$logFC_6hr, method = "pearson")
cor_DOX_0.5 <- cor.test(merged_DOX_0.5$logFC_Drug, merged_DOX_0.5$logFC_6hr, method = "pearson")

# Data frame for r and p-values annotations
correlation_data <- data.frame(
  Drug = rep(c("CX (3hr)", "DOX (3hr)"), each = 2),
  Concentration = rep(c("0.1 µM", "0.5 µM"), times = 2),
  x = max(merged_data$logFC_Drug, na.rm = TRUE) * 0.75,  # Adjusted placement for better fit
  y = max(merged_data$logFC_6hr, na.rm = TRUE) * 0.85,
  label = c(
    paste0("r = ", round(cor_CX_0.1$estimate, 3), "\np = ", signif(cor_CX_0.1$p.value, 3)),
    paste0("r = ", round(cor_CX_0.5$estimate, 3), "\np = ", signif(cor_CX_0.5$p.value, 3)),
    paste0("r = ", round(cor_DOX_0.1$estimate, 3), "\np = ", signif(cor_DOX_0.1$p.value, 3)),
    paste0("r = ", round(cor_DOX_0.5$estimate, 3), "\np = ", signif(cor_DOX_0.5$p.value, 3))
  )
)

# Create scatter plot with facets for concentration (vertically) and drug (horizontally)
scatter_plot <- ggplot(merged_data, aes(x = logFC_Drug, y = logFC_6hr)) +
  geom_point(alpha = 0.6, color = "black") +  # Black scatter points
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Black regression line
  labs(
    title = "Correlation between Drug (3hr) and Colorectal 6hr logFC",
    x = "logFC (Drug_3hr)",
    y = "logFC (Colorectal 6hr)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),

    # Outer box around the entire facet plot
    panel.border = element_rect(color = "black", fill = NA, linewidth = 2),

    # Black box around each facet title
    strip.background = element_rect(fill = "white", color = "black", linewidth = 1.5),
    strip.text = element_text(size = 12, face = "bold", color = "black")
  ) +

  # Facet by drug (horizontally) and concentration (vertically)
  facet_grid(Concentration ~ Drug) +

  # Add r and p-value in the upper-right corner of each facet with better fitting font size
  geom_text(data = correlation_data,
            aes(x = x, y = y, label = label),
            inherit.aes = FALSE, size = 3, fontface = "bold")  # Reduced font size for better fit

# Display plot
print(scatter_plot)

📌 24hr

# Import the CSV file
data_24hr_logFC <- read.csv("data/Colorectal/24hrlogFC.csv", stringsAsFactors = FALSE)

# Map gene symbols to Entrez IDs using org.Hs.eg.db
data_24hr_logFC <- data_24hr_logFC %>%
  mutate(Entrez_ID = mapIds(org.Hs.eg.db,
                            keys = Symbol,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first"))

# Merge datasets on Entrez_ID for CX and DOX at both concentrations (24 hours)
merged_CX_0.1_24 <- merge(data_24hr_logFC, CX_0.1_24, by = "Entrez_ID")
merged_CX_0.5_24 <- merge(data_24hr_logFC, CX_0.5_24, by = "Entrez_ID")
merged_DOX_0.1_24 <- merge(data_24hr_logFC, DOX_0.1_24, by = "Entrez_ID")
merged_DOX_0.5_24 <- merge(data_24hr_logFC, DOX_0.5_24, by = "Entrez_ID")

# Remove NA values
merged_CX_0.1_24 <- na.omit(merged_CX_0.1_24)
merged_CX_0.5_24 <- na.omit(merged_CX_0.5_24)
merged_DOX_0.1_24 <- na.omit(merged_DOX_0.1_24)
merged_DOX_0.5_24 <- na.omit(merged_DOX_0.5_24)

# Rename columns explicitly to avoid conflicts
colnames(merged_CX_0.1_24) <- colnames(merged_CX_0.5_24) <-
  colnames(merged_DOX_0.1_24) <- colnames(merged_DOX_0.5_24) <-
  c("Entrez_ID", "Symbol_24hr", "logFC_24hr", "logFC_Drug", "AveExpr_Drug", "t_Drug", "P.Value_Drug", "adj.P.Val_Drug", "B_Drug")

# Add drug and concentration labels for faceting
merged_CX_0.1_24$Drug <- "CX (24hr)"
merged_CX_0.5_24$Drug <- "CX (24hr)"
merged_DOX_0.1_24$Drug <- "DOX (24hr)"
merged_DOX_0.5_24$Drug <- "DOX (24hr)"

merged_CX_0.1_24$Concentration <- "0.1 µM"
merged_CX_0.5_24$Concentration <- "0.5 µM"
merged_DOX_0.1_24$Concentration <- "0.1 µM"
merged_DOX_0.5_24$Concentration <- "0.5 µM"

# Combine all datasets into a single data frame
merged_data_24hr <- rbind(
  merged_CX_0.1_24[, c("Entrez_ID", "logFC_Drug", "logFC_24hr", "Concentration", "Drug")],
  merged_CX_0.5_24[, c("Entrez_ID", "logFC_Drug", "logFC_24hr", "Concentration", "Drug")],
  merged_DOX_0.1_24[, c("Entrez_ID", "logFC_Drug", "logFC_24hr", "Concentration", "Drug")],
  merged_DOX_0.5_24[, c("Entrez_ID", "logFC_Drug", "logFC_24hr", "Concentration", "Drug")]
)

# Calculate correlations for each facet
cor_CX_0.1_24 <- cor.test(merged_CX_0.1_24$logFC_Drug, merged_CX_0.1_24$logFC_24hr, method = "pearson")
cor_CX_0.5_24 <- cor.test(merged_CX_0.5_24$logFC_Drug, merged_CX_0.5_24$logFC_24hr, method = "pearson")
cor_DOX_0.1_24 <- cor.test(merged_DOX_0.1_24$logFC_Drug, merged_DOX_0.1_24$logFC_24hr, method = "pearson")
cor_DOX_0.5_24 <- cor.test(merged_DOX_0.5_24$logFC_Drug, merged_DOX_0.5_24$logFC_24hr, method = "pearson")

# Data frame for r and p-values annotations
correlation_data_24hr <- data.frame(
  Drug = rep(c("CX (24hr)", "DOX (24hr)"), each = 2),
  Concentration = rep(c("0.1 µM", "0.5 µM"), times = 2),
  x = max(merged_data_24hr$logFC_Drug, na.rm = TRUE) * 0.75,  # Adjusted placement for better fit
  y = max(merged_data_24hr$logFC_24hr, na.rm = TRUE) * 0.85,
  label = c(
    paste0("r = ", round(cor_CX_0.1_24$estimate, 3), "\np = ", signif(cor_CX_0.1_24$p.value, 3)),
    paste0("r = ", round(cor_CX_0.5_24$estimate, 3), "\np = ", signif(cor_CX_0.5_24$p.value, 3)),
    paste0("r = ", round(cor_DOX_0.1_24$estimate, 3), "\np = ", signif(cor_DOX_0.1_24$p.value, 3)),
    paste0("r = ", round(cor_DOX_0.5_24$estimate, 3), "\np = ", signif(cor_DOX_0.5_24$p.value, 3))
  )
)

# Create scatter plot with facets for concentration (vertically) and drug (horizontally)
scatter_plot_24hr <- ggplot(merged_data_24hr, aes(x = logFC_Drug, y = logFC_24hr)) +
  geom_point(alpha = 0.6, color = "black") +  # Black scatter points
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Black regression line
  labs(
    title = "Correlation between Drug (24hr) and Colorectal 24hr logFC",
    x = "logFC (Drug_24hr)",
    y = "logFC (Colorectal 24hr)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),

    # Outer box around the entire facet plot
    panel.border = element_rect(color = "black", fill = NA, linewidth = 2),

    # Black box around each facet title
    strip.background = element_rect(fill = "white", color = "black", linewidth = 1.5),
    strip.text = element_text(size = 12, face = "bold", color = "black")
  ) +

  # Facet by drug (horizontally) and concentration (vertically)
  facet_grid(Concentration ~ Drug) +

  # Add r and p-value in the upper-right corner of each facet with better fitting font size
  geom_text(data = correlation_data_24hr,
            aes(x = x, y = y, label = label),
            inherit.aes = FALSE, size = 3, fontface = "bold")  # Reduced font size for better fit

# Display plot
print(scatter_plot_24hr)

📌 72hr

# Import the CSV file
data_72hr_logFC <- read.csv("data/Colorectal/72hrlogFC.csv", stringsAsFactors = FALSE)

# Map gene symbols to Entrez IDs using org.Hs.eg.db
data_72hr_logFC <- data_72hr_logFC %>%
  mutate(Entrez_ID = mapIds(org.Hs.eg.db,
                            keys = Symbol,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first"))

# Merge datasets on Entrez_ID for CX and DOX at both concentrations (48 hours)
merged_CX_0.1_48 <- merge(data_72hr_logFC, CX_0.1_48, by = "Entrez_ID")
merged_CX_0.5_48 <- merge(data_72hr_logFC, CX_0.5_48, by = "Entrez_ID")
merged_DOX_0.1_48 <- merge(data_72hr_logFC, DOX_0.1_48, by = "Entrez_ID")
merged_DOX_0.5_48 <- merge(data_72hr_logFC, DOX_0.5_48, by = "Entrez_ID")

# Remove NA values
merged_CX_0.1_48 <- na.omit(merged_CX_0.1_48)
merged_CX_0.5_48 <- na.omit(merged_CX_0.5_48)
merged_DOX_0.1_48 <- na.omit(merged_DOX_0.1_48)
merged_DOX_0.5_48 <- na.omit(merged_DOX_0.5_48)

# Rename columns explicitly to avoid conflicts
colnames(merged_CX_0.1_48) <- colnames(merged_CX_0.5_48) <-
  colnames(merged_DOX_0.1_48) <- colnames(merged_DOX_0.5_48) <-
  c("Entrez_ID", "Symbol_72hr", "logFC_72hr", "logFC_Drug", "AveExpr_Drug", "t_Drug", "P.Value_Drug", "adj.P.Val_Drug", "B_Drug")

# Add drug and concentration labels for faceting
merged_CX_0.1_48$Drug <- "CX (48hr)"
merged_CX_0.5_48$Drug <- "CX (48hr)"
merged_DOX_0.1_48$Drug <- "DOX (48hr)"
merged_DOX_0.5_48$Drug <- "DOX (48hr)"

merged_CX_0.1_48$Concentration <- "0.1 µM"
merged_CX_0.5_48$Concentration <- "0.5 µM"
merged_DOX_0.1_48$Concentration <- "0.1 µM"
merged_DOX_0.5_48$Concentration <- "0.5 µM"

# Combine all datasets into a single data frame
merged_data_48hr <- rbind(
  merged_CX_0.1_48[, c("Entrez_ID", "logFC_Drug", "logFC_72hr", "Concentration", "Drug")],
  merged_CX_0.5_48[, c("Entrez_ID", "logFC_Drug", "logFC_72hr", "Concentration", "Drug")],
  merged_DOX_0.1_48[, c("Entrez_ID", "logFC_Drug", "logFC_72hr", "Concentration", "Drug")],
  merged_DOX_0.5_48[, c("Entrez_ID", "logFC_Drug", "logFC_72hr", "Concentration", "Drug")]
)

# Calculate correlations for each facet
cor_CX_0.1_48 <- cor.test(merged_CX_0.1_48$logFC_Drug, merged_CX_0.1_48$logFC_72hr, method = "pearson")
cor_CX_0.5_48 <- cor.test(merged_CX_0.5_48$logFC_Drug, merged_CX_0.5_48$logFC_72hr, method = "pearson")
cor_DOX_0.1_48 <- cor.test(merged_DOX_0.1_48$logFC_Drug, merged_DOX_0.1_48$logFC_72hr, method = "pearson")
cor_DOX_0.5_48 <- cor.test(merged_DOX_0.5_48$logFC_Drug, merged_DOX_0.5_48$logFC_72hr, method = "pearson")

# Data frame for r and p-values annotations
correlation_data_48hr <- data.frame(
  Drug = rep(c("CX (48hr)", "DOX (48hr)"), each = 2),
  Concentration = rep(c("0.1 µM", "0.5 µM"), times = 2),
  x = max(merged_data_48hr$logFC_Drug, na.rm = TRUE) * 0.75,  # Adjusted placement for better fit
  y = max(merged_data_48hr$logFC_72hr, na.rm = TRUE) * 0.85,
  label = c(
    paste0("r = ", round(cor_CX_0.1_48$estimate, 3), "\np = ", signif(cor_CX_0.1_48$p.value, 3)),
    paste0("r = ", round(cor_CX_0.5_48$estimate, 3), "\np = ", signif(cor_CX_0.5_48$p.value, 3)),
    paste0("r = ", round(cor_DOX_0.1_48$estimate, 3), "\np = ", signif(cor_DOX_0.1_48$p.value, 3)),
    paste0("r = ", round(cor_DOX_0.5_48$estimate, 3), "\np = ", signif(cor_DOX_0.5_48$p.value, 3))
  )
)

# Create scatter plot with facets for concentration (vertically) and drug (horizontally)
scatter_plot_48hr <- ggplot(merged_data_48hr, aes(x = logFC_Drug, y = logFC_72hr)) +
  geom_point(alpha = 0.6, color = "black") +  # Black scatter points
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Black regression line
  labs(
    title = "Correlation between Drug (48hr) and Colorectal 72hr logFC",
    x = "logFC (Drug_48hr)",
    y = "logFC (Colorectal 72hr)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),

    # Outer box around the entire facet plot
    panel.border = element_rect(color = "black", fill = NA, linewidth = 2),

    # Black box around each facet title
    strip.background = element_rect(fill = "white", color = "black", linewidth = 1.5),
    strip.text = element_text(size = 12, face = "bold", color = "black")
  ) +

  # Facet by drug (horizontally) and concentration (vertically)
  facet_grid(Concentration ~ Drug) +

  # Add r and p-value in the upper-right corner of each facet with better fitting font size
  geom_text(data = correlation_data_48hr,
            aes(x = x, y = y, label = label),
            inherit.aes = FALSE, size = 3, fontface = "bold")  # Reduced font size for better fit

# Display plot
print(scatter_plot_48hr)

sessionInfo()

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] biomaRt_2.58.2         clusterProfiler_4.10.1 org.Hs.eg.db_3.18.0   
 [4] AnnotationDbi_1.64.1   IRanges_2.36.0         S4Vectors_0.40.1      
 [7] Biobase_2.62.0         BiocGenerics_0.48.1    tidyr_1.3.1           
[10] dplyr_1.1.4            ggplot2_3.5.1         

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3      rstudioapi_0.17.1       jsonlite_1.8.9         
  [4] magrittr_2.0.3          farver_2.1.2            rmarkdown_2.29         
  [7] fs_1.6.3                zlibbioc_1.48.0         vctrs_0.6.5            
 [10] memoise_2.0.1           RCurl_1.98-1.13         ggtree_3.10.1          
 [13] htmltools_0.5.8.1       progress_1.2.3          curl_6.0.1             
 [16] gridGraphics_0.5-1      sass_0.4.9              bslib_0.8.0            
 [19] plyr_1.8.9              cachem_1.0.8            whisker_0.4.1          
 [22] igraph_2.1.1            lifecycle_1.0.4         pkgconfig_2.0.3        
 [25] Matrix_1.6-1.1          R6_2.5.1                fastmap_1.1.1          
 [28] gson_0.1.0              GenomeInfoDbData_1.2.11 digest_0.6.34          
 [31] aplot_0.2.3             enrichplot_1.22.0       colorspace_2.1-0       
 [34] patchwork_1.3.0         rprojroot_2.0.4         RSQLite_2.3.3          
 [37] labeling_0.4.3          filelock_1.0.3          mgcv_1.9-1             
 [40] httr_1.4.7              polyclip_1.10-7         compiler_4.3.0         
 [43] bit64_4.0.5             withr_3.0.2             BiocParallel_1.36.0    
 [46] viridis_0.6.5           DBI_1.2.3               ggforce_0.4.2          
 [49] MASS_7.3-60             rappdirs_0.3.3          HDO.db_0.99.1          
 [52] tools_4.3.0             ape_5.8                 scatterpie_0.2.4       
 [55] httpuv_1.6.15           glue_1.7.0              nlme_3.1-166           
 [58] GOSemSim_2.28.1         promises_1.3.0          grid_4.3.0             
 [61] shadowtext_0.1.4        reshape2_1.4.4          fgsea_1.28.0           
 [64] generics_0.1.3          gtable_0.3.6            data.table_1.14.10     
 [67] hms_1.1.3               xml2_1.3.6              tidygraph_1.3.1        
 [70] XVector_0.42.0          ggrepel_0.9.6           pillar_1.10.1          
 [73] stringr_1.5.1           yulab.utils_0.1.8       later_1.3.2            
 [76] splines_4.3.0           tweenr_2.0.3            BiocFileCache_2.10.2   
 [79] treeio_1.26.0           lattice_0.22-5          bit_4.0.5              
 [82] tidyselect_1.2.1        GO.db_3.18.0            Biostrings_2.70.1      
 [85] knitr_1.49              git2r_0.35.0            gridExtra_2.3          
 [88] xfun_0.50               graphlayouts_1.2.0      stringi_1.8.3          
 [91] workflowr_1.7.1         lazyeval_0.2.2          ggfun_0.1.8            
 [94] yaml_2.3.10             evaluate_1.0.3          codetools_0.2-20       
 [97] ggraph_2.2.1            tibble_3.2.1            qvalue_2.34.0          
[100] ggplotify_0.1.2         cli_3.6.1               munsell_0.5.1          
[103] jquerylib_0.1.4         Rcpp_1.0.12             GenomeInfoDb_1.38.8    
[106] dbplyr_2.5.0            png_0.1-8               XML_3.99-0.17          
[109] parallel_4.3.0          blob_1.2.4              prettyunits_1.2.0      
[112] DOSE_3.28.2             bitops_1.0-7            viridisLite_0.4.2      
[115] tidytree_0.4.6          scales_1.3.0            purrr_1.0.2            
[118] crayon_1.5.3            rlang_1.1.3             cowplot_1.1.3          
[121] fastmatch_1.1-4         KEGGREST_1.42.0