ATAC_fastqc

Last updated: 2024-01-30

Checks: 7 0

Knit directory: ATAC_learning/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20231016)

The command set.seed(20231016) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ccb3f28

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ccb3f28. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/initial_complete_stats_run1.txt
    Ignored:    data/multiqc_fastqc_run1.txt
    Ignored:    data/multiqc_fastqc_run2.txt
    Ignored:    data/multiqc_genestat_run1.txt
    Ignored:    data/multiqc_genestat_run2.txt

Untracked files:
    Untracked:  code/just_for_Fun.R

Unstaged changes:
    Modified:   README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Fastqc_results.Rmd) and HTML (docs/Fastqc_results.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	ccb3f28	reneeisnowhere	2024-01-30	first updates

library(tidyverse)
# library(ggsignif)
# library(cowplot)
# library(ggpubr)
# library(scales)
# library(sjmisc)
library(kableExtra)
# library(broom)
# library(biomaRt)
library(RColorBrewer)
# library(gprofiler2)
# library(qvalue)

This code takes the multiqc fastqc output file and: splits by rows to trimmed and non trimmed, then separates the trimmed file names into catagories I want, then adds back in the non trimmed data rows (while also splitting file name like the trimmed file name). after rbind, I split treatmenttime by position, fix the names of the time column, remove numbers from trt column, add a new column called “trimmed” where I add in a vector that lets me group by trimmed file verses non trimmed file, the select only those columns containing the columns I want to keep.

multiqc_fastqc2 <- read_csv("data/multiqc_fastqc_run2.txt")
multiqc_general_stats2 <- read_csv("data/multiqc_genestat_run2.txt")


fastqc_full <- multiqc_fastqc2 %>% 
  slice_tail(n=144) %>% 
  separate(Filename, into = c(NA,"ind","treatmenttime",NA,"read")) %>% 
  rbind(., (multiqc_fastqc2 %>% slice_head(n=144) %>% separate(Filename, into = c("ind","treatmenttime",NA,"read")))) %>% 
  separate_wider_position(., col =treatmenttime,c(2,trt=2,time=3),too_few = "align_start") %>% 
  mutate(time=case_match(trt,"E2"~"24h","E3"~"3h","M2"~"24h", "M3"~"3h","T2"~"24h","T3"~"3h","V2"~"24h","V3"~"3h",.default = time)) %>% 
  mutate(trt=gsub("[[:digit:]]", "", trt) ) %>% 
  mutate(trimmed = if_else(grepl(pattern ="^trim", x = Sample)==TRUE, "yes","no")) %>% 
  select(Sample:read, trimmed,`Total Sequences`:avg_sequence_length) %>% 
  full_join(., multiqc_general_stats2, join_by(Sample)) %>% 
   rename("percent_gc"="FastQC_mqc-generalstats-fastqc-percent_gc",
         "avg_seq_len"= "FastQC_mqc-generalstats-fastqc-avg_sequence_length",
         "percent_dup"= "FastQC_mqc-generalstats-fastqc-percent_duplicates",
         "percent_fails"= "FastQC_mqc-generalstats-fastqc-percent_fails",
         "total_sequences"= "FastQC_mqc-generalstats-fastqc-total_sequences") %>% 
  mutate(ind = factor(ind, levels = c("Ind1", "Ind2", "Ind3", "Ind4", "Ind5", "Ind6"))) %>%
  mutate(time = factor(time, levels = c("3h", "24h"), labels= c("3 hours","24 hours"))) %>% 
  mutate(trt = factor(trt, levels = c("DX","E", "DA","M", "T", "V"), labels = c("DOX","EPI", "DNR", "MTX", "TRZ", "VEH")))

(in this case, Sample, ind, trt, time read, trimmed, Total sequences, Flagged poor quality, sequence length, %GC,total deduplicated %,and avg sequence length) I also then addin the gen_stats file and rename the columns to normal things.

fastqc_full

# A tibble: 288 × 17
   Sample                     ind   trt   time   read  trimmed `Total Sequences`
   <chr>                      <fct> <fct> <fct>  <chr> <chr>               <dbl>
 1 trimmed_Ind1_75DA24h_S7_R1 Ind1  DNR   24 ho… R1    yes              42551223
 2 trimmed_Ind1_75DA24h_S7_R2 Ind1  DNR   24 ho… R2    yes              42551223
 3 trimmed_Ind1_75DA3h_S1_R1  Ind1  DNR   3 hou… R1    yes              48230311
 4 trimmed_Ind1_75DA3h_S1_R2  Ind1  DNR   3 hou… R2    yes              48230311
 5 trimmed_Ind1_75DX24h_S8_R1 Ind1  DOX   24 ho… R1    yes              43284466
 6 trimmed_Ind1_75DX24h_S8_R2 Ind1  DOX   24 ho… R2    yes              43284466
 7 trimmed_Ind1_75DX3h_S2_R1  Ind1  DOX   3 hou… R1    yes              44480840
 8 trimmed_Ind1_75DX3h_S2_R2  Ind1  DOX   3 hou… R2    yes              44480840
 9 trimmed_Ind1_75E24h_S9_R1  Ind1  EPI   24 ho… R1    yes              42757767
10 trimmed_Ind1_75E24h_S9_R2  Ind1  EPI   24 ho… R2    yes              42757767
# ℹ 278 more rows
# ℹ 10 more variables: `Sequences flagged as poor quality` <dbl>,
#   `Sequence length` <chr>, `%GC` <dbl>, total_deduplicated_percentage <dbl>,
#   avg_sequence_length <dbl>, percent_dup <dbl>, percent_gc <dbl>,
#   avg_seq_len <dbl>, percent_fails <dbl>, total_sequences <dbl>

drug_pal <- c("#8B006D","#DF707E","#F1B72B", "#3386DD","#707031","#41B333")
fastqc_full %>% 
  filter(trimmed=="no") %>%
  ggplot(., aes(x=trt, y= `Total Sequences`))+
  geom_col(aes(fill= trt))+
  facet_wrap(ind~time)+
  scale_fill_manual(values=drug_pal)+
  theme_bw()+
  ggtitle("Total Sequences, untrimmed")+
      # ylab(ylab)+
      xlab("")+
      theme(strip.background = element_rect(fill = "white",linetype=1, linewidth = 0.5),
          plot.title = element_text(size=14,hjust = 0.5,face="bold"),
          axis.title = element_text(size = 10, color = "black"),
          axis.ticks = element_line(linewidth = 0.5),
          axis.line = element_line(linewidth = 0.5),
          axis.text.x = element_blank(),
          strip.text.x = element_text(margin = margin(2,0,2,0, "pt"),face = "bold"))

fastqc_full %>% 
  filter(trimmed=="yes") %>% 
  ggplot(., aes(x=trt, y= `Total Sequences`))+
  geom_col(aes(fill= trt))+
  facet_wrap(ind~time)+
  scale_fill_manual(values=drug_pal)+
  theme_bw()+
  ggtitle("Total Sequences, trimmed")+
      # ylab(ylab)+
      xlab("")+
      theme(strip.background = element_rect(fill = "white",linetype=1, linewidth = 0.5),
          plot.title = element_text(size=14,hjust = 0.5,face="bold"),
          axis.title = element_text(size = 10, color = "black"),
          axis.ticks = element_line(linewidth = 0.5),
          axis.line = element_line(linewidth = 0.5),
          axis.text.x = element_blank(),
          strip.text.x = element_text(margin = margin(2,0,2,0, "pt"),face = "bold"))

totseq <- fastqc_full %>% 
  dplyr::filter(read =='R1') %>% 
  # group_by(ind,trt,time) %>% 
 select(Sample, ind, trt, time, trimmed, `Total Sequences`) %>% 
  pivot_wider(id_cols = c(ind,trt,time), names_from = trimmed, values_from = `Total Sequences`) %>% 
  mutate(perc_removed=(no-yes)/no*100) #%>% 
 kable(list(totseq[1:36,], totseq[37:72,]),caption= "Summary of Total sequences before and after trimming, with percentage of removed sequences") %>%
  kable_paper("striped", full_width = FALSE) %>%
  kable_styling(full_width = FALSE,font_size = 18) #%>%

Summary of Total sequences before and after trimming, with percentage of removed sequences

ind	trt	time	yes	no	perc_removed
Ind1	DNR	24 hours	42551223	42552417	0.0028060
Ind1	DNR	3 hours	48230311	48236407	0.0126378
Ind1	DOX	24 hours	43284466	43287755	0.0075980
Ind1	DOX	3 hours	44480840	44487109	0.0140917
Ind1	EPI	24 hours	42757767	42760898	0.0073221
Ind1	EPI	3 hours	43425217	43436091	0.0250345
Ind1	MTX	24 hours	42126655	42128326	0.0039665
Ind1	MTX	3 hours	34347823	34353188	0.0156172
Ind1	TRZ	24 hours	39592932	39618391	0.0642606
Ind1	TRZ	3 hours	32503524	32515737	0.0375603
Ind1	VEH	24 hours	43814361	43825945	0.0264318
Ind1	VEH	3 hours	33684259	33702857	0.0551823
Ind2	DNR	24 hours	45642245	45642976	0.0016016
Ind2	DNR	3 hours	49765031	49766630	0.0032130
Ind2	DOX	24 hours	47963129	47964028	0.0018743
Ind2	DOX	3 hours	43218795	43219481	0.0015872
Ind2	EPI	24 hours	37876743	37877549	0.0021279
Ind2	EPI	3 hours	46509493	46509868	0.0008063
Ind2	MTX	24 hours	47565953	47566199	0.0005172
Ind2	MTX	3 hours	46547872	46548086	0.0004597
Ind2	TRZ	24 hours	42224852	42225791	0.0022238
Ind2	TRZ	3 hours	38631467	38632152	0.0017731
Ind2	VEH	24 hours	38128833	38135521	0.0175375
Ind2	VEH	3 hours	39053023	39053395	0.0009525
Ind3	DNR	24 hours	37681759	37682029	0.0007165
Ind3	DNR	3 hours	71147689	71149112	0.0020000
Ind3	DOX	24 hours	39048735	39049130	0.0010115
Ind3	DOX	3 hours	43756089	43756875	0.0017963
Ind3	EPI	24 hours	41488932	41489927	0.0023982
Ind3	EPI	3 hours	50138075	50139561	0.0029637
Ind3	MTX	24 hours	36498742	36498973	0.0006329
Ind3	MTX	3 hours	49431453	49431897	0.0008982
Ind3	TRZ	24 hours	40927652	40928915	0.0030858
Ind3	TRZ	3 hours	38780291	38781957	0.0042958
Ind3	VEH	24 hours	31324889	31327676	0.0088963
Ind3	VEH	3 hours	44853273	44855052	0.0039661

ind	trt	time	yes	no	perc_removed
Ind4	DNR	24 hours	41965134	41969945	0.0114630
Ind4	DNR	3 hours	57305036	57308058	0.0052733
Ind4	DOX	24 hours	39197691	39206235	0.0217925
Ind4	DOX	3 hours	38800646	38806358	0.0147192
Ind4	EPI	24 hours	43019600	43031730	0.0281885
Ind4	EPI	3 hours	43496611	43500262	0.0083931
Ind4	MTX	24 hours	41030501	41031137	0.0015500
Ind4	MTX	3 hours	41964607	41971195	0.0156965
Ind4	TRZ	24 hours	45437929	45444700	0.0148994
Ind4	TRZ	3 hours	53428222	53433439	0.0097635
Ind4	VEH	24 hours	38506744	38516586	0.0255526
Ind4	VEH	3 hours	37431664	37434854	0.0085215
Ind5	DNR	24 hours	46899322	46905257	0.0126532
Ind5	DNR	3 hours	49843467	49867445	0.0480835
Ind5	DOX	24 hours	36866406	36868936	0.0068621
Ind5	DOX	3 hours	50184488	50205715	0.0422800
Ind5	EPI	24 hours	40050758	40060480	0.0242683
Ind5	EPI	3 hours	45761710	45805846	0.0963545
Ind5	MTX	24 hours	47153857	47154510	0.0013848
Ind5	MTX	3 hours	35648456	35648586	0.0003647
Ind5	TRZ	24 hours	44126954	44145364	0.0417031
Ind5	TRZ	3 hours	38314546	38349375	0.0908203
Ind5	VEH	24 hours	49541600	49581967	0.0814147
Ind5	VEH	3 hours	41343052	41487846	0.3490034
Ind6	DNR	24 hours	41059389	41059920	0.0012932
Ind6	DNR	3 hours	44095342	44095825	0.0010953
Ind6	DOX	24 hours	37577690	37578841	0.0030629
Ind6	DOX	3 hours	45117924	45118756	0.0018440
Ind6	EPI	24 hours	43365041	43366727	0.0038878
Ind6	EPI	3 hours	43646213	43647011	0.0018283
Ind6	MTX	24 hours	41005624	41006798	0.0028629
Ind6	MTX	3 hours	45458055	45458459	0.0008887
Ind6	TRZ	24 hours	35700533	35701361	0.0023192
Ind6	TRZ	3 hours	35875024	35875416	0.0010927
Ind6	VEH	24 hours	39816658	39820062	0.0085485
Ind6	VEH	3 hours	42608440	42608980	0.0012673

   # scroll_box(width = "100%", height = "400px")
  
  totseq %>% 
    ggplot(.,aes(x=trt,y=perc_removed) )+
    geom_col(aes(fill= trt))+
  facet_wrap(ind~time)+
  scale_fill_manual(values=drug_pal)+
  theme_bw()+
  ggtitle("Total Sequences, percent removed")+
      # ylab(ylab)+
      xlab("")+
      theme(strip.background = element_rect(fill = "white",linetype=1, linewidth = 0.5),
          plot.title = element_text(size=14,hjust = 0.5,face="bold"),
          axis.title = element_text(size = 10, color = "black"),
          axis.ticks = element_line(linewidth = 0.5),
          axis.line = element_line(linewidth = 0.5),
          axis.text.x = element_blank(),
          strip.text.x = element_text(margin = margin(2,0,2,0, "pt"),face = "bold"))

Percenttrimmed

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-3 kableExtra_1.3.4   lubridate_1.9.3    forcats_1.0.0     
 [5] stringr_1.5.0      dplyr_1.1.3        purrr_1.0.2        readr_2.1.4       
 [9] tidyr_1.3.0        tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0   
[13] workflowr_1.7.1   

loaded via a namespace (and not attached):
 [1] gtable_0.3.4      xfun_0.41         bslib_0.5.1       processx_3.8.2   
 [5] callr_3.7.3       tzdb_0.4.0        vctrs_0.6.4       tools_4.3.1      
 [9] ps_1.7.5          generics_0.1.3    parallel_4.3.1    fansi_1.0.5      
[13] highr_0.10        pkgconfig_2.0.3   webshot_0.5.5     lifecycle_1.0.4  
[17] farver_2.1.1      compiler_4.3.1    git2r_0.32.0      munsell_0.5.0    
[21] getPass_0.2-2     httpuv_1.6.12     htmltools_0.5.7   sass_0.4.7       
[25] yaml_2.3.7        crayon_1.5.2      later_1.3.1       pillar_1.9.0     
[29] jquerylib_0.1.4   whisker_0.4.1     cachem_1.0.8      tidyselect_1.2.0 
[33] rvest_1.0.3       digest_0.6.33     stringi_1.7.12    labeling_0.4.3   
[37] rprojroot_2.0.4   fastmap_1.1.1     grid_4.3.1        colorspace_2.1-0 
[41] cli_3.6.1         magrittr_2.0.3    utf8_1.2.4        withr_2.5.2      
[45] scales_1.2.1      promises_1.2.1    bit64_4.0.5       timechange_0.2.0 
[49] rmarkdown_2.25    httr_1.4.7        bit_4.0.5         hms_1.1.3        
[53] evaluate_0.23     knitr_1.45        viridisLite_0.4.2 rlang_1.1.2      
[57] Rcpp_1.0.11       glue_1.6.2        xml2_1.3.5        svglite_2.1.2    
[61] rstudioapi_0.15.0 vroom_1.6.4       jsonlite_1.8.7    R6_2.5.1         
[65] systemfonts_1.0.5 fs_1.6.3

ATAC_fastqc

ERM

2024-01-30