Last updated: 2024-12-26
Checks: 6 1
Knit directory:
reproduce_genomics_paper_figures/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20241226)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ff491b5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Untracked files:
Untracked: analysis/01_download_fastq_from_GEO.Rmd
Untracked: analysis/02_align_to_hg38.Rmd
Untracked: data/fastq/
Untracked: data/reference/
Untracked: imgs/
Unstaged changes:
Modified: README.md
Modified: reproduce_genomics_paper_figures.Rproj
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
The next step is to align the fastq files to the genome. We will use hg38 reference human genome for the alignment.
Where to download the reference?
watch xx
use refgenine.
It is single-end read with 50 base pairs.
zless TAZ.fatq.gz | head -2 |tail -1
ATAGGCTTTAAGCTGTCTTTGGTTTGAAGGTGGGATTTTACCGGGGACCC
zless TAZ.fatq.gz | head -2 |tail -1 | wc -L
50
The ChIP-seq signal can happen to anywhere in the genome, so we need an aligner to align the reads to the genome.
For RNAseq data, STAR
is a popular aligner.
BWA, BWA-mem is very popular for DNA-seq (WGS, WES) alignment.
Let’s use bowtie2
https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes.
conda install bowtie2 -c bioconda
Channels:
- bioconda
- defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed
LibMambaUnsatisfiableError: Encountered problems while solving:
- nothing provides libcxx >=16 needed by bowtie2-2.5.4-hdcd892d_2
Could not solve for environment specs
The following packages are incompatible
└─ bowtie2 is not installable because there are no viable options
├─ bowtie2 2.5.4 would require
│ └─ libcxx >=16 , which does not exist (perhaps a missing channel);
└─ bowtie2 2.5.4 would require
└─ libcxx >=18 , which does not exist (perhaps a missing channel).
Ask ChatGPT, and it is lacking the libcxx. let’s install it first.
conda install libcxx -c conda-forge
conda install bowtie2 -c bioconda
Now it works, and after it finishes installing you can
bowtie2
scroll down the page https://bowtie-bio.sourceforge.net/bowtie2/index.shtml and on the left you will see the index to download.
We will download the GRCH38 (aka, hg38 without decoy). click it, it
will download a GRCh38_noalt_as.zip
file of 3.5G. unzip it
and place the folder to the data/reference
folder.
Which reference genome to use? Bonus, read this post by HengLi.
# make sure you are in the data folder
cd data/
bowtie2 -x reference/GRCh38_noalt_as/GRCh38_noalt_as -U fastq/YAP.fastq.gz -S fastq/YAP1.sam --threads 8 --no-mixed --no-discordant -k 1
-x reference/GRCh38_noalt_as/GRCh38_noalt_as
: The path
to the Bowtie2 reference genome index (built with bowtie2-build
reference.fa reference_index).
-U input.fastq: Input FASTQ file for single-end reads.
-S output.sam: Output alignment file in SAM format.
–threads 8: Use 8 threads to speed up alignment.
–no-mixed: Ensures that only properly paired reads are reported (not relevant for single-end data but good for ChIP-seq).
–no-discordant: Prevents reporting discordant alignments (not relevant for single-end data but ensures consistent results for paired-end).
-k 1: Reports only the best alignment for each read (important for ChIP-seq to avoid multi-mapping).
Below is the output. And it is super-fast!! for 24 million reads, it only took me 1.5 mins with 8 CPUs.
24549590 reads; of these:
24549590 (100.00%) were unpaired; of these:
895629 (3.65%) aligned 0 times
23653961 (96.35%) aligned exactly 1 time
0 (0.00%) aligned >1 times
96.35% overall alignment rate
93.73 real 733.08 user 21.44 sys
The output is the sam
file which is a format to store
the alignment. It is just a text file and you can use
less -S YAP1.sam
to see the content.
After we made one sample work, we can loop over and do the same for other fastq files.
rm YAP.sam
for fq in fastq/*fastq.gz
do
bowtie2 -x reference/GRCh38_noalt_as/GRCh38_noalt_as -U $fq -S ${fq/fastq.gz/sam} --threads 8 --no-mixed --no-discordant -k 1
done
here we used bash string manipulation ${fq/fastq.gz/sam}
to replace the fastq.gz
to sam
as the
output.
Bonus: read my blog post on it https://crazyhottommy.blogspot.com/2015/07/string-manipulation-in-bash.html
Make sure you have all the sam files:
ls fastq/*sam
fastq/IgG.sam fastq/TAZ.sam fastq/TEAD4.sam fastq/YAP.sam
Most of the tools work with the binary form of the sam file which is
a bam file. The size of the bam file is smaller samtools
is
the key toolkit to deal with it.
conda install samtools -c bioconda
for sam in fastq/*sam
do
samtools view -bS $sam > ${sam/sam/bam}
done
# check the size of the sam and bam files
ls -shlt fastq/*am
2590616 -rw-r--r--@ 1 tommytang staff 1.2G Dec 26 16:02 fastq/YAP.bam
3310856 -rw-r--r--@ 1 tommytang staff 1.6G Dec 26 16:01 fastq/TEAD4.bam
2885488 -rw-r--r--@ 1 tommytang staff 1.4G Dec 26 15:59 fastq/TAZ.bam
3277792 -rw-r--r--@ 1 tommytang staff 1.5G Dec 26 15:58 fastq/IgG.bam
10158320 -rw-r--r--@ 1 tommytang staff 4.8G Dec 26 15:54 fastq/YAP.sam
14483624 -rw-r--r--@ 1 tommytang staff 6.9G Dec 26 15:52 fastq/TEAD4.sam
11437112 -rw-r--r--@ 1 tommytang staff 5.4G Dec 26 15:50 fastq/TAZ.sam
12947640 -rw-r--r--@ 1 tommytang staff 6.2G Dec 26 15:48 fastq/IgG.sam
# remove the sam files to save space
rm fastq/*sam
sort the reads in the bam file by coordinates.
for bam in fastq/*bam
do
samtools sort -@ 4 -o ${bam/bam/sorted.bam} $bam
done
ls fastq/*bam
fastq/IgG.bam fastq/TAZ.bam fastq/TEAD4.bam fastq/YAP.bam
fastq/IgG.sorted.bam fastq/TAZ.sorted.bam fastq/TEAD4.sorted.bam fastq/YAP.sorted.bam
# remove all the unsorted bam
ls fastq/*bam | grep -v sorted.bam | xargs rm
Always test the command for a single sample and then apply it to all samples after you confirm it works.
Why Sort a BAM File?
Sorting arranges the reads in the BAM file by their genomic coordinates (e.g., by chromosome and position). This is crucial for downstream tools (e.g., variant calling, visualization) that rely on efficient access to reads aligned to specific regions. Compatibility with Analysis Tools:
for bam in fastq/*sorted.bam
do
samtools index $bam
done
Why Index a BAM File?
The BAM index (.bai or .csi file) allows tools to quickly locate and access reads aligned to specific genomic regions without scanning the entire file. This dramatically improves performance for tasks like visualization (e.g., in IGV) or focused analyses of particular loci.
Many tools (e.g., samtools, deepTools, GATK) use the index to load only relevant portions of the BAM file, reducing memory and computation time.
Genome browsers like IGV or UCSC Genome Browser require indexed BAM files to display aligned reads efficiently.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 httr_1.4.7 cli_3.6.3 knitr_1.48
[5] rlang_1.1.4 xfun_0.46 stringi_1.8.4 processx_3.8.4
[9] promises_1.3.0 jsonlite_1.8.8 glue_1.7.0 rprojroot_2.0.4
[13] git2r_0.35.0 htmltools_0.5.8.1 httpuv_1.6.15 ps_1.7.7
[17] sass_0.4.9 fansi_1.0.6 rmarkdown_2.27 jquerylib_0.1.4
[21] tibble_3.2.1 evaluate_0.24.0 fastmap_1.2.0 yaml_2.3.10
[25] lifecycle_1.0.4 whisker_0.4.1 stringr_1.5.1 compiler_4.4.1
[29] fs_1.6.4 pkgconfig_2.0.3 Rcpp_1.0.13 rstudioapi_0.16.0
[33] later_1.3.2 digest_0.6.36 R6_2.5.1 utf8_1.2.4
[37] pillar_1.9.0 callr_3.7.6 magrittr_2.0.3 bslib_0.8.0
[41] tools_4.4.1 cachem_1.1.0 getPass_0.2-4