Last updated: 2022-03-15
Checks: 7 0
Knit directory: chipseq-cross-species/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220209) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 61f21eb. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .RData
Ignored: .Rhistory
Ignored: analysis/.RData
Ignored: analysis/.Rhistory
Ignored: data/genomic_data/
Ignored: output/annotations/
Ignored: output/bam_files/
Ignored: output/filtered_peaks/
Ignored: output/logs/
Ignored: output/peaks/
Ignored: output/plots/
Ignored: output/qc/A-1_input.SeqDepthNorm.bw
Ignored: output/qc/A-1_input.SeqDepthNorm_dunnart_downSampled.bw
Ignored: output/qc/A-1_input.ccurve.txt
Ignored: output/qc/A-1_input.extrap.txt
Ignored: output/qc/A-1_input_R1_trimmed.fastq.gz
Ignored: output/qc/A-1_input_est_lib_complex_metrics.txt
Ignored: output/qc/A-2_H3K4me3.SeqDepthNorm.bw
Ignored: output/qc/A-2_H3K4me3.SeqDepthNorm_dunnart_downSampled.bw
Ignored: output/qc/A-2_H3K4me3.ccurve.txt
Ignored: output/qc/A-2_H3K4me3.extrap.txt
Ignored: output/qc/A-2_H3K4me3_R1_trimmed.fastq.gz
Ignored: output/qc/A-2_H3K4me3_est_lib_complex_metrics.txt
Ignored: output/qc/A-2_H3K4me3_vs_A-1_input.frip_default.txt
Ignored: output/qc/A-2_H3K4me3_vs_B-1_input.frip_default.txt
Ignored: output/qc/A-3_H3K27ac.SeqDepthNorm.bw
Ignored: output/qc/A-3_H3K27ac.SeqDepthNorm_dunnart_downSampled.bw
Ignored: output/qc/A-3_H3K27ac.ccurve.txt
Ignored: output/qc/A-3_H3K27ac.extrap.txt
Ignored: output/qc/A-3_H3K27ac_R1_trimmed.fastq.gz
Ignored: output/qc/A-3_H3K27ac_est_lib_complex_metrics.txt
Ignored: output/qc/A-3_H3K27ac_vs_A-1_input.frip_default.txt
Ignored: output/qc/A-3_H3K27ac_vs_B-1_input.frip_default.txt
Ignored: output/qc/B-1_input.SeqDepthNorm.bw
Ignored: output/qc/B-1_input.SeqDepthNorm_dunnart_downSampled.bw
Ignored: output/qc/B-1_input.ccurve.txt
Ignored: output/qc/B-1_input.extrap.txt
Ignored: output/qc/B-1_input_R1_trimmed.fastq.gz
Ignored: output/qc/B-1_input_est_lib_complex_metrics.txt
Ignored: output/qc/B-2_H3K4me3.SeqDepthNorm.bw
Ignored: output/qc/B-2_H3K4me3.SeqDepthNorm_dunnart_downSampled.bw
Ignored: output/qc/B-2_H3K4me3.ccurve.txt
Ignored: output/qc/B-2_H3K4me3.extrap.txt
Ignored: output/qc/B-2_H3K4me3_R1_trimmed.fastq.gz
Ignored: output/qc/B-2_H3K4me3_est_lib_complex_metrics.txt
Ignored: output/qc/B-2_H3K4me3_vs_A-1_input.frip_default.txt
Ignored: output/qc/B-2_H3K4me3_vs_A-1_input.frip_default_dunnart_downSampled.txt
Ignored: output/qc/B-2_H3K4me3_vs_B-1_input.frip_default.txt
Ignored: output/qc/B-3_H3K27ac.SeqDepthNorm.bw
Ignored: output/qc/B-3_H3K27ac.SeqDepthNorm_dunnart_downSampled.bw
Ignored: output/qc/B-3_H3K27ac.ccurve.txt
Ignored: output/qc/B-3_H3K27ac.extrap.txt
Ignored: output/qc/B-3_H3K27ac_R1_trimmed.fastq.gz
Ignored: output/qc/B-3_H3K27ac_est_lib_complex_metrics.txt
Ignored: output/qc/B-3_H3K27ac_vs_A-1_input.frip_default.txt
Ignored: output/qc/B-3_H3K27ac_vs_B-1_input.frip_default.txt
Ignored: output/qc/bamPEFragmentSize_rawcounts.tab
Ignored: output/qc/bamPEFragmentSize_rawcounts_dunnart_downSampled.tab
Ignored: output/qc/multiBAM_fingerprint_metrics.txt
Ignored: output/qc/multiBAM_fingerprint_metrics_dunnart_downSampled.txt
Ignored: output/qc/multiBAM_fingerprint_rawcounts.txt
Ignored: output/qc/multiBAM_fingerprint_rawcounts_dunnart_downSampled.txt
Ignored: output/qc/multibamsum.npz
Ignored: output/qc/multibamsum.tab
Ignored: output/qc/multibamsum_dunnart_downSampled.npz
Ignored: output/qc/multibamsum_dunnart_downSampled.tab
Ignored: output/qc/pearsoncor_multibamsum_matrix.txt
Ignored: output/qc/pearsoncor_multibamsum_matrix_dunnart_downSampled.txt
Ignored: output/wga/
Ignored: results/
Untracked files:
Untracked: .snakemake/
Untracked: Rplots.pdf
Untracked: data/raw_reads/
Untracked: output/qc/H3K27ac_overlap_default.frip
Untracked: output/qc/ucsc_alignment/
Untracked: output/rnaseq/
Untracked: slurm-33931929.out
Untracked: slurm-33931930.out
Untracked: slurm-33931931.out
Untracked: slurm-33931932.out
Untracked: slurm-33931933.out
Untracked: slurm-33931934.out
Untracked: slurm-33934389.out
Untracked: slurm-33934390.out
Untracked: slurm-33934391.out
Untracked: slurm-33934392.out
Untracked: slurm-33934393.out
Untracked: slurm-33934394.out
Untracked: slurm-33935177.out
Untracked: slurm-33935178.out
Untracked: slurm-33935179.out
Untracked: slurm-33935180.out
Untracked: slurm-33935181.out
Untracked: slurm-33935182.out
Untracked: slurm-33938210.out
Untracked: slurm-33938211.out
Untracked: slurm-33938212.out
Untracked: slurm-33938213.out
Untracked: slurm-33938214.out
Untracked: slurm-33938215.out
Untracked: slurm-33948861.out
Untracked: slurm-33948862.out
Untracked: slurm-33948863.out
Untracked: slurm-33948864.out
Untracked: slurm-33948865.out
Untracked: slurm-33948866.out
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/whole_genome_alignment.Rmd) and HTML (docs/whole_genome_alignment.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 61f21eb | lecook | 2022-03-15 | wflow_publish(“analysis/whole_genome_alignment.Rmd”) |
| html | c3418c7 | lecook | 2022-03-15 | Build site. |
| Rmd | 09ea90e | lecook | 2022-03-15 | wflow_publish(“analysis/whole_genome_alignment.Rmd”) |
| html | a557720 | lecook | 2022-03-15 | Build site. |
| Rmd | 4b30eee | lecook | 2022-03-15 | wflow_publish(“analysis/whole_genome_alignment.Rmd”) |
| Rmd | 6ebc287 | lecook | 2022-03-14 | fix |
| Rmd | bd46533 | lecook | 2022-03-14 | update |
| Rmd | d8b90f6 | lecook | 2022-03-14 | updated |
| Rmd | 290c569 | lecook | 2022-03-14 | update |
| Rmd | 7e64d0f | lecook | 2022-03-14 | add wga notes |
| Rmd | a97d3c5 | lecook | 2022-03-01 | first commit |
| Rmd | 1fa92d9 | lecook | 2022-02-23 | add analysis files |
There are a number of online tutorials to build the alignment and nets using Lastz: http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto https://www.bioconductor.org/packages/release/bioc/vignettes/CNEr/inst/doc/PairwiseWholeGenomeAlignment.html https://github.com/hillerlab/GenomeAlignmentTools (more fine-tuned and advanced)
Methods from Hecker & Hiller 2020 Whole Genome Alignment:
To compute pairwise and multiple genome alignments, we used the human hg38 assembly as the reference (Supplementary Fig. 1 shows the entire workflow). We first built pairwise alignments between human and a query species using lastz and axtChain to compute co-linear alignment chains [82]. To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scoring matrix) that have a sufficient sensitivity to capture orthologous exons [16]. To align chimpanzee, bonobo, and gorilla, we changed the lastz parameters (K = 4500 and L = 4500).
After building chains, we applied RepeatFiller (RRID:SCR_017414), a method that performs another round of local alignment, considering unaligning regions ≤20 kb in size that are bounded by co-linear alignment blocks up- and downstream. RepeatFiller removes any repeat masking from the unaligned region and is therefore able to detect novel alignments between repetitive regions. We have previously shown that RepeatFiller detects several megabases of aligning repetitive sequences that would be missed otherwise. After RepeatFiller, we applied chainCleaner with parameters -LRfoldThreshold = 2.5 -doPairs -LRfoldThresholdPairs = 10 -maxPairDistance = 10000 -maxSuspectScore = 100000 -minBrokenChainScore = 75000 to improve alignment specificity. Pairwise alignment chains were converted into alignment nets using a modified version of chainNet that computes real scores of partial nets. Nets were filtered using NetFilterNonNested.perl with parameters -doUCSCSynFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000, which applies the UCSC “syntenic net” score thresholds (minTopScore of 300000 and minSynScore of 200000) and keeps nested nets that align to the same locus (inversions or local translocations; net type “inv” or “syn” according to netClass) if they score ≥5,000. For the Mongolian gerbil, tarsier, Malayan flying lemur, sperm whale, Przewalski’s horse, Weddell seal, Malayan pangolin, Chinese pangolin, Hoffmann’s two-fingered sloth, and Cape rock hyrax that have genome assemblies with a scaffold N50 ≤1,000,000 and a contig N50 ≤100,000, we just required that nets have a score ≥100,000. For marsupials and platypus, we lowered the score threshold for nets to 10,000 and kept inv or syn nets with scores ≥3,000.
In chain and net lingo, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query.
A gapless block is a base-for-base alignment between part of the target and part of the query, possibly including mismatching bases. It has the same length in bases on the target and the query. This is the output of the most primitive alignment algorithms.
A gap is a link between two gapless blocks, indicating that the target or the query has sequence that should be skipped over in order to make the best-scoring alignment. In other words, the scoring penalty for skipping over one or more bases is less than the penalty for continuing to align the sequences without skipping.
A single-sided gap is a gap in which sequence in either target or query must be skipped over. A plausible explanation for needing to skip over a base in the target while not skipping a base in the query is that either the target has an inserted base or the query has a deleted base. Many alignment tools produce alignments with single-sided gaps between gapless blocks.
A double-sided gap skips over sequence in both target and query because the sum of penalties for mismatching bases exceeds the penalty for extending a gap across them. This is possible only when the penalty for extending a gap is less than the penalty for creating a new gap and less than the penalty for a mismatch, and when the alignment algorithm is capable of considering double-sided gaps.
A chain is a sequence of non-overlapping gapless blocks, with single- or double-sided gaps between blocks. Within a chain, target and query coords are monotonically non-decreasing (i.e. always increasing or flat). Chains are constructed by the axtChain program which finds pairwise alignments with the same target and query sequence, on the same strand, that can be merged if overlapping and joined into one longer alignment with a higher score under an affine gap-scoring system (progressively decreasing penalties for longer gaps).
A net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, which in turn may have gaps filled in by lower-level chains and so on.
Spartan modules
module load foss
module load lastz
module load ucsc/21072020
module load perl
Create conda environment will all the dependencies:
conda create -n wga
conda activate wga
conda config --add channels conda-forge
conda config --add channels biocore
conda config --add channels bioconda
conda install repeatmasker
Run RepeatModeler to de novo find repeat regions in the dunnart genome:
BuildDatabase -name dunnart -engine ncbi data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta
nohup RepeatModeler -database dunnart -pa 20 >& output/wga/repeat_modeler
repeat_modeler/repeatmodeler.out
Run RepeatMasker to mask repeats in dunnart genome (makes repeats lowercase). Run as an array for scaffolds to make it quicker. Create commands for array slurm script: repeatMasker.sh
RepeatMasker -q -xsmall data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta -default_search_engine hmmer -trf_prgm /home/lecook/.conda/envs/wga/bin/trf -hmmer_dir /home/lecook/.conda/envs/wga/bin/
Using faSplit from the UCSC Kent Tools
faSplit byName data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta data/genomic_data/smiCra1_scaffolds/
faToTwoBit data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr_RM.fasta data/genomic_data/smiCra1.2bit
twoBitInfo data/genomic_data/smiCra1.2bit stdout | sort -k2rn > data/genomic_data/smiCra1.chrom.sizes
http://hgdownload.cse.ucsc.edu/goldenpath/mm10/bigZips/mm10.2bit
mm10.2bit - contains the complete mouse/mm10 genome sequence in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.
To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scoring matrix) that have a sufficient sensitivity to capture orthologous exons
To align placental mammals, we used the lastz alignment parameters K = 2400, L = 3000, Y = 9400, H = 2000 and the lastz default scoring matrix, correspond- ing to parameter set 2 in Table 1. To align vertebrates, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix. Citation: Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369–77.
Create commands for running lastZ for all scaffolds: code/whole_genome_alignment/lastz.sh
Repeated with vertebrate alignment parameters and HoxD55 scoring matrix. Retrieve A LOT more aligned sequences. For example scaffold00002 with mammal parameters retrieved 863M of data, while with the new parameters it’s 2.5GB.
Run as an array on slurm: code/whole_genome_alignment//array_wrapper.slurm
http://last.cbrc.jp/doc/maf-convert.html
module load last/last/1066
maf-convert axt output/wga/maf/${tr}.maf > output/wga/axt/${tr}.axt
script: code/whole_genome_alignment//maf-convert_commands.sh run with: code/whole_genome_alignment//array_wrapper.slurm
We use axtChain (http://www.soe.ucsc.edu/~kent; default parameters) to build co-linear alignment chains.
axtChain -linearGap=loose -scoreScheme=lecook/bin/GenomeAlignmentTools/HoxD55.q output/axt/[axt_files] data/genomic_data/mm10.2bit data/genomic_data/smiCra1.2bit output/wga/axt_chain/mm10_smiCra1.chain
script: code/whole_genome_alignment//axt_commands.sh run with: code/whole_genome_alignment//array_wrapper.slurm
Merge short chains into longer ones, concatenate chains and sort
chainMergeSort output/wga/axt_chain/*.chain > output/wga/chain_merge/smiCra1_mm10.chain
git clone https://github.com/hillerlab/GenomeAlignmentTools.git
cd GenomeAlignmentTools/kent/src
make
# Kent binaries
PATH=$PATH:/Users/lauracook/bin/GenomeAlignmentTools/kent/bin;export PATH
export KENTSRC_DIR=/Users/lauracook/bin/GenomeAlignmentTools/kent/src/
cd /Users/lauracook/bin/GenomeAlignmentTools/src
export MACHTYPE=x86_64
make
PATH=$PATH:/Users/lauracook/bin/GenomeAlignmentTools/src
https://github.com/hillerlab/GenomeAlignmentTools
RepeatFiller [5] is a tool to incorporate newly-detected repeat-overlapping alignments into pairwise alignment chains [4]. Its runtime adds little to the computationally more expensive step of generating chains in pairwise whole-genome alignments. RepeatFiller circumvents the problem that considering all repeat-overlapping alignment seeds during whole genome alignment is computationally not feasible. Therefore, RepeatFiller only aligns local genomic regions that are bounded by colinear aligning blocks, as provided in the chains, which makes it feasible to consider all seeds including those that overlap repetitive regions. RepeatFiller application to mammalian genome alignment chains can add between 22 and 84 Mb of previously-undetected alignments that mostly originate from transposable elements [5]. This helps to comprehensively align repetitive regions and improves the annotation of conserved non-coding elements.
python3 RepeatFiller.py -c output/wga/chain_merge/smiCra1_mm10.chain -T2 data/genomic_data/mm10.2bit -Q2 data/genomic_data/smiCra1.2bit
python3 RepeatFiller.py -c smiCra1_mm10_patched_sorted.chain -T2 data/genomic_data/mm10.2bit -Q2 data/genomic_data/smiCra1.2bit
RepeatFiller adds 0.3G data to the alignment - 2 Feb 2021
patchChain.perl performs a highly sensitive local pairwise alignment for loci flanked by aligning blocks [1,3]. Given an alignment chain [4], it considers all chains that pass the score and span filters (optional parameters), extracts all the unaligning loci and creates local alignment jobs. After executing these alignment jobs, the newly found and the original local alignments are combined and used to produce a new set of improved chains.
This procedure is recommended for comparisons between species that are separated by >0.75 substitutions per neutral site [1].
patchChain.perl output/wga/chain_merge/smiCra1_mm10.chain data/genomic_data/mm10.2bit data/genomic_data/smiCra1.2bit data/genomic_data/mm10.chrom.sizes data/genomic_data/smiCra1.chrom.sizes -chainMinScore 5000 -gapMaxSizeT 500000 -gapMaxSizeQ 500000 -gapMinSizeT 30 -gapMinSizeQ 30 -numJobs 10 -jobDir jobs -jobList jobList -outputDir output/wga/patch_chain/pslOutput -minEntropy 1.8 -windowSize 30 -minIdentity 60 -lastzParameters "--format=axt K=1500 L=2500 M=0 T=0 W=5 Q=lecook/bin/GenomeAlignmentTools/example/HoxD55.q"
This results in 10 alignment jobs that are located in jobs/ and listed in ‘jobList’ now execute these jobs on a compute cluster or run them sequentially by doing ‘chmod +x jobList; ./jobList’
# concatenate all new results
find output/wga/psl -name "*.psl" | xargs -i cat {} > output/wga/patch_chain/newAlignments.psl
# concatenate all genomewide psl lastz alignments
find psl -name "*.psl" | xargs -i cat {} > otuput/wga/psl/genomeWide.lastz.psl
# combine the genome-wide lastz results (the combined psl file that was used to create the input chains) and the newly found psl alignments
cat psl/genomeWide.lastz.psl patchChain/newAlignments.psl > patch_chain/all.psl
patchChain adds 1.1G to the genome wide alignment file
use axtChain from the Kent source to compute alignment chains that include the new alignments
This create a folder with the jobs and a jobList which can be called in an array slurm script. Then you can run the jobs in parallel.
axtChain on psl alignments (patchChain new alignments plus genomeWide lastz alignments)
axtChain -psl -linearGap=loose -scoreScheme=/lecook/bin/GenomeAlignmentTools/HoxD55.q output/wga/patch_chain/all.psl data/genomic_data/mm10.2bit data/genomic_data/smiCra1.2bit output/wga/patch_chain/smiCra1_mm10_patched.chain
4.7G Jan 31 18:42 smiCra1_mm10.chain 5.3G Feb 1 17:52 smiCra1_mm10_patched.chain
https://github.com/hillerlab/GenomeAlignmentTools
After RepeatFiller, we applied chainCleaner with parameters -LRfoldThreshold = 2.5 -doPairs -LRfoldThresholdPairs = 10 -maxPairDistance = 10000 -maxSuspectScore = 100000 -minBrokenChainScore = 75000 to improve alignment specificity.
chainCleaner improves the specificity in genome alignment chains by detecting and removing local alignments that obscure the evolutionary history of genomic rearrangements [2]. The input is a chain file, ideally after adding alignments found with highly sensitive parameters if distal species are compared. The output is a chain file that contains re-scored and score-sorted chains after removing the local alignments from the parent chains and adding them as individual chains. The resulting output file can be used to get alignment nets by running chainNet [4].
chainCleaner output/wga/repeat_filler/smiCra1_mm10_repFil.chain data/genomic_data/mm10.2bit data/genomic_data/smiCra1.2bit output/wga/chain_cleaner/smiCra1_mm10_repFill_chainCl.chain output/wga/chain_cleaner/smiCra1_mm10_repFill_chainCl.bed -tSizes=mm10.chrom.sizes -qSizes=smiCra1.chrom.sizes -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000 -linearGap=loose
Run chain cleaner on patched chain file
chainCleaner smiCra1_mm10_patched_sorted_repFil.chain data/genomic_data/mm10.2bit data/genomic_data/smiCra1.2bit output/wga/chain_cleaner/smiCra1_mm10_patched_sorted_repFil_chainCl.chain output/wga/chain_cleaner/smiCra1_mm10_patched_sorted_repFil_chainCl.bed -tSizes=data/genomic_data/mm10.sizes -qSizes=data/genomic_data/smiCra1.sizes -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000 -linearGap=loose
-LRfoldThreshold = threshold for removing local alignment blocks if the score of the left and right fill of brokenChain. Default is 2.5 -doPairs = flag: if set, do test if pairs of chain breaking alignments can be removed -LRfoldThresholdPairs = threshold for removing local alignment blocks if the score of the left and right fill of broken chains (for pairs). Default = 10 -maxPairDistance = only consider pairs of chain breaking alignments where the distance between the end of the upstream CBA and the start of the downstream CBA is at most that many bp (default 10000) -maxSuspectScore = threshold for score of suspect subChain. If higher, do not remove suspect.
-linearGap=loose
directory: output/wga/chain_cleaner/
For filtering the chains, we need the size of each chromosome
faSize data/genomic_data/smiCra1.fa -detailed > data/genomic_data/smiCra1.sizes
faSize data/genomic_data/mm10.fa -detailed > data/genomic_data/mm10.sizes
chainPreNet output/wga/chain_cleaner/smiCra1_mm10_patched_sorted_chainCl.chain data/genomic_data/mm10.sizes data/genomic_data/smiCra1.sizes smiCra1_mm10_patched_sorted_chainCl_preNet.chain
chainPreNet without chainCleaner
chainPreNet smiCra1_mm10_patched_sorted_repFil_chainCl.chain data/genomic_data/mm10.sizes data/genomic_data/smiCra1.sizes output/wga/chain_prenet/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet.chain
Given a set of alignment chains, chainNet produces alignment nets, which is a hierarchical collection of chains or parts of chains that attempt to capture only orthologous alignments [4]. The original chainNet implementation approximates the score of “sub-nets” (nets that come from a part of a chain and fill a gap in a higher-level net) by the fraction of aligning bases. This can lead to a bias in case the aligning blocks of a chain are not equally distributed. We implemented a new parameter “-rescore” in chainNet that computes the real score of each subnet [2].
Make the alignment nets:
chainNet output/wga/chain_prenet/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet.chain -minSpace=1 data/genomic_data/mm10.sizes data/genomic_data/smiCra1.sizes stdout /dev/null | netSyntenic stdin output/wga/chain_net/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet.net
Creates a single chain file using only the chains that also appear in the net.
netChainSubset output/wga/chain_prenet/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet.net output/wga/chain_cleaner/smiCra1_mm10_patched_sorted_repFil_chainCl.chain output/wga/chain_net/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet.chain
chainStitchId output/wga/chain_net/smiCra1_mm10_patched_sorted_chainCl_netted.chain chainStitchId output/wga/chain_stitch/smiCra1_mm10_patched_sorted_chainCl_netted_stitched.chain
chainStitchId chainStitchId output/wga/chain_net/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet.chain chainStitchId output/wga/chain_stitch/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain
From the synteny files (positions), get the sequences and re-create alignments.
smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain
netToAxt output/wga/chain_net/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet.net chain_stitch/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain data/genomic_data/mm10.2bit data/genomic_data/smiCra1.2bit stdout | axtSort stdin output/wga/net_to_axt/mm10-smiCra1.axt
axtToMaf output/wga/net_to_axt/mm10-smiCra1.axt data/genomic_data/mm10.sizes data/genomic_data/smiCra1.sizes output/wga/maf/mm10-smiCra1.maf -tPrefix=mm10. -qPrefix=smiCra1.
chainSwap mm10_to_smiCra1_patched_sorted_chainCl_netted_stitched.chain stdout | nice chainSort stdin stdout | nice gzip -c > smiCra1_to_mm10_patched_sorted_chainCl_netted_stitched.chain
chainSwap smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain stdout | nice chainSort stdin stdout > mm10_smiCra1_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain
MafFilter. Using mm10-smiCra1 maf file generated using axtToMaf look at the block lengths and genomic coverage.
maffilter param=options-file
Parameter options file
DATA=mm10-smiCra1 //A user-defined variable, representing the input maf file, without extension
input.file=$(DATA).maf.gz //Input maf file, gzipped
input.file.compression=gzip
output.log=$(DATA).maffilter.log //Output log file
maf.filter= \
SequenceStatistics( \
statistics=( \
BlockLength, \
AlnScore, \
BlockCounts, \
PairwiseDivergence( \
species1=mm10, \
species2=smiCra1)), \
ref_species=mm10, \
file=divergence.statistics.csv)
Summarise block lengths, block counts, pairwise divergence and calculate exon coverage in the maf file between species in R.
# Load libraries
library(plyr)
library(data.table)
library(seqinr)
library(Biostrings)
library(tidyverse)
mm10 <- fread("output/qc/ucsc_alignment/mm10.statistics.csv") # read in table
smiCra1 <- fread("output/qc/ucsc_alignment/smiCra1.statistics.csv") # read in table
head(mm10) #check file has been imported correctly
Chr Start Stop BlockLength AlnScore Counts.A Counts.C Counts.G
1: chr1 3012982 3013097 118 2740 23 16 12
2: chr1 3027832 3027936 104 2614 31 23 11
3: chr1 3028771 3028920 149 2821 51 18 36
4: chr1 3029833 3029947 122 2687 38 19 21
5: chr1 3044214 3044384 170 787 49 21 38
6: chr1 3044384 3044519 142 2483 30 35 29
Counts.T Counts.Gap Counts.Unresolved
1: 64 3 0
2: 39 0 0
3: 44 0 0
4: 36 8 0
5: 62 0 0
6: 41 7 0
mm10ChrCount <- mm10 %>% dplyr::count(Chr, sort=TRUE) # how many blocks per chromosome
smiCra1ChrCount <- smiCra1 %>% dplyr::count(Chr, sort=TRUE) # how many blocks per chromosome
mm10ChrCount
Chr n
1: chr1 207321
2: chr2 206723
3: chr3 167622
4: chr6 163865
5: chr4 163152
6: chr5 163055
7: chr7 152609
8: chr8 142300
9: chr10 141726
10: chrX 138973
11: chr11 138669
12: chr9 137467
13: chr12 134230
14: chr14 132548
15: chr13 130426
16: chr15 114484
17: chr16 106874
18: chr17 103335
19: chr18 97483
20: chrY 73394
21: chr19 67800
22: chr5_JH584299_random 417
23: chrY_JH584301_random 233
24: chr1_GL456211_random 224
25: chr4_JH584293_random 218
26: chr4_GL456350_random 209
27: chr1_GL456210_random 199
28: chr4_JH584294_random 182
29: chr1_GL456221_random 181
30: chrY_JH584300_random 179
31: chrX_GL456233_random 170
32: chr1_GL456212_random 141
33: chrY_JH584303_random 132
34: chrY_JH584302_random 101
35: chrUn_GL456379 88
36: chr4_GL456216_random 87
37: chr5_GL456354_random 87
38: chr7_GL456219_random 87
39: chr5_JH584298_random 84
40: chr5_JH584297_random 83
41: chr5_JH584296_random 78
42: chrUn_JH584304 77
43: chr1_GL456213_random 53
44: chrUn_GL456359 38
45: chrUn_GL456378 35
46: chrUn_GL456393 35
47: chrUn_GL456367 32
48: chrUn_GL456382 30
49: chrUn_GL456370 27
50: chrUn_GL456372 27
51: chrUn_GL456387 27
52: chrUn_GL456239 25
53: chrUn_GL456366 23
54: chr4_JH584292_random 22
55: chrUn_GL456385 19
56: chrUn_GL456394 18
57: chrUn_GL456381 17
58: chrUn_GL456360 12
59: chrUn_GL456368 10
60: chrM 7
61: chr4_JH584295_random 3
Chr n
smiCra1ChrCount
Chr n
1: scaffold00002_pilon_pilon 456793
2: scaffold00001_pilon_pilon 105481
3: scaffold00004_pilon_pilon 73635
4: scaffold00020_pilon_pilon 73492
5: scaffold00009_pilon_pilon 69917
---
609: scaffold00699_pilon_pilon 1
610: scaffold00704_pilon_pilon 1
611: scaffold00708_pilon_pilon 1
612: scaffold00714_pilon_pilon 1
613: scaffold00715_pilon_pilon 1
just use mouse as the statistics are the same
size1to10bp <- mm10[which(mm10$BlockLength>=1 & mm10$BlockLength<=10),]
size10to100bp <- mm10[which(mm10$BlockLength>=10 & mm10$BlockLength<=100),]
size100to1kb <- mm10[which(mm10$BlockLength>=100 & mm10$BlockLength<=1000),]
size1kbto10kb <- mm10[which(mm10$BlockLength>=1000 & mm10$BlockLength<=10000),]
size10kbto100kb <- mm10[which(mm10$BlockLength>=10000 & mm10$BlockLength<=100000),]
size100kbto1Mb <- mm10[which(mm10$BlockLength>=100000 & mm10$BlockLength<=1000000),]
dplyr::count(size1to10bp)
n
1: 5523
count(size10to100bp)
n
1: 816883
count(size100to1kb)
n
1: 1975414
count(size1kbto10kb)
n
1: 103456
count(size10kbto100kb)
n
1: 17
count(size100kbto1Mb)
n
1: 0
sum(size1to10bp$BlockLength)
[1] 28265
sum(size10to100bp$BlockLength)
[1] 57934059
sum(size100to1kb$BlockLength)
[1] 578314504
sum(size1kbto10kb$BlockLength)
[1] 165302684
sum(size10kbto100kb$BlockLength)
[1] 212515
sum(size100kbto1Mb$BlockLength)
[1] 0
df <- data.table(
size = c("size1to10bp", "size10to100bp", "size100to1kb", "size1kbto10kb", "size10kbto100kb", "size100kbto1Mb"),
sum_block_length = c(sum(size1to10bp$BlockLength),
sum(size10to100bp$BlockLength),
sum(size100to1kb$BlockLength),
sum(size1kbto10kb$BlockLength),
sum(size10kbto100kb$BlockLength),
sum(size100kbto1Mb$BlockLength)),
number_blocks = c(dplyr::count(size1to10bp), dplyr::count(size10to100bp), dplyr::count(size100to1kb),
dplyr::count(size1kbto10kb), dplyr::count(size10kbto100kb), dplyr::count(size100kbto1Mb))
)
df
size sum_block_length number_blocks
1: size1to10bp 28265 5523
2: size10to100bp 57934059 816883
3: size100to1kb 578314504 1975414
4: size1kbto10kb 165302684 103456
5: size10kbto100kb 212515 17
6: size100kbto1Mb 0 0
mm10 exons from refseq (downloaded from UCSC)
dunnart exons from lifted devil annotation gtf file converted to a bed file using BEDOPS gtf2bed function
gff2bed < data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff > data/genomic_data/dunnart_gff.bed
less data/genomic_data/dunnart_gff.bed | grep "exon" > data/genomic_data/dunnart_exons.bed
mm10 refseq ncbi curated exons downloaded from ucsc
system("bedtools sort -i data/genomic_data/mm10exons.bed > output/qc/ucsc_alignment/mm10exonsSorted.bed", intern=TRUE) ## sort by coordinates
system("bedtools merge -i output/qc/ucsc_alignment/mm10exonsSorted.bed > output/qc/ucsc_alignment/mm10exonsMerged.bed", intern=TRUE) ## bedtools merge where transcripts are the same coordinates or within each other
system("bedtools merge -i output/qc/ucsc_alignment/mm10_mafAlignment.bed > output/qc/ucsc_alignment/mm10_mafAlignmentMerged.bed", intern=TRUE) ## also merge the maf alignments from the liftOver
system("bedtools intersect -a output/qc/ucsc_alignment/mm10exonsMerged.bed -b output/qc/ucsc_alignment/mm10_mafAlignmentBlocks.bed -wo > output/qc/ucsc_alignment/mm10exonOverlaps.bed", intern=TRUE) ## intersect exon coordinates with maf alignment file
mm10exon <- fread("output/qc/ucsc_alignment/mm10exonsMerged.bed") ## exon file
mm10exon$length <- (mm10exon$V3 - mm10exon$V2) ## exon length in ncbi refseq
mm10overlaps <- fread("output/qc/ucsc_alignment/mm10exonOverlaps.bed") ## overlap file
dim(mm10overlaps) ## check file imported correctly
[1] 190176 10
sum(mm10overlaps$V10) ## 31776094bps of exons found in maf alignment
[1] 31776094
sum(mm10exon$length) ## 34221920bps of exons in refseq mm10
[1] 34221920
mm10exonCoverage <- sum(mm10overlaps$V10)/sum(mm10exon$length) ## converge alignment exon bps/total exon bps
print(paste0("mm10 exon coverage = ", mm10exonCoverage))
[1] "mm10 exon coverage = 0.928530427281695"
system("bedtools sort -i data/genomic_data/dunnart_exons.bed > output/qc/ucsc_alignment/dunnart_exonsSorted.bed", intern = TRUE) ## sort by coordinates
system("bedtools merge -i output/qc/ucsc_alignment/dunnart_exonsSorted.bed > output/qc/ucsc_alignment/smiCra1exonsMerged.bed", intern = TRUE) ## bedtools merge where transcripts are the same coordinates or within each other
system("bedtools merge -i output/qc/ucsc_alignment/smiCra1_mafAlignmentBlocksSorted.bed > output/qc/ucsc_alignment/smiCra1_mafAlignmentMerged.bed", intern = TRUE) ## also merge the maf alignments from the liftOver
system("bedtools intersect -a output/qc/ucsc_alignment/smiCra1exonsMerged.bed -b output/qc/ucsc_alignment/smiCra1_mafAlignmentMerged.bed -wo > output/qc/ucsc_alignment/smiCra1_exonOverlaps.bed", intern=TRUE) ## intersect exon coordinates with maf alignment file
smiCra1exon <- fread("output/qc/ucsc_alignment/smiCra1exonsMerged.bed") ## exon file
dim(smiCra1exon) ## check file imported correctly
[1] 193182 3
smiCra1exon$length <- (smiCra1exon$V3 - smiCra1exon$V2) ## exon length
smiCra1overlaps <- fread("output/qc/ucsc_alignment/smiCra1_exonOverlaps.bed") ## exon overlaps with maf alignment file
dim(smiCra1overlaps) ## check imported correctly
[1] 193996 7
sum(smiCra1exon$length) # 63502228
[1] 63502228
sum(smiCra1overlaps$V7) # 43939135
[1] 43939135
smiCra1exonCoverage <- sum(smiCra1overlaps$V7)/sum(smiCra1exon$length)
print(paste0("dunnart exon coverage = ", smiCra1exonCoverage))
[1] "dunnart exon coverage = 0.691930604387613"
mm10 <- fread("output/qc/ucsc_alignment/mm10.divergence.statistics.csv")
smiCra1 <- fread("output/qc/ucsc_alignment/smiCra1.divergence.statistics.csv")
mean(mm10$`Div.mm10-smiCra1`)
[1] 37.1781
mean(smiCra1$`Div.smiCra1-mm10`)
[1] 37.1781
## Genome coverage
readDNAmm10 <- readDNAStringSet("output/qc/ucsc_alignment/mm10_mafAlignment_nogaps.fasta") #number bp in alignment 743350114
readDNAsmiCra1 <- readDNAStringSet("output/qc/ucsc_alignment/smiCra1_mafAlignment_nogaps.fasta") #number bp in alignment 747935780
mm10genomeSize <- 2652783500
smiCra1genomeSize <-2838290115
mm10coverage <- readDNAmm10@ranges@width/mm10genomeSize
smiCra1coverage <- readDNAsmiCra1@ranges@width/smiCra1genomeSize
print(paste0("mm10 genome coverage = ", mm10coverage))
[1] "mm10 genome coverage = 0.280215145336964"
print(paste0("dunnart genome coverage =", smiCra1coverage))
[1] "dunnart genome coverage =0.263516324863077"
liftOver data/genomic_data/smiCra1exonsMerged.bed output/wga/liftOver_chains/mm10_smiCra1_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain output/qc/ucsc_alignment/smiCra1exonsMapped.txt output/qc/ucsc_alignment/smiCra1exonsUnmapped.txt
number of dunnart exons = 193182
number of dunnart exons that can be mapped to mouse = 111693
liftOver output/qc/ucsc_alignment/mm10exonsMerged.bed output/wga/liftOver_chains/smiCra1_mm10_patched_sorted_repFil_chainCl_preNet_chainNet_stitched.chain output/qc/ucsc_alignment/mm10exonsMapped.txt output/qc/ucsc_alignment/mm10exonsUnmapped.txt
number of mouse exons = 193776
number of mouse exons that can be lifted over to dunnart = 160213
[1] Sharma V, Hiller M. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res., 45(14), 8369–8377, 2017
[2] Suarez H, Langer BE, Ladde P, Hiller M. chainCleaner improves genome alignment specificity and sensitivity. Bioinformatics, 33(11):1596-1603, 2017
[3] Hiller M, Agarwal S, Notwell JH, Parikh R, Guturu H, Wenger AM, Bejerano G. Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish. Nucleic Acids Res, 41(15):e151.
[4] Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. PNAS, 100(20):11484-9, 2003
[5] Osipova E, Hecker N, Hiller M. RepeatFiller newly identifies megabases of aligning repetitive sequences and improves annotations of conserved non-exonic elements, submitted
[6] MafFilter
[7] Kent tools UCSC
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/local/easybuild-2019/easybuild/software/compiler/gcc/10.2.0/openblas/0.3.12/lib/libopenblas_haswellp-r0.3.12.so
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8
[4] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[7] tibble_3.1.2 ggplot2_3.3.3 tidyverse_1.3.1
[10] Biostrings_2.62.0 GenomeInfoDb_1.30.1 XVector_0.34.0
[13] IRanges_2.28.0 S4Vectors_0.32.3 BiocGenerics_0.40.0
[16] seqinr_4.2-5 data.table_1.14.0 plyr_1.8.6
[19] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 sass_0.4.0 jsonlite_1.7.2
[4] modelr_0.1.8 bslib_0.2.5.1 assertthat_0.2.1
[7] getPass_0.2-2 cellranger_1.1.0 GenomeInfoDbData_1.2.7
[10] yaml_2.2.1 progress_1.2.2 pillar_1.6.1
[13] backports_1.2.1 glue_1.4.2 digest_0.6.27
[16] promises_1.2.0.1 rvest_1.0.0 colorspace_2.0-1
[19] htmltools_0.5.1.1 httpuv_1.6.1 pkgconfig_2.0.3
[22] broom_0.7.6 haven_2.4.1 zlibbioc_1.40.0
[25] scales_1.1.1 processx_3.5.2 whisker_0.4
[28] later_1.2.0 git2r_0.28.0 generics_0.1.0
[31] ellipsis_0.3.2 withr_2.4.2 cli_2.5.0
[34] magrittr_2.0.1 crayon_1.4.1 readxl_1.3.1
[37] evaluate_0.14 ps_1.6.0 fs_1.5.0
[40] fansi_0.5.0 MASS_7.3-54 xml2_1.3.2
[43] tools_4.1.0 prettyunits_1.1.1 hms_1.1.0
[46] lifecycle_1.0.1 reprex_2.0.0 munsell_0.5.0
[49] callr_3.7.0 ade4_1.7-16 compiler_4.1.0
[52] jquerylib_0.1.4 rlang_1.0.2 grid_4.1.0
[55] RCurl_1.98-1.3 rstudioapi_0.13 bitops_1.0-7
[58] rmarkdown_2.8 gtable_0.3.0 DBI_1.1.1
[61] R6_2.5.0 lubridate_1.7.10 knitr_1.33
[64] utf8_1.2.1 rprojroot_2.0.2 stringi_1.6.2
[67] Rcpp_1.0.8 vctrs_0.3.8 dbplyr_2.1.1
[70] tidyselect_1.1.1 xfun_0.23