Showing docs for version 3.7-0

IndelRealigner

Perform local realignment of reads around indels

Category Sequence Data Processing Tools

Traversal ReadWalker

PartitionBy READ

Overview

The local realignment process is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches.

Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.

There are 2 steps to the realignment process:

Determining (small) suspicious intervals which are likely in need of realignment (see the RealignerTargetCreator tool)
Running the realigner over those intervals (IndelRealigner)

For more details, see the indel realignment method documentation.

Input

One or more aligned BAM files and optionally one or more lists of known indels.

Output

A realigned version of your input BAM file(s).

Usage example

 java -jar GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R reference.fasta \
   -I input.bam \
   -known indels.vcf \
   -targetIntervals intervalListFromRTC.intervals \
   -o realignedBam.bam

Caveats

The input BAM(s), reference, and known indel file(s) should be the same ones to be used for the IndelRealigner step.
Because reads produced from the 454 technology inherently contain false indels, the realigner will not work with them (or with reads from similar technologies).
This tool also ignores MQ0 reads and reads with consecutive indel operators in the CIGAR string.

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by IndelRealigner.

Downsampling settings

This tool does not apply any downsampling by default.

Command-line Arguments

Engine arguments

All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

CommandLineGATK

IndelRealigner specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Inputs
--targetIntervals	NA	Intervals file output from RealignerTargetCreator
Optional Inputs
--knownAlleles -known	[]	Input VCF file(s) with known indels
Optional Outputs
--out -o	NA	Output bam
Optional Parameters
--consensusDeterminationModel -model	USE_READS	Determines how to compute the possible alternate consenses
--LODThresholdForCleaning -LOD	5.0	LOD threshold above which the cleaner will clean
--nWayOut	NA	Generate one output file for each input (-I) bam file (not compatible with -output)
Advanced Parameters
--entropyThreshold -entropy	0.15	Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)
--maxConsensuses	30	Max alternate consensuses to try (necessary to improve performance in deep coverage)
--maxIsizeForMovement -maxIsize	3000	maximum insert size of read pairs that we attempt to realign
--maxPositionalMoveAllowed -maxPosMove	200	Maximum positional move in basepairs that a read can be adjusted during realignment
--maxReadsForConsensuses -greedy	120	Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)
--maxReadsForRealignment -maxReads	20000	Max reads allowed at an interval for realignment
--maxReadsInMemory -maxInMemory	150000	max reads allowed to be kept in memory at a time by the SAMFileWriter
Advanced Flags
--noOriginalAlignmentTags -noTags	false	Don't output the original cigar or alignment start tags for each realigned read in the output bam

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--consensusDeterminationModel / -model

Determines how to compute the possible alternate consenses
We recommend that users run with USE_READS when trying to realign high quality longer read data mapped with a gapped aligner; Smith-Waterman is really only necessary when using an ungapped aligner (e.g. MAQ in the case of single-end read data).

The --consensusDeterminationModel argument is an enumerated type (ConsensusDeterminationModel), which can have one of the following values:

KNOWNS_ONLY: Uses only indels from a provided ROD of known indels.
USE_READS: Additionally uses indels already present in the original alignments of the reads.
USE_SW: Additionally uses 'Smith-Waterman' to generate alternate consenses.

ConsensusDeterminationModel USE_READS

--entropyThreshold / -entropy

Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)
For expert users only! This is similar to the argument in the RealignerTargetCreator walker. The point here is that the realigner will only proceed with the realignment (even above the given threshold) if it minimizes entropy among the reads (and doesn't simply push the mismatch column to another position). This parameter is just a heuristic and should be adjusted based on your particular data set.

double 0.15 [ [ -∞ ∞ ] ]

--knownAlleles / -known

Input VCF file(s) with known indels
Any number of VCF files representing known indels to be used for constructing alternate consenses. Could be e.g. dbSNP and/or official 1000 Genomes indel calls. Non-indel variants in these files will be ignored.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

List[RodBinding[VariantContext]] []

--LODThresholdForCleaning / -LOD

LOD threshold above which the cleaner will clean
This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment? Note that this number should be adjusted based on your particular data set. For low coverage and/or when looking for indels with low allele frequency, this number should be smaller.

double 5.0 [ [ -∞ ∞ ] ]

--maxConsensuses / -maxConsensuses

Max alternate consensuses to try (necessary to improve performance in deep coverage)
For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.

int 30 [ [ -∞ ∞ ] ]

--maxIsizeForMovement / -maxIsize

maximum insert size of read pairs that we attempt to realign
For expert users only!

int 3000 [ [ -∞ ∞ ] ]

--maxPositionalMoveAllowed / -maxPosMove

Maximum positional move in basepairs that a read can be adjusted during realignment
For expert users only!

int 200 [ [ -∞ ∞ ] ]

--maxReadsForConsensuses / -greedy

Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)
For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.

int 120 [ [ -∞ ∞ ] ]

--maxReadsForRealignment / -maxReads

Max reads allowed at an interval for realignment
For expert users only! If this value is exceeded at a given interval, realignment is not attempted and the reads are passed to the output file(s) as-is. If you need to allow more reads (e.g. with very deep coverage) regardless of memory, use a higher number.

int 20000 [ [ -∞ ∞ ] ]

--maxReadsInMemory / -maxInMemory

max reads allowed to be kept in memory at a time by the SAMFileWriter
For expert users only! To minimize memory consumption you can lower this number (but then the tool may skip realignment on regions with too much coverage; and if the number is too low, it may generate errors during realignment). Just make sure to give Java enough memory! 4Gb should be enough with the default value.

int 150000 [ [ -∞ ∞ ] ]

--noOriginalAlignmentTags / -noTags

Don't output the original cigar or alignment start tags for each realigned read in the output bam

boolean false

--nWayOut / -nWayOut

Generate one output file for each input (-I) bam file (not compatible with -output)
Reads from all input files will be realigned together, but then each read will be saved in the output file corresponding to the input file that the read came from. There are two ways to generate output bam file names: 1) if the value of this argument is a general string (e.g. '.cleaned.bam'), then extensions (".bam" or ".sam") will be stripped from the input file names and the provided string value will be pasted on instead; 2) if the value ends with a '.map' (e.g. input_output.map), then the two-column tab-separated file with the specified name must exist and list unique output file name (2nd column) for each input file name (1st column). Note that some GATK arguments do NOT work in conjunction with nWayOut (e.g. --disable_bam_indexing).

String NA

--out / -o

Output bam
The realigned bam file.

GATKSAMFileWriter NA

--targetIntervals / -targetIntervals

Intervals file output from RealignerTargetCreator
The interval list output from the RealignerTargetCreator tool using the same bam(s), reference, and known indel file(s).

R IntervalBinding[Feature] NA