Perform local realignment of reads around indels
The local realignment process is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches.
Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.
There are 2 steps to the realignment process:
For more details, see the indel realignment method documentation.
One or more aligned BAM files and optionally one or more lists of known indels.
A realigned version of your input BAM file(s).
java -jar GenomeAnalysisTK.jar \ -T IndelRealigner \ -R reference.fasta \ -I input.bam \ -known indels.vcf \ -targetIntervals intervalListFromRTC.intervals \ -o realignedBam.bam
These Read Filters are automatically applied to the data by the Engine before processing by IndelRealigner.
This tool does not apply any downsampling by default.
All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Inputs | |||
--targetIntervals |
NA | Intervals file output from RealignerTargetCreator | |
Optional Inputs | |||
--knownAlleles -known |
[] | Input VCF file(s) with known indels | |
Optional Outputs | |||
--out -o |
NA | Output bam | |
Optional Parameters | |||
--consensusDeterminationModel -model |
USE_READS | Determines how to compute the possible alternate consenses | |
--LODThresholdForCleaning -LOD |
5.0 | LOD threshold above which the cleaner will clean | |
--nWayOut |
NA | Generate one output file for each input (-I) bam file (not compatible with -output) | |
Advanced Parameters | |||
--entropyThreshold -entropy |
0.15 | Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0) | |
--maxConsensuses |
30 | Max alternate consensuses to try (necessary to improve performance in deep coverage) | |
--maxIsizeForMovement -maxIsize |
3000 | maximum insert size of read pairs that we attempt to realign | |
--maxPositionalMoveAllowed -maxPosMove |
200 | Maximum positional move in basepairs that a read can be adjusted during realignment | |
--maxReadsForConsensuses -greedy |
120 | Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage) | |
--maxReadsForRealignment -maxReads |
20000 | Max reads allowed at an interval for realignment | |
--maxReadsInMemory -maxInMemory |
150000 | max reads allowed to be kept in memory at a time by the SAMFileWriter | |
Advanced Flags | |||
--noOriginalAlignmentTags -noTags |
false | Don't output the original cigar or alignment start tags for each realigned read in the output bam |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Determines how to compute the possible alternate consenses
We recommend that users run with USE_READS when trying to realign high quality longer read data mapped with a gapped aligner;
Smith-Waterman is really only necessary when using an ungapped aligner (e.g. MAQ in the case of single-end read data).
The --consensusDeterminationModel argument is an enumerated type (ConsensusDeterminationModel), which can have one of the following values:
ConsensusDeterminationModel USE_READS
Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)
For expert users only! This is similar to the argument in the RealignerTargetCreator walker. The point here is that the realigner
will only proceed with the realignment (even above the given threshold) if it minimizes entropy among the reads (and doesn't simply
push the mismatch column to another position). This parameter is just a heuristic and should be adjusted based on your particular data set.
double 0.15 [ [ -∞ ∞ ] ]
Input VCF file(s) with known indels
Any number of VCF files representing known indels to be used for constructing alternate consenses.
Could be e.g. dbSNP and/or official 1000 Genomes indel calls. Non-indel variants in these files will be ignored.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
List[RodBinding[VariantContext]] []
LOD threshold above which the cleaner will clean
This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment? Note that this number
should be adjusted based on your particular data set. For low coverage and/or when looking for indels with low allele frequency,
this number should be smaller.
double 5.0 [ [ -∞ ∞ ] ]
Max alternate consensuses to try (necessary to improve performance in deep coverage)
For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.
int 30 [ [ -∞ ∞ ] ]
maximum insert size of read pairs that we attempt to realign
For expert users only!
int 3000 [ [ -∞ ∞ ] ]
Maximum positional move in basepairs that a read can be adjusted during realignment
For expert users only!
int 200 [ [ -∞ ∞ ] ]
Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)
For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.
int 120 [ [ -∞ ∞ ] ]
Max reads allowed at an interval for realignment
For expert users only! If this value is exceeded at a given interval, realignment is not attempted and the reads are passed to the output file(s) as-is.
If you need to allow more reads (e.g. with very deep coverage) regardless of memory, use a higher number.
int 20000 [ [ -∞ ∞ ] ]
max reads allowed to be kept in memory at a time by the SAMFileWriter
For expert users only! To minimize memory consumption you can lower this number (but then the tool may skip realignment on regions with too much coverage;
and if the number is too low, it may generate errors during realignment). Just make sure to give Java enough memory! 4Gb should be enough with the default value.
int 150000 [ [ -∞ ∞ ] ]
Don't output the original cigar or alignment start tags for each realigned read in the output bam
boolean false
Generate one output file for each input (-I) bam file (not compatible with -output)
Reads from all input files will be realigned together, but then each read will be saved in the output file corresponding to the input file that
the read came from. There are two ways to generate output bam file names: 1) if the value of this argument is a general string (e.g. '.cleaned.bam'),
then extensions (".bam" or ".sam") will be stripped from the input file names and the provided string value will be pasted on instead; 2) if the
value ends with a '.map' (e.g. input_output.map), then the two-column tab-separated file with the specified name must exist and list unique output
file name (2nd column) for each input file name (1st column).
Note that some GATK arguments do NOT work in conjunction with nWayOut (e.g. --disable_bam_indexing).
String NA
Output bam
The realigned bam file.
GATKSAMFileWriter NA
Intervals file output from RealignerTargetCreator
The interval list output from the RealignerTargetCreator tool using the same bam(s), reference, and known indel file(s).
R IntervalBinding[Feature] NA