Apply a score cutoff to filter variants based on a recalibration table
This tool performs the second pass in a two-stage process called VQSR; the first pass is performed by the VariantRecalibrator tool. In brief, the first pass consists of creating a Gaussian mixture model by looking at the distribution of annotation values over a high quality subset of the input call set, and then scoring all input variants according to the model. The second pass consists of filtering variants based on score cutoffs identified in the first pass.
Using the tranche file and recalibration table generated by the previous step, the ApplyRecalibration tool looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their FILTER field annotated with the corresponding tranche level. This will result in a call set that is filtered to the desired level but retains the information necessary to increase sensitivity if needed.
To be clear, please note that by "filtered", we mean that variants failing the requested tranche cutoff are marked as filtered in the output VCF; they are not discarded.
VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools and how to use them for best results on your own data.
java -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference.fasta \ -input raw_variants.vcf \ --ts_filter_level 99.0 \ -tranchesFile output.tranches \ -recalFile output.recal \ -mode SNP \ -o path/to/output.recalibrated.filtered.vcf
java -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference.fasta \ -input raw_variants.withASannotations.vcf \ -AS \ --ts_filter_level 99.0 \ -tranchesFile output.AS.tranches \ -recalFile output.AS.recal \ -mode SNP \ -o path/to/output.recalibrated.ASfiltered.vcfEach allele will be annotated by its corresponding entry in the AS_FilterStatus INFO field annotation. Allele-specific VQSLOD and culprit are also carried through from VariantRecalibrator and stored in the AS_VQSLOD and AS_culprit INFO fields, respectively. The site-level filter is set to the most lenient of any of the allele filters. That is, if one allele passes, the whole site will be PASS. If no alleles pass, the site-level filter will be set to the lowest sensitivity tranche among all the alleles. Note that the .tranches and .recal files should be derived from an allele-specific run of VariantRecalibrator Also note that the AS_culprit, AS_FilterStatus, and AS_VQSLOD fields will have placeholder values (NA or NaN) for alleles of a type that have not yet been processed by ApplyRecalibration The spanning deletion allele (*) will not be recalibrated because it represents missing data. Its VQSLOD will remain NaN and it's culprit and FilterStatus will be NA.
These Read Filters are automatically applied to the data by the Engine before processing by ApplyRecalibration.
This tool can be run in multi-threaded mode using this option.
All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Inputs | |||
--input |
NA | The raw input variants to be recalibrated | |
--recal_file -recalFile |
NA | The input recal file used by ApplyRecalibration | |
Optional Inputs | |||
--tranches_file -tranchesFile |
NA | The input tranches file describing where to cut the data | |
Optional Outputs | |||
--out -o |
stdout | The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value | |
Optional Parameters | |||
--ignore_filter -ignoreFilter |
NA | If specified, the recalibration will be applied to variants marked as filtered by the specified filter name in the input VCF file | |
--mode |
SNP | Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously. | |
--ts_filter_level |
NA | The truth sensitivity level at which to start filtering | |
Optional Flags | |||
--excludeFiltered -ef |
false | Don't output filtered loci after applying the recalibration | |
--ignore_all_filters -ignoreAllFilters |
false | If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file. | |
--useAlleleSpecificAnnotations -AS |
false | If specified, the tool will attempt to apply a filter to each allele based on the input tranches and allele-specific .recal file. | |
Advanced Parameters | |||
--lodCutoff |
NA | The VQSLOD score below which to start filtering |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Don't output filtered loci after applying the recalibration
boolean false
If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.
boolean false
If specified, the recalibration will be applied to variants marked as filtered by the specified filter name in the input VCF file
For this to work properly, the -ignoreFilter argument should also be applied to the VariantRecalibration command.
String[] NA
The raw input variants to be recalibrated
These calls should be unfiltered and annotated with the error covariates that are intended to use for modeling.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
R List[RodBinding[VariantContext]] NA
The VQSLOD score below which to start filtering
Double NA
Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously.
The --mode argument is an enumerated type (Mode), which can have one of the following values:
Mode SNP
The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value
VariantContextWriter stdout
The input recal file used by ApplyRecalibration
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
R RodBinding[VariantContext] NA
The input tranches file describing where to cut the data
File NA
The truth sensitivity level at which to start filtering
Double NA
If specified, the tool will attempt to apply a filter to each allele based on the input tranches and allele-specific .recal file.
Filter the input file based on allele-specific recalibration data. See tool docs for site-level and allele-level filtering details.
Requires a .recal file produced using an allele-specific run of VariantRecalibrator
boolean false