• Diagnostics and Quality Control Tools
  • ASEReadCounter
  • AnalyzeCovariates
  • CallableLoci
  • CheckPileup
  • CompareCallableLoci
  • ContEst
  • CountBases
  • CountIntervals
  • CountLoci
  • CountMales
  • CountRODs
  • CountRODsByRef
  • CountReadEvents
  • CountReads
  • CountTerminusEvent
  • DepthOfCoverage
  • DiagnoseTargets
  • DiffObjects
  • ErrorRatePerCycle
  • FastaStats
  • FindCoveredIntervals
  • FlagStat
  • GCContentByInterval
  • GatherBqsrReports
  • Pileup
  • PrintRODs
  • QualifyMissingIntervals
  • ReadClippingStats
  • ReadGroupProperties
  • ReadLengthDistribution
  • SimulateReadsForVariants
  • Sequence Data Processing Tools
  • BaseRecalibrator
  • ClipReads
  • IndelRealigner
  • LeftAlignIndels
  • PrintReads
  • RealignerTargetCreator
  • SplitNCigarReads
  • SplitSamFile
  • Variant Discovery Tools
  • ApplyRecalibration
  • CalculateGenotypePosteriors
  • GATKPaperGenotyper
  • GenotypeGVCFs
  • HaplotypeCaller
  • MuTect2
  • RegenotypeVariants
  • UnifiedGenotyper
  • VariantRecalibrator
  • Variant Evaluation Tools
  • GenotypeConcordance
  • ValidateVariants
  • VariantEval
  • VariantFiltration
  • Variant Manipulation Tools
  • CatVariants
  • CombineGVCFs
  • CombineVariants
  • HaplotypeResolver
  • LeftAlignAndTrimVariants
  • PhaseByTransmission
  • RandomlySplitVariants
  • ReadBackedPhasing
  • SelectHeaders
  • SelectVariants
  • ValidationSiteSelector
  • VariantAnnotator
  • VariantsToAllelicPrimitives
  • VariantsToBinaryPed
  • VariantsToTable
  • VariantsToVCF

  • Annotation Modules
  • AS_BaseQualityRankSumTest
  • AS_FisherStrand
  • AS_InbreedingCoeff
  • AS_InsertSizeRankSum
  • AS_MQMateRankSumTest
  • AS_MappingQualityRankSumTest
  • AS_QualByDepth
  • AS_RMSMappingQuality
  • AS_ReadPosRankSumTest
  • AS_StrandOddsRatio
  • AlleleBalance
  • AlleleBalanceBySample
  • AlleleCountBySample
  • BaseCounts
  • BaseCountsBySample
  • BaseQualityRankSumTest
  • BaseQualitySumPerAlleleBySample
  • ChromosomeCounts
  • ClippingRankSumTest
  • ClusteredReadPosition
  • Coverage
  • DepthPerAlleleBySample
  • DepthPerSampleHC
  • ExcessHet
  • FisherStrand
  • FractionInformativeReads
  • GCContent
  • GenotypeSummaries
  • HaplotypeScore
  • HardyWeinberg
  • HomopolymerRun
  • InbreedingCoeff
  • LikelihoodRankSumTest
  • LowMQ
  • MVLikelihoodRatio
  • MappingQualityRankSumTest
  • MappingQualityZero
  • MappingQualityZeroBySample
  • NBaseCount
  • OxoGReadCounts
  • PossibleDeNovo
  • QualByDepth
  • RMSMappingQuality
  • ReadPosRankSumTest
  • SampleList
  • SnpEff
  • SpanningDeletions
  • StrandAlleleCountsBySample
  • StrandBiasBySample
  • StrandOddsRatio
  • TandemRepeatAnnotator
  • TransmissionDisequilibriumTest
  • VariantType
  • Read Filters
  • BadCigarFilter
  • BadMateFilter
  • CountingFilteringIterator.CountingReadFilter
  • DuplicateReadFilter
  • FailsVendorQualityCheckFilter
  • HCMappingQualityFilter
  • LibraryReadFilter
  • MalformedReadFilter
  • MappingQualityFilter
  • MappingQualityUnavailableFilter
  • MappingQualityZeroFilter
  • MateSameStrandFilter
  • MaxInsertSizeFilter
  • MissingReadGroupFilter
  • NoOriginalQualityScoresFilter
  • NotPrimaryAlignmentFilter
  • OverclippedReadFilter
  • Platform454Filter
  • PlatformFilter
  • PlatformUnitFilter
  • ReadGroupBlackListFilter
  • ReadLengthFilter
  • ReadNameFilter
  • ReadStrandFilter
  • ReassignMappingQualityFilter
  • ReassignOneMappingQualityFilter
  • ReassignOriginalMQAfterIndelRealignmentFilter
  • SampleFilter
  • SingleReadGroupFilter
  • UnmappedReadFilter
  • Resource File Codecs
  • BeagleCodec
  • BedTableCodec
  • RawHapMapCodec
  • RefSeqCodec
  • SAMPileupCodec
  • SAMReadCodec
  • TableCodec

  • Reference Utilities
  • FastaAlternateReferenceMaker
  • FastaReferenceMaker
  • QCRef
  • Showing docs for version 3.7-0


    VariantRecalibrator

    Build a recalibration model to score variant quality for filtering purposes

    Category Variant Discovery Tools

    Traversal LocusWalker

    PartitionBy NONE


    Overview

    The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. You can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call. The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (such as QD, MQ, and ReadPosRankSum, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array (in humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds of being a true variant versus being false under the trained Gaussian mixture model.

    This tool performs the first pass in a two-stage process called VQSR; the second pass is performed by the ApplyRecalibration tool. In brief, the first pass consists of creating a Gaussian mixture model by looking at the distribution of annotation values over a high quality subset of the input call set, and then scoring all input variants according to the model. The second pass consists of filtering variants based on score cutoffs identified in the first pass.

    VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools and how to use them for best results on your own data.

    Inputs

    Output

    Usage example

    Recalibrating SNPs in exome data:

     java -Xmx4g -jar GenomeAnalysisTK.jar \
       -T VariantRecalibrator \
       -R reference.fasta \
       -input raw_variants.vcf \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
       -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf
       -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf \
       -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
       -mode SNP \
       -recalFile output.recal \
       -tranchesFile output.tranches \
       -rscriptFile output.plots.R
     

    Allele-specfic usage

     java -Xmx4g -jar GenomeAnalysisTK.jar \
       -T VariantRecalibrator \
       -R reference.fasta \
       -input raw_variants.withASannotations.vcf \
       -AS \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
       -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf
       -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf \
       -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
       -mode SNP \
       -recalFile output.AS.recal \
       -tranchesFile output.AS.tranches \
       -rscriptFile output.plots.AS.R
     
    The input VCF must have been produced using allele-specific annotations in HaplotypeCaller. Note that each allele will have a separate line in the output .recal file with its own VQSLOD and culprit that will be transferred to the final VCF in ApplyRecalibration.

    Caveats


    Additional Information

    Read filters

    These Read Filters are automatically applied to the data by the Engine before processing by VariantRecalibrator.

    Parallelism options

    This tool can be run in multi-threaded mode using this option.


    Command-line Arguments

    Engine arguments

    All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

    VariantRecalibrator specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Inputs
    --input
    NA One or more VCFs of raw input variants to be recalibrated
    --resource
    [] A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
    Required Outputs
    --recal_file
     -recalFile
    NA The output recal file used by ApplyRecalibration
    --tranches_file
     -tranchesFile
    NA The output tranches file used by ApplyRecalibration
    Required Parameters
    --mode
    SNP Recalibration mode to employ
    --use_annotation
     -an
    [] The names of the annotations which should used for calculations
    Optional Inputs
    --aggregate
    NA Additional raw input variants to be used in building the model
    Optional Outputs
    --model_file
     -modelFile
    stdout A GATKReport containing the positive and negative model fits
    --rscript_file
     -rscriptFile
    NA The output rscript file generated by the VQSR to aid in visualization of the input data and learned model
    Optional Parameters
    --ignore_filter
     -ignoreFilter
    [] If specified, the variant recalibrator will also use variants marked as filtered by the specified filter name in the input VCF file
    --target_titv
     -titv
    2.15 The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
    --TStranche
     -tranche
    [100.0, 99.9, 99.0, 90.0] The levels of truth sensitivity at which to slice the data. (in percent, that is 1.0 for 1 percent)
    Optional Flags
    --ignore_all_filters
     -ignoreAllFilters
    false If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.
    --output_model
     -outputModel
    false If specified, the variant recalibrator will output the VQSR model fit to the file specified by -modelFile or to stdout
    --useAlleleSpecificAnnotations
     -AS
    false If specified, the variant recalibrator will attempt to use the allele-specific versions of the specified annotations.
    Advanced Parameters
    --badLodCutoff
    -5.0 LOD score cutoff for selecting bad variants
    --dirichlet
    0.001 The dirichlet parameter in the variational Bayes algorithm.
    --max_attempts
    1 Number of attempts to build a model before failing
    --maxGaussians
     -mG
    8 Max number of Gaussians for the positive model
    --maxIterations
     -mI
    150 Maximum number of VBEM iterations
    --maxNegativeGaussians
     -mNG
    2 Max number of Gaussians for the negative model
    --maxNumTrainingData
    2500000 Maximum number of training data
    --minNumBadVariants
     -minNumBad
    1000 Minimum number of bad variants
    --MQCapForLogitJitterTransform
     -MQCap
    0 Apply logit transform and jitter to MQ values
    --numKMeans
     -nKM
    100 Number of k-means iterations
    --priorCounts
    20.0 The number of prior counts to use in the variational Bayes algorithm.
    --shrinkage
    1.0 The shrinkage parameter in the variational Bayes algorithm.
    --stdThreshold
     -std
    10.0 Annotation value divergence threshold (number of standard deviations from the means)
    Advanced Flags
    --trustAllPolymorphic
     -allPoly
    false Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --aggregate / -aggregate

    Additional raw input variants to be used in building the model
    These additional calls should be unfiltered and annotated with the error covariates that are intended to be used for modeling.

    This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

    List[RodBinding[VariantContext]]  NA


    --badLodCutoff / -badLodCutoff

    LOD score cutoff for selecting bad variants
    Variants scoring lower than this threshold will be used to build the Gaussian model of bad variants.

    double  -5.0  [ [ -∞  ∞ ] ]


    --dirichlet / -dirichlet

    The dirichlet parameter in the variational Bayes algorithm.

    double  0.001  [ [ -∞  ∞ ] ]


    --ignore_all_filters / -ignoreAllFilters

    If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.

    boolean  false


    --ignore_filter / -ignoreFilter

    If specified, the variant recalibrator will also use variants marked as filtered by the specified filter name in the input VCF file
    For this to work properly, the -ignoreFilter argument should also be applied to the ApplyRecalibration command.

    List[String]  []


    --input / -input

    One or more VCFs of raw input variants to be recalibrated
    These variant calls must be annotated with the annotations that will be used for modeling. If the calls come from multiple samples, they must have been obtained by joint calling the samples, either directly (running HaplotypeCaller on all samples together) or via the GVCF workflow (HaplotypeCaller with -ERC GVCF per-sample then GenotypeGVCFs on the resulting gVCFs) which is more scalable. Note that the ability to pass multiple input files is only intended to facilitate scatter-gather parallelism (to enable e.g. running on VCFs generated per-chromosome), not to combine different callsets. The variant calls in the separate input files should not overlap.

    R List[RodBindingCollection[VariantContext]]  NA


    --max_attempts / -max_attempts

    Number of attempts to build a model before failing
    The statistical model being built by this tool may fail due to simple statistical sampling issues. Rather than dying immediately when the initial model fails, this argument allows the tool to restart with a different random seed and try to build the model again. The first successfully built model will be kept. Note that the most common underlying cause of model building failure is that there is insufficient data to build a really robust model. This argument provides a workaround for that issue but it is preferable to provide this tool with more data (typically by including more samples or more territory) in order to generate a more robust model.

    int  1  [ [ -∞  ∞ ] ]


    --maxGaussians / -mG

    Max number of Gaussians for the positive model
    This parameter determines the maximum number of Gaussians that should be used when building a positive model using the variational Bayes algorithm.

    int  8  [ [ -∞  ∞ ] ]


    --maxIterations / -mI

    Maximum number of VBEM iterations
    This parameter determines the maximum number of VBEM iterations to be performed in the variational Bayes algorithm. The procedure will normally end when convergence is detected.

    int  150  [ [ -∞  ∞ ] ]


    --maxNegativeGaussians / -mNG

    Max number of Gaussians for the negative model
    This parameter determines the maximum number of Gaussians that should be used when building a negative model using the variational Bayes algorithm. The actual maximum used is the smaller value between the mG and mNG arguments, meaning that if -mG is smaller than -mNG, -mG will be used for both. Note that this number should be small (e.g. 4) to achieve the best results.

    int  2  [ [ -∞  ∞ ] ]


    --maxNumTrainingData / -maxNumTrainingData

    Maximum number of training data
    The number of variants to use in building the Gaussian mixture model. Training sets larger than this will be randomly downsampled.

    int  2500000  [ [ -∞  ∞ ] ]


    --minNumBadVariants / -minNumBad

    Minimum number of bad variants
    This parameter determines the minimum number of variants that will be selected from the list of worst scoring variants to use for building the Gaussian mixture model of bad variants.

    int  1000  [ [ -∞  ∞ ] ]


    --mode / -mode

    Recalibration mode to employ
    Use either SNP for recalibrating only SNPs (emitting indels untouched in the output VCF) or INDEL for indels (emitting SNPs untouched in the output VCF). There is also a BOTH option for recalibrating both SNPs and indels simultaneously, but this is meant for testing purposes only and should not be used in actual analyses.

    The --mode argument is an enumerated type (Mode), which can have one of the following values:

    SNP
    INDEL
    BOTH

    R Mode  SNP


    --model_file / -modelFile

    A GATKReport containing the positive and negative model fits

    PrintStream  stdout


    --MQCapForLogitJitterTransform / -MQCap

    Apply logit transform and jitter to MQ values
    MQ is capped at a "max" value (60 for bwa-mem) when the alignment is considered perfect. Typically, a huge proportion of the reads in a dataset are perfectly mapped, which yields a distribution of MQ values with a blob below the max value and a huge peak at the max value. This does not conform to the expectations of the Gaussian mixture model of VQSR and has been observed to yield a ROC curve with a jump. This argument aims to mitigate this problem. Using MQCap = X has 2 effects: (1) MQs are transformed by a scaled logit on [0,X] (+ epsilon to avoid division by zero) to make the blob more Gaussian-like and (2) the transformed MQ=X are jittered to break the peak into a narrow Gaussian. Beware that IndelRealigner, if used, adds 10 to MQ for successfully realigned indels. We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account. If this option is not used, or if MQCap is set to 0, MQ will not be transformed.

    int  0  [ [ -∞  ∞ ] ]


    --numKMeans / -nKM

    Number of k-means iterations
    This parameter determines the number of k-means iterations to perform in order to initialize the means of the Gaussians in the Gaussian mixture model.

    int  100  [ [ -∞  ∞ ] ]


    --output_model / -outputModel

    If specified, the variant recalibrator will output the VQSR model fit to the file specified by -modelFile or to stdout
    This GATKReport gives information to describe the VQSR model fit. Normalized means for the positive model are concatenated as one table and negative model normalized means as another table. Covariances are also concatenated for positive and negative models, respectively. Tables of annotation means and standard deviations are provided to help describe the normalization. The model fit report can be read in with our R gsalib package. Individual model Gaussians can be subset by the value in the "Gaussian" column if desired.

    boolean  false


    --priorCounts / -priorCounts

    The number of prior counts to use in the variational Bayes algorithm.

    double  20.0  [ [ -∞  ∞ ] ]


    --recal_file / -recalFile

    The output recal file used by ApplyRecalibration

    R VariantContextWriter  NA


    --resource / -resource

    A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
    Any set of VCF files to use as lists of training, truth, or known sites. Training - The program builds the Gaussian mixture model using input variants that overlap with these training sites. Truth - The program uses these truth sites to determine where to set the cutoff in VQSLOD sensitivity. Known - The program only uses known sites for reporting purposes (to indicate whether variants are already known or novel). They are not used in any calculations by the algorithm itself. Bad - A database of known bad variants can be used to supplement the set of worst ranked variants (compared to the Gaussian mixture model) that the program selects from the data to model "bad" variants.

    This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

    R List[RodBinding[VariantContext]]  []


    --rscript_file / -rscriptFile

    The output rscript file generated by the VQSR to aid in visualization of the input data and learned model

    File  NA


    --shrinkage / -shrinkage

    The shrinkage parameter in the variational Bayes algorithm.

    double  1.0  [ [ -∞  ∞ ] ]


    --stdThreshold / -std

    Annotation value divergence threshold (number of standard deviations from the means)
    If a variant has annotations more than -std standard deviations away from mean, it won't be used for building the Gaussian mixture model.

    double  10.0  [ [ -∞  ∞ ] ]


    --target_titv / -titv

    The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
    The expected transition / transversion ratio of true novel variants in your targeted region (whole genome, exome, specific genes), which varies greatly by the CpG and GC content of the region. See expected Ti/Tv ratios section of the GATK best practices documentation (http://www.broadinstitute.org/gatk/guide/best-practices) for more information. Normal values are 2.15 for human whole genome values and 3.2 for human whole exomes. Note that this parameter is used for display purposes only and isn't used anywhere in the algorithm!

    double  2.15  [ [ -∞  ∞ ] ]


    --tranches_file / -tranchesFile

    The output tranches file used by ApplyRecalibration

    R File  NA


    --trustAllPolymorphic / -allPoly

    Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

    Boolean  false


    --TStranche / -tranche

    The levels of truth sensitivity at which to slice the data. (in percent, that is 1.0 for 1 percent)
    Add truth sensitivity slices through the call set at the given values. The default values are 100.0, 99.9, 99.0, and 90.0 which will result in 4 estimated tranches in the final call set: the full set of calls (100% sensitivity at the accessible sites in the truth set), a 99.9% truth sensitivity tranche, along with progressively smaller tranches at 99% and 90%.

    List[Double]  [100.0, 99.9, 99.0, 90.0]


    --use_annotation / -an

    The names of the annotations which should used for calculations
    See the input VCF file's INFO field for a list of all available annotations.

    R List[String]  []


    --useAlleleSpecificAnnotations / -AS

    If specified, the variant recalibrator will attempt to use the allele-specific versions of the specified annotations.
    Generate a VQSR model using per-allele data instead of the default per-site data, assuming that the input VCF contains allele-specific annotations. Annotations should be specified using their full names with AS_ prefix. Non-allele-specific (scalar) annotations will be applied to all alleles.

    boolean  false