• Diagnostics and Quality Control Tools
  • ASEReadCounter
  • AnalyzeCovariates
  • CallableLoci
  • CheckPileup
  • CompareCallableLoci
  • ContEst
  • CountBases
  • CountIntervals
  • CountLoci
  • CountMales
  • CountRODs
  • CountRODsByRef
  • CountReadEvents
  • CountReads
  • CountTerminusEvent
  • DepthOfCoverage
  • DiagnoseTargets
  • DiffObjects
  • ErrorRatePerCycle
  • FastaStats
  • FindCoveredIntervals
  • FlagStat
  • GCContentByInterval
  • GatherBqsrReports
  • Pileup
  • PrintRODs
  • QualifyMissingIntervals
  • ReadClippingStats
  • ReadGroupProperties
  • ReadLengthDistribution
  • SimulateReadsForVariants
  • Sequence Data Processing Tools
  • BaseRecalibrator
  • ClipReads
  • IndelRealigner
  • LeftAlignIndels
  • PrintReads
  • RealignerTargetCreator
  • SplitNCigarReads
  • SplitSamFile
  • Variant Discovery Tools
  • ApplyRecalibration
  • CalculateGenotypePosteriors
  • GATKPaperGenotyper
  • GenotypeGVCFs
  • HaplotypeCaller
  • MuTect2
  • RegenotypeVariants
  • UnifiedGenotyper
  • VariantRecalibrator
  • Variant Evaluation Tools
  • GenotypeConcordance
  • ValidateVariants
  • VariantEval
  • VariantFiltration
  • Variant Manipulation Tools
  • CatVariants
  • CombineGVCFs
  • CombineVariants
  • HaplotypeResolver
  • LeftAlignAndTrimVariants
  • PhaseByTransmission
  • RandomlySplitVariants
  • ReadBackedPhasing
  • SelectHeaders
  • SelectVariants
  • ValidationSiteSelector
  • VariantAnnotator
  • VariantsToAllelicPrimitives
  • VariantsToBinaryPed
  • VariantsToTable
  • VariantsToVCF

  • Annotation Modules
  • AS_BaseQualityRankSumTest
  • AS_FisherStrand
  • AS_InbreedingCoeff
  • AS_InsertSizeRankSum
  • AS_MQMateRankSumTest
  • AS_MappingQualityRankSumTest
  • AS_QualByDepth
  • AS_RMSMappingQuality
  • AS_ReadPosRankSumTest
  • AS_StrandOddsRatio
  • AlleleBalance
  • AlleleBalanceBySample
  • AlleleCountBySample
  • BaseCounts
  • BaseCountsBySample
  • BaseQualityRankSumTest
  • BaseQualitySumPerAlleleBySample
  • ChromosomeCounts
  • ClippingRankSumTest
  • ClusteredReadPosition
  • Coverage
  • DepthPerAlleleBySample
  • DepthPerSampleHC
  • ExcessHet
  • FisherStrand
  • FractionInformativeReads
  • GCContent
  • GenotypeSummaries
  • HaplotypeScore
  • HardyWeinberg
  • HomopolymerRun
  • InbreedingCoeff
  • LikelihoodRankSumTest
  • LowMQ
  • MVLikelihoodRatio
  • MappingQualityRankSumTest
  • MappingQualityZero
  • MappingQualityZeroBySample
  • NBaseCount
  • OxoGReadCounts
  • PossibleDeNovo
  • QualByDepth
  • RMSMappingQuality
  • ReadPosRankSumTest
  • SampleList
  • SnpEff
  • SpanningDeletions
  • StrandAlleleCountsBySample
  • StrandBiasBySample
  • StrandOddsRatio
  • TandemRepeatAnnotator
  • TransmissionDisequilibriumTest
  • VariantType
  • Read Filters
  • BadCigarFilter
  • BadMateFilter
  • CountingFilteringIterator.CountingReadFilter
  • DuplicateReadFilter
  • FailsVendorQualityCheckFilter
  • HCMappingQualityFilter
  • LibraryReadFilter
  • MalformedReadFilter
  • MappingQualityFilter
  • MappingQualityUnavailableFilter
  • MappingQualityZeroFilter
  • MateSameStrandFilter
  • MaxInsertSizeFilter
  • MissingReadGroupFilter
  • NoOriginalQualityScoresFilter
  • NotPrimaryAlignmentFilter
  • OverclippedReadFilter
  • Platform454Filter
  • PlatformFilter
  • PlatformUnitFilter
  • ReadGroupBlackListFilter
  • ReadLengthFilter
  • ReadNameFilter
  • ReadStrandFilter
  • ReassignMappingQualityFilter
  • ReassignOneMappingQualityFilter
  • ReassignOriginalMQAfterIndelRealignmentFilter
  • SampleFilter
  • SingleReadGroupFilter
  • UnmappedReadFilter
  • Resource File Codecs
  • BeagleCodec
  • BedTableCodec
  • RawHapMapCodec
  • RefSeqCodec
  • SAMPileupCodec
  • SAMReadCodec
  • TableCodec

  • Reference Utilities
  • FastaAlternateReferenceMaker
  • FastaReferenceMaker
  • QCRef
  • Showing docs for version 3.7-0


    BaseRecalibrator

    Detect systematic errors in base quality scores

    Category Sequence Data Processing Tools

    Traversal ReadWalker

    PartitionBy READ


    Overview

    Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model. There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes. This tool performs the first step described above: it builds the model of covariation and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This tool generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).

    Inputs

    A BAM file containing data that needs to be recalibrated.

    A database of known polymorphic sites to mask out.

    Output

    A GATKReport file with many tables:

    The GATKReport table format is intended to be easy to read by both humans and computer languages (especially R). Check out the documentation of the GATKReport (in the FAQs) to learn how to manipulate this table.

    Usage example

     java -jar GenomeAnalysisTK.jar \
       -T BaseRecalibrator \
       -R reference.fasta \
       -I my_reads.bam \
       -knownSites latest_dbsnp.vcf \
       -o recal_data.table
     

    Notes


    Additional Information

    Read filters

    These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator.

    Parallelism options

    This tool can be run in multi-threaded mode using this option.

    Downsampling settings

    This tool does not apply any downsampling by default.


    Command-line Arguments

    Engine arguments

    All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

    BaseRecalibrator specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Outputs
    --out
     -o
    NA The output recalibration table file to create
    Optional Inputs
    --knownSites
    [] A database of known polymorphic sites
    Optional Parameters
    --covariate
     -cov
    NA One or more covariates to be used in the recalibration. Can be specified multiple times
    --indels_context_size
     -ics
    3 Size of the k-mer context to be used for base insertions and deletions
    --maximum_cycle_value
     -maxCycle
    500 The maximum cycle value permitted for the Cycle covariate
    --mismatches_context_size
     -mcs
    2 Size of the k-mer context to be used for base mismatches
    --solid_nocall_strategy
    THROW_EXCEPTION Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
    --solid_recal_mode
     -sMode
    SET_Q_ZERO How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
    Optional Flags
    --list
     -ls
    false List the available covariates and exit
    --lowMemoryMode
    false Reduce memory usage in multi-threaded code at the expense of threading efficiency
    --no_standard_covs
     -noStandard
    false Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
    --sort_by_all_columns
     -sortAllCols
    false Sort the rows in the tables of reports
    Advanced Parameters
    --binary_tag_name
     -bintag
    NA the binary tag covariate name if using it
    --bqsrBAQGapOpenPenalty
     -bqsrBAQGOP
    40.0 BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets
    --deletions_default_quality
     -ddq
    45 default quality for the base deletions covariate
    --insertions_default_quality
     -idq
    45 default quality for the base insertions covariate
    --low_quality_tail
     -lqt
    2 minimum quality for the bases in the tail of the reads to be considered
    --mismatches_default_quality
     -mdq
    -1 default quality for the base mismatches covariate
    --quantizing_levels
     -ql
    16 number of distinct quality scores in the quantized output
    Advanced Flags
    --run_without_dbsnp_potentially_ruining_quality
    false If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --binary_tag_name / -bintag

    the binary tag covariate name if using it
    The tag name for the binary tag covariate (if using it)

    String  NA


    --bqsrBAQGapOpenPenalty / -bqsrBAQGOP

    BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets

    double  40.0  [ [ -∞  ∞ ] ]


    --covariate / -cov

    One or more covariates to be used in the recalibration. Can be specified multiple times
    Note that the ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no_standard_covs is specified, the Cycle and Context covariates are standard and are included by default. Use the --list argument to see the available covariates.

    String[]  NA


    --deletions_default_quality / -ddq

    default quality for the base deletions covariate
    A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is on]

    byte  45  [ [ -∞  ∞ ] ]


    --indels_context_size / -ics

    Size of the k-mer context to be used for base insertions and deletions
    The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.

    int  3  [ [ -∞  ∞ ] ]


    --insertions_default_quality / -idq

    default quality for the base insertions covariate
    A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. [default is on]

    byte  45  [ [ -∞  ∞ ] ]


    --knownSites / -knownSites

    A database of known polymorphic sites
    This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.

    This argument supports reference-ordered data (ROD) files in the following formats: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, GELITEXT, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3

    List[RodBinding[Feature]]  []


    --list / -ls

    List the available covariates and exit
    Note that the --list argument requires a fully resolved and correct command-line to work.

    boolean  false


    --low_quality_tail / -lqt

    minimum quality for the bases in the tail of the reads to be considered
    Reads with low quality bases on either tail (beginning or end) will not be considered in the context. This parameter defines the quality below which (inclusive) a tail is considered low quality

    byte  2  [ [ -∞  ∞ ] ]


    --lowMemoryMode / -lowMemoryMode

    Reduce memory usage in multi-threaded code at the expense of threading efficiency
    When you use nct > 1, BQSR uses nct times more memory to compute its recalibration tables, for efficiency purposes. If you have many covariates, and therefore are using a lot of memory, you can use this flag to safely access only one table. There may be some CPU cost, but as long as the table is really big the cost should be relatively reasonable.

    boolean  false


    --maximum_cycle_value / -maxCycle

    The maximum cycle value permitted for the Cycle covariate
    The cycle covariate will generate an error if it encounters a cycle greater than this value. This argument is ignored if the Cycle covariate is not used.

    int  500  [ [ -∞  ∞ ] ]


    --mismatches_context_size / -mcs

    Size of the k-mer context to be used for base mismatches
    The context covariate will use a context of this size to calculate its covariate value for base mismatches. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.

    int  2  [ [ -∞  ∞ ] ]


    --mismatches_default_quality / -mdq

    default quality for the base mismatches covariate
    A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is off]

    byte  -1  [ [ -∞  ∞ ] ]


    --no_standard_covs / -noStandard

    Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
    The Cycle and Context covariates are standard and are included by default unless this argument is provided. Note that the ReadGroup and QualityScore covariates are required and cannot be excluded.

    boolean  false


    --out / -o

    The output recalibration table file to create
    After the header, data records occur one per line until the end of the file. The first several items on a line are the values of the individual covariates and will change depending on which covariates were specified at runtime. The last three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, and the raw empirical quality score calculated by phred-scaling the mismatch rate.

    R File  NA


    --quantizing_levels / -ql

    number of distinct quality scores in the quantized output
    BQSR generates a quantization table for quick quantization later by subsequent tools. BQSR does not quantize the base qualities, this is done by the engine with the -qq or -BQSR options. This parameter tells BQSR the number of levels of quantization to use to build the quantization table.

    int  16  [ [ -∞  ∞ ] ]


    --run_without_dbsnp_potentially_ruining_quality / -run_without_dbsnp_potentially_ruining_quality

    If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.
    This calculation is critically dependent on being able to skip over known polymorphic sites. Please be sure that you know what you are doing if you use this option.

    boolean  false


    --solid_nocall_strategy / -solid_nocall_strategy

    Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
    BaseRecalibrator accepts a --solid_nocall_strategy flag which governs how the recalibrator handles no calls in the color space tag. Unfortunately because of the reference inserted bases mentioned above, reads with no calls in their color space tag can not be recalibrated.

    The --solid_nocall_strategy argument is an enumerated type (SOLID_NOCALL_STRATEGY), which can have one of the following values:

    THROW_EXCEPTION
    When a no call is detected throw an exception to alert the user that recalibrating this SOLiD data is unsafe. This is the default option.
    LEAVE_READ_UNRECALIBRATED
    Leave the read in the output bam completely untouched. This mode is only okay if the no calls are very rare.
    PURGE_READ
    Mark these reads as failing vendor quality checks so they can be filtered out by downstream analyses.

    SOLID_NOCALL_STRATEGY  THROW_EXCEPTION


    --solid_recal_mode / -sMode

    How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
    BaseRecalibrator accepts a --solid_recal_mode flag which governs how the recalibrator handles the reads which have had the reference inserted because of color space inconsistencies.

    The --solid_recal_mode argument is an enumerated type (SOLID_RECAL_MODE), which can have one of the following values:

    DO_NOTHING
    Treat reference inserted bases as reference matching bases. Very unsafe!
    SET_Q_ZERO
    Set reference inserted bases and the previous base (because of color space alignment details) to Q0. This is the default option.
    SET_Q_ZERO_BASE_N
    In addition to setting the quality scores to zero, also set the base itself to 'N'. This is useful to visualize in IGV.
    REMOVE_REF_BIAS
    Look at the color quality scores and probabilistically decide to change the reference inserted base to be the base which is implied by the original color space instead of the reference.

    SOLID_RECAL_MODE  SET_Q_ZERO


    --sort_by_all_columns / -sortAllCols

    Sort the rows in the tables of reports
    Whether GATK report tables should have rows in sorted order, starting from leftmost column

    Boolean  false