Detect systematic errors in base quality scores
Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model. There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes. This tool performs the first step described above: it builds the model of covariation and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This tool generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).
A BAM file containing data that needs to be recalibrated.
A database of known polymorphic sites to mask out.
A GATKReport file with many tables:
The GATKReport table format is intended to be easy to read by both humans and computer languages (especially R). Check out the documentation of the GATKReport (in the FAQs) to learn how to manipulate this table.
java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fasta \ -I my_reads.bam \ -knownSites latest_dbsnp.vcf \ -o recal_data.table
These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator.
This tool can be run in multi-threaded mode using this option.
This tool does not apply any downsampling by default.
All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Outputs | |||
--out -o |
NA | The output recalibration table file to create | |
Optional Inputs | |||
--knownSites |
[] | A database of known polymorphic sites | |
Optional Parameters | |||
--covariate -cov |
NA | One or more covariates to be used in the recalibration. Can be specified multiple times | |
--indels_context_size -ics |
3 | Size of the k-mer context to be used for base insertions and deletions | |
--maximum_cycle_value -maxCycle |
500 | The maximum cycle value permitted for the Cycle covariate | |
--mismatches_context_size -mcs |
2 | Size of the k-mer context to be used for base mismatches | |
--solid_nocall_strategy |
THROW_EXCEPTION | Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ | |
--solid_recal_mode -sMode |
SET_Q_ZERO | How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS | |
Optional Flags | |||
--list -ls |
false | List the available covariates and exit | |
--lowMemoryMode |
false | Reduce memory usage in multi-threaded code at the expense of threading efficiency | |
--no_standard_covs -noStandard |
false | Do not use the standard set of covariates, but rather just the ones listed using the -cov argument | |
--sort_by_all_columns -sortAllCols |
false | Sort the rows in the tables of reports | |
Advanced Parameters | |||
--binary_tag_name -bintag |
NA | the binary tag covariate name if using it | |
--bqsrBAQGapOpenPenalty -bqsrBAQGOP |
40.0 | BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets | |
--deletions_default_quality -ddq |
45 | default quality for the base deletions covariate | |
--insertions_default_quality -idq |
45 | default quality for the base insertions covariate | |
--low_quality_tail -lqt |
2 | minimum quality for the bases in the tail of the reads to be considered | |
--mismatches_default_quality -mdq |
-1 | default quality for the base mismatches covariate | |
--quantizing_levels -ql |
16 | number of distinct quality scores in the quantized output | |
Advanced Flags | |||
--run_without_dbsnp_potentially_ruining_quality |
false | If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only. |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
the binary tag covariate name if using it
The tag name for the binary tag covariate (if using it)
String NA
BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets
double 40.0 [ [ -∞ ∞ ] ]
One or more covariates to be used in the recalibration. Can be specified multiple times
Note that the ReadGroup and QualityScore covariates are required and do not need to be specified.
Also, unless --no_standard_covs is specified, the Cycle and Context covariates are standard and are included by default.
Use the --list argument to see the available covariates.
String[] NA
default quality for the base deletions covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is on]
byte 45 [ [ -∞ ∞ ] ]
Size of the k-mer context to be used for base insertions and deletions
The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
int 3 [ [ -∞ ∞ ] ]
default quality for the base insertions covariate
A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. [default is on]
byte 45 [ [ -∞ ∞ ] ]
A database of known polymorphic sites
This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference,
so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, GELITEXT, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3
List[RodBinding[Feature]] []
List the available covariates and exit
Note that the --list argument requires a fully resolved and correct command-line to work.
boolean false
minimum quality for the bases in the tail of the reads to be considered
Reads with low quality bases on either tail (beginning or end) will not be considered in the context. This parameter defines the quality below which (inclusive) a tail is considered low quality
byte 2 [ [ -∞ ∞ ] ]
Reduce memory usage in multi-threaded code at the expense of threading efficiency
When you use nct > 1, BQSR uses nct times more memory to compute its recalibration tables, for efficiency
purposes. If you have many covariates, and therefore are using a lot of memory, you can use this flag
to safely access only one table. There may be some CPU cost, but as long as the table is really big
the cost should be relatively reasonable.
boolean false
The maximum cycle value permitted for the Cycle covariate
The cycle covariate will generate an error if it encounters a cycle greater than this value.
This argument is ignored if the Cycle covariate is not used.
int 500 [ [ -∞ ∞ ] ]
Size of the k-mer context to be used for base mismatches
The context covariate will use a context of this size to calculate its covariate value for base mismatches. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
int 2 [ [ -∞ ∞ ] ]
default quality for the base mismatches covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is off]
byte -1 [ [ -∞ ∞ ] ]
Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
The Cycle and Context covariates are standard and are included by default unless this argument is provided.
Note that the ReadGroup and QualityScore covariates are required and cannot be excluded.
boolean false
The output recalibration table file to create
After the header, data records occur one per line until the end of the file. The first several items on a line are the
values of the individual covariates and will change depending on which covariates were specified at runtime. The last
three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches,
and the raw empirical quality score calculated by phred-scaling the mismatch rate.
R File NA
number of distinct quality scores in the quantized output
BQSR generates a quantization table for quick quantization later by subsequent tools. BQSR does not quantize the base qualities, this is done by the engine with the -qq or -BQSR options.
This parameter tells BQSR the number of levels of quantization to use to build the quantization table.
int 16 [ [ -∞ ∞ ] ]
If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.
This calculation is critically dependent on being able to skip over known polymorphic sites. Please be sure that you know what you are doing if you use this option.
boolean false
Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
BaseRecalibrator accepts a --solid_nocall_strategy
The --solid_nocall_strategy argument is an enumerated type (SOLID_NOCALL_STRATEGY), which can have one of the following values:
SOLID_NOCALL_STRATEGY THROW_EXCEPTION
How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
BaseRecalibrator accepts a --solid_recal_mode
The --solid_recal_mode argument is an enumerated type (SOLID_RECAL_MODE), which can have one of the following values:
SOLID_RECAL_MODE SET_Q_ZERO
Sort the rows in the tables of reports
Whether GATK report tables should have rows in sorted order, starting from leftmost column
Boolean false