Call SNPs and indels on a per-locus basis
This tool uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting a genotype for each sample. The system can either emit just the variant sites or complete genotypes (which includes homozygous reference calls) satisfying some phred-scaled confidence value.
The read data from which to make variant calls.
A raw, unfiltered, highly sensitive callset in VCF format.
java -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R reference.fasta \ -I sample1.bam [-I sample2.bam ...] \ --dbsnp dbSNP.vcf \ -o snps.raw.vcf \ -stand_call_conf [50.0] \ [-L targets.interval_list]
java -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R reference.fasta \ -I input.bam \ -o raw_variants.vcf \ --output_mode EMIT_ALL_SITES
This tool is able to handle almost any ploidy (except very high ploidies in large pooled experiments); the ploidy can be specified using the -ploidy argument for non-diploid organisms.
These Read Filters are automatically applied to the data by the Engine before processing by UnifiedGenotyper.
This tool can be run in multi-threaded mode using these options.
This tool applies the following downsampling settings by default.
This tool uses a sliding window on the reference.
All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Optional Inputs | |||
--alleles |
none | Set of alleles to use in genotyping | |
--comp |
[] | Comparison VCF file | |
--dbsnp -D |
none | dbSNP file | |
Optional Outputs | |||
--out -o |
stdout | File to which variants should be written | |
Optional Parameters | |||
--annotation -A |
[] | One or more specific annotations to apply to variant calls | |
--contamination_fraction_to_filter -contamination |
0.0 | Fraction of contamination to aggressively remove | |
--excludeAnnotation -XA |
[] | One or more specific annotations to exclude | |
--genotype_likelihoods_model -glm |
SNP | Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together | |
--genotyping_mode -gt_mode |
DISCOVERY | Specifies how to determine the alternate alleles to use for genotyping | |
--group -G |
[Standard, StandardUG] | One or more classes/groups of annotations to apply to variant calls. The single value 'none' removes the default group | |
--heterozygosity -hets |
0.001 | Heterozygosity value used to compute prior likelihoods for any locus | |
--heterozygosity_stdev -heterozygosityStandardDeviation |
0.01 | Standard deviation of eterozygosity for SNP and indel calling. | |
--indel_heterozygosity -indelHeterozygosity |
1.25E-4 | Heterozygosity for indel calling | |
--max_deletion_fraction -deletions |
0.05 | Maximum fraction of reads with deletions spanning this locus for it to be callable | |
--min_base_quality_score -mbq |
17 | Minimum base quality required to consider a base for calling | |
--min_indel_count_for_genotyping -minIndelCnt |
5 | Minimum number of consensus indels required to trigger genotyping run | |
--min_indel_fraction_per_sample -minIndelFrac |
0.25 | Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles | |
--pair_hmm_implementation -pairHMM |
LOGLESS_CACHING | The PairHMM implementation to use for -glm INDEL genotype likelihood calculations | |
--pcr_error_rate -pcr_error |
1.0E-4 | The PCR error rate to be used for computing fragment-based likelihoods | |
--sample_ploidy -ploidy |
2 | Ploidy per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy). | |
--standard_min_confidence_threshold_for_calling -stand_call_conf |
10.0 | The minimum phred-scaled confidence threshold at which variants should be called | |
Optional Flags | |||
--annotateNDA -nda |
false | Annotate number of alleles observed | |
--computeSLOD -slod |
false | If provided, we will calculate the SLOD (SB annotation) | |
--useNewAFCalculator -newQual |
false | Use new AF model instead of the so-called exact model | |
Advanced Parameters | |||
--contamination_fraction_per_sample_file -contaminationFile |
NA | Contamination per sample | |
--indelGapContinuationPenalty -indelGCP |
10 | Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10 | |
--indelGapOpenPenalty -indelGOP |
45 | Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10 | |
--input_prior -inputPrior |
[] | Input prior for calls | |
--max_alternate_alleles -maxAltAlleles |
6 | Maximum number of alternate alleles to genotype | |
--max_genotype_count -maxGT |
1024 | Maximum number of genotypes to consider at any site | |
--max_num_PL_values -maxNumPLValues |
100 | Maximum number of PL values to output | |
--onlyEmitSamples |
[] | If provided, only these samples will be emitted into the VCF, regardless of which samples are present in the BAM file | |
--output_mode -out_mode |
EMIT_VARIANTS_ONLY | Which type of calls we should output | |
Advanced Flags | |||
--allSitePLs |
false | Annotate all sites with PLs |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Set of alleles to use in genotyping
When --genotyping_mode is set to GENOTYPE_GIVEN_ALLELES mode, the caller will genotype the samples using only the alleles provide in this callset. Note that this is not well tested in HaplotypeCaller, and is definitely not suitable for use with HaplotypeCaller in -ERC GVCF mode. In addition, it does not apply to MuTect2 at all.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
RodBinding[VariantContext] none
Annotate all sites with PLs
Experimental argument FOR USE WITH UnifiedGenotyper ONLY: if SNP likelihood model
is specified, and if EMIT_ALL_SITES output mode is set, when we set this argument then we
will also emit PLs at all sites. This will give a measure of reference confidence and a
measure of which alt alleles are more plausible (if any).
WARNINGS:
- This feature will inflate VCF file size considerably.
- All SNP ALT alleles will be emitted with corresponding 10 PL values.
- An error will be emitted if EMIT_ALL_SITES is not set, or if anything other than diploid
SNP model is used
- THIS WILL NOT WORK WITH HaplotypeCaller, GenotypeGVCFs or MuTect2! Use HaplotypeCaller with
-ERC GVCF then GenotypeGVCFs instead. See the Best Practices documentation for more information.
boolean false
Annotate number of alleles observed
Depending on the value of the --max_alternate_alleles argument, we may genotype only a fraction of the alleles
being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the
number of alternate alleles that were originally discovered (but not necessarily genotyped) at the site.
boolean false
One or more specific annotations to apply to variant calls
Which annotations to add to the output VCF file. See the VariantAnnotator -list argument to view available annotations.
List[String] []
Comparison VCF file
If a call overlaps with a record from the provided comp track, the INFO field will be annotated
as such in the output with the track name (e.g. -comp:FOO will have 'FOO' in the INFO field).
Records that are filtered in the comp track will be ignored.
Note that 'dbSNP' has been special-cased (see the --dbsnp argument).
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
List[RodBinding[VariantContext]] []
If provided, we will calculate the SLOD (SB annotation)
Note that calculating the SLOD increases the runtime by an appreciable amount.
boolean false
Contamination per sample
This argument specifies a file with two columns "sample" and "contamination" (separated by a tab)
specifying the contamination level for those samples (where contamination is given as a
decimal number, not an integer) per line. There should be no header. Samples that do not appear
in this file will be processed with CONTAMINATION_FRACTION.
File NA
Fraction of contamination to aggressively remove
If this fraction is greater is than zero, the caller will aggressively attempt to remove
contamination through biased down-sampling of reads (for all samples). Basically, it will ignore the
contamination fraction of reads for each alternate allele. So if the pileup contains N
total bases, then we will try to remove (N * contamination fraction) bases for each alternate
allele.
double 0.0 [ [ -∞ ∞ ] ]
dbSNP file
rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate.
dbSNP is not used in any way for the calculations themselves.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
RodBinding[VariantContext] none
One or more specific annotations to exclude
Which annotations to exclude from output in the VCF file. Note that this argument has higher priority than the -A or -G arguments,
so annotations will be excluded even if they are explicitly included with the other options.
List[String] []
Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together
The --genotype_likelihoods_model argument is an enumerated type (Model), which can have one of the following values:
Model SNP
Specifies how to determine the alternate alleles to use for genotyping
The --genotyping_mode argument is an enumerated type (GenotypingOutputMode), which can have one of the following values:
GenotypingOutputMode DISCOVERY
One or more classes/groups of annotations to apply to variant calls. The single value 'none' removes the default group
If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups.
Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.
String[] [Standard, StandardUG]
Heterozygosity value used to compute prior likelihoods for any locus
The expected heterozygosity value used to compute prior probability that a locus is non-reference. See
https://software.broadinstitute.org/gatk/documentation/article?id=8603 for more details.
Double 0.001 [ [ -∞ ∞ ] ]
Standard deviation of eterozygosity for SNP and indel calling.
The standard deviation of the distribution of alt allele fractions. The above heterozygosity parameters give
the *mean* of this distribution; this parameter gives its spread.
double 0.01 [ [ -∞ ∞ ] ]
Heterozygosity for indel calling
This argument informs the prior probability of having an indel at a site.
double 1.25E-4 [ [ -∞ ∞ ] ]
Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
byte 10 [ [ -∞ ∞ ] ]
Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
byte 45 [ [ -∞ ∞ ] ]
Input prior for calls
By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a
particular locus, using an infinite sites model (see e.g. Waterson, 1975 or Tajima, 1996). This model asserts that
the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N.
However, there are instances where using this prior might not be desirable, e.g. for population studies where prior
might not be appropriate, as for example when the ancestral status of the reference allele is not known.
This argument allows you to manually specify a list of probabilities for each AC>1 to be used as
priors for genotyping, with the following restrictions: only diploid calls are supported; you must specify 2 *
N values where N is the number of samples; probability values must be positive and specified in Double format,
in linear space (not log10 space nor Phred-scale); and all values must sume to 1.
For completely flat priors, specify the same value (=1/(2*N+1)) 2*N times, e.g.
-inputPrior 0.33 -inputPrior 0.33
for the single-sample diploid case.
List[Double] []
Maximum number of alternate alleles to genotype
If there are more than this number of alternate alleles presented to the genotyper (either through discovery or
GENOTYPE_GIVEN_ALLELES), then only this many alleles will be used. Note that genotyping sites with many
alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate
alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around
with this parameter.
See also {@link #MAX_GENOTYPE_COUNT}.
int 6 [ [ -∞ ∞ ] ]
Maximum fraction of reads with deletions spanning this locus for it to be callable
If the fraction of reads with deletions spanning a locus is greater than this value, the site will not be considered callable and will be skipped.
To disable the use of this parameter, set its value to >1.
Double 0.05 [ [ -∞ ∞ ] ]
Maximum number of genotypes to consider at any site
If there are more than this number of genotypes at a locus presented to the genotyper, then only this many
genotypes will be used. This is intended to deal with sites where the combination of high ploidy and high alt
allele count can lead to an explosion in the number of possible genotypes, with extreme adverse effects on
runtime performance.
How does it work? The possible genotypes are simply different ways of partitioning alleles given a specific
ploidy assumption. Therefore, we remove genotypes from consideration by removing alternate alleles that are the
least well supported. The estimate of allele support is based on the ranking of the candidate haplotypes coming
out of the graph building step. Note however that the reference allele is always kept.
The maximum number of alternative alleles used in the genotyping step will be the lesser of the two:
1. the largest number of alt alleles, given ploidy, that yields a genotype count no higher than {@link #MAX_GENOTYPE_COUNT}
2. the value of {@link #MAX_ALTERNATE_ALLELES}
As noted above, genotyping sites with large genotype counts is both CPU and memory intensive. Unless you have
a good reason to change the default value, we highly recommend that you not play around with this parameter.
See also {@link #MAX_ALTERNATE_ALLELES}.
int 1024 [ [ -∞ ∞ ] ]
Maximum number of PL values to output
Determines the maximum number of PL values that will be logged in the output. If the number of genotypes
(which is determined by the ploidy and the number of alleles) exceeds the value provided by this argument,
then output of all of the PL values will be suppressed.
int 100 [ [ -∞ ∞ ] ]
Minimum base quality required to consider a base for calling
The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base
is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value.
Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).
int 17 [ [ -∞ ∞ ] ]
Minimum number of consensus indels required to trigger genotyping run
A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site.
Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives.
int 5 [ [ -∞ ∞ ] ]
Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles
Complementary argument to minIndelCnt. Only samples with at least this fraction of indel-containing reads will contribute
to counting and overcoming the threshold minIndelCnt. This parameter ensures that in deep data you don't end
up summing lots of super rare errors up to overcome the 5 read default threshold. Should work equally well for
low-coverage and high-coverage samples, as low coverage samples with any indel containing reads should easily over
come this threshold.
double 0.25 [ [ -∞ ∞ ] ]
If provided, only these samples will be emitted into the VCF, regardless of which samples are present in the BAM file
Set[String] []
File to which variants should be written
A raw, unfiltered, highly sensitive callset in VCF format.
VariantContextWriter stdout
Which type of calls we should output
Experimental argument FOR USE WITH UnifiedGenotyper ONLY. When using HaplotypeCaller, use -ERC
instead. When using GenotypeGVCFs, see -allSites.
The --output_mode argument is an enumerated type (OutputMode), which can have one of the following values:
OutputMode EMIT_VARIANTS_ONLY
The PairHMM implementation to use for -glm INDEL genotype likelihood calculations
The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.
The --pair_hmm_implementation argument is an enumerated type (HMM_IMPLEMENTATION), which can have one of the following values:
HMM_IMPLEMENTATION LOGLESS_CACHING
The PCR error rate to be used for computing fragment-based likelihoods
The PCR error rate is independent of the sequencing error rate, which is necessary because we cannot necessarily
distinguish between PCR errors vs. sequencing errors. The practical implication for this value is that it
effectively acts as a cap on the base qualities.
Double 1.0E-4 [ [ -∞ ∞ ] ]
Ploidy per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
Sample ploidy - equivalent to number of chromosome copies per pool. For pooled experiments this should be set to
the number of samples in pool multiplied by individual sample ploidy.
int 2 [ [ -∞ ∞ ] ]
The minimum phred-scaled confidence threshold at which variants should be called
The minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with
confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this
is the default).
double 10.0 [ [ -∞ ∞ ] ]
Use new AF model instead of the so-called exact model
This activates a model for calculating QUAL that was introduced in version 3.7 (November 2016). We expect this
model will become the default in future versions.
boolean false