Select a subset of variants from a larger callset
Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose.
There are many different options for selecting subsets of variants from a larger callset:
There are also several options for recording the original values of certain annotations that are recalculated when a subsetting the new callset, trimming alleles, and so on.
A variant call set from which to select a subset.
A new VCF file containing the selected subset of variants.
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -sn SAMPLE_A_PARC \ -sn SAMPLE_B_ACTG
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -sn SAMPLE_1_PARC \ -sn SAMPLE_1_ACTG \ -se 'SAMPLE.+PARC'
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -xl_sn SAMPLE_1_PARC \ -xl_sn SAMPLE_1_ACTG \ -xl_se 'SAMPLE.+PARC'
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -se 'SAMPLE.+PARC' \ -select "QD > 10.0"
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -se 'SAMPLE.+PARC' \ -select "QD > 10.0" -invertSelect
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -sn SAMPLE_1_ACTG \ -env \ -ef
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -sn SAMPLE_1_ACTG \ -env \ -noTrim
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -L /path/to/my.interval_list \ -sn SAMPLE_1_ACTG
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V hapmap.vcf \ --discordance myCalls.vcf \ -o output.vcf \ -sn mySample
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V myCalls.vcf \ --concordance theirCalls.vcf \ -o output.vcf \ -sn mySample
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -ped family.ped \ -mv -mvq 50 \ -o violations.vcf
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -ped family.ped \ -mv -mvq 50 -invMv \ -o violations.vcf
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -fraction 0.5
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -selectType INDEL --minIndelSize 2 --maxIndelSize 5
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ --selectTypeToExclude INDEL
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fasta \ -V input.vcf \ -o output.vcf \ -selectType SNP -selectType MNP \ -restrictAllelesTo MULTIALLELIC
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -IDs fileKeep \ -excludeIDs fileExclude
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ --maxFilteredGenotypes 5 --minFilteredGenotypes 2 --maxFractionFilteredGenotypes 0.60 --minFractionFilteredGenotypes 0.10
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ --setFilteredGtToNocall
These Read Filters are automatically applied to the data by the Engine before processing by SelectVariants.
This tool can be run in multi-threaded mode using this option.
All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Inputs | |||
--variant -V |
NA | Input VCF file | |
Optional Inputs | |||
--concordance -conc |
none | Output variants also called in this comparison track | |
--discordance -disc |
none | Output variants not called in this comparison track | |
--exclude_sample_expressions -xl_se |
[] | List of sample expressions to exclude | |
--exclude_sample_file -xl_sf |
[] | List of samples to exclude | |
--sample_file -sf |
NA | File containing a list of samples to include | |
Optional Outputs | |||
--out -o |
stdout | File to which variants should be written | |
Optional Parameters | |||
--exclude_sample_name -xl_sn |
[] | Exclude genotypes from this sample | |
--excludeIDs -xlIDs |
NA | List of variant IDs to select | |
--keepIDs -IDs |
NA | List of variant IDs to select | |
--maxFilteredGenotypes |
2147483647 | Maximum number of samples filtered at the genotype level | |
--maxFractionFilteredGenotypes |
1.0 | Maximum fraction of samples filtered at the genotype level | |
--maxIndelSize |
2147483647 | Maximum size of indels to include | |
--maxNOCALLfraction |
1.0 | Maximum fraction of samples with no-call genotypes | |
--maxNOCALLnumber |
2147483647 | Maximum number of samples with no-call genotypes | |
--mendelianViolationQualThreshold -mvq |
0.0 | Minimum GQ score for each trio member to accept a site as a violation | |
--minFilteredGenotypes |
0 | Minimum number of samples filtered at the genotype level | |
--minFractionFilteredGenotypes |
0.0 | Maximum fraction of samples filtered at the genotype level | |
--minIndelSize |
0 | Minimum size of indels to include | |
--remove_fraction_genotypes -fractionGenotypes |
0.0 | Select a fraction of genotypes at random from the input and sets them to no-call | |
--restrictAllelesTo |
ALL | Select only variants of a particular allelicity | |
--sample_expressions -se |
NA | Regular expression to select multiple samples | |
--sample_name -sn |
[] | Include genotypes from this sample | |
--select_random_fraction -fraction |
0.0 | Select a fraction of variants at random from the input | |
--selectexpressions -select |
[] | One or more criteria to use when selecting the data | |
--selectTypeToExclude -xlSelectType |
[] | Do not select certain type of variants from the input file | |
--selectTypeToInclude -selectType |
[] | Select only a certain type of variants from the input file | |
Optional Flags | |||
--excludeFiltered -ef |
false | Don't include filtered sites | |
--excludeNonVariants -env |
false | Don't include non-variant sites | |
--forceValidOutput |
false | Forces output VCF to be compliant to up-to-date version | |
--invertMendelianViolation -invMv |
false | Output non-mendelian violation sites only | |
--invertselect -invertSelect |
false | Invert the selection criteria for -select | |
--keepOriginalAC |
false | Store the original AC, AF, and AN values after subsetting | |
--keepOriginalDP |
false | Store the original DP value after subsetting | |
--mendelianViolation -mv |
false | Output mendelian violation sites only | |
--preserveAlleles -noTrim |
false | Preserve original alleles, do not trim | |
--removeUnusedAlternates -trimAlternates |
false | Remove alternate alleles not present in any genotypes | |
--setFilteredGtToNocall |
false | Set filtered genotypes to no-call |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Output variants also called in this comparison track
A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
in both the variant and concordance tracks or (2) every sample present in the variant track is present in the
concordance track and they have the sample genotype call.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
RodBinding[VariantContext] none
Output variants not called in this comparison track
A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype
and either the site isn't present in this track, the sample isn't present in this track,
or the sample is called reference in this track.
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
RodBinding[VariantContext] none
List of sample expressions to exclude
Using a regular expression allows you to match multiple sample names that have that pattern in common. Note that sample exclusion takes precedence
over inclusion, so that if a sample is in both lists it will be excluded. This argument can be specified multiple times in order to use multiple
different matching patterns.
Set[String] []
List of samples to exclude
Sample names should be in a plain text file listing one sample name per line. Note that sample exclusion takes precedence over inclusion, so that
if a sample is in both lists it will be excluded. This argument can be specified multiple times in order to
provide multiple sample list files.
Set[File] []
Exclude genotypes from this sample
Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be
excluded. This argument can be specified multiple times in order to provide multiple sample names.
Set[String] []
Don't include filtered sites
If this flag is enabled, sites that have been marked as filtered (i.e. have anything other than `.` or `PASS`
in the FILTER field) will be excluded from the output.
boolean false
List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will not select variants whose ID
field is present in this list of IDs. The matching is done by exact string matching. The expected file format
is simply plain text with one ID per line.
File NA
Don't include non-variant sites
boolean false
Forces output VCF to be compliant to up-to-date version
If this argument is provided, the output will be compliant with the version in the header, however it will also
cause the tool to run slower than without the argument. Without the argument the header will be compliant with
the up-to-date version, but the output in the body may not be compliant. If an up-to-date input file is used,
then the output will also be up-to-date regardless of this argument.
boolean false
Output non-mendelian violation sites only
If this flag is enabled, this tool will select only variants that do not correspond to a mendelian violation as
determined on the basis of family structure. Requires passing a pedigree file using the engine-level
`-ped` argument.
Boolean false
Invert the selection criteria for -select
Invert the selection criteria for -select.
boolean false
List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID
field is present in this list of IDs. The matching is done by exact string matching. The expected file format
is simply plain text with one ID per line.
File NA
Store the original AC, AF, and AN values after subsetting
When subsetting a callset, this tool recalculates the AC, AF, and AN values corresponding to the contents of the
subset. If this flag is enabled, the original values of those annotations will be stored in new annotations called
AC_Orig, AF_Orig, and AN_Orig.
boolean false
Store the original DP value after subsetting
When subsetting a callset, this tool recalculates the site-level (INFO field) DP value corresponding to the contents of the
subset. If this flag is enabled, the original value of the DP annotation will be stored in a new annotation called
DP_Orig.
boolean false
Maximum number of samples filtered at the genotype level
If this argument is provided, select sites where at most a maximum number of samples are filtered at the genotype level.
int 2147483647 [ [ -∞ ∞ ] ]
Maximum fraction of samples filtered at the genotype level
If this argument is provided, select sites where a fraction or less of the samples are filtered at the genotype level.
double 1.0 [ [ -∞ ∞ ] ]
Maximum size of indels to include
If this argument is provided, indels that are larger than the specified size will be excluded.
int 2147483647 [ [ -∞ ∞ ] ]
Maximum fraction of samples with no-call genotypes
If this argument is provided, select sites where at most the given fraction of samples have no-call genotypes.
double 1.0 [ [ -∞ ∞ ] ]
Maximum number of samples with no-call genotypes
If this argument is provided, select sites where at most the given number of samples have no-call genotypes.
int 2147483647 [ [ -∞ ∞ ] ]
Output mendelian violation sites only
If this flag is enabled, this tool will select only variants that correspond to a mendelian violation as
determined on the basis of family structure. Requires passing a pedigree file using the engine-level
`-ped` argument.
Boolean false
Minimum GQ score for each trio member to accept a site as a violation
This argument specifies the genotype quality (GQ) threshold that all members of a trio must have in order
for a site to be accepted as a mendelian violation. Note that the `-mv` flag must be set for this argument to have an effect.
double 0.0 [ [ -∞ ∞ ] ]
Minimum number of samples filtered at the genotype level
If this argument is provided, select sites where at least a minimum number of samples are filtered at the genotype level.
int 0 [ [ -∞ ∞ ] ]
Maximum fraction of samples filtered at the genotype level
If this argument is provided, select sites where a fraction or more of the samples are filtered at the genotype level.
double 0.0 [ [ -∞ ∞ ] ]
Minimum size of indels to include
If this argument is provided, indels that are smaller than the specified size will be excluded.
int 0 [ [ -∞ ∞ ] ]
File to which variants should be written
VariantContextWriter stdout
Preserve original alleles, do not trim
The default behavior of this tool is to remove bases common to all remaining alleles after subsetting
operations have been completed, leaving only their minimal representation. If this flag is enabled, the original
alleles will be preserved as recorded in the input VCF.
boolean false
Select a fraction of genotypes at random from the input and sets them to no-call
The value of this argument should be a number between 0 and 1 specifying the fraction of total variants to be
randomly selected from the input callset and set to no-call (./). Note that this is done using a probabilistic
function, so the final result is not guaranteed to carry the exact fraction requested. Can be used for large fractions.
double 0.0 [ [ -∞ ∞ ] ]
Remove alternate alleles not present in any genotypes
When this flag is enabled, all alternate alleles that are not present in the (output) samples will be removed.
Note that this even extends to biallelic SNPs - if the alternate allele is not present in any sample, it will be
removed and the record will contain a '.' in the ALT column. Note also that sites-only VCFs, by definition, do
not include the alternate allele in any genotype calls.
boolean false
Select only variants of a particular allelicity
When this argument is used, we can choose to include only multiallelic or biallelic sites, depending on how many alleles are listed in the ALT column of a VCF.
For example, a multiallelic record such as:
1 100 . A AAA,AAAAA
will be excluded if `-restrictAllelesTo BIALLELIC` is used, because there are two alternate alleles, whereas a record such as:
1 100 . A T
will be included in that case, but would be excluded if `-restrictAllelesTo MULTIALLELIC` is used.
Valid options are ALL (default), MULTIALLELIC or BIALLELIC.
The --restrictAllelesTo argument is an enumerated type (NumberAlleleRestriction), which can have one of the following values:
NumberAlleleRestriction ALL
Regular expression to select multiple samples
Using a regular expression allows you to match multiple sample names that have that pattern in common. This
argument can be specified multiple times in order to use multiple different matching patterns.
Set[String] NA
File containing a list of samples to include
Sample names should be in a plain text file listing one sample name per line. This argument can be specified multiple times in order to provide
multiple sample list files.
Set[File] NA
Include genotypes from this sample
This argument can be specified multiple times in order to provide multiple sample names.
Set[String] []
Select a fraction of variants at random from the input
The value of this argument should be a number between 0 and 1 specifying the fraction of total variants to be
randomly selected from the input callset. Note that this is done using a probabilistic function, so the final
result is not guaranteed to carry the exact fraction requested. Can be used for large fractions.
double 0.0 [ [ -∞ ∞ ] ]
One or more criteria to use when selecting the data
See example commands above for detailed usage examples. Note that these expressions are evaluated *after* the
specified samples are extracted and the INFO field annotations are updated.
ArrayList[String] []
Do not select certain type of variants from the input file
This argument excludes particular kinds of variants out of a list. If left empty, there is no type selection
and all variant types are considered for other selection criteria. Valid types are INDEL, SNP, MIXED, MNP,
SYMBOLIC, NO_VARIATION. Can be specified multiple times.
List[Type] []
Select only a certain type of variants from the input file
This argument selects particular kinds of variants out of a list. If left empty, there is no type selection
and all variant types are considered for other selection criteria. Valid types are INDEL, SNP, MIXED, MNP,
SYMBOLIC, NO_VARIATION. Can be specified multiple times.
List[Type] []
Set filtered genotypes to no-call
If this argument is provided, set filtered genotypes to no-call (./.).
boolean false
Input VCF file
Variants from this VCF file are used by this tool as input.
The file must at least contain the standard VCF header lines, but
can be empty (i.e., no variants are contained in the file).
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
R RodBinding[VariantContext] NA