Randomly select variant records according to specified options
This tool is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study.
Sites are selected randomly but within certain restrictions. There are two main sources of restrictions:
One or more variant sets to choose from.
A sites-only VCF with the desired number of randomly selected sites.
java -jar GenomeAnalysisTK.jar \ -T ValidationSiteSelectorWalker \ -R reference.fasta \ -V input1.vcf \ -V input2.vcf \ -sn NA12878 \ -o output.vcf \ --numValidationSites 200 \ -sampleMode POLY_BASED_ON_GT \ -freqMode KEEP_AF_SPECTRUM
java -jar GenomeAnalysisTK.jar \ -T ValidationSiteSelectorWalker \ -R reference.fasta \ -V:foo input1.vcf \ -V:bar input2.vcf \ --numValidationSites 200 \ -sf samples.txt \ -o output.vcf \ -sampleMode POLY_BASED_ON_GT \ -freqMode UNIFORM \ -selectType INDEL
These Read Filters are automatically applied to the data by the Engine before processing by ValidationSiteSelector.
All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Inputs | |||
--variant -V |
NA | Input VCF file, can be specified multiple times | |
Required Parameters | |||
--numValidationSites -numSites |
0 | Number of output validation sites | |
Optional Inputs | |||
--sample_file -sf |
NA | File containing a list of samples (one per line) to include. Can be specified multiple times | |
Optional Outputs | |||
--out -o |
stdout | File to which variants should be written | |
Optional Parameters | |||
--frequencySelectionMode -freqMode |
KEEP_AF_SPECTRUM | Allele Frequency selection mode | |
--sample_expressions -se |
NA | Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times | |
--sample_name -sn |
[] | Include genotypes from this sample. Can be specified multiple times | |
--sampleMode |
NONE | Sample selection mode | |
--samplePNonref |
0.99 | GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site | |
--selectTypeToInclude -selectType |
[] | Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times | |
Optional Flags | |||
--ignoreGenotypes |
false | If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection | |
--ignorePolymorphicStatus |
false | If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection | |
--includeFilteredSites -ifs |
false | If true, will include filtered sites in set to choose variants from |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Allele Frequency selection mode
This argument selects allele frequency selection mode. See the wiki for more information.
The --frequencySelectionMode argument is an enumerated type (AF_COMPUTATION_MODE), which can have one of the following values:
AF_COMPUTATION_MODE KEEP_AF_SPECTRUM
If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection
Argument for the frequency selection mode. (AC/AF/AN) are taken from VCF info field, not recalculated. Typically specified for sites-only VCFs that still have AC/AF/AN information.
boolean false
If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection
Argument for the frequency selection mode. Allows reference (non-polymorphic) sites to be included in the validation set.
boolean false
If true, will include filtered sites in set to choose variants from
Do not exclude filtered sites (e.g. not PASS or .) from consideration for validation
boolean false
Number of output validation sites
The number of sites in your validation set
R int 0 [ [ -∞ ∞ ] ]
File to which variants should be written
The output VCF file
VariantContextWriter stdout
Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
Sample regexps to subset the input VCF to, prior to selecting variants. -sn NA12* subsets to all samples with prefix NA12
Set[String] NA
File containing a list of samples (one per line) to include. Can be specified multiple times
File containing a list of sample names to subset the input vcf to. Equivalent to specifying the contents of the file separately with -sn
Set[File] NA
Include genotypes from this sample. Can be specified multiple times
Sample name(s) to subset the input VCF to, prior to selecting variants. -sn A -sn B subsets to samples A and B.
Set[String] []
Sample selection mode
A mode for selecting sites based on sample-level data. See the wiki documentation for more information.
The --sampleMode argument is an enumerated type (SAMPLE_SELECTION_MODE), which can have one of the following values:
SAMPLE_SELECTION_MODE NONE
GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site
An P[nonref] threshold for SAMPLE_SELECTION_MODE=POLY_BASED_ON_GL. See the wiki documentation for more information.
double 0.99 [ [ -∞ ∞ ] ]
Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
This argument selects particular kinds of variants (i.e. SNP, INDEL) out of a list. If left unspecified, all types are considered.
List[Type] []
Input VCF file, can be specified multiple times
The input VCF file
This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
R List[RodBinding[VariantContext]] NA