• Diagnostics and Quality Control Tools
  • ASEReadCounter
  • AnalyzeCovariates
  • CallableLoci
  • CheckPileup
  • CompareCallableLoci
  • ContEst
  • CountBases
  • CountIntervals
  • CountLoci
  • CountMales
  • CountRODs
  • CountRODsByRef
  • CountReadEvents
  • CountReads
  • CountTerminusEvent
  • DepthOfCoverage
  • DiagnoseTargets
  • DiffObjects
  • ErrorRatePerCycle
  • FastaStats
  • FindCoveredIntervals
  • FlagStat
  • GCContentByInterval
  • GatherBqsrReports
  • Pileup
  • PrintRODs
  • QualifyMissingIntervals
  • ReadClippingStats
  • ReadGroupProperties
  • ReadLengthDistribution
  • SimulateReadsForVariants
  • Sequence Data Processing Tools
  • BaseRecalibrator
  • ClipReads
  • IndelRealigner
  • LeftAlignIndels
  • PrintReads
  • RealignerTargetCreator
  • SplitNCigarReads
  • SplitSamFile
  • Variant Discovery Tools
  • ApplyRecalibration
  • CalculateGenotypePosteriors
  • GATKPaperGenotyper
  • GenotypeGVCFs
  • HaplotypeCaller
  • MuTect2
  • RegenotypeVariants
  • UnifiedGenotyper
  • VariantRecalibrator
  • Variant Evaluation Tools
  • GenotypeConcordance
  • ValidateVariants
  • VariantEval
  • VariantFiltration
  • Variant Manipulation Tools
  • CatVariants
  • CombineGVCFs
  • CombineVariants
  • HaplotypeResolver
  • LeftAlignAndTrimVariants
  • PhaseByTransmission
  • RandomlySplitVariants
  • ReadBackedPhasing
  • SelectHeaders
  • SelectVariants
  • ValidationSiteSelector
  • VariantAnnotator
  • VariantsToAllelicPrimitives
  • VariantsToBinaryPed
  • VariantsToTable
  • VariantsToVCF

  • Annotation Modules
  • AS_BaseQualityRankSumTest
  • AS_FisherStrand
  • AS_InbreedingCoeff
  • AS_InsertSizeRankSum
  • AS_MQMateRankSumTest
  • AS_MappingQualityRankSumTest
  • AS_QualByDepth
  • AS_RMSMappingQuality
  • AS_ReadPosRankSumTest
  • AS_StrandOddsRatio
  • AlleleBalance
  • AlleleBalanceBySample
  • AlleleCountBySample
  • BaseCounts
  • BaseCountsBySample
  • BaseQualityRankSumTest
  • BaseQualitySumPerAlleleBySample
  • ChromosomeCounts
  • ClippingRankSumTest
  • ClusteredReadPosition
  • Coverage
  • DepthPerAlleleBySample
  • DepthPerSampleHC
  • ExcessHet
  • FisherStrand
  • FractionInformativeReads
  • GCContent
  • GenotypeSummaries
  • HaplotypeScore
  • HardyWeinberg
  • HomopolymerRun
  • InbreedingCoeff
  • LikelihoodRankSumTest
  • LowMQ
  • MVLikelihoodRatio
  • MappingQualityRankSumTest
  • MappingQualityZero
  • MappingQualityZeroBySample
  • NBaseCount
  • OxoGReadCounts
  • PossibleDeNovo
  • QualByDepth
  • RMSMappingQuality
  • ReadPosRankSumTest
  • SampleList
  • SnpEff
  • SpanningDeletions
  • StrandAlleleCountsBySample
  • StrandBiasBySample
  • StrandOddsRatio
  • TandemRepeatAnnotator
  • TransmissionDisequilibriumTest
  • VariantType
  • Read Filters
  • BadCigarFilter
  • BadMateFilter
  • CountingFilteringIterator.CountingReadFilter
  • DuplicateReadFilter
  • FailsVendorQualityCheckFilter
  • HCMappingQualityFilter
  • LibraryReadFilter
  • MalformedReadFilter
  • MappingQualityFilter
  • MappingQualityUnavailableFilter
  • MappingQualityZeroFilter
  • MateSameStrandFilter
  • MaxInsertSizeFilter
  • MissingReadGroupFilter
  • NoOriginalQualityScoresFilter
  • NotPrimaryAlignmentFilter
  • OverclippedReadFilter
  • Platform454Filter
  • PlatformFilter
  • PlatformUnitFilter
  • ReadGroupBlackListFilter
  • ReadLengthFilter
  • ReadNameFilter
  • ReadStrandFilter
  • ReassignMappingQualityFilter
  • ReassignOneMappingQualityFilter
  • ReassignOriginalMQAfterIndelRealignmentFilter
  • SampleFilter
  • SingleReadGroupFilter
  • UnmappedReadFilter
  • Resource File Codecs
  • BeagleCodec
  • BedTableCodec
  • RawHapMapCodec
  • RefSeqCodec
  • SAMPileupCodec
  • SAMReadCodec
  • TableCodec

  • Reference Utilities
  • FastaAlternateReferenceMaker
  • FastaReferenceMaker
  • QCRef
  • Showing docs for version 3.7-0


    ValidationSiteSelector

    Randomly select variant records according to specified options

    Category Variant Manipulation Tools

    Traversal LocusWalker

    PartitionBy LOCUS


    Overview

    This tool is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study.

    Sites are selected randomly but within certain restrictions. There are two main sources of restrictions:

    Input

    One or more variant sets to choose from.

    Output

    A sites-only VCF with the desired number of randomly selected sites.

    Usage examples

     java -jar GenomeAnalysisTK.jar \
       -T ValidationSiteSelectorWalker \
       -R reference.fasta \
       -V input1.vcf \
       -V input2.vcf \
       -sn NA12878 \
       -o output.vcf \
       --numValidationSites 200   \
       -sampleMode POLY_BASED_ON_GT \
       -freqMode KEEP_AF_SPECTRUM
     
     java -jar GenomeAnalysisTK.jar \
       -T ValidationSiteSelectorWalker \
       -R reference.fasta \
       -V:foo input1.vcf \
       -V:bar input2.vcf \
       --numValidationSites 200 \
       -sf samples.txt \
       -o output.vcf \
       -sampleMode  POLY_BASED_ON_GT \
       -freqMode UNIFORM \
       -selectType INDEL
     

    Additional Information

    Read filters

    These Read Filters are automatically applied to the data by the Engine before processing by ValidationSiteSelector.


    Command-line Arguments

    Engine arguments

    All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

    ValidationSiteSelector specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Inputs
    --variant
     -V
    NA Input VCF file, can be specified multiple times
    Required Parameters
    --numValidationSites
     -numSites
    0 Number of output validation sites
    Optional Inputs
    --sample_file
     -sf
    NA File containing a list of samples (one per line) to include. Can be specified multiple times
    Optional Outputs
    --out
     -o
    stdout File to which variants should be written
    Optional Parameters
    --frequencySelectionMode
     -freqMode
    KEEP_AF_SPECTRUM Allele Frequency selection mode
    --sample_expressions
     -se
    NA Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
    --sample_name
     -sn
    [] Include genotypes from this sample. Can be specified multiple times
    --sampleMode
    NONE Sample selection mode
    --samplePNonref
    0.99 GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site
    --selectTypeToInclude
     -selectType
    [] Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
    Optional Flags
    --ignoreGenotypes
    false If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection
    --ignorePolymorphicStatus
    false If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection
    --includeFilteredSites
     -ifs
    false If true, will include filtered sites in set to choose variants from

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --frequencySelectionMode / -freqMode

    Allele Frequency selection mode
    This argument selects allele frequency selection mode. See the wiki for more information.

    The --frequencySelectionMode argument is an enumerated type (AF_COMPUTATION_MODE), which can have one of the following values:

    KEEP_AF_SPECTRUM
    UNIFORM

    AF_COMPUTATION_MODE  KEEP_AF_SPECTRUM


    --ignoreGenotypes / -ignoreGenotypes

    If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection
    Argument for the frequency selection mode. (AC/AF/AN) are taken from VCF info field, not recalculated. Typically specified for sites-only VCFs that still have AC/AF/AN information.

    boolean  false


    --ignorePolymorphicStatus / -ignorePolymorphicStatus

    If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection
    Argument for the frequency selection mode. Allows reference (non-polymorphic) sites to be included in the validation set.

    boolean  false


    --includeFilteredSites / -ifs

    If true, will include filtered sites in set to choose variants from
    Do not exclude filtered sites (e.g. not PASS or .) from consideration for validation

    boolean  false


    --numValidationSites / -numSites

    Number of output validation sites
    The number of sites in your validation set

    R int  0  [ [ -∞  ∞ ] ]


    --out / -o

    File to which variants should be written
    The output VCF file

    VariantContextWriter  stdout


    --sample_expressions / -se

    Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
    Sample regexps to subset the input VCF to, prior to selecting variants. -sn NA12* subsets to all samples with prefix NA12

    Set[String]  NA


    --sample_file / -sf

    File containing a list of samples (one per line) to include. Can be specified multiple times
    File containing a list of sample names to subset the input vcf to. Equivalent to specifying the contents of the file separately with -sn

    Set[File]  NA


    --sample_name / -sn

    Include genotypes from this sample. Can be specified multiple times
    Sample name(s) to subset the input VCF to, prior to selecting variants. -sn A -sn B subsets to samples A and B.

    Set[String]  []


    --sampleMode / -sampleMode

    Sample selection mode
    A mode for selecting sites based on sample-level data. See the wiki documentation for more information.

    The --sampleMode argument is an enumerated type (SAMPLE_SELECTION_MODE), which can have one of the following values:

    NONE
    POLY_BASED_ON_GT
    POLY_BASED_ON_GL

    SAMPLE_SELECTION_MODE  NONE


    --samplePNonref / -samplePNonref

    GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site
    An P[nonref] threshold for SAMPLE_SELECTION_MODE=POLY_BASED_ON_GL. See the wiki documentation for more information.

    double  0.99  [ [ -∞  ∞ ] ]


    --selectTypeToInclude / -selectType

    Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
    This argument selects particular kinds of variants (i.e. SNP, INDEL) out of a list. If left unspecified, all types are considered.

    List[Type]  []


    --variant / -V

    Input VCF file, can be specified multiple times
    The input VCF file

    This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

    R List[RodBinding[VariantContext]]  NA