• Diagnostics and Quality Control Tools
  • ASEReadCounter
  • AnalyzeCovariates
  • CallableLoci
  • CheckPileup
  • CompareCallableLoci
  • ContEst
  • CountBases
  • CountIntervals
  • CountLoci
  • CountMales
  • CountRODs
  • CountRODsByRef
  • CountReadEvents
  • CountReads
  • CountTerminusEvent
  • DepthOfCoverage
  • DiagnoseTargets
  • DiffObjects
  • ErrorRatePerCycle
  • FastaStats
  • FindCoveredIntervals
  • FlagStat
  • GCContentByInterval
  • GatherBqsrReports
  • Pileup
  • PrintRODs
  • QualifyMissingIntervals
  • ReadClippingStats
  • ReadGroupProperties
  • ReadLengthDistribution
  • SimulateReadsForVariants
  • Sequence Data Processing Tools
  • BaseRecalibrator
  • ClipReads
  • IndelRealigner
  • LeftAlignIndels
  • PrintReads
  • RealignerTargetCreator
  • SplitNCigarReads
  • SplitSamFile
  • Variant Discovery Tools
  • ApplyRecalibration
  • CalculateGenotypePosteriors
  • GATKPaperGenotyper
  • GenotypeGVCFs
  • HaplotypeCaller
  • MuTect2
  • RegenotypeVariants
  • UnifiedGenotyper
  • VariantRecalibrator
  • Variant Evaluation Tools
  • GenotypeConcordance
  • ValidateVariants
  • VariantEval
  • VariantFiltration
  • Variant Manipulation Tools
  • CatVariants
  • CombineGVCFs
  • CombineVariants
  • HaplotypeResolver
  • LeftAlignAndTrimVariants
  • PhaseByTransmission
  • RandomlySplitVariants
  • ReadBackedPhasing
  • SelectHeaders
  • SelectVariants
  • ValidationSiteSelector
  • VariantAnnotator
  • VariantsToAllelicPrimitives
  • VariantsToBinaryPed
  • VariantsToTable
  • VariantsToVCF

  • Annotation Modules
  • AS_BaseQualityRankSumTest
  • AS_FisherStrand
  • AS_InbreedingCoeff
  • AS_InsertSizeRankSum
  • AS_MQMateRankSumTest
  • AS_MappingQualityRankSumTest
  • AS_QualByDepth
  • AS_RMSMappingQuality
  • AS_ReadPosRankSumTest
  • AS_StrandOddsRatio
  • AlleleBalance
  • AlleleBalanceBySample
  • AlleleCountBySample
  • BaseCounts
  • BaseCountsBySample
  • BaseQualityRankSumTest
  • BaseQualitySumPerAlleleBySample
  • ChromosomeCounts
  • ClippingRankSumTest
  • ClusteredReadPosition
  • Coverage
  • DepthPerAlleleBySample
  • DepthPerSampleHC
  • ExcessHet
  • FisherStrand
  • FractionInformativeReads
  • GCContent
  • GenotypeSummaries
  • HaplotypeScore
  • HardyWeinberg
  • HomopolymerRun
  • InbreedingCoeff
  • LikelihoodRankSumTest
  • LowMQ
  • MVLikelihoodRatio
  • MappingQualityRankSumTest
  • MappingQualityZero
  • MappingQualityZeroBySample
  • NBaseCount
  • OxoGReadCounts
  • PossibleDeNovo
  • QualByDepth
  • RMSMappingQuality
  • ReadPosRankSumTest
  • SampleList
  • SnpEff
  • SpanningDeletions
  • StrandAlleleCountsBySample
  • StrandBiasBySample
  • StrandOddsRatio
  • TandemRepeatAnnotator
  • TransmissionDisequilibriumTest
  • VariantType
  • Read Filters
  • BadCigarFilter
  • BadMateFilter
  • CountingFilteringIterator.CountingReadFilter
  • DuplicateReadFilter
  • FailsVendorQualityCheckFilter
  • HCMappingQualityFilter
  • LibraryReadFilter
  • MalformedReadFilter
  • MappingQualityFilter
  • MappingQualityUnavailableFilter
  • MappingQualityZeroFilter
  • MateSameStrandFilter
  • MaxInsertSizeFilter
  • MissingReadGroupFilter
  • NoOriginalQualityScoresFilter
  • NotPrimaryAlignmentFilter
  • OverclippedReadFilter
  • Platform454Filter
  • PlatformFilter
  • PlatformUnitFilter
  • ReadGroupBlackListFilter
  • ReadLengthFilter
  • ReadNameFilter
  • ReadStrandFilter
  • ReassignMappingQualityFilter
  • ReassignOneMappingQualityFilter
  • ReassignOriginalMQAfterIndelRealignmentFilter
  • SampleFilter
  • SingleReadGroupFilter
  • UnmappedReadFilter
  • Resource File Codecs
  • BeagleCodec
  • BedTableCodec
  • RawHapMapCodec
  • RefSeqCodec
  • SAMPileupCodec
  • SAMReadCodec
  • TableCodec

  • Reference Utilities
  • FastaAlternateReferenceMaker
  • FastaReferenceMaker
  • QCRef
  • Showing docs for version 3.7-0


    GenotypeConcordance

    Genotype concordance between two callsets

    Category Variant Evaluation Tools

    Traversal LocusWalker

    PartitionBy LOCUS


    Overview

    This tool takes in two callsets (vcfs) and tabulates the number of sites which overlap and share alleles, and for each sample, the genotype-by-genotype counts (e.g. the number of sites at which a sample was called homozygous-reference in the EVAL callset, but homozygous-variant in the COMP callset). It outputs these counts as well as convenient proportions (such as the proportion of het calls in the EVAL which were called REF in the COMP) and metrics (such as NRD and NRS).

    Input

    Genotype concordance requires two callsets (as it does a comparison): an EVAL and a COMP callset, specified via the -eval and -comp arguments. Typically, the EVAL callset is an experimental set you want to evaluate, while the COMP callset is a previously existing set used as a standard for comparison (taken to represent "truth").

    (Optional) Jexl expressions for genotype-level filtering of EVAL or COMP genotypes, specified via the -gfe and -cfe arguments, respectively.

    Output

    Genotype Concordance writes a GATK report to the specified file (via -o), consisting of multiple tables of counts and proportions. These tables are constructed on a per-sample basis, and include counts of EVAL vs COMP genotype states.

    Tables

    Headers for the (non-moltenized -- see below) GenotypeConcordance counts and proportions tables give the genotype of the EVAL callset followed by the genotype of the COMP callset. For example the value corresponding to HOM_REF_HET reflects variants called HOM_REF in the EVAL callset and HET in the COMP callset. Variants for which the alternate alleles between the EVAL and COMP sample did not match are excluded from genotype comparisons and given in the "Mismatching_Alleles" field.

    It may be informative to reshape rows of the GenotypeConcordance counts and proportions tables into separate row-major tables where the columns indicate the COMP genotype and the rows indicate the EVAL genotype for easy comparison between the two callsets. This can be done with the gsa.reshape.concordance.table function in the gsalib R library. In Excel this can be accomplished using the OFFSET function.

    Term and metrics definitions

    Site-level allelic concordance

    For strictly bi-allelic VCFs, only the ALLELES_MATCH, EVAL_ONLY, TRUTH_ONLY fields will be populated, but where multi-allelic sites are involved counts for EVAL_SUBSET_TRUTH and EVAL_SUPERSET_TRUTH will be generated.

    For example, in the following situation

        eval:  ref - A   alt - C
        comp:  ref - A   alt - C,T
      
    then the site is tabulated as EVAL_SUBSET_TRUTH. Were the situation reversed, it would be EVAL_SUPERSET_TRUTH. However, in the case where EVAL has both C and T alternate alleles, both must be observed in the genotypes (that is, there must be at least one of (0/1,1/1) and at least one of (0/2,1/2,2/2) in the genotype field). If one of the alleles has no observations in the genotype fields of the EVAL, the site-level concordance is tabulated as though that allele were not present in the record.

    Monomorphic Records

    A site which has an alternate allele, but which is monomorphic in samples, is treated as not having been discovered, and will be recorded in the TRUTH_ONLY column (if a record exists in the COMP set), or not at all (if no record exists in the COMP set).

    That is, in the situation

       eval:  ref - A   alt - C   genotypes - 0/0  0/0  0/0 ... 0/0
       comp:  ref - A   alt - C   ...         0/0  0/0  ...
      
    is equivalent to
       eval:  ref - A   alt - .   genotypes - 0/0  0/0  0/0 ... 0/0
       comp:  ref - A   alt - C   ...         0/0  0/0  ...
      

    When a record is present in the COMP set the *genotypes* for the monomorphic site will still be used to evaluate per-sample genotype concordance counts.

    Filtered Records

    Filtered records are treated as though they were not present in the VCF, unless -ignoreSiteFilters is provided, in which case all records are used. There is currently no way to assess concordance metrics on filtered sites exclusively. SelectVariants can be used to extract filtered sites, and VariantFiltration used to un-filter them.

    Moltenized tables

    These tables may be optionally moltenized via the -moltenize argument. That is, the standard table

      Sample   NO_CALL_HOM_REF  NO_CALL_HET  NO_CALL_HOM_VAR   (...)
      NA12878       0.003        0.001            0.000        (...)
      NA12891       0.005        0.000            0.000        (...)
      
    would instead be displayed
      NA12878  NO_CALL_HOM_REF   0.003
      NA12878  NO_CALL_HET       0.001
      NA12878  NO_CALL_HOM_VAR   0.000
      NA12891  NO_CALL_HOM_REF   0.005
      NA12891  NO_CALL_HET       0.000
      NA12891  NO_CALL_HOM_VAR   0.000
      (...)
      

    Usage example

     java -jar GenomeAnalysisTK.jar \
       -T GenotypeConcordance \
       -R reference.fasta \
       -eval test_set.vcf \
       -comp truth_set.vcf \
       -o output.grp
     

    Additional Information

    Read filters

    These Read Filters are automatically applied to the data by the Engine before processing by GenotypeConcordance.


    Command-line Arguments

    Engine arguments

    All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

    GenotypeConcordance specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Inputs
    --comp
    NA The variants and genotypes to compare against
    --eval
    NA The variants and genotypes to evaluate
    Required Flags
    --moltenize
    false Molten rather than tabular output
    Optional Outputs
    --out
     -o
    stdout An output file created by the walker. Will overwrite contents if file exists
    Optional Parameters
    --genotypeFilterExpressionComp
     -gfc
    [] One or more criteria to use to set COMP genotypes to no-call. These genotype-level filters are only applied to the COMP rod.
    --genotypeFilterExpressionEval
     -gfe
    [] One or more criteria to use to set EVAL genotypes to no-call. These genotype-level filters are only applied to the EVAL rod.
    --printInterestingSites
     -sites
    NA File to output the discordant sites and genotypes.
    Optional Flags
    --ignoreFilters
    false Filters will be ignored

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --comp / -comp

    The variants and genotypes to compare against
    The callset you want to treat as 'truth'. Can also be of unknown quality for the sake of callset comparisons.

    This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

    R RodBinding[VariantContext]  NA


    --eval / -eval

    The variants and genotypes to evaluate
    The callset you want to evaluate, typically this is where you'd put 'unassessed' callsets.

    This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

    R RodBinding[VariantContext]  NA


    --genotypeFilterExpressionComp / -gfc

    One or more criteria to use to set COMP genotypes to no-call. These genotype-level filters are only applied to the COMP rod.
    Identical to -gfe except the filter is applied to genotypes in the comp rod.

    ArrayList[String]  []


    --genotypeFilterExpressionEval / -gfe

    One or more criteria to use to set EVAL genotypes to no-call. These genotype-level filters are only applied to the EVAL rod.
    A genotype level JEXL expression to apply to eval genotypes. Genotypes filtered in this way will be replaced by NO_CALL. For instance: -gfe 'GQ<20' will set to no-call any genotype with genotype quality less than 20.

    ArrayList[String]  []


    --ignoreFilters / NA

    Filters will be ignored
    The FILTER field of the eval and comp VCFs will be ignored. If this flag is not included, all FILTER sites will be treated as not being present in the VCF. (That is, the genotypes will be assigned UNAVAILABLE, as distinct from NO_CALL).

    boolean  false


    --moltenize / -moltenize

    Molten rather than tabular output
    Moltenize the count and proportion tables. Rather than moltenizing per-sample data into a 2x2 table, it is fully moltenized into elements. That is, WITHOUT this argument, each row of the table begins with the sample name and proceeds directly with counts/proportions of eval/comp counts (for instance HOM_REF/HOM_REF, HOM_REF/NO_CALL). If the Moltenize argument is given, the output will begin with a sample name, followed by the contrastive genotype type (such as HOM_REF/HOM_REF), followed by the count or proportion. This will significantly increase the number of rows.

    boolean  false


    --out / -o

    An output file created by the walker. Will overwrite contents if file exists

    PrintStream  stdout


    --printInterestingSites / -sites

    File to output the discordant sites and genotypes.
    Print sites where genotypes are mismatched between callsets along with annotations giving the genotype of each callset to the given filename

    PrintStream  NA