• Diagnostics and Quality Control Tools
  • ASEReadCounter
  • AnalyzeCovariates
  • CallableLoci
  • CheckPileup
  • CompareCallableLoci
  • ContEst
  • CountBases
  • CountIntervals
  • CountLoci
  • CountMales
  • CountRODs
  • CountRODsByRef
  • CountReadEvents
  • CountReads
  • CountTerminusEvent
  • DepthOfCoverage
  • DiagnoseTargets
  • DiffObjects
  • ErrorRatePerCycle
  • FastaStats
  • FindCoveredIntervals
  • FlagStat
  • GCContentByInterval
  • GatherBqsrReports
  • Pileup
  • PrintRODs
  • QualifyMissingIntervals
  • ReadClippingStats
  • ReadGroupProperties
  • ReadLengthDistribution
  • SimulateReadsForVariants
  • Sequence Data Processing Tools
  • BaseRecalibrator
  • ClipReads
  • IndelRealigner
  • LeftAlignIndels
  • PrintReads
  • RealignerTargetCreator
  • SplitNCigarReads
  • SplitSamFile
  • Variant Discovery Tools
  • ApplyRecalibration
  • CalculateGenotypePosteriors
  • GATKPaperGenotyper
  • GenotypeGVCFs
  • HaplotypeCaller
  • MuTect2
  • RegenotypeVariants
  • UnifiedGenotyper
  • VariantRecalibrator
  • Variant Evaluation Tools
  • GenotypeConcordance
  • ValidateVariants
  • VariantEval
  • VariantFiltration
  • Variant Manipulation Tools
  • CatVariants
  • CombineGVCFs
  • CombineVariants
  • HaplotypeResolver
  • LeftAlignAndTrimVariants
  • PhaseByTransmission
  • RandomlySplitVariants
  • ReadBackedPhasing
  • SelectHeaders
  • SelectVariants
  • ValidationSiteSelector
  • VariantAnnotator
  • VariantsToAllelicPrimitives
  • VariantsToBinaryPed
  • VariantsToTable
  • VariantsToVCF

  • Annotation Modules
  • AS_BaseQualityRankSumTest
  • AS_FisherStrand
  • AS_InbreedingCoeff
  • AS_InsertSizeRankSum
  • AS_MQMateRankSumTest
  • AS_MappingQualityRankSumTest
  • AS_QualByDepth
  • AS_RMSMappingQuality
  • AS_ReadPosRankSumTest
  • AS_StrandOddsRatio
  • AlleleBalance
  • AlleleBalanceBySample
  • AlleleCountBySample
  • BaseCounts
  • BaseCountsBySample
  • BaseQualityRankSumTest
  • BaseQualitySumPerAlleleBySample
  • ChromosomeCounts
  • ClippingRankSumTest
  • ClusteredReadPosition
  • Coverage
  • DepthPerAlleleBySample
  • DepthPerSampleHC
  • ExcessHet
  • FisherStrand
  • FractionInformativeReads
  • GCContent
  • GenotypeSummaries
  • HaplotypeScore
  • HardyWeinberg
  • HomopolymerRun
  • InbreedingCoeff
  • LikelihoodRankSumTest
  • LowMQ
  • MVLikelihoodRatio
  • MappingQualityRankSumTest
  • MappingQualityZero
  • MappingQualityZeroBySample
  • NBaseCount
  • OxoGReadCounts
  • PossibleDeNovo
  • QualByDepth
  • RMSMappingQuality
  • ReadPosRankSumTest
  • SampleList
  • SnpEff
  • SpanningDeletions
  • StrandAlleleCountsBySample
  • StrandBiasBySample
  • StrandOddsRatio
  • TandemRepeatAnnotator
  • TransmissionDisequilibriumTest
  • VariantType
  • Read Filters
  • BadCigarFilter
  • BadMateFilter
  • CountingFilteringIterator.CountingReadFilter
  • DuplicateReadFilter
  • FailsVendorQualityCheckFilter
  • HCMappingQualityFilter
  • LibraryReadFilter
  • MalformedReadFilter
  • MappingQualityFilter
  • MappingQualityUnavailableFilter
  • MappingQualityZeroFilter
  • MateSameStrandFilter
  • MaxInsertSizeFilter
  • MissingReadGroupFilter
  • NoOriginalQualityScoresFilter
  • NotPrimaryAlignmentFilter
  • OverclippedReadFilter
  • Platform454Filter
  • PlatformFilter
  • PlatformUnitFilter
  • ReadGroupBlackListFilter
  • ReadLengthFilter
  • ReadNameFilter
  • ReadStrandFilter
  • ReassignMappingQualityFilter
  • ReassignOneMappingQualityFilter
  • ReassignOriginalMQAfterIndelRealignmentFilter
  • SampleFilter
  • SingleReadGroupFilter
  • UnmappedReadFilter
  • Resource File Codecs
  • BeagleCodec
  • BedTableCodec
  • RawHapMapCodec
  • RefSeqCodec
  • SAMPileupCodec
  • SAMReadCodec
  • TableCodec

  • Reference Utilities
  • FastaAlternateReferenceMaker
  • FastaReferenceMaker
  • QCRef
  • Showing docs for version 3.7-0


    ClipReads

    Read clipping based on quality, position or sequence matching

    Category Sequence Data Processing Tools

    Traversal ReadWalker

    PartitionBy READ


    Overview

    This tool provides simple, powerful read clipping capabilities that allow you to remove low quality strings of bases, sections of reads, and reads containing user-provided sequences.

    There are three options for clipping (quality, position and sequence), which can be used alone or in combination. In addition, you can also specify a clipping representation, which determines exactly how ClipReads applies clips to the reads (soft clips, writing Q0 base quality scores, etc.). Please note that you MUST specify at least one of the three clipping options, and specifying a clipping representation is not sufficient. If you do not specify a clipping option, the program will run but it will not do anything to your reads.

    Quality score based clipping
    Clip bases from the read in clipper from
    argmax_x{ \sum{i = x + 1}^l (qTrimmingThreshold - qual)
    to the end of the read. This is copied from BWA. Walk through the read from the end (in machine cycle order) to the beginning, calculating the running sum of qTrimmingThreshold - qual. While we do this, we track the maximum value of this sum where the delta > 0. After the loop, clipPoint is either -1 (don't do anything) or the clipping index in the read (from the end).

    Cycle based clipping
    Clips machine cycles from the read. Accepts a string of ranges of the form start1-end1,start2-end2, etc. For each start/end pair, removes bases in machine cycles from start to end, inclusive. These are 1-based values (positions). For example, 1-5,10-12 clips the first 5 bases, and then three bases at cycles 10, 11, and 12.

    Sequence matching
    Clips bases from that exactly match one of a number of base sequences. This employs an exact match algorithm, filtering only bases whose sequence exactly matches SEQ.

    Input

    Any number of BAM files.

    Output

    A new BAM file containing all of the reads from the input BAMs with the user-specified clipping operation applied to each read.

    Summary output (console)

         Number of examined reads              13
         Number of clipped reads               13
         Percent of clipped reads              100.00
         Number of examined bases              988
         Number of clipped bases               126
         Percent of clipped bases              12.75
         Number of quality-score clipped bases 126
         Number of range clipped bases         0
         Number of sequence clipped bases      0
         

    Usage example

       java -jar GenomeAnalysisTK.jar \
         -T ClipReads \
         -R reference.fasta \
         -I original.bam \
         -o clipped.bam \
         -XF seqsToClip.fasta \
         -X CCCCC \
         -CT "1-5,11-15" \
         -QT 10
     

    The command line shown above will apply all three options in combination. See the detailed examples below to see how the choice of clipping representation affects the output.

    Detailed clipping examples

    Suppose we are given this read:

         314KGAAXX090507:1:19:1420:1123#0        16      chrM    3116    29      76M     *       *       *
              TAGGACCCGGGCCCCCCTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
              #################4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
         

    If we are clipping reads with -QT 10 and -CR WRITE_NS, we get:

         314KGAAXX090507:1:19:1420:1123#0        16      chrM    3116    29      76M     *       *       *
              NNNNNNNNNNNNNNNNNTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
              #################4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
         

    Whereas with -QT 10 -CR WRITE_Q0S:

         314KGAAXX090507:1:19:1420:1123#0        16      chrM    3116    29      76M     *       *       *
              TAGGACCCGGGCCCCCCTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
              !!!!!!!!!!!!!!!!!4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
         

    Or -QT 10 -CR SOFTCLIP_BASES:

         314KGAAXX090507:1:19:1420:1123#0        16      chrM    3133    29      17S59M  *       *       *
              TAGGACCCGGGCCCCCCTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
              #################4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
         

    Additional Information

    Read filters

    These Read Filters are automatically applied to the data by the Engine before processing by ClipReads.

    Downsampling settings

    This tool does not apply any downsampling by default.


    Command-line Arguments

    Engine arguments

    All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

    ClipReads specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Optional Outputs
    --out
     -o
    stdout Write BAM output here
    --outputStatistics
     -os
    NA File to output statistics
    Optional Parameters
    --clipRepresentation
     -CR
    WRITE_NS How should we actually clip the bases?
    --clipSequence
     -X
    NA Remove sequences within reads matching this sequence
    --clipSequencesFile
     -XF
    NA Remove sequences within reads matching the sequences in this FASTA file
    --cyclesToTrim
     -CT
    NA String indicating machine cycles to clip from the reads
    --qTrimmingThreshold
     -QT
    -1 If provided, the Q-score clipper will be applied

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --clipRepresentation / -CR

    How should we actually clip the bases?
    The different values for this argument determines how ClipReads applies clips to the reads. This can range from writing Ns over the clipped bases to hard clipping away the bases from the BAM.

    The --clipRepresentation argument is an enumerated type (ClippingRepresentation), which can have one of the following values:

    WRITE_NS
    Clipped bases are changed to Ns
    WRITE_Q0S
    Clipped bases are changed to have Q0 quality score
    WRITE_NS_Q0S
    Clipped bases are change to have both an N base and a Q0 quality score
    SOFTCLIP_BASES
    Change the read's cigar string to soft clip (S, see sam-spec) away the bases. Note that this can only be applied to cases where the clipped bases occur at the start or end of a read.
    HARDCLIP_BASES
    WARNING: THIS OPTION IS STILL UNDER DEVELOPMENT AND IS NOT SUPPORTED. Change the read's cigar string to hard clip (H, see sam-spec) away the bases. Hard clipping, unlike soft clipping, actually removes bases from the read, reducing the resulting file's size but introducing an irrevesible (i.e., lossy) operation. Note that this can only be applied to cases where the clipped bases occur at the start or end of a read.
    REVERT_SOFTCLIPPED_BASES
    Turn all soft-clipped bases into matches

    ClippingRepresentation  WRITE_NS


    --clipSequence / -X

    Remove sequences within reads matching this sequence
    Clips bases from the reads matching the provided SEQ. Can be provided any number of times on the command line

    String[]  NA


    --clipSequencesFile / -XF

    Remove sequences within reads matching the sequences in this FASTA file
    Reads the sequences in the provided FASTA file, and clip any bases that exactly match any of the sequences in the file.

    String  NA


    --cyclesToTrim / -CT

    String indicating machine cycles to clip from the reads
    Clips machine cycles from the read. Accepts a string of ranges of the form start1-end1,start2-end2, etc. For each start/end pair, removes bases in machine cycles from start to end, inclusive. These are 1-based values (positions). For example, 1-5,10-12 clips the first 5 bases, and then three bases at cycles 10, 11, and 12.

    String  NA


    --out / -o

    Write BAM output here
    The output SAM/BAM file will be written here

    GATKSAMFileWriter  stdout


    --outputStatistics / -os

    File to output statistics
    If provided, ClipReads will write summary statistics about the clipping operations applied to the reads in this file.

    PrintStream  NA


    --qTrimmingThreshold / -QT

    If provided, the Q-score clipper will be applied
    If a value > 0 is provided, then the quality score based read clipper will be applied to the reads using this quality score threshold.

    int  -1  [ [ -∞  ∞ ] ]