uniq - ensure reads are not duplicates

Synopsis

$ dnmtools uniq [OPTIONS] <input-sorted.sam> [out-sorted.sam]

Description

The uniq command removes PCR duplicates. Before calculating methylation level, you should now remove duplicate reads, which in wgbs data are typically identified by their mapping to identical genomic locations. These reads are most likely PCR clones rather than representations of distinct DNA molecules. The command uniq remove such duplicates. It collects duplicate reads and/or fragments that have identical sequences and are mapped to the same genomic location (same chromosome, same start and end positions, and same strand), and chooses a random one to be the representative of the original DNA sequence.

Note As of dnmtools v1.2.5, the option to use the sequence of reads when deciding if two reads are duplicates has been removed. In the context of analyzing bisulfite sequencing reads, this has the danger of introducing bias in downstream analyses. Also, in the same version the test for sorted order of reads cannot be disabled. Empirical tests showed very little improvement to speed when disabling this test.

The uniq command can take reads sorted by (chrom, start, end, strand). If the reads in the input file are not sorted, run the following sort command using samtools:

$ samtools sort -o reads_sorted.bam reads.bam

Next, execute the following command to remove duplicate reads:

$ dnmtools uniq -S duplicate-removal-stats.txt reads_sorted.bam reads_uniq.bam

Options

 -t, -threads

The number of threads to use. These threads are used for I/O, and are most helpful when the input and output are both BAM, where the threads can really speed things up.

 -S, -summary

Save statistics on duplication rates to this file. The statistics are not reported unless a file is specified here. This option is correct as of v1.4.0.

 -hist

Output a histogram of duplication frequencies into the specified file for library complexity analysis.

 -B, -bam

The output is in BAM format. This is an option to help prevent accidentally writing BAM format to the terminal or through a pipe that expects plain text, e.g., SAM.

 -stdout

Write the output to standard out. This is not done by default even without an output file given, because of the danger of writing BAM to the terminal or through a pipe unexpectedly. It is possible to write BAM redirected or through a pipe, but the -stdout argument is required.

 -s, -seed

Random number seed. Affects which read is kept among duplicates. The default seed is 408. This option is typically only used for testing.

 -v, -verbose

Report more information while the program is running.