format - prepare SAM files to dnmtools analysis

Synopsis

$ dnmtools format [OPTIONS] -f <mapper>  <input.bam> [output.bam]

Description

Not all bisulfite sequencing mappers use the same formats for their output files. Before analyzing the output SAM generated by mappers, some formatting is required. The first formatting step is to merge paired-end mates into single-end entries. This is particularly important to quantify methylation, as fragments that overlap must count the overlapping bases only once and must be treated as originating from the same allele. These can be ensured by merging them into a single entry. SAM/BAM files generated by abismal, Bismark and BSMAP can be formatted using the format command.

An example use of this command to format a mapped reads file is:

dnmtools format -f abismal input.bam output.sam

Above, the file input.sam would have been generated by abismal. The file output.bam is the output, and an output file is required here unless the -stdout argument is specified (see below). Another example:

dnmtools format -f abismal -t 8 -B input.bam output.bam

This will use 8 threads because of the -t 8 and will produce output in BAM format because of the -B flag (not the filename of the output).

Note As of dnmtools v1.2.5, there is no longer a "buffer size" argument. This introduced arbitrary behavior. Now format assumes reads are sorted by read name, which should ensure mates in paired-end sequencing are consecutive in the file. No "buffer" is needed, and data that does not conform is more easily detected, making this tool more easily detect improperly formatted input.

Options

-f, -format

This option indicates the format of the input SAM file, corresponding to the mapper that generated it (options: abismal, bsmap, bismark).

 -t, -threads

The number of threads to use. These threads are used for I/O, and are most helpful when the input and output are both BAM, where the threads can really speed things up.

 -B, -bam

The output is in BAM format. This is an option to help prevent accidentally writing BAM format to the terminal or through a pipe that expects plain text, e.g., SAM.

 -stdout

Write the output to standard out. This is not done by default even without an output file given, because of the danger of writing BAM to the terminal or through a pipe unexpectedly. It is possible to write BAM redirected or through a pipe, but the -stdout argument is required.

-s, -suff

The length of the suffix for read names, which indicates whether the read is from end 1 or end 2 for paired-end reads. If this is not specified, but the data is paired end (i.e., the flag -single-end is not used; see below), then the length of this suffix is inferred.

-single-end

Using this argument tells format not to look for mates to merge as a single fragment. The default assumption is that data is paired-ended and that mates are consecutive in the input.

-L, -max-frag

The maximum allowed insert size in base-pairs (default: unlimited). Normally this parameter is determined during read mapping, but format can also reject reads that are in opposing strands in the same chromosome but map more than this many bases apart.

-S, -sam

The input follows SAM standards for orientation, where reads that map to the reverse complement of the reference genome are stored as their reverse complement in the SAM/BAM file.

-F, -force

This option "forces" the format command to process paired-end reads even if it is unable to detect mates. Without this argument, failure to detect mates will cause format to terminate. This option is useful, for example, if the reads were paired-ended, but the second end is of such low quality that only reads from the first end were mapped. In a data analysis pipeline, it might not be apparent that one of two ends failed entirely, so providing this option can help. If you are only analyzing a small number of data sets, you probably want to be made aware of this problem rather than force it to be ignored.

-v, -verbose

Report more information while the program is running.