format - prepare SAM files to dnmtools analysis
Synopsis
$ dnmtools format [OPTIONS] -f <mapper> <input.bam> [output.bam]
Description
Not all bisulfite sequencing mappers use the same formats for their
output files. Before analyzing the output SAM generated by mappers,
some formatting is required. The first formatting step is to merge
paired-end mates into single-end entries. This is particularly
important to quantify methylation, as fragments that overlap must
count the overlapping bases only once and must be treated as
originating from the same allele. These can be ensured by merging them
into a single entry. SAM/BAM files generated by abismal, Bismark and
BSMAP can be formatted using the format command.
An example use of this command to format a mapped reads file is:
dnmtools format -f abismal input.bam output.sam
Above, the file input.sam would have been generated by abismal.
The file output.bam is the output, and an output file is required
here unless the -stdout argument is specified (see below). Another
example:
dnmtools format -f abismal -t 8 -B input.bam output.bam
This will use 8 threads because of the -t 8 and will produce output
in BAM format because of the -B flag (not the filename of the
output).
Note As of dnmtools v1.2.5, there is no longer a "buffer size"
argument. This introduced arbitrary behavior. Now format assumes
reads are sorted by read name, which should ensure mates in paired-end
sequencing are consecutive in the file. No "buffer" is needed, and
data that does not conform is more easily detected, making this tool
more easily detect improperly formatted input.
Options
-f, -format
This option indicates the format of the input SAM file, corresponding to the mapper that generated it (options: abismal, bsmap, bismark).
-t, -threads
The number of threads to use. These threads are used for I/O, and are most helpful when the input and output are both BAM, where the threads can really speed things up.
-B, -bam
The output is in BAM format. This is an option to help prevent accidentally writing BAM format to the terminal or through a pipe that expects plain text, e.g., SAM.
-stdout
Write the output to standard out. This is not done by default even
without an output file given, because of the danger of writing BAM to
the terminal or through a pipe unexpectedly. It is possible to write
BAM redirected or through a pipe, but the -stdout argument is
required.
-s, -suff
The length of the suffix for read names, which indicates whether the
read is from end 1 or end 2 for paired-end reads. If this is not
specified, but the data is paired end (i.e., the flag -single-end is
not used; see below), then the length of this suffix is inferred.
-single-end
Using this argument tells format not to look for mates to merge as a
single fragment. The default assumption is that data is paired-ended
and that mates are consecutive in the input.
-L, -max-frag
The maximum allowed insert size in base-pairs (default:
unlimited). Normally this parameter is determined during read mapping,
but format can also reject reads that are in opposing strands in the
same chromosome but map more than this many bases apart.
-S, -sam
The input follows SAM standards for orientation, where reads that map to the reverse complement of the reference genome are stored as their reverse complement in the SAM/BAM file.
-F, -force
This option "forces" the format command to process paired-end reads
even if it is unable to detect mates. Without this argument, failure
to detect mates will cause format to terminate. This option is
useful, for example, if the reads were paired-ended, but the second
end is of such low quality that only reads from the first end were
mapped. In a data analysis pipeline, it might not be apparent that one
of two ends failed entirely, so providing this option can help. If you
are only analyzing a small number of data sets, you probably want to
be made aware of this problem rather than force it to be ignored.
-v, -verbose
Report more information while the program is running.