states - Allele-specific methylation file format
Synopsis
$ dnmtools states [OPTIONS] <input.sam>
Description
All programs that calculate statistics related to ASM must take the
linked states of CpG sites within reads into account. Using full read
sequences for this purpose is inefficient, so we defined an
intermediate format, "epiread," for this purpose. The states command
will convert a BAM or SAM file of mapped reads into a "states" file in
the format used by amrfinder and amrtester.
The epiread format consists of three columns. The first column is the
chromosome name for the mapped read, the second is the "index" of the
first CpG in the read. The index x indicates that the first CpG site
in the read corresponds to the x'th (starting from 0) CpG site in
the chromosome. Therefore, these are not nucleotide positions in the
genome. The final column in the epiread format is the sequence of
methylation states within the read. This sequence of states is
composed of 3 possible letters: C if the corresponding letter at that
CpG site in the mapped read is a C, and similar for T. Within this
state sequence, letters in mapped reads at positions corresponding to
CpG sites that are neither C nor T are encoded as N. Aside from the
"N" this is effectively a binary encoding of methylation states.
Here is an example showing how some lines of an epiread format file might look:
chr1 1460 CCCCCCCC
chr1 1460 CCC
chr1 1461 TCTTNNNNTTCT
chr1 1468 CCCC
chr1 1469 CCC
chr1 1469 CCCT
chr1 1469 CCC
chr1 1469 CCCCCCT
chr1 1469 CCC
chr1 1470 CCCC
chr1 1471 CCCNNNNNNTCCC
chr1 1472 CCC
Those epireads with the "N" in the middle correspond to paired-end
reads with ends that are joined. It is important to use these as one
fragment because linking methylation states within a fragment, over as
large a distance as possible, helps the inference methods within both
amrfinder and amrtester.
The following is an example of how to run the states command:
$ dnmtools states -c /path/to/genome.fa -o output.epiread input.sam
Options
-o, -output
The name of the output file.
-c, -chrom
FASTA file of chromosomes containing FASTA files [required].
-v, -verbose
Print information to the terminal while the program runs.
-z, -zip
Write output in gzip compressed format.