sym - collapse counts for symmetric CpGs sites
Synopsis
$ dnmtools sym [OPTIONS] <input.meth>
Description
Many of our tools were designed for data vertebrate species. In these
species, the methylation levels at CpG sites tends to be symmetric,
the same on each strand. Of course there are exceptions. But in many
analysis settings, combining data from both strands for the same CpG
site is a good idea. Assume you have output from
counts. The sym command will merge data on both strands
for each CpG site. It takes files having the same format as output by
counts with either all cytosines or CpGs only (generated with -n
option when running counts).
$ dnmtools sym -o human_esc_CpG.meth human_esc.meth
The above command will merge all CpG pairs while also discarding sites with an indication that the CpG has mutated. Note that as long as one site of the pair is mutated, the pair is discarded. This is the default mode. If you want to keep those mutated sites, run the following:
$ dnmtools sym -m -o human_esc_CpG.meth human_esc.meth
Here is an example to show what sym actually does with the data.
First, the following is several lines of output generated by
counts. This partial output includes sites in multiple
contexts, and among them 4 are CpG sites:
chr10 11473 + CHH 0 3
chr10 11474 + CXG 0 13
chr10 11476 - CXG 0 22
chr10 11477 + CpG 0.181818 11
chr10 11478 - CpG 0.391304 23
chr10 11479 - CCG 0 22
chr10 11481 - CHH 0 23
chr10 11483 + CCG 0 11
chr10 11484 + CpG 0.909091 11
chr10 11485 - CpG 0.913043 23
chr10 11486 - CCG 0 19
chr10 11487 - CHH 0 20
chr10 11489 - CHH 0.105263 19
The first CpG site above is at position 11477 on chr10, and there is
another one immediately following it on the opposite strand. These are
the two C in the same CpG site. The first one is covered by 11 reads,
and among those 2 indicate methylation (a C in the reads). This is
obtained by 0.181818 x 11. The next CpG has a "-" for the strand, so
it refers to the G on the positive reference strand, which is the same
as the C on the opposite strand for that site. This one is covered by
23 reads, 9 of which indicate methylation (0.391304 x 23). For this
one CpG dinucleotide, the total methylation observations are 2 + 9 =
11, and the total reads are 11 + 23 = 34. Therefore, the methylation
level for the dinucleotide is 11/34 = 0.3235294. The sym command
would produce the following:
chr10 11477 + CpG 0.323529 34
chr10 11484 + CpG 0.911765 34
By chance, the other CpG site in the partial output above had the same number, 34, of reads covering the site when counting both strands. Notice that non-CpG sites are removed. Your input/output might look slightly different in your terminal, as the format involves tabs and not spaces.
Options
-o, -output
The name of the output file (default: stdout). The format is the same as output by counts.
-m, -muts
Include mutated CpG sites among the output, i.e. entries with an "x" terminating the fourth column of each line of input.
-v, -verbose
Report more information while the program is running.