sym - collapse counts for symmetric CpGs sites

Synopsis

$ dnmtools sym [OPTIONS] <input.meth>

Description

Many of our tools were designed for data vertebrate species. In these species, the methylation levels at CpG sites tends to be symmetric, the same on each strand. Of course there are exceptions. But in many analysis settings, combining data from both strands for the same CpG site is a good idea. Assume you have output from counts. The sym command will merge data on both strands for each CpG site. It takes files having the same format as output by counts with either all cytosines or CpGs only (generated with -n option when running counts).

$ dnmtools sym -o human_esc_CpG.meth human_esc.meth

The above command will merge all CpG pairs while also discarding sites with an indication that the CpG has mutated. Note that as long as one site of the pair is mutated, the pair is discarded. This is the default mode. If you want to keep those mutated sites, run the following:

$ dnmtools sym -m -o human_esc_CpG.meth human_esc.meth

Here is an example to show what sym actually does with the data. First, the following is several lines of output generated by counts. This partial output includes sites in multiple contexts, and among them 4 are CpG sites:

chr10        11473        +        CHH        0          3
chr10        11474        +        CXG        0          13
chr10        11476        -        CXG        0          22
chr10        11477        +        CpG        0.181818   11
chr10        11478        -        CpG        0.391304   23
chr10        11479        -        CCG        0          22
chr10        11481        -        CHH        0          23
chr10        11483        +        CCG        0          11
chr10        11484        +        CpG        0.909091   11
chr10        11485        -        CpG        0.913043   23
chr10        11486        -        CCG        0          19
chr10        11487        -        CHH        0          20
chr10        11489        -        CHH        0.105263   19

The first CpG site above is at position 11477 on chr10, and there is another one immediately following it on the opposite strand. These are the two C in the same CpG site. The first one is covered by 11 reads, and among those 2 indicate methylation (a C in the reads). This is obtained by 0.181818 x 11. The next CpG has a "-" for the strand, so it refers to the G on the positive reference strand, which is the same as the C on the opposite strand for that site. This one is covered by 23 reads, 9 of which indicate methylation (0.391304 x 23). For this one CpG dinucleotide, the total methylation observations are 2 + 9 = 11, and the total reads are 11 + 23 = 34. Therefore, the methylation level for the dinucleotide is 11/34 = 0.3235294. The sym command would produce the following:

chr10   11477   +   CpG 0.323529    34
chr10   11484   +   CpG 0.911765    34

By chance, the other CpG site in the partial output above had the same number, 34, of reads covering the site when counting both strands. Notice that non-CpG sites are removed. Your input/output might look slightly different in your terminal, as the format involves tabs and not spaces.

Options

-o, -output

The name of the output file (default: stdout). The format is the same as output by counts.

-m, -muts

Include mutated CpG sites among the output, i.e. entries with an "x" terminating the fourth column of each line of input.

-v, -verbose

Report more information while the program is running.