counts - compute single-site methylation

Synopsis

$ dnmtools counts [OPTIONS] -c <chroms> <input.bam>

Description

The counts command takes the mapped reads and produces the methylation level at each genomic cytosine (both strands), with the option to produce only levels for CpG-context cytosines. While most DNA methylation in vertebrates is in the CpG context, cytosines in other sequence contexts, such as CXG or CHH (where H denotes adenines, thymines, or cytosines and X denotes adenines or thymines) may also be methylated. Non-CpG methylation occurs frequently in plant genomes and pluripotent mammalian cells such as embryonic stem cells. And possibly whatever cells you are studying. The output of counts serves as the input for many downstream analyses.

The input mapped reads file (input.bam) is in SAM/BAM format. The reads should be sorted so those mapping to the same chromosome are consecutive in the file. Duplicate reads should be probably be removed using the uniq command first, but that depends on your data.

The methylation level for every cytosine site at single base resolution is estimated as the ratio of methylated to total bases in reads mapped over that site. If a site has no reads covering it, then we leave the value at 0, but also indicate this in the output (see below).

Each site is marked with a context in the output of the counts command, and you can find out more about cytosine contexts here.

To compute methylation levels at each cytosine site along the genome you can use the following command:

$ dnmtools counts -c /path/to/genome.fa -o output.counts input.sam

The argument -c gives the name of a FASTA file containing all chromosome sequences (as of v1.2.5, a directory of separate files is no longer supported). Importantly, the "name" line in each chromosome FASTA file must begin with the character > followed immediately by the same name that identifies that chromosome in the SAM output (the .bam files). If you use the same FASTA format file you used to map the reads, everything should be fine. An example of the output and explanation of each column follows:

chr1  1869  +  CCG  0         1
chr1  1870  +  CpG  0         1
chr1  1871  -  CpG  0.666667  6
chr1  1874  +  CHH  0         1
chr1  1875  +  CCG  0         1
chr1  1876  +  CpG  0         1
chr1  1877  -  CpG  0.333333  6
chr1  1880  +  CHH  0         1
chr1  1881  +  CCG  0         1
chr1  1882  +  CpG  0         1
chr1  1883  -  CpG  0.5       6
chr1  1886  +  CHH  0         1
chr1  1887  +  CCG  0         1
chr1  1888  +  CpG  0         1
chr1  1889  -  CpG  0.333333  6
chr1  1892  +  CHH  0         1
chr1  1893  +  CCG  0         1
chr1  1894  +  CpG  0         1
chr1  1895  -  CpG  0.25      4
chr1  1898  +  CHH  0         1

The output file contains one line per cytosine site. The first column is the chromosome. The second is the position of the cytosine (zero-based). The 3rd column indicates the strand, which can be either '+' or '-', corresponding to either a C on the positive reference strand or a G on the positive reference strand. The 4th column is the sequence context of that site, followed by an x if the site has mutated in the sample away from the reference genome. The estimate for mutation is based on whether the reads mapping on the opposite strand for a cytosine are biased towards 'A', which happens if a 'C' to 'T' mutation has occurred (a G to A on the other strand) and would otherwise be interpreted as a C-to-T conversion by bisulfite. The 5th column is the estimated methylation level, equal to the number of Cs in reads at position corresponding to the site, divided by the sum of the Cs and Ts mapping to that position. The final column is number of reads overlapping with that site.

Note that because counts produces a file containing one line for every cytosine in the genome, the file can get quite large. For reference assembly mm10, the output is approximately 25GB. The -n option produces methylation data for CpG context sites only, and for mm10 this produces an output file that is approximately 1GB. It is recommended that users allocate at least 8GB of memory when running counts.

To examine the methylation status of cytosines a particular sequence context, you can use the awk command (or grep) to filter those lines based on the fourth column. For example, to get all cytosines within the CHG context, run the following:

$ awk '$4 == CHG' human_esc.counts > human_esc_chg.counts

Our convention is to name counts output with all cytosines like *.counts, with CHG like *.chg.counts and with CHH like *.chh.counts.

Estimated "mutations" in the sample: In version 1.2.1 sites are flagged if there is an indication that a mutation at a C in the sample might cause a T to appear in the reads, and then confound the issues of mutations vs. the absence of methylation. The 4th column in the output (as indicated above) is the "context" string for that cytosine, and if a site is flagged as a possible mutation, then an x will appear at the end of the context string. This is an example of what you would see in the output:

chr1  1870  +  CpGx  0         1

The data used to make this inference is on the opposite strand: if there is a C on the positive strand, then we should observe a G at the same position in reads mapping to the negative strand, regardless of the bisulfite. If the reference has a C, but molecules in the sample have mutated C->T, independent of methylation or bisulfite, then we would observe an A in reads mapping to the same position on the opposite strand. At present we use very poor statistical criteria for this, so unless you believe you understand it well, it's best to ignore. Hopefully we can do this in a more rigorous and interpretable way in the future.

Options

-o, -output

Output file name. The default is to write output to the terminal, which is not useful unless commands are piped.

-c, -chrom

Reference genome file, which must be in FASTA format. This is required.

-t, -threads

The number of threads to use. This is only helpful if the input is BAM or if the output is to be zipped (see -z below). The threads speed up decompressing BAM input or compressing gzip output. Because counts spends most of its computing time processing reads sequentially, there are diminishing returns for specifying too many threads.

-z, -zip

The output should be zipped (in gzip format). This is not deduced by the filename, but specifying this argument should be accompanied by using a .gz filename suffix for the output.

-n, -cpg-only

Print only CpG context cytosines. This significantly reduces the output size in most genomes. Note that using this option does not merge data as symmetric CpGs.

-v, -verbose

Report more information while the program is running.

-progress

Show progress while the program is running.