hmr - hypomethylated regions

Synopsis

$ dnmtools hmr [OPTIONS] <input.meth>

Description

This command identifies "hypomethylated regions" which we abbreviate as HMRs. The valleys are identified using a 2-state hidden Markov model with a beta-binomial emission distribution for each CpG site. This distribution accounts for methylation and random changes to coverage along the genome. hmr automatically learns the average methylation levels inside and outside the HMRs, and also the average size of those HMRs.

The input to hmr is a file with the output format from the counts command. The output is a BED format file containing the HMRs as genomic intervals.

For mammalian healthy somatic primary cells, these are the most important functional features in the methylome. They tend to correspond to promoters and enhancers (and other regions of possible regulatory activity) that are functional/active, are accessible and poised for function, or that were so in a progenitor cell. Although the latter two categories are important, in most somatic primary cells the HMRs mark active regulatory regions. From a global view of the methylome, these are the valleys in an otherwise high background methylation level. In a typical sample of healthy somatic primary cells, you should expect to find between roughly 40k-100k HMRs, and their mean size should be 1.5 kbp to 3 kbp. If your results deviate too much from this, then you should consider whether it makes sense to identify HMRs in your sample (e.g., if it's a cancer sample, or immortalized, then the HMRs will be obscured by other features). I have never observed HMRs defined in this way in species outside vertebrates. For example, Arabidopsis has a low background methylation level punctuated by peaks, rather than valleys (and the difference isn't as simple as subtracting the methylation level from 1.0).

Requirements on the data

Running hmr requires a file of methylation levels formatted like the output of the counts. For identifying HMRs in mammalian methylomes, use the symmetric CpG methylation levels. This is obtained by using the sym command after having used the counts command.

We typically like to have about 10x coverage to feel very confident in the HMRs called in mammalian genomes, but the method will work with lower coverage. Coverage can be calculated using the levels command, and is summarized in the mean_depth_covered statistic under cpg_symmetric group.

If reads have low coverage, the boundaries of HMRs will be less accurate, but overall most of the HMRs will probably be in the right places if you have coverage of 5-8x (depending on the methylome). Boundaries of these regions are ignored by analysis methods based on smoothing or using fixed-width windows, so you will get better precision on boundaries (and accuracy overall) using hmr.

Output

The output will be in BED format, and the indicated strand (always positive) is not informative. The name column in the output will just assign a unique name to each HMR, and the score column indicates how many CpGs exist inside the HMR. Each time the hmr is run it requires parameters for the HMM to use in identifying the HMRs. We usually train these HMM parameters on the data being analyzed, since the parameters depend on the average methylation level and variance of methylation level; the variance observed can also depend on the coverage. However, in some cases it might be desirable to use the parameters trained on one data set to find HMRs in another. The option -p indicates a file in which the trained parameters are written, and the argument -P indicates a file containing parameters (as produced with the -p option on a previous run) to use:

$ dnmtools hmr -p params.txt -o output.hmr input.meth

Above the output file has the extension .hmr but this doesn't matter. The format of the output is 6-column BED.

In the above example the trained parameters are stored in the file params.txt but are also used to find HMRs in the input methylome. Storing these parameters can be useful if a particular methylome seems to have very strange methylation levels through much of the genome, and the HMRs would be more comparable with those from some other methylome if the model were not trained on that strange methylome.

Partially methylated regions (PMRs)

The hmr command also has the option of directly identifying partially methylated regions (PMRs), not to be confused with partially methylated domains. These are contiguous intervals where the methylation level at individual sites is close to 0.5. This should also not be confused with regions that have allele-specific methylation (ASM) or regions with alternating high and low methylation levels at nearby sites. Regions with ASM are almost always among the PMRs, but most PMRs are not regions of ASM. The hmr command is run with the same input but a different optional argument to find PMRs:

$ dnmtools hmr -partial -o output.pmr input.meth

Options

 -o, -out

The name of the output file. If no file name is provided, the output will be written to standard output. Due to the size of this output, a file name should be specified unless the output will be piped to another command or program. The output file contains genomic intervals in BED format.

 -d, -desert

The maximum distance between covered CpGs in HMR (default: 1000 bp). Beyond this distance, adjacent CpG sites will be considered part of distinct HMRs, regardless of their methylation status.

 -i, -itr

The maximum number of iterations for learning parameters (default: 10).

 -v, -verbose

Report more information while the program is running.

 -partial

Identify PMRs instead of HMRs.

 -post-hypo

Output file for single-CpG posterior hypomethylation probability (default: none). By default this is information is not reported, and only reported if a file is specified here.

 -post-meth

Output file for single-CpG posteiror methylation probability (default: none). By default this is information is not reported, and only reported if a file is specified here.

 -P, -params-in

File containing existing parameters to use in the model (skip the training step). This should be a file produced previously by the hmr command using the -p parameter.

 -p, -params-out

File in which to write parameters learned during the current run.

 -s, -seed

A random number seed. Randomization is used in a shuffling step prior to filering candidate HMRs. This parameter is typically only used for testing (default: 408).

 -S, -summary

Write the analysis summary to this file. The summary is not reported unless a file is specified here. This option is correct as of v1.4.0.