pmd - partially methylated domains
Synopsis
$ dnmtools pmd [OPTIONS] <input.meth>
Huge genomic blocks with abnormal hypomethylation have been
extensively observed in human cancer methylomes and more recently in
extraembryonic tissues like the placenta. These domains are
characterized by enrichment in intergenic regions or Lamina associated
domains (LAD), which are usually hypermethylated in normal tissues.
Partially methylated domains (PMDs) are not homogeneously
hypomethylated as in the case of HMRs, and contain focal
hypermethylation at specific sites. PMDs are large domains with sizes
ranging from 10kb to over 1Mb. Hidden Markov Models can also identify
these larger domains. The pmd command is provided for their
identification, and can be run as follows:
$ dnmtools pmd -i 1000 -o output.bed input.meth
The underlying Hidden Markov Model for PMD identification is very similar to that of HMR identification, with a few key differences. Because PMDs are large, megabase-scale regions of methylation loss with individual CpGs exhibiting high methylation variance, each step in the HMM is the weighted average methylation level for a genomic region rather than a single CpG site. For samples with high sequencing coverage and depth, a bin size of 1000bp suffices most of the time.
The PMD program has a built-in bin-size selection method that chooses
the smallest bin size (in 500bp increments) such that bins have enough
observations for their average methylation to be confidently
estimated. The bin size can be fixed by specifying -b. We recommend
defaulting to 1000 iterations to ensure the Baum-Welch training
procedure converges.
The sequence of genomic bins is segmented into hypermethylation and
partial-methylation domains, where the latter are the candidate PMDs.
Further processing of candidate PMDs includes trimming the two ends of
a domain to the first and last CpG positions, and merging candidates
that are "close" to each other. Currently, we are using twice the bin
size as the merging distance. Development in later versions of the
pmd program will include randomization procedures for choosing
merging distance.
The output of the program is a BED file formatted as follows:
chr1 4083054 4756012 PMD0:3.455556:278:4.584455:260 673 +
chr1 4846663 5430747 PMD1:3.842801:276:5.246181:166 580 +
chr1 5463102 5912049 PMD2:3.839765:217:4.191114:282 448 +
chr1 11366900 11716004 PMD3:4.141159:240:3.132796:411 349 +
chr1 12781853 13024981 PMD4:3.499296:241:2.168534:3 227 +
chr1 14741633 15210084 PMD5:5.306778:173:5.009876:147 467 +
chr1 18150329 18718393 PMD6:5.247428:179:3.984616:45 569 +
chr1 18789404 19197536 PMD7:3.511930:168:5.494317:268 408 +
chr1 29655012 29877666 PMD8:4.887023:232:1.406238:162 221 +
chr1 30028301 30464612 PMD9:5.101033:16:2.987482:118 434 +
The first three columns give the chromosome, start, and end positions of the identified PMD. The fourth column has an arbitrarily assigned name for the PMD (counting up from zero) and boundary quality information: specifically the likelihood and certainty of the boundary at the PMD start (1&2) and boundary at the PMD end (3&4). These values can be used relative to other PMDs to assess the "sharpness" of the PMD boundaries, and the confidence we have in the boundary estimate based on the coverage of the sample. NaN boundary scores represent the case where the joint likelihood and confidence score of a boundary cannot be calculated because one side of the boundary has either zero coverage or no CpGs: this occurs when PMDs cut off due to deserts, for instance. The fifth column is the number of bins in the PMD, which is analogous to the number of CpGs segmented in the HMR program. PMDs should be called using information from both strands, so the last column is a placeholder.
In general, the presence of a single HMR would not cause the program
to report a PMD in that region. However, in cases where a number of
HMRs are close to each other, such as the promoter HMRs in a gene
cluster, the pmd program might report a PMD covering those HMRs. Users
should be cautious with using such PMD calls in their further studies.
In addition, not all methylomes have PMDs, some initial visualization
or summary statistics can be of help in deciding whether to use pmd
program on the methylome of interest.
In addition, calling HMRs in samples with PMDs can be difficult: PMDs can obscure the sites we are trying to identify by providing an alternative foreground methylation state to the focused, very low methylation typically at promoter regions. A good workaround for this is to call PMDs first, and then call HMRs separately inside and outside of PMDs (e.g. using selectsites and using the output of PMDs as the BED format input). This ensures that the foreground methylation state learned by the HMM in both types of background is the focused hypomethylation at CpG islands and promoter regions.
PMD detection in multiple samples
The pmd program allows for multiple samples to be provided at once,
and it models each emission distribution separately. This behavior is
not yet well understood, but does yield an estimate of the "composite"
PMD state across the samples input. Because each emission distribution
is learned separately with no concept of a normally distributed error
on the parameters, this mode is less applicable with technical
replicates of the same tissue but potentially useful when looking
across many tissue types where substantial sample-to-sample biological
variability is expected.
PMD detection in microarray data
Our recent addition of a bin size selection method, coupled with tuning
the desert size, can lead to a rough estimation of PMDs in human
Illumina MethylationEPIC microarray data. The -a option can be used to
specify that the input is all microarray data, and will result in the
emission distributions being modeled as Beta rather than
Beta-binomial. One should keep in mind that to use this option will
require some tuning of the desert size and bin sizes (we have seen
large desert sizes > 100kb and bin sizes around 20kb perform well) and
is best paired with matching WGBS data.
Options
-o, -out
The name of the output file. If no file name is provided, the output will be written to standard output. Due to the size of this output, a file name should be specified unless the output will be piped to another command or program. The output file contains genomic intervals in BED format.
-d, -desert
The maximum allowed span where no data is observed inside a PMD. If a PMD is larger than this size, it will be broken up.
-f, -fixedbin
Value of the fixed bin size
-b, -bin
Starting bin size
-a, -arraymode
Input is microarray data (e.g. from MethylationEPIC)
-i, -itr
max number of iterations
-v, -verbose
Print more information while the command is running.
-D, -debug
Print debug information while the command is running.
-P, -params-in
HMM parameter files for individual methylomes. If multiple files are specified they must be separated with comma and not spaces.
-r, -posteriors-out
Write posterior probabilities, in counts format, to this file.
-p, -params-out
Write the trained parameters to this file.
-s, -seed
Specify a random seed value.
-S, -summary
Write the analysis summary to this file. The summary is not reported unless a file is specified here. This option is correct as of v1.4.0.