radmeth - Differential CpG methylation

Synopsis

dnmtools radmeth -factor case -o output.txt design_matrix.txt data_matrix.txt

Description

The differential-methylation detection method described in this section is based on the beta-binomial regression. We recommend using this method when more than three replicates are available in each group, or for a more complicated experimental design, there are enough samples to fit the parameters. The radmeth command can take advantage of multiple cores for improved speed. The processing time depends on the number of sites analyzed and the number of methylomes in the dataset.

Generating input tables

The first step in the differential methylation analysis is to assemble a table with the counts of total reads and methylated reads for each site (rows) and each methylome (columns). Dnmtools includes a command merge to generate this kind of table from the given collection of methylomes in counts format.

Features: although the explanation here focuses on individual CpG sites, the identical method can be applied to genomic intervals as features (rows). The common requirement is that in each row, each "column" has a count for total reads and a count for reads indicating methylation. One prominent example is to run the roi command on a set of genomic intervals, and use the output as the basis for differential methylation analysis.

Suppose that we want to compare methylomes named control_a.sym, control_b.sym, control_c.sym to the methylomes case_a.sym, case_b.sym, case_c.sym. Each of these filenames suggests that the methylation is for symmetric CpG sites as would be obtained using the sym command.

The data matrix can be created with the following command:

$ dnmtools merge -t -radmeth \
     control_a.meth control_b.meth control_c.meth \
     case_a.meth case_b.meth case_c.meth > data_table.txt

This is what data_table.txt looks like:

control_a       control_b       control_c       case_a  case_b  case_c
chr1:108:+:CpG 9       6       10      8       1       1       2       2       2       1       14      1
chr1:114:+:CpG 17      7       10      0       14      3       5       1       9       1       7       1
chr1:160:+:CpG 12      8       10      5       17      4       15      14      13      6       4       4
chr1:309:+:CpG 1       1       1       0       17      12      12      8       2       1       19      8

As indicated in the header, this table contains information about 6 methylomes: 3 controls and 3 cases. Each row of the table contains information about an individual CpG site, and two values for each column heading: one value being the count of total reads and the second the count of methylated reads. Note that the second value does not need to be integers, which can happen when reads contribute fractionally, as in counts from nanopore. For example, the first row describes a cytosine within a CpG site located on chromosome 1 at position 108. This site is covered by 9 reads in the methylome "control a" and 6 of those reads indicate methylation. Note that the dnmtools merge command adds methylomes into the data matrix in the order in which they are listed on the command line.

Note: the header does not include a column name for the first column, and for each methylome, currently there is only one column name for each pair of columns in this format.

Design matrix

The next step is to specify the design matrix, which describes the structure of the experiment. For our running example, the design matrix looks like this:

base   case
control_a      1       0
control_b      1       0
control_c      1       0
case_a 1       1
case_b 1       1
case_c 1       1

The design matrix shows that samples in this dataset are associated with two factors: base and case. The first column corresponds to the intercept and will always be present in the design matrix. Think of it as stating that all samples have the same baseline mean methylation level. To distinguish cases from controls we add another factor case (second column). The 1s in this second column indicate the methylomes that belong to the cases group. You can use this design matrix as a template to create design matrices for two-group comparisons involving arbitrary many methylomes.

After creating the data matrix and the design matrix, we are now ready to start the differential methylation analysis. It consists of (1) regression, (2) combining significance, and (3) multiple testing adjustment steps.

Regression

Suppose that the data_table.txt and design_matrix.txt are as described above. The regression step is run with the command

$ dnmtools radmeth -factor case -o output.txt design_matrix.txt data_table.txt

The -factor parameter specifies the factor with respect to which we want to test for differential methylation. The test factor is 'case', meaning that we are testing for differential methylation between cases and controls. In the output file output.txt, the last four columns correspond to the total read counts and methylated read counts of the case group and control group, respectively. The p-value (5-th column) is given by the log-likelihood ratio test.

Depending on the distribution of methylated and unmethylated CpGs, some p-values may not be calculated, and a value of NA will be output. If the argument -na-info is used, the following detail will be given. If there are no reads that cover a given CpG, the value NA_LOW_COV will be printed instead of a p-value. If all CpGs are fully methylated and fully methylated in both case and control, we say the CpG has a "extreme" counts. In these cases, we cannot reject the null and output NA_EXTREME_CNT. Finally, if there are numerical errors in the log-likelihood ratio test, we simply print NA.

Output format

The output for radmeth is one line for each row of the input data matrix. If the input data matrix has rownames with "tokens" separated by colons, for example "chr3:67363494:+:CpG", then this will appear in the output but with the colons replaced by tabs. If the rownames in the input file do not have colons, they will be output as is, in the first column of the output file.

  • The second column, or the first column after the (possibly transformed) rowname, is the p-value for that column, or an NA value (see above).
  • After the p-value, the mean methylation per "group" is given in each subsequent column. Here, a "group" is a combination of factor levels, assuming binary factors. The order of groups corresponds to the order of their binary representation in the design matrix. The order for the groups in the example design matrix below would be (MB=100, MA=101, FB=110, FA=111).
  • The final column in the output is the "overdispersion factor" for the beta-binomial. Assume phi is the dispersion. The binomial variance is np(1-p) and the beta-binomial variance is np(1-p)((n+phi)/(phi+1)). So (n - 1)/(phi + 1) is the factor by which the variance for the beta-binomial exceeds that of the binomial. An overdispersion factor of 0 means the beta-binomial is behaving just like a binomial.

Tutorial

The data used here is available at:

https://github.com/andrewdavidsmith/radmeth-demo-simulated-data

And the results below were generated with v1.5.1 on Linux.

This tutorial should answer questions users often have about the assumptions of radmeth and how to interepret the results. We will assume a factor of interest, with levels A and B. We will also assume a potentially confounding factor, sex, with levels M and F. Further, we have 8 total samples, 2 samples for each of the 4 combinations of {A,B}x{M,F}. The design matrix should look like this:

base    sex factor
sample_FA1  1   1   1
sample_FA2  1   1   1
sample_FB1  1   1   0
sample_FB2  1   1   0
sample_MA1  1   0   1
sample_MA2  1   0   1
sample_MB1  1   0   0
sample_MB2  1   0   0

Note that the 3 headings correspond with the three binary columns, but need not be aligned over them. The first sample in the design matrix, sample_FA1, has an indicator in the other columns for both the "F" and the "A", both encoded with a "1" in their respective columns. But the replicate number 1 in the _FA1 is just a replicate, and if it had meaning we would consider encoding it using some factor. Designating it as a replicate with no meaning is a choice here, aligning with the idea that the replicates are statistically identical. If all replicates identified with a "2" were done on "day 2", for example, then we might reconsider.

For our example we will use a genome file genome.fa that is very small and not a real genome. We will also use a file features.bed that will be the starting point for generating data that fit with our hypotheses. We will be generating data that should show us the differences we want to see -- assuming we use the tools properly. I will refer to each of the intervals in features.bed as a "feature" but just think of it as a genomic interval with a chromosome name, a start and an end. We need to first obtain some features that are associated specifically with each of the levels of our factors: A, B, M and F. We can do this starting with our features file and using the bedtools shuffle command:

$ bedtools shuffle -i features.bed -g genome.chrom.sizes > sim/features_M.bed;
$ cat sim/features_*.bed | sort -k 1,1 -k 2,2g -o excl.bed;
$ bedtools shuffle -excl excl.bed -i features.bed -g genome.chrom.sizes > sim/features_F.bed;
$ cat sim/features_*.bed | sort -k 1,1 -k 2,2g -o excl.bed;
$ bedtools shuffle -excl excl.bed -i features.bed -g genome.chrom.sizes > sim/features_A.bed;
$ cat sim/features_*.bed | sort -k 1,1 -k 2,2g -o excl.bed;
$ bedtools shuffle -excl excl.bed -i features.bed -g genome.chrom.sizes > sim/features_B.bed;

This ensures none of the "features" are overlapping between A, B, M or F, and all are similar in number and size distribution. Using each of these, I then made files of features corresponding to the combinations of factors:

for i in M F; do
    for j in A B; do
        cat sim/features_${i}.bed sim/features_${j}.bed |\
            sort -k 1,1 -k 2,2g -o sim/features_${i}${j}.bed;
    done;
done;

Now I have files named:

sim/features_MA.bed
sim/features_MB.bed
sim/features_FA.bed
sim/features_FB.bed

The files named with an "M" include all intervals in sim/features_M.bed, and similarly the files named with a "B" include all intervals in sim/features_B.bed. So the file sim/features_FA.bed is all the features for "A" and all the features for "F". It will differ from sim/features_FB.bed less than it will from sim/features_MB.bed.

Using these files, I simulated methylation levels at every CpG site in the fake genome genome.fa with CpGs outside the intervals receiving a methylation level of 0.8 and CpGs inside the intervals receiving a methylation level of 0.15. At each CpG site, a number of reads was sampled (Poisson with mean 10) and then the methylation levels were sampled with a coin flip weighed as either 0.8 or 0.15, depending on the CpG site. I did this twice for each combination of {M,F}x{A,B}. The result is 8 files named as follows:

$ ls -1 sim/*.counts
sim/sample_FA1.counts
sim/sample_FA2.counts
sim/sample_FB1.counts
sim/sample_FB2.counts
sim/sample_MA1.counts
sim/sample_MA2.counts
sim/sample_MB1.counts
sim/sample_MB2.counts

The program I used for the simulation above is not currently available. In this simulation, the only difference between replicates (e.g., sim/sample_FA1.counts vs. sim/sample_FA2.counts) is random noise due to sampling.

Here is a look inside one of these files:

$ head sim/sample_FA1.counts
chr1    163 +   CG  0.857143    7
chr1    206 +   CG  0.928571    14
chr1    232 +   CG  0.7 10
chr1    278 +   CG  0.555556    9
chr1    296 +   CG  0.714286    7
chr1    310 +   CG  0.545455    11
chr1    322 +   CG  0.833333    6
chr1    324 +   CG  0.555556    9
chr1    350 +   CG  0.933333    15
chr1    356 +   CG  0.923077    13

Then I used the merge command from dnmtools to make a data matrix as follows:

$ dnmtools merge -t -radmeth -v -remove .counts -o sim/table.txt sim/sample_*.counts

And this is what it gave me:

$ head sim/table.txt
sample_FA1  sample_FA2  sample_FB1  sample_FB2  sample_MA1  sample_MA2  sample_MB1  sample_MB2
chr1:163:+:CG   7   6   7   2   9   7   10  8   11  7   10  7   12  10  14  11
chr1:206:+:CG   14  13  4   2   12  11  15  14  8   7   11  8   5   4   11  9
chr1:232:+:CG   10  7   5   3   9   8   14  9   10  8   3   2   2   1   10  8
chr1:278:+:CG   9   5   16  14  14  12  8   8   10  9   8   7   6   4   9   7
chr1:296:+:CG   7   5   15  14  9   7   11  10  5   5   6   5   9   8   9   7
chr1:310:+:CG   11  6   11  11  8   6   9   5   5   4   10  7   9   7   5   4
chr1:322:+:CG   6   5   6   6   9   8   19  14  13  11  13  11  9   6   10  7
chr1:324:+:CG   9   5   10  9   8   5   10  9   7   6   14  9   7   6   6   6
chr1:350:+:CG   15  14  8   6   6   5   9   6   7   6   6   4   9   8   11  10

With this simulated data, in which each column has the influence of two different factors, we can use radmeth with the design matrix shown above (in a file named desgin.txt) to get p-values for differential methylation between levels A and B of our factor of interest -- while controling for the effect of M/F:

$ dnmtools radmeth -o sim/samples.radmeth -f factor design.txt sim/table.txt

As explained already, the 5th column of the output gives the p-values for differential methylation between levels for our factor of interest (named "factor" in the design matrix). Here is a look at part of the output:

$ head samples.radmeth | column -t
chr1  163  +  CG  0.0863989  0.818157  0.653660  0.775035  0.591029  0.000052
chr1  206  +  CG  0.443989   0.841557  0.764998  0.908700  0.859154  0.000091
chr1  232  +  CG  0.756234   0.776885  0.744420  0.725115  0.688140  0.000242
chr1  278  +  CG  0.77863    0.831485  0.807046  0.842138  0.818908  0.000808
chr1  296  +  CG  0.648263   0.848138  0.884828  0.836674  0.875727  0.000341
chr1  310  +  CG  0.640836   0.729179  0.780351  0.689915  0.745921  0.139695
chr1  322  +  CG  0.0972604  0.679385  0.849711  0.789027  0.908915  0.000235
chr1  324  +  CG  0.25803    0.861294  0.753022  0.822346  0.694456  0.008238
chr1  350  +  CG  0.890771   0.844111  0.855183  0.807851  0.820951  0.000267
chr1  356  +  CG  0.0527503  0.750013  0.933344  0.818157  0.954543  0.000315

Focusing on the 5th column, we see values between 0 and 1, but none of them are very extreme. This particular output file has 17903 rows. We can use the dnmtools command selectsites to pull out the rows in this file that correspond to each of the original feature sets. Here is how we would do it for all 4 different original sets of features:

for i in M F A B; do
    dnmtools selectsites -relaxed -o sim/features_${i}.txt sim/features_${i}.bed sim/samples.radmeth;
done

(The argument -relaxed to the selectsites command is to allow for flexibility in the column format after the first 6.)

The result allows us to verify that radmeth with the design matrix given above identifies sites as significant when they differ between levels A and B of the factor of interest:

$ for i in sim/features_[ABFM].txt; do echo $i `awk 'BEGIN{k=0;c=0}{k+=$5;c+=1}END{print k,c,k/c}' $i`; done | column -t
sim/features_A.txt  0.255514  430  0.000594218
sim/features_B.txt  0.11431   372  0.000307286
sim/features_F.txt  196.602   400  0.491505
sim/features_M.txt  296.222   588  0.50378

The p-values in features associated with either M or F are averaging around 0.5, which is what we want if we are trying to control this possible confounding variable. In the features associated either with A or B, the p-values are very small on average. Of course this is literally by design. Knowing that we simulated just as much difference between M and F as between A and B, if we had used -f sex then radmeth would have correctly assigned low p-values to sites in the intervals of features_M.bed and features_F.bed.

We can use the same simulated data to verify that we do need the intercept in our design. In other words, the column "base" in design.txt needs to be present (you can give it any name you want; make sure it's all 1s).

We change the design matrix to make the following, which has the intercept column removed:

$ cat design_noint.txt
sex factor
sample_FA1  1   1
sample_FA2  1   1
sample_FB1  1   0
sample_FB2  1   0
sample_MA1  0   1
sample_MA2  0   1
sample_MB1  0   0
sample_MB2  0   0

Then we re-run all the above commands (I won't exaplain each) to get results analogous but with this modified design matrix:

$ dnmtools radmeth -o sim/samples_noint.radmeth -f factor design_noint.txt sim/table.txt
$ for i in M F A B; do dnmtools selectsites -relaxed -o sim/features_noint_${i}.txt sim/features_${i}.bed sim/samples_noint.radmeth; done
$ for i in sim/*_noint_*.txt; do echo $i `awk 'BEGIN{k=0;c=0}{k+=$5;c+=1}END{print k,c,k/c}' $i`; done | column -t
sim/features_noint_A.txt  8.27797  430  0.0192511
sim/features_noint_B.txt  15.2391  372  0.0409654
sim/features_noint_F.txt  77.2454  400  0.193114
sim/features_noint_M.txt  93.7184  588  0.159385

Although the sites within features corresponding to A and B are much lower on average than those for M or F, the distinction has been reduced. More importantly, we know that the sites in sim/features_M.bed and sim/features_F.bed are not influenced by the level A/B of our factor of interest. The p-values for sites in the last two rows above should have mean of 0.5. But without the intercept included in the design matrix, the outcomes to not match our statistical assumptions. I cannot predict what will happen if you use the wrong design matrix. In analyzing bisulfite sequencing data for differential methylation, we become aware of uncertainty through the counts, and we leverage this to try and be more accurate. So unlike other regression settings, if you try to normalize the data first, in order to avoid having to deal with an intercept, you will not arrive at an equivalent procedure -- if you want to normalize the data yourself (i.e. subtract mu and divide by sigma) then you are probably better off using a general regression package.

Options

 -o, -out

The name of the output file (default: stdout).

 -t, -threads

Use multiple threads. This gives very good speedup as long as you do not exceed the number of available physical cores. Most CPUs support 2 hardware threads per physical core via SMT/Hyper-Threading, so check your number of cores if you are unsure.

 -n -na-info

If a p-value is not calculated, print NAs in more detail: low coverage (NA_LOW_COV), extreme values (NA_EXTREME) or numerical errors in likelihood ratios (NA).

 -v, -verbose

Print more information while the command is running.

 -progress

Show progress while radmeth runs.

 -f, -factor

The name of the factor on which to test differences. This must be the name of one of the columns in file design matrix. This is required.

 -tolerance

The numerical tolerance for convergence when estimating model parameters. This cutoff is applied to the norm of the gradient of the log-likelihood function for MLE parameter estimation, and is multiplied by both the square root of the number of parameters and by the number of methylomes in the analysis.

 -max-iter

The maximum iterations to allow when estimating model parameters.