roi - average methylation level in each region of interest

Synopsis

$ dnmtools roi [OPTIONS] <intervals.bed> <input.counts>

Description

One of the most common analysis tasks is to compute the average methylation level through each of a set of genomic intervals. The roi command does this. It reports weighted mean methylation in each interval, along with read counts contributing to this mean. It can also report an unweighted mean and a fraction of sites "called" methylated in each interval. More details on these quantities can be found in the documentation for the levels command.

The roi command requires two input files. The first is a sorted counts output file, i.e. input.counts in the example above. This file provides data for every site, either a cytosine or CpG, that is of interest. The second input file (intervals.bed) specifies the genomic intervals in which methylation statistics should be summarized. If either file is not sorted by (chrom,end,start,strand) it can be sorted using the following command:

$ LC_ALL=C sort -k 1,1 -k 3,3n -k 2,2n -k 6,6 -o input-sorted.counts input.counts

Note: As of v1.4.0, the sorted order of chromosomes/targets within these files is not important, but the sites within each chromosome must still be sorted.

The intervals must be specified as a BED format file, and these can be sorted using bedtools sort. Note that the roi command accepts intervals in two different BED formats: (1) 6-column BED format, which may have more than 6 columns, but requires the first 6 columns to match the specification, or (2) 3-column BED format.

An important note about the input files: several aspects of the output for roi depend on the number of sites within each region of interest. If the .counts file provided as input does not have all the sites you might expect, for example if it is missing sites that have been excluded from some earlier step in your pipeline, then the results will be affected. We hope to make roi more robust to this issue in the future, for example by accepting some information about the reference genome to ensure that the numbers of sites are as expected by the user.

From there, the roi command can be run as follows:

$ dnmtools roi -o output.bed regions.bed input-sorted.counts

The default output format is a 6-column BED format file, with the "score" column (column 5 in the 6-column BED format) of each interval now indicating the methylation level for the interval. If the input intervals are 6-column BED format, then the default output only differs in the "score" column. If the input intervals are a 3-column BED file, then a name (auto-numbered and starting with "X") will be assigned to each interval and all will be assumed on the positive strand.

The value reported in the "score" column of the output file is, by default, the weighted mean methylation through the interval. This is calculated the same way as explained for the levels command. Users may specify two other values (the unweighted mean and the "fractional" methylation) with a command line argument. If the input regions are as follows:

chr1  3011124  3015902
chr1  3015904  3016852
chr1  3017204  3017572
chr1  3021791  3025633
chr1  3026050  3027589

Then the output might appear like:

chr1    3011124 3015902 X0  0.458647    +
chr1    3015904 3016852 X1  0.866667    +
chr1    3017204 3017572 X2  0.946429    +
chr1    3021791 3025633 X3  0.938038    +
chr1    3026050 3027589 X4  0.927007    +

If the input were given as 6-column BED format:

chr1    3011124 3015902 REGION_A    0   +
chr1    3015904 3016852 REGION_B    0   +
chr1    3017204 3017572 REGION_C    0   -
chr1    3021791 3025633 REGION_D    0   -
chr1    3026050 3027589 REGION_E    0   -

Then the output might be as follows:

chr1    3011124 3015902 REGION_A    0.458647    +
chr1    3015904 3016852 REGION_B    0.866667    +
chr1    3017204 3017572 REGION_C    0.946429    -
chr1    3021791 3025633 REGION_D    0.938038    -
chr1    3026050 3027589 REGION_E    0.927007    -

If additional information is desired in the methylation summary for each interval, users can request this with a command line argument. The additional information is all 3 varieties of "levels" (weighted, unweighted and fractional), along with related integer counts through the intervals. These are as follows, appearing after the first 6 columns of the 6-column BED format:

  • (7) weighted mean methylation
  • (8) unweighted mean methylation
  • (9) fractional methylation
  • (10) number of CpGs in the region
  • (11) number of CpGs covered at least once
  • (12) number of observations in reads indicating methylation
  • (13) total number of observations from reads in the region

The weighted mean methylation level is then (12) divided by (13) above. Example output might look like this:

chr1    3011124 3015902 REGION_A    0.458647    +   0.458647    0.448519    0.459302    172 172 915 1995
chr1    3015904 3016852 REGION_B    0.866667    +   0.866667    0.863492    1   6   6   39  45
chr1    3017204 3017572 REGION_C    0.946429    -   0.946429    0.954545    1   11  11  106 112
chr1    3021791 3025633 REGION_D    0.938038    -   0.938038    0.935424    1   109 109 1090    1162
chr1    3026050 3027589 REGION_E    0.927007    -   0.927007    0.918554    0.923077    13  13  127 137

Clearly if there are no reads mapping in a region, then the methylation level will be undefined. By default these are indicated with a value of "NA" in the output. This violates the 6-column BED format convention of using a numerical score in the 5th column (although strictly there are additional constraints on the possible numerical values of the score column in BED format). So the option -N for "numeric" will exclude such intervals in the output. It tends to be rare that downstream analysis will not be robust to the "NA" value, for example any processing in R can interpret the NA. Note: when using the -N flag, if several methylation files are used with the same set of genomic intervals, the output files may not have the same number of intervals.

By default roi performs a binary search on the methylation file to find sites (C or CpG) associated with each interval and loads only those sites, which often saves both time and memory. However, it is routine to compute the average methylation state across a large number of target regions (e.g., fixed-size bins tiling the genome). When the number of intervals is sufficiently large, it can be faster to load the entire methylation file first. The -L option loads all lines of the methcounts file into memory, which saves time at the expense of an increased memory requirement.

Performance There are two different methods for reading the .counts input file. The default mode of assuming the counts file is sorted and doing binary search on disk will only work if the file is not zipped. If the file is not zipped, the user must specify -L to load all the counts into memory. These two methods should produce identical output files, but depending on the application one might be much faster than the other. The numbers below are based on sampling random intervals from a file with mean interval size of 2237 bp. The time is measured in seconds, and all work was done on local SSDs.

N on disk in memory
10 1.70 s 9.55 s
100 1.75 s 9.43 s
1000 2.05 s 9.56 s
10000 5.13 s 9.64 s
20000 8.49 s 9.81 s
30000 11.93 s 9.83 s
40000 15.20 s 10.06 s
50000 18.96 s 9.95 s
10 6.7 MB 3079.7 MB
50000 10.1 MB 3084.0 MB

The counts file for the above data had 30962770 lines, corresponding to CpG sites in hg38 including those in the "extra" chromosomes. Be aware that if you work on a cluster or in the cloud, latency of the disks might become a problem so the "in memory" option might be better all the way. This depends on how your storage is configured. Even if throughput is highly tuned, latency can cause major slowdown for the "on disk" mode.

Options

-o, -output

The name of the output file. If no file name is provided, the output will be written to standard output. Due to the size of this output, a file should be specified unless the output will be piped to another command or program. The output file contains genomic intervals in BED format, with intervals corresponding to those provided as input.

-N, -numeric

print numeric values only (not NAs)

-L, -preload

Load all CpG sites

-s, -sort

sort data if needed

-l, -level

the level to report as score column in bed format output (w, u or f), corresponding to weighted, unweighted or fractional methylation (default: w)

-M, -more-levels

report more methylation information

-v, -verbose

print more run info