multistat

Synopsis

$ dnmtools multistat [OPTIONS] <intervals.bed> <input-tabular.tsv>

Description

The multistat program is similar to roi, but instead of creating a BED file with averge methylation levels from a single counts file, it takes as an input the output of merge with tabular format (i.e. using the -tabular flag to make a data frame) and using the -radmeth flag to remove suffixes that are not used in this program. In other words, multistat takes a data frame as input and produces a data frame as output.

The input of multistat starts with a line with 2n column names, with each column name appearing sequentially twice. The file is then followed by a set of lines containing 2n+1 elements. Each sample contains two columns. The first column is the number of reads that cover the CpG in the sample, and the second column is the number of CpGs that are methylated among the reads.

Here is a visual example of a file called input-tabular.tsv with four samples (D083a, D083b, D091a and D091b):

  D083a       D083a       D083b       D083b       D091a       D091a       D091b       D091b
chr1:10468:+:CpG        3       0       2       0       2       0       1       0
chr1:10470:+:CpG        6       0       3       0       4       0       3       0
chr1:10483:+:CpG        7       0       3       0       5       0       3       1
chr1:10488:+:CpG        7       0       3       0       5       0       3       0
chr1:10492:+:CpG        7       0       2       0       5       0       3       0
chr1:10496:+:CpG        6       0       4       0       5       0       4       0
chr1:10524:+:CpG        6       2       4       0       7       0       5       1
chr1:10541:+:CpG        4       0       4       0       7       2       5       0
chr1:10562:+:CpG        3       0       3       0       6       0       4       0
chr1:10570:+:CpG        2       0       3       0       6       0       4       0
chr1:10576:+:CpG        2       0       3       0       6       0       4       0

Note that, if you do not add the -radmeth flag when running merge, the tabular output may contain suffixes _R and _M on the column names (e.g. D083a_R and D083a_M corresponding to the "Reads" and "Methylated" columns). You can remove these to make the input proper by running

 $ sed -i '1s/_[MR]//g' input-tabular.tsv

multistat also requires an input BED file representing the genomic intervals of interest. The regions must be sorted by chromosome, position and strand. If they are not, you can add the -s flag to sort the file prior to running the program. Note that for very large BED files, this may take a long time. Given an input file regions.bed, you can sort it in one of the two following ways:

 $ LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k6,6 -o regions.bed regions.bed

 $ bedtools sort -i regions.bed

Finally, to create a file data-frame.tsv with methylation levels (which can be weighted, unweighted or fractional methylation), run

 $ dnmtools multistat -o data-frame.tsv regions.bed input-tabular.tsv

Options

 -o, -output

Name of output file (default: STDOUT)

 -N, -numeric

print numeric values only (not NAs), guaranteeing that the output contains as many rows as there are regions in the BED input.

 -L, -preload

Load all CpG sites

 -s, -sort

sort data if needed

 -l, -level

the level to report as score column in bed format output (w, u or f), corresponding to weighted, unweighted or fractional methylation (default: w)

 -M, -more-levels

report more methylation information

 -v, -verbose

print more run info to STDERR