merge - Combine counts files
Synopsis
$ dnmtools merge [OPTIONS] <file1.meth> <file2.meth> ...
Description
This important command does two things, both of which can be called "merging" of methylation files. The two behaviors are in one command because the internal work is the same, but the output and the motivation differs quite a bit between the two uses of merge.
(1) The merge command can take a set of counts output files and combine them into one file. There are several reasons to do this. One example is when technical replicates are performed, and initially analyzed separately. But later these need to be combined for downstream analysis. If the counts output files are already available, then merging them using the merge command is easier than re-doing the analysis with all replicates together. Similarly, if some data is generated earlier, and more produced later from the same sample, merging in this way avoids repeating some of the early analysis effort.
Suppose you have the three counts output files from
three different replicates: R1.meth, R2.meth and
R3.meth. To merge those individual methcounts files, execute:
$ dnmtools merge -o combined.meth R1.meth R2.meth R3.meth
The command can handle an arbitrary number of files, and the files do not need to have the same number of lines/sites. The merge command does assume that the sorted order of chromosomes within each input file is consistent, and for each input file, within a chromosome all sites appear in increasing order.
(2) The merge command can take a set of counts output files and combine them as a table that contains all the same information. The table format is helpful if subsequent analyses are to be done using a data table, for example a data frame in R. When producing this tabular format, merge allows the user to select whether the desired output is in counts (both the count of methylated and unmethylated reads for every site) or as the fractions. If the fraction would have involved division by zero, then "NA" is written in the output. But this behavior can be controlled with the command line options.
Suppose you have 4 different methylomes, two replicates from wild type and two from a mutant. You want to create a table of methylation information so you can analyze the data in R as a data frame. The merge command can help by pasting the data as a table, ensuring a consistent order and filling in any missing values from the individual input files:
$ dnmtools merge -o table.txt wt1.meth wt2.meth mut1.meth mut2.meth
The file table.txt is not in the same format as the input files,
since those each have exactly 6 columns. The output has one column as
the row names. Then it will have two columns for each of the input
files, one with the count of total reads, and one with the count of
reads indicating methylation. Here is what the output might look like:
wt1_R wt1_M wt2_R wt2_M mut1_R mut1_M mut2_R mut2_M
chr1:108:+:CpG 9 6 10 8 2 2 2 1
chr1:114:+:CpG 17 7 10 0 5 1 9 1
chr1:160:+:CpG 12 8 10 5 15 14 13 6
chr1:309:+:CpG 1 1 1 0 12 8 2 1
Note: Currently the output from merge may not be immediately compatible with radmeth, since the column headings might not in the expected format. An option for this has been added in the most recent code.
Options
-o, -output
output file as counts format (default: stdout)
-h, -header
Print a header given by the input string at the top of the file (ignored for tabular)
-t, -tabular
Output is in table format.
-remove
Suffix to remove from filenames when making column names for tabular format. If not specified, suffix including from final dot is removed.
-s, -suff
Column name suffixes, one for total reads and one for methylated reads, to be separated from sample name with underscore in the header for tabular format output.
-f, -fractional
Output table will give fractions (requires -tabular).
-r, -reads
Minimum number of reads required when using the -f flag (default: 1)
-ignore
Ignore sorting. Do not attempt to determine chromosome order. Lexicographic order on chromosome names will be assumed.
-v, -verbose
Report more information while the program is running.