guessprotocol - Identify bisulfite sequencing protocol

Synopsis

$ dnmtools guessprotocol [OPTIONS] <file-1.fastq> [<file-2.fastq>]

Description

Mapping a WGBS dataset requires knowledge of the sequencing protocol generated to process the data. This may not be properly documented where the data was obtained, so we created this command to guess the protocol based on the nucleotide content in the input FASTQ file (or files, for paired-end).

The guessprotocol tool uses two models of nucleotide content following bisulfite conversion and applies this model to each read. One model is for WGBS, and the other is for PBAT. For each read, both models are applied, and the result is a probability for whether the read (or read pair) was generated using WGBS or PBAT. Once the requested number of reads is processed, the aggregate results for all reads are used to guess whether the protocol used to generate the data was WGBS, PBAT or rPBAT. The criteria are roughly as follows: if most of the reads look like they are from WGBS, then we conclude WGBS. If most of the reads look like they are from PBAT, then we conclude PBAT. If the result is more towards the middle, then we conclude rPBAT.

More details: the number of As, Cs, Gs and Ts differs depending on WGBS (traditional WGBS or MethylC-seq), PBAT -- post bisulfite adaptor tagging, or rPBAT (random PBAT).

For WGBS, a single-end sequenced read should be T-rich, and if the data is paired-end, read1 is T-rich and read2 is A-rich.
For PBAT, a single-end sequenced read should be A-rich, and if the data is paired-end, read1 is A-rich and read2 is T-rich.
For rPBAT, we have a random mix of the above situations. However, in practice it seems almost never to be 50% each.

In most cases, when the data is WGBS or PBAT, it is very obvious which is the protocol used.

As of dnmtools v1.4.1, guessprotocol will always make a conclusion, but includes a confidence level.

The output of guessprotocol is useful prior to mapping. For example, it can be used to decide whether or not to map with the -R flag (for "random PBAT") when using abismal.

For paired-end data, guessprotocol finds ensures reads are mates by finding identical read names. Some datasets finish the read name with identifiers like ".1" on end 1 and ".2" on end 2, thus making the read names technically different at the last two characters. You can tell the program to ignore a certain suffix size (like size 2 in this example) when matching read names using the -i flag.

The output includes the following values in a YAML format: * protocol: this is the guessed protocol (wgbs, pbat or rpbat) based on the content of the reads. * confidence: indicates the level of confidence in the guess for the protocol (values: low or high). * layout: indicates whether the supplied reads were paired or single-ended. * n_reads_wgbs: the average number of reads (for single-ended reads) or read pairs (for paired reads) where read1 is determined by the model to be T-rich. * n_reads: the number of evaluated reads or read pairs. * wgbs_fraction: the probability that a read (for single-ended reads) or the read1 of a read pair (for paired reads) is T-rich.

Options

-n, -nreads

Number of reads to check. The program stops after collecting statistics for the first n reads (default: 1,000,000). Fewer than the default are usually sufficient, but increase this value if you suspect reads at the start of the file might be problematic.

 -i -ignore

Length of the read name suffix to ignore when matching read names to ensure mates are correctly synchronized when the data is paired-end.

-b, -bisulfite

Assumed bisulfite conversion rate for the models (default: 0.98).

-H, -human

Use human genome nucleotide frequencies. A good assumption for samples from a mammal.

-o, -output

The output file name.

-v, -verbose

Report available information during the run.