bsrate - estimate bisulfite conversion rate

Synopsis

$ dnmtools bsrate [OPTIONS] -c <chroms> <input.sam>

Description

Unmethylated cytosines in DNA fragments are converted to uracil by sodium bisulfite treatment. As these fragments are amplified, the uracils are converted to thymines and so unmethylated Cs are ultimately read as Ts (assuming no error). Despite its high fidelity, bisulfite conversion of C to T does have some inherent failure rate, depending on the bisulfite kit used, reagent concentration, time of treatment, etc. These factors impact the success rate of the reaction. Therefore, the bisulfite conversion rate, defined as the rate at which unmethylated cytosines in the sample appear as Ts in the sequenced reads, should be measured and should be very high (e.g. > 0.99) for the experiment to be considered a success.

Measuring bisulfite conversion rate this way requires some kind of control set of genomic cytosines not believed to be methylated. Three options are (1) to spike in some DNA known not to be methylated, such as a Lambda virus, (2) to use the human mitochondrial genome, which is known to be unmethylated almost everywhere in nearly every cell type (or chloroplast genomes), or (3) to use non-CpG cytosines which are believed to be almost completely unmethylated in most mammalian cells. In general the procedure is to identify the positions in reads that correspond to these presumed unmethylated cytosines, then compute the ratio of T over (C + T) at these positions. If the bisulfite reaction is perfect, then this ratio should be very close to 1, and if there is no bisulfite treatment, then this ratio should be close to 0.

The bsrate command will estimate the bisulfite conversion rate in this way. Assuming method (3) from the above paragraph of measuring conversion rate at non-CpG cytosines in a mammalian methylome. The following command will estimate the conversion rate:

$ dnmtools bsrate -c /path/to/genome.fa -o output.bsrate input-sorted.sam

Note that we often use the output of uniq to reduce any bias introduced by differential PCR amplification as a function of conversion. The bsrate command requires that the input be sorted so that reads mapping to the same chromosome are consecutive within the file and the input should have duplicate reads already removed. The first several lines of the output might look like the following:

OVERALL CONVERSION RATE = 0.994141
POS CONVERSION RATE = 0.994166  832349
NEG CONVERSION RATE = 0.994116  825919
BASE PTOT  PCONV PRATE   NTOT  NCONV NRATE   BTHTOT BTHCONV BTHRATE ERR ALL    ERRRATE
1    8964  8813  0.9831  9024  8865  0.9823  17988  17678   0.9827  95  18083  0.0052
2    7394  7305  0.9879  7263  7183  0.9889  14657  14488   0.9884  100 14757  0.0067
3    8530  8442  0.9896  8323  8232  0.9890  16853  16674   0.9893  98  16951  0.0057
4    8884  8814  0.9921  8737  8664  0.9916  17621  17478   0.9918  76  17697  0.0042
5    8658  8596  0.9928  8872  8809  0.9929  17530  17405   0.9928  70  17600  0.0039
6    9280  9218  0.9933  9225  9177  0.9948  18505  18395   0.9940  59  18564  0.0031
7    9165  9117  0.9947  9043  8981  0.9931  18208  18098   0.9939  69  18277  0.0037
8    9323  9268  0.9941  9370  9314  0.9940  18693  18582   0.9940  55  18748  0.0029
9    9280  9228  0.9944  9192  9154  0.9958  18472  18382   0.9951  52  18524  0.0028
10   9193  9143  0.9945  9039  8979  0.9933  18232  18122   0.9939  66  18298  0.0036

The above example is based on a very small number of mapped reads in order to make the output fit here. The first thing to notice is that the conversion rate is computed separately for each strand. The information is presented separately because this is often a good way to see when some problem has occurred in the context of paired-end reads. If the conversion rate looks significantly different between the two strands, then we would go back and look for a mistake that has been made at an earlier stage in the pipeline. The first 3 lines in the output indicate the overall conversion rate, the conversion rate for positive strand mappers, and the conversion rate for negative strand mappers. The total number of nucleotides used (e.g. all C+T mapping over genomic non-CpG Cs for method (3)) is given for positive and negative strand conversion rate computation, and if everything has worked up to this point these two numbers should be very similar. The 4th line gives column labels for a table showing conversion rate at each position in the reads. The labels PTOT, PCONV and PRATE give the total nucleotides used, the number converted, and the ratio of those two, for the positive-strand mappers. The corresponding numbers are also given for negative strand mappers (NTOT, NCONV, NRATE) and combined (BTH). The sequencing error rate is also shown for each position, though this is an underestimate because we assume at these genomic sites any read with either a C or a T contains no error.

The bisulfite conversion rate reported in MethBase is computed from all non-CpG cytosines, and assumes zero non-CpG methylation. Because this is likely not 100% true, it is a conservative estimate of conversion rate. A better method for computing conversion rate is to use an unmethylated spike-in or reads mapping to mitochondria, which has been shown to be entirely unmethylated in most human tissues. To use the mitochondria, you can extract mitochondrial reads from a SAM file using awk:

$ awk '$1 /~/ ^@ || $3 == "chrM"' input.sam >input-chrM-only.sam

After completing bisulfite conversion rate analysis, remember to remove any control reads not naturally occurring in the sample (lambda virus, mitochondrial DNA from another organism, etc.) before continuing your analysis. The output from two different runs of bsrate can be merged using the command merge-bsrate.

Options

-o, -output

The name of the output file (default: stdout).

-c, -chrom

File or directory of files containing the chromosome sequences (FASTA format; .fa suffix assumed). If the input is a directory, it should contain several FASTA files, each one of which contains a chromosome sequence. These should be the exact same as used for mapping the reads. This is required.

-N, -all

Use all Cs (including CpGs) when estimating bisulfite conversion. This will only be useful if the estimate is made from sequences known to be unmethylated, like with a spike-in.

-seq

Use only reads that map to this chromosome (e.g. chrM). This simply avoids having to extract reads separately from the mapped reads file before using bsrate.

 -A, -a-rich

The reads are A-rich. This may be relevant if a BPAT protocol is used.

-v, -verbose

Print more information while the program is running.

 -S, -summary

Write the analysis summary to this file. The summary is not reported unless a file is specified here. This information includes nothing beyond what is currently generated for the output file. This option is correct as of v1.4.0.

 -t, -threads

Use this many threads. This only has an effect if set to a value greater than 1. The additional threads are only used for decompressing BAM format input. It is not recommended to use many threads due to diminishing returns, and performance will degrade if more threads are specified than physical cores available. This option is correct as of v1.4.0.

 -p, -per-read

This option will generate a histogram of conversion rate per read, printed to the terminal. This option is correct as of v1.4.0.

 -b, -bins

Assming the -p option is used, this option determines the number of bins in the histogram. This option is correct as of v1.4.0.

 -S, -summary

Write a summary of information gathered during the run to this file.