abismal - another bisulfite mapping algorithm
Synopsis
dnmtools abismal [OPTIONS] input.fq [input-r2.fq]
Description
During bisulfite treatment, unmethylated cytosines in the original DNA
sequences are converted to uracils, which are then incorporated as
thymines (T) during PCR amplification. These PCR products are referred
to as T-rich sequences as a result of their high thymine constitution.
With paired-end sequencing experiments, the compliments of these
T-rich sequences are also sequenced. These complimentary sequences
have high adenine (A) constitution (A is the complimentary base pair
of T), and are referred to as A-rich sequences. Mapping consists of
finding sequence similarity, based on context specific criteria,
between these short sequences, or reads, and an orthologous reference
genome. When mapping T-rich reads to the reference genome, either a
cytosine (C) or a thymine (T) in a read is considered a valid match
for a cytosine in the reference genome. For A-rich reads, an adenine
or a guanine is considered a valid match for a guanine in the
reference genome. The mapping of reads to the reference genome by
abismal is described below. If you choose to map reads with a
different tool, make sure that your post-mapping files are
appropriately formatted for the next components of the dnmtools
pipeline (necessary file formats for each step are covered in the
corresponding sections). The default behavior of abismal is to
assume that reads are T-rich and map accordingly, but different
sequencing protocols that generate A-rich and T-rich reads in
different combinations are equally accepted. abismal is
available here.
Input and output file formats
We assume that the original data is a set of sequenced read files,
typically as produced by Illumina sequencing. These are FASTQ format
files, and can be quite large. After the reads are mapped, these files
are not used by our pipeline. The reference genome should be a single
FASTA file that was previously indexed using the abismalidx
tool. The abismal program requires an indexed FASTA reference genome
and the input FASTQ files(s), after which it generates a Sequence
Alignment/Map(SAM) output indicating the coordinates of mapped reads.
Details of the SAM file format can be found at the SAM file format
documentation. These
SAM files will be the input files for the postprocessing quality
control and analysis programs to follow, including bsrate
and counts.
abismal operates by preprocessing the reference genome into a large
index, where k-mers of set length are used as keys to a list of
potential mapping locations for reads that begin with their sequence.
To produce this index run the following command:
$ dnmtools abismalidx genome.fa genome.idx
Where genome.fa is the name of your reference genome, and genome.idx is
what you decided to name your abismal index file (you can name it anything).
For the human genome hg38 reference, with size 3 GB, the resulting index is approximately 2.5 GB.
Decompressing and isolating paired-end reads
Sometimes paired-end reads are stored in the same FASTQ file. Because
we treat these paired ends differently, they must be separated into
two files and both files must be passed as inputs to abismal.
If your data is compressed as a Sequenced Read Archive, or SRA file,
you can decompress and split paired-end reads into two files at the
same time using fastq-dump , which is a program included in the
sra-toolkit
package, available for most unix systems. Below is an example of using
fastq-dump to decompress and separate FASTQ data by end:
$ fastq-dump --split-3 human_esc.sra
If you have a FASTQ file not compressed in SRA format, you can split paired ends into two separate files by running the following commands:
$ sed -ne '1~8{N;N;N;p}' *.fastq > *_1.fastq
$ sed -ne '4~8{N;N;N;p}' *.fastq > *_2.fastq
Sequencing adapters
These are a problem in any sequencing experiment with short fragments
relative to the lengths of reads. A robust method for removing
adapters is available via the Babraham Institute's Bioinformatics
group, called
trim_galore.
This program takes into account adapter contamination of fewer than 10
nucleotides and can handle mismatches with the provided sequence. We
strongly recommend reads are trimmed prior to mapping. Our recommended
parameter choice for trim_galore restricts trimming only to
sequencing adapters, leaving all other bases unaltered. This can be
attained by running it with the following command for single-end
mapping
$ trim_galore -q 0 --length 0 human_esc.fastq
and the following command for paired-end
$ trim_galore --paired -q 0 --length 0 human_esc_1.fastq human_esc_2.fastq
Note that the --paired flag is necessary to ensure the program does
not interpret the two input files as independent and that the
resulting FASTQ files still have corresponding read mates in the
correct order.
Single-end reads
When working with data from a single-end sequencing experiment, you
will have T-rich reads only. abismal expects T-rich reads as a
default. Execute the following command to map all of your single-end
reads with abismal:
$ abismal -i <genome index> -o <output SAM> [options] <input fastq>
To save files in BAM format, which significantly reduce disk space,
simply redirect the abismal output to the samtools
view program using the -b
flag to compress to BAM and the -S flag to indicate that input is in
SAM format.
$ abismal -i <genome index> -s <output STATS> [options] <input fastq> | samtools view -bS > <output BAM>
Paired-end reads
When working with data from a paired-end sequencing experiment, you
will have T-rich and A-rich reads. T-rich reads are often kept in
files labeled with an _1 and A-rich reads are often kept in files
labeled with an _2. T-rich reads are sometimes referred to as
5' reads or mate 1 and A-rich reads are sometimes referred
to 3' reads or mate 2. We assume that the T-rich file and
the A-rich contain the same number of reads, and each pair of mates
occupy the same lines in their respective files. We will follow this
convention throughout the manual and strongly suggest that you do the
same. Run the following command to map two
reads files from a paired-end sequencing experiment:
$ abismal -i <index> -o <output SAM> [options] <input fastq 1> <input fastq 2>
In brief, what happens internally in abismal is as follows.
abismal finds candidate mapping locations for a T-rich mate with
CG-wildcard mapping, and candidate mapping locations for the
corresponding A-rich mate with AG-wildcard mapping. If two candidate
mapping locations of the pair of mates are within certain distance in
the same chromosome and strand and with correct orientation, the two
mates are combined into a single read (after reverse complement of the
A-rich mate), referred to as a fragment. The overlapping region
between the two mates, if any, is included once, and the gap region
between them, if any, is filled with Ns. The parameters -l and -L
to abismal indicate the minimum and maximum size of fragments to
allow to be merged, respectively.
Here the fragment size is the sum of the read lengths at
both ends, plus whatever distance is between them. So this is the
length of the original molecule that was sequenced, excluding the
sequencing adapters. It is possible for a given read pair that the
molecule was shorter than twice the length of the reads, in which case
the ends of the mates will overlap, and so in the merged fragment will
only be included once. Also, it is possible that the entire molecule
was shorter than the length of even one of the mates, in which case
the merged fragment will be shorter than either of the read ends. If
the two mates cannot be merged because they are mapped to different
chromosomes or different strand, or they are far away from each other,
abismal will output each mate individually if its mapping
position is unambiguous.
abismal provides a statistical summary
of its mapping job in the mapstats file, which includes the total
number and proportion of reads mapped, how many paired end mates were
mapped together, and the distribution of fragment lengths computed by
the matched pairs. The mapstats file name is usually the same as
the SAM output with mapstats appended to it, but custom file
names can be provided using the -s flag.
Mapping reads in a large file:
Mapping reads may take a while, and mapping reads from WGBS takes
even longer. It usually take quite a long time to map reads from a
single large file with tens of millions of reads. If you have access
to a cluster, one strategy is to launch multiple jobs, each working on
a subset of reads simultaneously, and finally combine their output.
abismal uses standard c++ threads to parallelize the process of
mapping reads using the shared index.
If each node can only access its local storage, dividing the set of
reads to into k equal sized smaller reads files, and mapping these all
simultaneously on multiple nodes, will make the mapping finish about k
times faster. The UNIX split command is good for dividing the reads
into smaller parts. The following BASH commands will take a directory
named reads containing Illumina sequenced reads files, and split
them into files containing at most 3M reads:
$ mkdir reads_split
$ for i in reads/*.txt; do split -a 3 -d -l 12000000 ${i} reads_split/$(basename $i); done
Notice that the number of lines per split file is 12M, since we want 3M reads,
and there are 4 lines per read. If you split the reads like this, you will
need to "unsplit" them after the mapping is done. This can be done using the
samtools merge command.
Abismal also exists as a standalon program, and more details on abismal are available in its documentation manual