2011-03-21

de novo assembly of Illumina CEO genome in 11.5 h

THE initial aims of our group regarding de novo assembly of genomes were:

1. To assemble genomes using mixes of sequencing technologies simultaneously.
2. To assemble large repeat-rich genomes.
3. To devise novel approaches to deal with repeats.

1. To assemble genomes using mixes of sequencing technologies simultaneously.

We showed the feasibility of using mixes of sequencing technologies simultaneously using Ray -- a de novo genome assembler -- see Journal of Computational Biology 17(11): 1519-1533. Ray follows the Single Program, Multiple Data approach. It is implemented using the Message-Passing Interface.

With this method, a computation is separated into data-dependent parts. Each part is given to a processor and any processor communicates with others to access remote information.

2. To assemble large repeat-rich genomes.

It was thought that message transit in Ray would interfere strongly with the feasibility of genome assembly of large repeat-rich genomes.

On 12th of December 2010, I had written an entry on resource consumption of Ray. In the same writings, only the sheer memory usage was reported -- no computation time was available as it was still unclear it would be feasible to use Message-Passing Interface end-to-end. The k-mer length used was 19. This low k-mer value did not allow Ray to produce significant assemblies.

Since then, I have worked on the in-memory data storage systems, refined manual message aggregation systems, and further investigated ways of automatically grouping communications. One approach that I developed that effectively accelerates the computation was the design, implementation and integration of a virtual communicator on the top of MPI's default communicator -- MPI_COMM_WORLD. For an overview of the Virual Communicator, read Arnie's story.

On the 1st of January 2011, our group was awarded a special allocation resource project from Compute Canada. However, we gained access to this resource only one month later: on the 4th of February 2011.

I wandered on the Internet and fetched the genome data of the CEO of Illumina (SRA010766). These are large numbers: 6 372 129 288 reads, each in pair and having 75 nucleotides. The bits sum to 477 909 696 600 nucleotides.

Here, I report a de novo assembly of Illumina CEO genome in 11.5 hours using a supercomputer (on Compute Canada's colosse; Job identifier: 2814556; code name: Nitro). The computation was done with Ray 1.3.0 and the k-mer length was 21 (default value in Ray).

Ray detected pairs with an average outer distance equal to 190 and a standard deviation of 30. The peak coverage was 22 and the minimum coverage was 6. There were 1 803 534 contiguous sequences with at least 100 nucleotides, with a total of 1 772 120 417 nucleotides. The N50 was 1341 and the average length was 982. Finally, the length of the longest contiguous sequence was 14584.

Table 1: Running time of Ray on genome data of the CEO of Illumina.

Algorithm step Elapsed time
Distribution of sequence reads
42 minutes, 1 seconds
Distribution of vertices & edges
18 minutes, 36 seconds
Calculation of coverage distribution
19 seconds
Indexing of sequence reads
24 minutes, 42 seconds
Computation of seeds
16 minutes, 14 seconds
Computation of library sizes
2 minutes, 58 seconds
Extension of seeds
7 hours, 10 minutes, 38 seconds
Computation of fusions
2 hours, 26 minutes, 26 seconds
Collection of fusions
1 minutes, 46 seconds
Total
11 hours, 23 minutes, 40 seconds

64 computers were utilised. Each had 24 GiB of physical memory and 2 Intel Nehalem-EP processors. Each processor had 4 compute cores. There were a total of 512 compute cores and 1536 GiB of distributed physical memory. Computer were interconnected with InfiniBand QDR (40 Gigabits per second).

For people that don't have access to a supercomputer nearby, compute time can be rented from compute cloud providers. An example is Amazon Elastic Cloud Compute (EC2). Ray is cloud-ready, so to say. The Amazon EC2 Cluster Compute provides low latency, full bisection 10 Gigabits per second bandwidth.

3. To devise novel approaches to deal with repeats.

In addition, I developped novel ways of enabling repeat traversal with paired sequences. I will present this work during the First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing. The 15-minute talk's title is Constrained traversal of repeats with paired sequences and will take place on 27 March 2011 from 12:10 to 12:30.


Perspective

* To try a larger k-mer length.
* To assemble the Yoruban African Genome (SRA000271).

Ray is freely (librement) available on Sourceforge.

The Ray paper is freely accessible:

Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.
Sébastien Boisvert, François Laviolette, and Jacques Corbeil.
Journal of Computational Biology (Mary Ann Liebert, Inc. publishers).
November 2010, 17(11): 1519-1533.
doi:10.1089/cmb.2009.0238

Edited on March 21st, 2011 to correct a typographical error.
There was an error in this gadget