2014-08-28

Profiling an high-performance actor application for metagenomics


I am currently in an improvement phase where I break, build and improve various components of the system.


The usual way of doing things is to have a static view of one node among all the nodes inside an actor computation. The graphs look like this:


512x16


1024x16



1536x16



2048x16




But with 2048 nodes, the one single selected node may not be an accurate representation of what is going on. This is why, using Thorium profiles, we are generating 3D graphs instead. They look like this:



512x16


1024x16


1536x16


2048x16


2014-08-02

The public datasets from the DOE/JGI Great Prairie Soil Metagenome Grand Challenge



I am working on a couple of very large public metagenomics datasets from the Department of Energy (DOE) Joint Genome Institute (JGI). These datasets were produced in the context of the Grand Challenge program.

Professor Janet Jansson was the Principal Investigator for the proposal named Great Prairie Soil Metagenome Grand Challenge ( Proposal ID: 949 ).


Professor C. Titus Brown wrote a blog article about this Grand Challenge.
Moreover, the Brown research group published at least one paper using these Grand Challenge datasets (assembly with digital normalization and partitioning).

Professor James Tiedje presented the Great Challenge at the 2012 Metagenomics Workshop.

Alex Copeland presented interesting work at Sequencing, Finishing and Analysis in the Future (SFAF) in 2012 related to this Grand Challenge.



Jansson's Grand Challenge included 12 projects. Below I made a list with colors (one color for the sample site and one for the type of soil).

  1. Great Prairie Soil Metagenome Grand Challenge: Kansas, Cultivated corn soil metagenome reference core (402463)
  2. Great Prairie Soil Metagenome Grand Challenge: Kansas, Native Prairie metagenome reference core (402464)
  3. Great Prairie Soil Metagenome Grand Challenge: Kansas, Native Prairie metagenome reference core (402464) (I don't know why it's listed twice)
  4. Great Prairie Soil Metagenome Grand Challenge: Kansas soil pyrotag survey (402466)
  5. Great Prairie Soil Metagenome Grand Challenge: Iowa, Continuous corn soil metagenome reference core (402461)
  6. Great Prairie Soil Metagenome Grand Challenge: Iowa, Native Prairie soil metagenome reference core (402462)
  7. Great Prairie Soil Metagenome Grand Challenge: Iowa soil pyrotag survey (402465)
  8. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Continuous corn soil metagenome reference core (402460)
  9. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Native Prairie soil metagenome reference core (402459)
  10. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Restored Prairie soil metagenome reference core (402457)
  11. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Switchgrass soil metagenome reference core (402458)
  12. Great Prairie Soil Metagenome Grand Challenge: Wisconsin soil pyrotag survey (402456)

I thank the Jansson research group for making these datasets public so that I don't have to look further for large politics-free metagenomics datasets.


Table 1: number of files, reads, and bases in the Grand Challenge datasets. Most of the sequences are paired reads.
Dataset
File count
Read count
Base count
Iowa_Continuous_Corn_Soil (details)
252 055 601 258196 708 830 076
Iowa_Native_Prairie_Soil (details)253 750 844 486326 986 888 235
Kansas_Cultivated_Corn_Soil (details)302 677 222 281272 276 185 410
Kansas_Native_Prairie_Soil (details)335 126 775 452597 933 511 278
Wisconsin_Continuous_Corn_Soil (details)181 912 865 700192 128 891 088
Wisconsin_Native_Prairie_Soil (details)202 098 317 886211 016 377 208
Wisconsin_Restored_Prairie_Soil (details)6347 778 67052 514 579 170
Wisconsin_Switchgrass_Soil (details)7448 382 76658 323 428 574
Total16418 417 788 4991 907 888 691 039


At Argonne we are using these datasets to develop a next-generation metagenomics assembler named "Spate" built on top of the Thorium actor engine. The word spate means a large number of similar things or events appearing or occurring in quick succession. With the actor model, every single message is an active message. Active messages are very neat and there is a lot of them with the actor model.


Similar posts:

2014-08-01

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

I have been very busy during the last months. In particular, I completed my doctorate on April 10th, 2014 and we moved from Canada to the United States on April 15th, 2014. I started a new occupation on April 21st, 2014 at Argonne National Laboratory (a U.S. Department of Energy laboratory).

But the biggest change, perhaps, was not one listed in the enumeration above. The biggest change was to stop working on Ray. Ray is built on top of RayPlatform, which in turn uses MPI for the parallelism and distribution. But this approach is not an easy way of devising applications because message passing alone is a very leaky, not self-contained, abstraction. Ray usually works fine, but it has some bugs.

The problem with leaky abstractions is that they lack simplicity and are way too complex to scale out.

For example, it is hard to add new code to an existing code base without breaking anything. This is the case because MPI only offers a fixed number of ranks. Sure, the MPI standard has some features to spawn ranks, but it's not supported on most platforms and when it is ranks are spawned as operating system processes.

There are arguably 3 known methods to reduce the number of bugs. First is to (1) write a lot of tests. But it's better if you can have a lower number of bugs in the first place. The second one is to use pure (2) functional programming. The third is to use the (3) actor model.

If you look at what the industry is doing, Erlang, Scala (and perhaps D) use the actor model of computation. The actor model of computation was introduced by the legendary (that's my opinion) Carl Hewitt in two seminal papers (Hewitt, Bishop, Steiger 1973 and Hewitt and Baker 1977).

Erlang is cooler than Scala (this is an opinion, not a fact) because it enforces both the actor model and functional programming whereas Scala (arguably) does not enforce anything.

The real thing, perhaps, is to apply the actor model to high-performance computing. In particular, I am applying it to metagenomics because there is a lot of data. For example, Janet Jansson and her team generated huge datasets in 2011 in the context of a Grand Challenge.

So basically I started to work on biosal (biological sequence analysis library) on May 22th, 2014. The initial momentum for the SAL concept (Sequence Analysis Library) was created in 2012 at a workshop. So far, at least two projects (that I am aware of) are related to this workshop: KMI (Kmer Matching Interface) and biosal.

The biosal team is small: we are currently 6 people and we are only 2 that are pushing code.

Here is the current team:







Person (alphabetical order) Roles in biosal project
Pavan Balaji
  • MPI consultant
Sébastien Boisvert
  • Master branch owner
  • Actor model enthusiast
  • Metagenomics person
  • Scrum master
Huy Bui
  • PAMI consultant
  • Communication consultant
Rick Stevens
  • Supervisor
  • Metagenomics person
  • Stakeholder
  • Product owner
  • Exascale computing enthusiast
Venkatram Vishwanath
  • Actor model enthusiast
  • Exascale computing enthusiast
Fangfang Xia
  • Product manager
  • Actor model enthusiast
  • Metagenomics person





When I started to implement the runtime system in biosal, I did not plan to give a name to that component. But I changed my mind because the code is general and very cool. It is a distributed actor engine in C 1999, MPI 1.0, and Pthreads and it's named Thorium (like the atom).

Thorium uses the actor model, but does not use functional programming.

It is quite easy to get started with this. It is a two step process.

The first step is to create an actor script (3 C functions called init, destroy and receive). For a given actor script, you need to write 2 files (a H header file and a C implementation file).

The first step defines a actor script structure like this:

struct bsal_script hello_script = {
    .name = HELLO_SCRIPT,
    .init = hello_init,
    .destroy = hello_destroy,
    .receive = hello_receive,
    .size = sizeof(struct hello),
    .description = "hello"
};
The prototype for the 3 functions are:



Function Concrete actor function
init
void hello_init(struct bsal_actor *actor);
destroy
void hello_destroy(struct bsal_actor *self);
receive
void hello_receive(struct bsal_actor *self,
   struct bsal_message *message);


The functions init and destroy are called automatically by Thorium when an actor is spawned and killed, respectively. The function receive is called automatically by Thorium when the actor receives a message. Sending messages is the only way to interact with an actor.

There is only one (very simple) way to send a message to an actor:

void bsal_actor_send(struct bsal_actor *self, int destination, struct bsal_message *message);


The second step is to create a Thorium runtime node in a C file with a main function (around 10 lines).

After creating the code in two easy steps, you just need to compile and link the code.

After that, you can perform actor computations anywhere. A typical command to do so is:

mpiexec -n 1024 ./hello_world -threads-per-node 32
 
 

Obviously, you need more than just one actor script to actually something cool with actors.


On a final note, biosal is an object-oriented project. The current object is typically called self, like in Swift, Ruby, and Smalltalk.

There was an error in this gadget