2011-11-04

Qui est en charge ?

Construction d'un bâtiment

2012 commencera bientôt et le Centre de génomique de Québec est toujours en piètre état. Rénovation ici, bruits insupportables là. Et il devait être livré (ou délivré) en 2005. Il faut déjà réparer ses défectuosités majeures.

Selon le Journal de Québec, ce projet aura coûté (au moins) 22 M$.

L'ancien directeur du centre de recherche du CHUQ, Jean-Claude Forest, affirme que l'impact financier est en train d'être calculé.

Mais il est (un peu) tard pour les chercheurs.


Gestion informatique d'étudiants

Un autre échec est le système informatique Capsule de l'Université Laval -- un portail web qui gère (très mal) les informations des étudiants. Ce projet aura coûté 26 M$.

André Armstrong, le directeur du projet de modernisation des études, attribue son échec à la non-participation étudiante à des comités de suivi.

 Je trouve difficile à comprendre comment on peut laisser un projet ayant un aussi gros budget échouer autant à cause de sièges étudiants vacants.


Carnet informatisé de santé

"Les médecins pourraient traiter 20 % plus de patients s'ils avaient accès aux dossiers électroniques."

Mais le Dossier (informatique) de santé du Québec, avec un budget de 563 M$, doit maintenant traverser l'étape décisive: l'implémentation d'une interopérabilité. Il est très difficile de modifier des logiciels qui n'ont pas été initialement conçus pour opérer ensemble (inter-opération).

Un autre échec, quoi.


Marché émergent

Les firmes d'avocats adorent. La corruption et la collusion est un marché lucratif pour celles-ci. Il y a même des chercheurs qui poursuivent l'université à la Cours supérieure du Québec (et vice-versa).

2011-09-27

Code review: What happens in Open-MPI's MPI_Iprobe ?

Code review: What happens in Open-MPI's MPI_Iprobe ?


Update 1

Subject: Re: [OMPI devel] Implementation of MPI_Iprobe
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-09-27 15:34:05

Sebastien,

Your analysis is correct in case the checkpoint/restart approach maintained by ORNL is enabled. This is not the code path of the "normal" MPI processes, where the PML OB1 is used. In this generic case the function mca_pml_ob1_iprobe, defined in the file ompi/mca/pml/ob1/pml_ob1_iprobe.c is used.

george.


http://www.open-mpi.org/community/lists/devel/2011/09/9766.php

End of update 1


The message-passing interface (MPI) standard defines an interface for passing messages between processes.

These processes are not necessarily running on the same physical computer.


Open-MPI is an implementation of the MPI standard.

Here, I have utilised openmpi-1.4.3 to find out what is happening when a call to MPI_Iprobe occurs.


According to the MPI 2.2 standard


"The MPI_PROBE and MPI_IPROBE operations allow incoming messages to be checked for, without actually receiving them."


The prototype of MPI_Iprobe is

int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status);


In ompi/mpi/c/iprobe.c, MPI_Iprobe is embodied.


The code (some lines omitted):

ompi/mpi/c/iprobe.c

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2005 The Regents of the University of California.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status)
{
    int rc;

    MEMCHECKER(
        memchecker_comm(comm);
    );

    if ( MPI_PARAM_CHECK ) {
        rc = MPI_SUCCESS;
        OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
        if (((tag < 0) && (tag != MPI_ANY_TAG)) || (tag > mca_pml.pml_max_tag)) {
            rc = MPI_ERR_TAG;
        } else if (ompi_comm_invalid(comm)) {
            rc = MPI_ERR_COMM;
        } else if ((source != MPI_ANY_SOURCE) &&
                   (MPI_PROC_NULL != source) &&
                   ompi_comm_peer_invalid(comm, source)) {
            rc = MPI_ERR_RANK;
        }
        OMPI_ERRHANDLER_CHECK(rc, comm, rc, FUNC_NAME);
    }

    if (MPI_PROC_NULL == source) {
        if (MPI_STATUS_IGNORE != status) {
            *status = ompi_request_empty.req_status;
            /*
             * Per MPI-1, the MPI_ERROR field is not defined for single-completion calls
             */
            MEMCHECKER(
                opal_memchecker_base_mem_undefined(&status->MPI_ERROR, sizeof(int));
            );
        }
        return MPI_SUCCESS;
    }

    OPAL_CR_ENTER_LIBRARY();

    rc = MCA_PML_CALL(iprobe(source, tag, comm, flag, status));

    /*
     * Per MPI-1, the MPI_ERROR field is not defined for single-completion calls
     */
    MEMCHECKER(
        opal_memchecker_base_mem_undefined(&status->MPI_ERROR, sizeof(int));
    );

    OMPI_ERRHANDLER_RETURN(rc, comm, rc, FUNC_NAME);
}



In this function, the following line is the one we want to follow.

rc = MCA_PML_CALL(iprobe(source, tag, comm, flag, status));


The rest is mostly just parameter validation, I guess.


MCA_PML_CALL is probably a macro. Let's search it.


In ompi/mca/pml/pml.h, MCA_PML_CALL is defined. Here is the code (some lines omitted):

ompi/mca/pml/pml.h

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * Copyright (c) 2006-2007 Los Alamos National Security, LLC.  All rights
 *                         reserved. 
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

#if MCA_pml_DIRECT_CALL

#include MCA_pml_DIRECT_CALL_HEADER

#define MCA_PML_CALL_EXPANDER(a, b) MCA_PML_CALL_STAMP(a,b)
#define MCA_PML_CALL(a) MCA_PML_CALL_EXPANDER(MCA_pml_DIRECT_CALL_COMPONENT, a)

#else
#define MCA_PML_CALL(a) mca_pml.pml_ ## a
#endif


With MCA_PML_CALL, iprobe becomes mca_pml.pml_iprobe and some additional code is added depending on the value of MCA_pml_DIRECT_CALL.




In ompi/mca/pml/pml.h, mca_pml_base_module_t is defined.

ompi/mca/pml/pml.h

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * Copyright (c) 2006-2007 Los Alamos National Security, LLC.  All rights
 *                         reserved. 
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

struct mca_pml_base_module_1_0_0_t {

    /* downcalls from MCA to PML */
    mca_pml_base_module_add_procs_fn_t    pml_add_procs;
    mca_pml_base_module_del_procs_fn_t    pml_del_procs;
    mca_pml_base_module_enable_fn_t       pml_enable;
    mca_pml_base_module_progress_fn_t     pml_progress;

    /* downcalls from MPI to PML */
    mca_pml_base_module_add_comm_fn_t     pml_add_comm;
    mca_pml_base_module_del_comm_fn_t     pml_del_comm;
    mca_pml_base_module_irecv_init_fn_t   pml_irecv_init;
    mca_pml_base_module_irecv_fn_t        pml_irecv;
    mca_pml_base_module_recv_fn_t         pml_recv;
    mca_pml_base_module_isend_init_fn_t   pml_isend_init;
    mca_pml_base_module_isend_fn_t        pml_isend;
    mca_pml_base_module_send_fn_t         pml_send;
    mca_pml_base_module_iprobe_fn_t       pml_iprobe;
    mca_pml_base_module_probe_fn_t        pml_probe;
    mca_pml_base_module_start_fn_t        pml_start;

    /* diagnostics */
    mca_pml_base_module_dump_fn_t         pml_dump;

    /* FT Event */
    mca_pml_base_module_ft_event_fn_t     pml_ft_event;

    /* maximum constant sizes */
    int                                   pml_max_contextid;
    int                                   pml_max_tag;
};
typedef struct mca_pml_base_module_1_0_0_t mca_pml_base_module_1_0_0_t;
typedef mca_pml_base_module_1_0_0_t mca_pml_base_module_t;




In mca_pml_base_module_t, there is the attribute pml_iprobe. This type of this attribute is mca_pml_base_module_iprobe_fn_t.


This type is defined also in ompi/mca/pml/pml.h and is basically a function pointer (right?).

ompi/mca/pml/pml.h

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * Copyright (c) 2006-2007 Los Alamos National Security, LLC.  All rights
 *                         reserved. 
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

/**
 * Probe to poll for pending recv.
 *
 * @param src (IN)        Source rank w/in communicator.
 * @param tag (IN)        User defined tag.
 * @param comm (IN)       Communicator.
 * @param matched (OUT)   Flag indicating if matching recv exists.
 * @param status (OUT)    Completion statuses.
 * @return                OMPI_SUCCESS or failure status.
 *
 */
typedef int (*mca_pml_base_module_iprobe_fn_t)(
    int src,
    int tag,
    struct ompi_communicator_t* comm,
    int *matched,
    ompi_status_public_t *status
);




I believe that this scheme allows to have several implementations of pml_iprobe.



Let us look for these now.


In ompi/mca/pml/crcpw/pml_crcpw_module.c, there is this code (some lines omitted):

ompi/mca/pml/crcpw/pml_crcpw_module.c

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2006 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */

int mca_pml_crcpw_iprobe(int dst, int tag, struct ompi_communicator_t* comm, int *matched, ompi_status_public_t* status )
{
    int ret;
    ompi_crcp_base_pml_state_t * pml_state;

    PML_CRCP_STATE_ALLOC(pml_state, ret);

    pml_state->wrapped_pml_component = &(mca_pml_crcpw_module.wrapped_pml_component);
    pml_state->wrapped_pml_module    = &(mca_pml_crcpw_module.wrapped_pml_module);

    pml_state->state = OMPI_CRCP_PML_PRE;
    pml_state = ompi_crcp.pml_iprobe(dst, tag, comm, matched, status, pml_state);
    if( OMPI_SUCCESS != pml_state->error_code) {
        ret =  pml_state->error_code;
        PML_CRCP_STATE_RETURN(pml_state);
        return ret;
    }

    if( OMPI_CRCP_PML_DONE == pml_state->state) {
        goto CLEANUP;
    }

    if( OMPI_CRCP_PML_SKIP != pml_state->state) {
        if( OMPI_SUCCESS != (ret = mca_pml_crcpw_module.wrapped_pml_module.pml_iprobe(dst, tag, comm, matched, status) ) ) {
            PML_CRCP_STATE_RETURN(pml_state);
            return ret;
        }
    }

    pml_state->state = OMPI_CRCP_PML_POST;
    pml_state = ompi_crcp.pml_iprobe(dst, tag, comm, matched, status, pml_state);
    if( OMPI_SUCCESS != pml_state->error_code) {
        ret =  pml_state->error_code;
        PML_CRCP_STATE_RETURN(pml_state);
        return ret;
    }

 CLEANUP:
    PML_CRCP_STATE_RETURN(pml_state);

    return OMPI_SUCCESS;
}



This code however seems to rely heavily on ompi_crcp.pml_iprobe.



In ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c (some lines omitted):

ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c

/*
 * Copyright (c) 2004-2008 The Trustees of Indiana University.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */
/**************** Probe *****************/
/* JJH - Code reuse: Combine iprobe and probe logic */
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_iprobe(
                                  int dst, int tag,
                                  struct ompi_communicator_t* comm,
                                  int *matched,
                                  ompi_status_public_t* status,
                                  ompi_crcp_base_pml_state_t* pml_state )
{
    ompi_crcp_bkmrk_pml_drain_message_ref_t   *drain_msg_ref = NULL;
    ompi_crcp_bkmrk_pml_message_content_ref_t *content_ref = NULL;
    int exit_status = OMPI_SUCCESS;
    int ret;

    OPAL_OUTPUT_VERBOSE((30, mca_crcp_bkmrk_component.super.output_handle,
                        "crcp:bkmrk: pml_iprobe(%d, %d)", dst, tag));

    /*
     * Before PML Call
     * - Determine if this can be satisfied from the drained list
     * - Otherwise let the PML handle it
     */
    if( OMPI_CRCP_PML_PRE == pml_state->state) {
        /*
         * Check to see if this message is in the drained message list
         */
        if( OMPI_SUCCESS != (ret = drain_message_find_any(PROBE_ANY_COUNT, tag, dst,
                                                          comm, PROBE_ANY_SIZE,
                                                          &drain_msg_ref,
                                                          &content_ref,
                                                          NULL) ) ) {
            ERROR_SHOULD_NEVER_HAPPEN("crcp:bkmrk: pml_iprobe(): Failed trying to find a drained message.");
            exit_status = ret;
            goto DONE;
        }

        /*
         * If the message is a drained message
         *  - Copy of the status structure to pass back to the user
         *  - Mark the 'matched' flag as true
         */
        if( NULL != drain_msg_ref ) {
            OPAL_OUTPUT_VERBOSE((12, mca_crcp_bkmrk_component.super.output_handle,
                                 "crcp:bkmrk: pml_iprobe(): Matched a drained message..."));

            /* Copy the status information */
            if( MPI_STATUS_IGNORE != status ) {
                memcpy(status, &content_ref->status, sizeof(ompi_status_public_t));
            }

            /* Mark as complete */
            *matched = 1;

            /* This will identify to the wrapper that this message is complete */
            pml_state->state = OMPI_CRCP_PML_DONE;
            pml_state->error_code = OMPI_SUCCESS;
            return pml_state;
        }
        /*
         * Otherwise the message is not drained (common case), so let the PML deal with it
         */
        else {
            /* Mark as not complete */
            *matched = 0;
        }
    }

 DONE:
    pml_state->error_code = exit_status;
    return pml_state;
}




How I understand it, ompi_crcp_bkmrk_pml_iprobe first check if any message in the drained list satisfies what the calling code wants with
a call to drain_message_find_any.

If this is not the case, the code "let the PML deal with it", which probably means that that this case would be dealt with somewhere in mca_pml_crcpw_iprobe although I am not sure.


drain_message_find_any is defined in ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c:

ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c

/*
 * Copyright (c) 2004-2008 The Trustees of Indiana University.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

static int drain_message_find_any(size_t count, int tag, int peer,
                                  struct ompi_communicator_t* comm, size_t ddt_size,
                                  ompi_crcp_bkmrk_pml_drain_message_ref_t ** found_msg_ref,
                                  ompi_crcp_bkmrk_pml_message_content_ref_t ** content_ref,
                                  ompi_crcp_bkmrk_pml_peer_ref_t **peer_ref)
{
    ompi_crcp_bkmrk_pml_peer_ref_t *cur_peer_ref = NULL;
    opal_list_item_t* item = NULL;

    *found_msg_ref = NULL;

    for(item  = opal_list_get_first(&ompi_crcp_bkmrk_pml_peer_refs);
        item != opal_list_get_end(&ompi_crcp_bkmrk_pml_peer_refs);
        item  = opal_list_get_next(item) ) {
        cur_peer_ref = (ompi_crcp_bkmrk_pml_peer_ref_t*)item;

        /*
         * If we ware not MPI_ANY_SOURCE, then extract the process name from the
         * communicator, and search only the peer that matches.
         */
        if( MPI_ANY_SOURCE != peer && peer >= 0) {
            /* Check to see if peer could possibly be in this communicator */
            if( comm->c_local_group->grp_proc_count <= peer ) {
                continue;
            }

            if( OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_ALL,
                                                            &(cur_peer_ref->proc_name),
                                                            &(comm->c_local_group->grp_proc_pointers[peer]->proc_name)) ) {
                continue;
            }
        }

        drain_message_find(&(cur_peer_ref->drained_list),
                           count, tag, peer,
                           comm->c_contextid, ddt_size,
                           found_msg_ref,
                           content_ref);
        if( NULL != *found_msg_ref) {
            if( NULL != peer_ref ) {
                *peer_ref = cur_peer_ref;
            }
            return OMPI_SUCCESS;
        }
    }

    return OMPI_SUCCESS;
}



If peer is MPI_ANY_SOURCE, then drain_message_find_any will fetch a message from any source.
Otherwise, it will look only for messages from the desired peer.


ompi_crcp_bkmrk_pml_peer_refs is populated in ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c (some lines omitted):

ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c

/*
 * Copyright (c) 2004-2008 The Trustees of Indiana University.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

/**************** Processes *****************/
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_add_procs(
                                   struct ompi_proc_t **procs,
                                   size_t nprocs,
                                   ompi_crcp_base_pml_state_t* pml_state )
{
    int ret;
    ompi_crcp_bkmrk_pml_peer_ref_t *new_peer_ref;
    size_t i;

    if( OMPI_CRCP_PML_PRE != pml_state->state ){
        goto DONE;
    }

    OPAL_OUTPUT_VERBOSE((30, mca_crcp_bkmrk_component.super.output_handle,
                        "crcp:bkmrk: pml_add_procs()"));

    /*
     * Save pointers to the wrapped PML
     */
    wrapped_pml_component = pml_state->wrapped_pml_component;
    wrapped_pml_module    = pml_state->wrapped_pml_module;

    /*
     * Create a peer_ref for each peer added
     */
    for( i = 0; i < nprocs; ++i) {
        HOKE_PEER_REF_ALLOC(new_peer_ref, ret);

        new_peer_ref->proc_name.jobid  = procs[i]->proc_name.jobid;
        new_peer_ref->proc_name.vpid   = procs[i]->proc_name.vpid;

        opal_list_append(&ompi_crcp_bkmrk_pml_peer_refs, &(new_peer_ref->super));
    }

 DONE:
    pml_state->error_code = OMPI_SUCCESS;
    return pml_state;
}


Presumably, nprocs is the number of MPI ranks in MPI_COMM_WORLD because ompi_crcp_bkmrk_pml_add_procs has
no argument struct ompi_communicator_t* comm.



Let's say that the peer is 255 and that the number of MPI ranks in MPI_COMM_WORLD is 256. In this case,
drain_message_find_any will need to loop over all the peers before calling drain_message_find for the peer 255 !

That is slow but it works, right ?

With MPI_ANY_SOURCE, I suspect message starvation can occur because there is no round-robin for the starting point of the loop -- the code always starts with the first peer.

Accordingly, if all peers send messages at the same rate, peers with lower ranks will see their messages received first.

This leads to starvation of peers with higher ranks.

2011-08-25

The VirtualProcessor technology in Ray



The VirtualProcessor I developed enables any MPI rank to have thousands of workers working
on different tasks. In reality, only one worker can work at any given moment, but
the VirtualProcessor schedules fairly the workers on the only instruction pipeline
available so that is 1d not a problem at all.


The VirtualProcessor is a technology that make thousands of worker compute tasks
in parallel on a single MPI rank. Obviously, only one such worker is active at
any point, but they all get to work.

The idea is that when a worker pushes a message on the VirtualCommunicator, it has
to wait for a reply. And this reply may arrive later. The idea of the
VirtualProcessor is to easily submit communication-intensive tasks.

Basically, the VirtualCommunicator groups smaller messages into larger messages
to send fewer messages on the physical network.

But to achieve that, an easy way of generating a lot of small messages is
needed. This is the use of the VirtualProcessor.


== Implementation ==

Author: Sébastien Boisvert
License: GPL
code/scheduling/VirtualProcessor.h
code/scheduling/VirtualProcessor.cpp

http://github.com/sebhtml/ray

== Workers ==

In Ray, each MPI rank actually has thousands of workers within it. All these
workers are scheduled, one after the other, inside the same process. No thread
are utilised as these would trash the processor cache.

At any point in time, each of these worker is in one of these states:

- active;
- sleeping;
- completed.



active

    The worker is not waiting for a message reply.

sleeping

    The worker is waiting for a message reply.

completed

    The worker has completed its task.

2011-07-22

Understand the main loop of message-passing-interface software

In video games, the main loop usually looks like this:


1
2
3
4
5
while(running){
    processInput();
    updateGameState();
    drawScreen();
}


(from higherorderfun.com)

Message-passing-interface (MPI) software can be designed in a similar fashion.

Each message-passing-interface rank of an MPI software has its own message inbox and its own message outbox.

Like for emails, received messages go in the inbox and sent messages go in the outbox.

The main loop of an MPI rank usually looks like this:

1
2
3
4
5
6
while(running){
    receiveMessages();
    processMessages();
    processData();
    sendMessages();
}

This general architecture is utilised in the Ray de novo genome assembler.

Read the code: github.com/sebhtml/ray

2011-07-19

I don't understand why Ray was not included in the paper Bioinformatics 27, 2031–2037

In the paper

Lin, Y. et al. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27, 2031–2037 (2011).


Ray is not mentioned.

Why is that so ? 
We have been working on Ray for quite a while with an early prototype of the assembly engine called OpenAssembler (started in 2009-01-21).

We published Ray in 2010 in Journal of Computational Biology.


We have also presented Ray at a few places.
 


Cool facts about Ray:

- We are participating to Assemblathon 2.
- We assembled a genome on 512 compute cores in 18 hours, there were > 3 000 000 000 Illumina TruSeq paired reads (that is a lot of reads !)
- Ray is an high-performance peer-to-peer assembler
- Can assemble mixtures of technologies
- Is open source, licensed with the GNU GPL
- Ray is free.
- Works on POSIX systems (Linux, Mac, and others) and on Microsoft Windows
- Compiles cleanly with gcc and with Microsoft Visual Studio

- Ray utilises the message-passing interface

- Works well with Open-MPI or MPICH2 -- the 2 main open source implementations of the MPI standard.

- Ray does very few assembly errors.

- Ray is a single executable called Ray
- Implemented in C++
- Ray is object-oriented
- Ray is modular
- Ray is scalable

- Ray is easy to use.

- All the code is on github

I think the paper should have compared Ray with the other assemblers...



                                                     Sébastien

2011-06-13

More on virtual communication with the message-passing interface.

The message-passing interface (MPI) is a standard that allows numerous computers to communicate in order to achieve a large-scale peer-to-peer communication in a high-performance matter.

OK, let's face it, writing computer programs with MPI can be hard and tedious. However, one can devise a set of techniques that collectively enhance his/her software craftmanship.

I write C++ (1998) using MPI -- you already know that if you read my blog. In ray, I implemented a few tricks to make the message-passing thing a lot easier.

Slave modes and master modes

First, each MPI rank (a rank is usually mapped to a process, if that matters) has a slave mode and a master mode. A slave mode can be called SLAVE_MODE_COUNT_ENTRIES, which would obviously count entries. Master modes follow the same principle. If an MPI rank is in a state such that it must wait for others, than there is this very special slave mode called SLAVE_MODE_DO_NOTHING.




Each of these modes has an associated callback method. SLAVE_MODE_DO_NOTHING's callback method is called

void Machine::call_SLAVE_MODE_DO_NOTHING();



Its implementation is quite simple, actually.

void Machine::call_SLAVE_MODE_DO_NOTHING(){
    /* do nothing */
}
OK, that was funny.

Given a slave mode, I want to call its associated callback method. To do so, I use an array of method pointers. Simple enough.

Keep it simple, stupid

But wait, for each slave mode (the same is true for master modes), I have to

1. Add it in a enumeration so it can be used later;
2. Define its string (read char*) representation for debugging purposes.;
3. Define its callback method prototype (in .h);
4. Define its callback method definition (in .cpp);
5. Add its callback method pointer in the array of method pointers.


Accordingly, to add a new slave mode, I would need to add things at at least 5 places. To avoid that, I use macros.

An example follows.

core/slave_mode_macros.h

MACRO_LIST_ITEM( RAY_SLAVE_MODE_LOAD_SEQUENCES )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_START_SEEDING )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_DO_NOTHING )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_SEND_EXTENSION_DATA )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_ASSEMBLE_WAVES )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_FUSION )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_FINISH_FUSIONS )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_AMOS )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_AUTOMATIC_DISTANCE_DETECTION )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_SEND_LIBRARY_DISTANCES )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_EXTRACT_VERTICES )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_SEND_DISTRIBUTION )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_EXTENSION )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_INDEX_SEQUENCES )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_REDUCE_MEMORY_CONSUMPTION )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_DELETE_VERTICES )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_SEND_SEED_LENGTHS )
MACRO_LIST_ITEM( RAY_SLAVE_MODE_SCAFFOLDER )


1. Add it in a enumeration so it can be used later

core/slave_modes.h

#define MACRO_LIST_ITEM(element) element,

enum{
#include "core/slave_mode_macros.h"
SLAVE_MODES_H_DUMMY
};

#undef MACRO_LIST_ITEM


2. Define its string (read char*) representation for debugging purposes.

core/slave_modes.cpp

#define MACRO_LIST_ITEM(x) #x,

const char* SLAVE_MODES[]={
#include "core/slave_mode_macros.h"
};

#undef MACRO_LIST_ITEM


3. Define its callback method prototype (in .h)

core/Machine.h

#define MACRO_LIST_ITEM(x) void call_ ## x();
#include "core/slave_mode_macros.h"
#undef MACRO_LIST_ITEM


5. Add its callback method pointer in the array of method pointers

core/Machine.cpp

#define MACRO_LIST_ITEM(x) m_slave_methods[x]=&Machine::call_ ## x ;
#include "core/slave_mode_macros.h"
#undef MACRO_LIST_ITEM

Note that the definition of a callback is not done with these macros because it can not be automated.

The same principle is applied to master modes and to MPI message types. A callback method for an MPI message type receives a message and processes it.


Wary workers

Let us say that I want to compute the following function.

int sum(){
     int sum=0;
     int i=1000;
     while(i--){
          sum+=i;
     }
     return sum;
}


Now, let's do it the worker's way.

The methods of a worker usually are:


void Worker::work();
bool Worker::isDone();
Result Worker::getWorkResult();

In my case:

class Worker{
    int m_i;
    int m_sum;
    bool m_done;
public:
    void constructor(){
        m_i=10000;
        m_sum=0;
        m_done=false;
    }
    void work{
        if(m_done){
            return;
        }
        m_sum+=m_i;
        m_i--;
        if(m_i==0){
            m_done=true;
        }
    }
    bool isDone(){
        return m_done;
    }
    int getWorkResult(){
        return m_sum;
    }
};



I can now use this worker to compute function sum:

int sumWithWorker(){
    Worker worker;
    worker.constructor();
    while(!worker.isDone()){
        worker.work();
    }
    return worker.getWorkResult();
}

This allows the computation to be sliced in parts. This is handy for programming an MPI application because it produces naturally fine granularity. By the way, fine granularity is necessary to obtain low-latency message passing. Otherwise, an MPI rank may have to wait several minutes before receiving its response -- which is perhaps not very acceptable.


Proper message passing and virtual communication


Now, the most time-consuming part of programming with MPI: message transiting -- from one end, we send a message to a destination rank. On the other end, we receive the message. It can be tedious to do this message-passing recipe and obviously you don't want to use a spinlock while waiting for a message's response. Instead, you want to be able to receive messages from other MPI ranks.

In a previous post, I described the virtual communicator. Here, I use the concept again.

Given a piece of MPI code, the 3 basic necessary operations from a worker's point of view are

- void VirtualCommunicator::pushMessage(uint64_t workerId,Message*message);
- bool VirtualCommunicator::isMessageProcessed(uint64_t workerId);
- vector VirtualCommunicator::getMessageResponseElements(uint64_t workerId);


Typical workflow:


The worker identified by workerId pushes a message to the virtual communicator.

The worker identified by workerId asks the virtual communicator if its message has been processed.

If so, the worker identifier by workerId asks its message response to the virtual communicator.

Now, we don't want to send numerous skinny messages (one for each call to pushMessage). Instead, automatic message aggregation is highly desired.
In ray, the VirtualCommunicator does just that right away, for you.

To achieve automatic message aggregation, a pool of workers is needed. Each of these workers are independent, but their messages will be grouped without them knowing it.

This method alone yields a significant performance gain.

What follows is an gain example.

Rank 14 VirtualCommunicator Statistics
Rank 14: 60466 virtual messages for 20854019 pushed messages
Rank 14: percentage= 0.289949%

Out of the 20854019 pushed messages, only 60466 virtual messages were sent.


Berserk producers overflow ring buffers

Note that a worker never goes berserk -- it never sends numerous messages without first getting a response for its first pushed message. Thus, no problem of consumer and producer can occur.


gg

sebhtml

2011-03-21

de novo assembly of Illumina CEO genome in 11.5 h

THE initial aims of our group regarding de novo assembly of genomes were:

1. To assemble genomes using mixes of sequencing technologies simultaneously.
2. To assemble large repeat-rich genomes.
3. To devise novel approaches to deal with repeats.

1. To assemble genomes using mixes of sequencing technologies simultaneously.

We showed the feasibility of using mixes of sequencing technologies simultaneously using Ray -- a de novo genome assembler -- see Journal of Computational Biology 17(11): 1519-1533. Ray follows the Single Program, Multiple Data approach. It is implemented using the Message-Passing Interface.

With this method, a computation is separated into data-dependent parts. Each part is given to a processor and any processor communicates with others to access remote information.

2. To assemble large repeat-rich genomes.

It was thought that message transit in Ray would interfere strongly with the feasibility of genome assembly of large repeat-rich genomes.

On 12th of December 2010, I had written an entry on resource consumption of Ray. In the same writings, only the sheer memory usage was reported -- no computation time was available as it was still unclear it would be feasible to use Message-Passing Interface end-to-end. The k-mer length used was 19. This low k-mer value did not allow Ray to produce significant assemblies.

Since then, I have worked on the in-memory data storage systems, refined manual message aggregation systems, and further investigated ways of automatically grouping communications. One approach that I developed that effectively accelerates the computation was the design, implementation and integration of a virtual communicator on the top of MPI's default communicator -- MPI_COMM_WORLD. For an overview of the Virual Communicator, read Arnie's story.

On the 1st of January 2011, our group was awarded a special allocation resource project from Compute Canada. However, we gained access to this resource only one month later: on the 4th of February 2011.

I wandered on the Internet and fetched the genome data of the CEO of Illumina (SRA010766). These are large numbers: 6 372 129 288 reads, each in pair and having 75 nucleotides. The bits sum to 477 909 696 600 nucleotides.

Here, I report a de novo assembly of Illumina CEO genome in 11.5 hours using a supercomputer (on Compute Canada's colosse; Job identifier: 2814556; code name: Nitro). The computation was done with Ray 1.3.0 and the k-mer length was 21 (default value in Ray).

Ray detected pairs with an average outer distance equal to 190 and a standard deviation of 30. The peak coverage was 22 and the minimum coverage was 6. There were 1 803 534 contiguous sequences with at least 100 nucleotides, with a total of 1 772 120 417 nucleotides. The N50 was 1341 and the average length was 982. Finally, the length of the longest contiguous sequence was 14584.

Table 1: Running time of Ray on genome data of the CEO of Illumina.

Algorithm step Elapsed time
Distribution of sequence reads
42 minutes, 1 seconds
Distribution of vertices & edges
18 minutes, 36 seconds
Calculation of coverage distribution
19 seconds
Indexing of sequence reads
24 minutes, 42 seconds
Computation of seeds
16 minutes, 14 seconds
Computation of library sizes
2 minutes, 58 seconds
Extension of seeds
7 hours, 10 minutes, 38 seconds
Computation of fusions
2 hours, 26 minutes, 26 seconds
Collection of fusions
1 minutes, 46 seconds
Total
11 hours, 23 minutes, 40 seconds

64 computers were utilised. Each had 24 GiB of physical memory and 2 Intel Nehalem-EP processors. Each processor had 4 compute cores. There were a total of 512 compute cores and 1536 GiB of distributed physical memory. Computer were interconnected with InfiniBand QDR (40 Gigabits per second).

For people that don't have access to a supercomputer nearby, compute time can be rented from compute cloud providers. An example is Amazon Elastic Cloud Compute (EC2). Ray is cloud-ready, so to say. The Amazon EC2 Cluster Compute provides low latency, full bisection 10 Gigabits per second bandwidth.

3. To devise novel approaches to deal with repeats.

In addition, I developped novel ways of enabling repeat traversal with paired sequences. I will present this work during the First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing. The 15-minute talk's title is Constrained traversal of repeats with paired sequences and will take place on 27 March 2011 from 12:10 to 12:30.


Perspective

* To try a larger k-mer length.
* To assemble the Yoruban African Genome (SRA000271).

Ray is freely (librement) available on Sourceforge.

The Ray paper is freely accessible:

Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.
Sébastien Boisvert, François Laviolette, and Jacques Corbeil.
Journal of Computational Biology (Mary Ann Liebert, Inc. publishers).
November 2010, 17(11): 1519-1533.
doi:10.1089/cmb.2009.0238

Edited on March 21st, 2011 to correct a typographical error.

2011-01-23

The Virtual Communicator

IT WAS a wintry day of January, in a coldly-tempered land. On this island lived peculiar citizens whose main everyday whereabouts involved producing goods and getting back some credits for the hard work. Arnie was one of them. He was cultivating vegetables, in a greenhouse in the cold season, obviously. He would prepare and send a shipment whenever his plums were ready. But most often his daily production, which departed his ranch daily, did not fill all of the available space provided by the Raw Communicator. The Raw Communicator is the entity whose personhood solely involves delivering the production of a worker's goods to the markets, and bringing back the gained assets to the farmer. And because he was so raw, the Raw Communicator would not even care of visiting other folks after visiting Arnie. He would directly transit to the markets. Afterwards, he would visit Arnie's first neighbour, return to the market, and so on.

Arnie, although being just a potent grower of greens, felt that the Raw Communicator should be less raw and more genuinely efficient. Arnie had brought up the matter to the Raw Communicator, but to no avail. The latter had stated that his raw nature prohibited him from doing anything else from what he was here to do: delivering goods to the markets and bringing back good fortune to the goods's erstwhile owner.

Just like he felt, there was many of his kind who wished communications were faster -- they were insulted by the delays brought by the way of message transportation. And they would not secure a way of grouping their parcels -- before sending them -- by speaking to each other; that would be utterly called communism in this sacred freedom-driven world. Although they knew that the transportation mechanism could hold a particular amount of goods -- an amount being upper-bounded by some arcane physic laws – they held onto their traditions of not gathering their deeds together and instead continued to ship them individually, one by one in these harsh wintering times. What was going on, in fact, was simply an unwillingness to modify their habits. What they needed was a novel mechanism under which they could continue their whereabouts without changing their customs. Should this almighty desired interface be available, the earthlings would only want to send their package to a destination, and to receive a timely response from that very said destination. Less is more. No more, no less.

Then arrived an opportunistic celebrity. His name was Virtual Communicator. And he said on his arrival:

“Virtual Communicator, that is my name. Oh ! Honourable citizens of this spectacular communicative gateway, not only could your population save energy by grouping your merchandise, but your collective wisdom would also save precious time should you select me as your communicator. Yes ! I say drop your vanilla communicator, the Raw Communicator is outdated, rather slow and quite expensive !”

This newly-arrived character then went on:

“My ways are the same for you, but I can group bits and packets to make you save some gas and minerals. All I ask is a mere percentage of your planned financial gains as predictable by scheduled grouping of your workers's production.”

Arnie and his fellow neighbours listened carefully -- they truly thought that the Virtual Communicator sounded like an huge improvement from the Raw Communicator. Being a spectator of these reactions, the Virtual Communicator resumed talking:

“In the end, you can send and receive letters faster, the process saves your energy and your time. You will also gain considerable financial advantages. Should you assign me to this gracious task, I would also be granted an occupation with associated remuneration.”

But the folks from the cold city wanted to know the motivation of this recently introduced entity. The Virtual Communicator then explained his motives.

“Yes, I the great Virtual Communicator am a hub -- a gateway. I make profits from work others perform. These workers -- who produce the goods and needs -- are however wordless when it comes to leveraging efficiency and distribution. Needless to say, I the great Virtual Communicator can't do work like worker Arnie does. And neither Arnie can do my job. Pals, it is clearly a win-win situation here.”

“What about the Raw Communicator ?” asked Arnie. “He too deserve occupational remuneration” he added. The Virtual Communicator then explained that he was not able to actually send and receive merchandises, that all he could do was grouping and ungrouping things. The rest was part of the personhood of the Raw Communicator. To put it simply, he explained that the Raw Communicator actually receives credit on a per-message basis. Although he would notice a reduction of messaging throughput, the state-sponsored part of his salary would remain a frozen asset for him nonetheless.

“What do you say ?” asked finally the Virtual Communicator.


I wrote this story while working on virtual (read abstract) message-grouping in a message-passing-interface environment.
There was an error in this gadget