2016-05-13

The Bioinformatics Adventure continues.


According to  the social network LinkedIn, I have been working on optical mapping problems at Gydle Inc. (with Philippe and Marc-Alexandre) for 8 months so far. I previously worked at Argonne National Laboratory (with Fangfang and Rick and other people).

The Bioinformatics Adventure continues. Yes, I am still doing bioinformatics. No, I no longer work on assemblers.

I have not worked on any genome assemblers for a while. As of now, I am more an aligner person than an assembler person (see "Sense from sequence reads: methods for alignment and assembly").

At Gydle Inc, we are 3 employees (the CEO, a computer scientist, and myself). Our CEO worked at Illumina at some point. Working in a small company is very fun. We get to do a lot of different things:

  • algorithm design,
  • software design,
  • software development,
  • software quality control,
  • test development,
  • ticket management,
  • data analysis,
  • data curation.

I schedule my time around activities related to the development of optical mapping software -- mainly an aligner and tools to parse / analyze the resulting alignments. We use g++ and clang++, CMake, and Qt. Right now, we are using the option -std=c++11 (short for C++ 2011).


Here is a list of Gydle optical tools.

  • gydle-optical-aligner
  • gydle-optical-alignment-aligner
  • gydle-optical-alignment-viewer
  • gydle-optical-asset-viewer
  • gydle-optical-checker
  • gydle-optical-extender
  • gydle-optical-filter
  • gydle-optical-joiner
  • gydle-optical-recaller
  • gydle-optical-splitter
  • gydle-optical-tester
  • gydle-optical-tiler
  • gydle-optical-validator
  • gydle-optical-xmap-comparator

Our main tool is gydle-optical-aligner. It is multi-threaded, and aligns sequences to references. The sequences are typically optical molecules and the references are optical maps.

Here are some examples of optical alignments (with the format of BioNano Genomics, and the format of Gydle).

SequenceName: Wheat_7AS_mol_FC1_00000170

XMAP alignment:

271 Wheat_7AS_mol_FC1_00000170 Wheat_7AS_map_00415 7715.8 143567.6 677658.2 812605.7 + 13.60 11M1D3M 156833.3 898716.1 1 (73,2)(74,3)(75,4)(76,5)(77,6)(78,7)(79,8)(80,9)(81,10)(82,11)(83,12)(85,13)(86,14)(87,15)

ALI alignments (1)

### GammaScore: 0.53
Wheat_7AS_map_00415 898256 96 Wheat_7AS_mol_FC1_00000170 156833 16 T 1 + 11M1D1G2g2M1e 0 134960 13266 0.66 11.36 16.36 73 677230 87 812189 2 7715 16 143567
 

==> The reference is the same.
===================================================



Everyday, we need to fix the behavior of our software. This is achieved using a test suite with many test cases which in turn contain many assertions. So far, this approach is paying off because we rarely have regressions. However, there is of course a cost (human time) to maintain a test suite. I found the article Good Coding Practices written by Nicholas Nethercote very interesting. In it, the author introduces the concept of envelope of known behaviour. For the Gydle Inc optical mapping business case, our envelope of known behavior for our optical aligner mostly consists in the capability of aligning things to other things, within a range of scaling factors.

Even with its very small size, Gydle Inc. does have a culture. The tenets at Gydle Inc are not numerous, but they are important. I tabulated a list that I find to be true.

  1. The reference is always right, but the reference can evolve according to newly obtained evidence.
  2. Everything must have a name: sequences, assets, maps, references, contigs, and so on. We use these name to better communicate and describe what is going on.
  3. We always test something on a whole dataset and not on just a sample.
  4. The paradigms implemented in our aligners use several phases: indexing, searching, baby-HSP (high-scoring segment pair) generation, HSP sorting, HSP binning, HSP optimization, filtering.
  5. It is generally better to combine different data types to alleviate their respective weaknesses.
The number 2. is important. If you take, for example, the PDF (printable document format) specifications of the BNX, CMAP, and XMAP formats from the next-generation mapping (NGM) company BioNano Genomics, all the names of molecules, maps, or whatnot, are stored as integers (the type int). So it follows that you have to say things like:

"Hey Joe, how are you ? Can you align the molecule "1" on the map "1" and report back to me if you are getting anything good at all. So, say the mapping project used many flow cells, then you are possibly stuck with many optical assets with the same exact name. Yeah, you could have 3 molecules with the name "1". Adding a namespace, like "FC1:1", "FC2:1", and "FC3:1" promptly solves the problem, but this is not possible directly because, as stated above, the BNX, CMAP, and XMAP format specify that integers are to be used for the names.

If you look at Illumina sequences, they also have names that are not integers.


The Bioinformatics Adventure continues.

No comments:

There was an error in this gadget