2013-02-28

Introducing genome subway maps

It's no secret, data visualization is more appealing than bare tables with floating numbers and integers. And visualization can be dynamic and responsive too, if designed correctly. In November 2012, I started to work on a pet project called Ray Cloud Browser. From the name, you can tell that it's something to browse stuff related to astronomy: rays and clouds. In fact, that's untrue. Ray Cloud Browser is a data browser that can run in the cloud -- an abstraction for virtualized hardware that you pay by the hour. Ray is just the brand name of the products I am working on during my doctoral projects.

Ray Cloud Browser is open source and free software. It's all on github with nice documentation and all that. Anyway, enough with the chitchat.

The first picture I want to share is this view that illustrates repeated regions in a genome. It's very like a subway map, hence the title of this post.





You can visit this subway location by yourself here. The demo is running on a t1.micro spot instance on Amazon EC2.

It's even possible (boom!) to have a menu when navigating this genetic map in the cloud.


The visual landscape of regions that are unique in a genome (or in a metagenome, transcriptome, or whatever -ome you deem the best for you) are more calm and simple, like the one below.


In the scientific literature, repeats are usually described as simple branching points in the string graph. Well, some of them are simple (such as the one below in the picture), but most of them are complex with repeats within repeats (worlds within worlds).



My backlog is almost depleted, meaning that soon Ray Cloud Browser will be full of features.

There is a short guide on how to deploy this super-cool software for your own use.

2013-02-25

Building a client for visualizing graphs, in a browser

A graph has a set of vertices and a set of edges. An edge is a relationship between two vertices.

If you take Facebook, the vertices are people and the edges are friendships. If you take two people on Facebook, they are probably connected by just a few links -- like in pretty much every discrete systems known to mankind. A path (like that path between two people) is the second class of interesting objects for visualizing a given system, the first class being the graphs.

With graphs and paths, it is possible to describe numerous discrete systems.

In Ray Cloud Browser -- a graph visualizer for genomics, vocabulary terms were carefully selected. In Ray Cloud Browser, the 4 main object types are maps, sections, regions, and locations. A map is a graph in genomics. The vertices of a map are DNA sequences (like GATTACA), and edges are direct neighbourhood relationships (such as GATTACA -> ATTACAG). A section is really just a bunch of paths in the graph. The paths in the map are called regions. Several locations can be explored in a region.

The geometrical landscape of data in Ray Cloud Browser is quite easy to browse because any map has an index associated to it, and it's the same for sections, regions, and locations. For instance, {"map": 0, "section": 3, "region": 5, "location": 3000} will get you somewhere in a genome.

Mathematically, there is a injection between the set of locations -- that is a 4-tuple containing 4 integers (map, section, region, location) -- and the union of all possible sequences and the set containing only the nil object. For a given sequence, a set of 4-tuples (like those described above) can be obtained.

When the operator is at the end of a given region (for example, a contig), it is insightful to obtain what are the nearby regions in the map. To do so, the web service must have an action to search regions associated to any sequence (remember, sequences are vertices in the map).

This is about to become a reality in Ray Cloud Browser. This is exciting to reach this significant milestone after 4 months of relentless work.


In my backlog, I have only 5 tasks remaining ! Yay !

I will use this new powerful feature to better understand what's going on in various Ray issues.

  • store path data inside Region class (data engine) (20 min)
  • push other paths in region list when receiving annotations (data engine) (30 min)
  • do readahead for other paths too (data engine) (30 min) 
  • select region in menu (UI) (30 min) 
  • paths in other colors (rendering) (30 min)


2013-02-22

Using Cost Allocation Report on Amazon Web Services (AWS)


AWS offers web services like compute instances. Lately, I have been using one cc2.8xlarge instance for 3 hours on a weekly basis to give training sessions. My 14 students connect to orion.cloud.raytrek.com (a canonical name to my AWS instance) during every training session.The instance has one additional 300 GiB EBS volume attached to it so that my students keep their data for the whole duration of the training program.

On AWS, I can tag anything I use: EC2 instances, EC2 EBS volumes, S3 buckets, and so on. A tag is a key and a value (key=value), for example Project=Ray-Cloud-Browser-public-demo. On AWS, it's possible to activate a feature called Cost Allocation Report. This feature deposits detailed usage reports in one S3 bucket that I own. These reports include costs.

I tagged my Cost Allocation Report S3 bucket with Project=Billing to get a grasp on on much it costs to use the Cost Allocation Report feature. The cost of getting my Cost Allocation Report reports is only $ 0.02.

I wrote a Ruby script that generates pivot tables for my projects. Below is my Cost Allocation Report tables with some confidential information redacted. Things under Project=not-classified are things that were not tagged.

Pivot table for Cost Allocation Report on AWS
File from AWS S3: #####################-aws-cost-allocation-2013-02.csv
Project=ray-in-cloud-cc2.8xlarge-CLI
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-Regional-Bytes GB 2.8E-70.0
AWSDataTransferDataTransfer-In-Bytes GB 0.000060330.0
AmazonEC2SpotUsage:cc2.8xlarge instance-hours 10.000000002.7
AWSDataTransferDataTransfer-Out-Bytes GB 0.000020290.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000044145.0e-06



Total=2.700005
Project=Ray-Cloud-Browser-##############
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-Regional-Bytes GB 0.000047656.0e-06
AWSDataTransferDataTransfer-In-Bytes GB 0.000002860.0
AWSDataTransferDataTransfer-In-Bytes GB 8.656911690.0
AmazonEC2SpotUsage:t1.micro instance-hours 338.751381221.017348
AmazonEC2EBS:VolumeUsage GB-months 36.334022093.633367
AmazonEC2EBS:VolumeIOUsage I/O requests 3861215.847887990.385104
AWSDataTransferDataTransfer-Out-Bytes GB 0.000002300.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.181801450.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000005001.0e-06
AWSDataTransferDataTransfer-Out-Bytes GB 0.395612710.047268



Total=5.083094
Project=formation-#############-bioinformatique-hiver-2013
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-Regional-Bytes GB 0.000452776.0e-05
AWSDataTransferDataTransfer-In-Bytes GB 3.909986890.0
AmazonEC2BoxUsage:cc2.8xlarge instance-hours 10.0000000024.0
AmazonEC2EBS:VolumeUsage GB-months 145.4776285614.547622
AmazonEC2EBS:VolumeIOUsage I/O requests 267824.572144110.026712
AWSDataTransferDataTransfer-Out-Bytes GB 0.102163410.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.222314740.026562



Total=38.600956000000004
Project=not-classified
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-Regional-Bytes GB 0.074354980.009904
AWSDataTransferDataTransfer-Regional-Bytes GB 0.000006451.0e-06
AWSDataTransferDataTransfer-In-Bytes GB 0.001651900.0
AWSDataTransferDataTransfer-In-Bytes GB 0.043491390.0
AWSDataTransferDataTransfer-In-Bytes GB 0.042701760.0
AWSDataTransferDataTransfer-In-Bytes GB 0.043251740.0
AWSDataTransferDataTransfer-In-Bytes GB 5.474990610.0
AWSDataTransferDataTransfer-In-Bytes GB 0.000001970.0
AWSDataTransferDataTransfer-In-Bytes GB 0.000185850.0
AmazonS3Requests-Tier1 HTTP requests 26.426229510.001066
AmazonEC2BoxUsage:cc2.8xlarge instance-hours 1.000000002.4
AmazonEC2SpotUsage:t1.micro instance-hours 1.026519340.003083
AmazonEC2SpotUsage:t1.micro instance-hours 250.470718230.752221
AmazonEC2EBS:VolumeUsage GB-months 28.044400062.804413
AmazonEC2DataProcessing-Bytes
0.001464110.01
AmazonSNSRequests-Tier1 HTTP requests 459.000000000.0
AmazonEC2EBS:VolumeIOUsage I/O requests 3051481.508183820.304344
AmazonEC2SpotUsage:cr1.8xlarge instance-hours 9.000000003.09
AmazonEC2LoadBalancerUsage
1.000000000.03
AmazonEC2SpotUsage:cc2.8xlarge instance-hours 10.000000002.7
AWSDataTransferDataTransfer-Out-Bytes GB 0.000252600.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.002935500.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.001021180.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000263750.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.147749280.0
AWSDataTransferDataTransfer-Out-Bytes GB 4.3E-70.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000405910.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000549686.6e-05
AWSDataTransferDataTransfer-Out-Bytes GB 0.006387850.000763
AWSDataTransferDataTransfer-Out-Bytes GB 0.002222160.000266
AWSDataTransferDataTransfer-Out-Bytes GB 0.000573946.9e-05
AWSDataTransferDataTransfer-Out-Bytes GB 0.321512800.038415
AWSDataTransferDataTransfer-Out-Bytes GB 9.4E-70.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000883290.000106
AmazonEC2SpotUsage:cr1.8xlarge instance-hours 3.000000001.03
AmazonEC2SpotUsage:cc2.8xlarge instance-hours 1.000000000.27



Total=13.444717
Project=Ray-Cloud-Browser-public-demo
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-Regional-Bytes GB 0.000213352.8e-05
AWSDataTransferDataTransfer-In-Bytes GB 0.000020070.0
AWSDataTransferDataTransfer-In-Bytes GB 0.226134370.0
AmazonEC2SpotUsage:t1.micro instance-hours 338.751381221.017348
AmazonEC2EBS:VolumeUsage GB-months 36.334022093.633367
AmazonEC2EBS:VolumeIOUsage I/O requests 3726756.784679400.371693
AWSDataTransferDataTransfer-Out-Bytes GB 0.000016130.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.560928340.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000035104.0e-06
AWSDataTransferDataTransfer-Out-Bytes GB 1.220619400.145841



Total=5.168280999999999
Project=ray-in-cloud-cc2.8xlarge
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-In-Bytes GB 0.043436850.0
AmazonEC2SpotUsage:cc2.8xlarge instance-hours 10.000000002.7
AWSDataTransferDataTransfer-Out-Bytes GB 0.002327370.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.005064520.000605



Total=2.7006050000000004
Project=Billing
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AWSDataTransferDataTransfer-In-Bytes GB 0.003016820.0
AmazonS3Requests-Tier1 HTTP requests 221.573770490.008934
AmazonS3Requests-Tier2 HTTP requests 362.000000000.01
AmazonS3TimedStorage-ByteHrs GB 0.000048820.01
AWSDataTransferDataTransfer-Out-Bytes GB 0.000112060.0
AWSDataTransferDataTransfer-Out-Bytes GB 0.000243862.9e-05



Total=0.028963000000000003
Project=Ray-TestSuite
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AmazonEC2EBS:VolumeUsage GB-months 0.012308270.001231
AmazonEC2EBS:VolumeIOUsage I/O requests 21529.287104680.002147



Total=0.003378
Project=###############-instance-testing
Product CodeUsage TypeUnitsUsage QuantityTotal Cost ($)
AmazonEC2BoxUsage:hs1.8xlarge instance-hours 1.000000004.6



Total=4.6

People at Amazon.com, Inc. always say that they are obsessed by theirs customers. I am a happy customer of Amazon Web Services, Inc. (AWS), and I can confirm that AWS is really easy for the customer for many reasons, like the Cost Allocation Report.

Cost Allocation Report is really a feature for the customer that allows a better understanding of costs in the cloud. AWS could have charged a lot for that kind of feature -- banks charge their customers a lot for getting account statements from 3 years ago.


p.s.: I have no financial or commercial links with AWS, I am really just one happy AWS customer. I really think that AWS is giving me a great service for the money I give them. It's a win-win situation.

Big milestone reached for Ray Cloud Browser

It's almost March, and yet another milestone for Ray Cloud Browser was successfully reached this week. The data model of this software is composited of 4 types of objects: a map (a DNA kmer graph with a name), a section (a group of DNA sequences which are called regions), a region (a DNA sequence), and a location (a position in a region).

Ray Cloud Browser is a distributed application: some parts run in your browser, and some other parts run in the cloud (or your other favorite place to host your infrastructure). The client is in Javascript and HTML5 and runs in a web browser. The web service is in C++ and runs atop a web server.

The web services is implemented in C++ and is really efficient. There are 3 file binary file formats (with ASCII version that can be converted). The first is the map, which contains all the k-mers of a sample, their coverage, their parents and their children. Any k-mer can obtained in a logarithmic time using the C++ API of this file format. The second file format is the region file format. It allows the retrieval of parts of any region in constant time (each operation is constant time, fetching N locations of a section will perform O(N) operations obviously). The last format is implements annotations. Annotations allow a reverse search. With annotations, it's easy and fast to get a list of locations (map, section, region, location) for any k-mer. This is necessary to have a rich user experience in the HTML5 client where several regions are to be rendered in the user interface.

For the end user, the starting point is

http://smart.cloud.raytrek.com:55001/client/

The port 55001 is just because I am using IBM SmartCloud. Usually, the port is implicit and it's 80.

Below, the HTTP query is in red and the HTTP response is in blue. In some cases, I truncated the message body.

The HTTP query for the first communication follows.

GET /client/ HTTP/1.1
Host: smart.cloud.raytrek.com:55001

HTTP/1.1 200 OK
Date: Fri, 22 Feb 2013 04:56:58 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 20 Feb 2013 03:37:53 GMT
ETag: "102a33-9f5-4d61faedea640"
Accept-Ranges: bytes
Content-Length: 2549
Connection: close
Content-Type: text/html; charset=UTF-8







<br>Ray Cloud Browser: interactively skim processed genomics data with energy<br>


(message body is truncated)

This returns a HTML content and the client will fetch all the required Javascript files and so on.

The first HTTP query performed by the client returns the list of maps and associated sections for each map.

GET /server/?tag=RAY_MESSAGE_TAG_GET_MAPS HTTP/1.1
Host: smart.cloud.raytrek.com:55001
 
HTTP/1.1 200 OK
Date: Fri, 22 Feb 2013 04:54:03 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Ray Cloud Browser by Ray Technologies
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Content-Type: application/json

{"maps": [
{    "name": "Sample 2-3 2013-02-19-1",
    "sections": [
        { "name": "contigs" } ,
        { "name": "scaffolds" } ,
        { "name": "seeds" } ,
        { "name": "extensions" }
] },
{    "name": "American eel 2013-01-31-8",
    "sections": [
        { "name": "contigs" }
] }
]}



The next query fetches information about a particular map.

GET /server/?tag=RAY_MESSAGE_TAG_GET_MAP_INFORMATION&map=0 HTTP/1.1
Host: smart.cloud.raytrek.com:55001

HTTP/1.1 200 OK
Date: Fri, 22 Feb 2013 05:00:36 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Ray Cloud Browser by Ray Technologies
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Content-Type: application/json

{
"map": 0,
"kmerLength": 61,
"entries": 177593546
}


GET /server/?tag=RAY_MESSAGE_TAG_GET_REGIONS&map=0&section=0&first=0&readahead=4096 HTTP/1.1
Host: smart.cloud.raytrek.com:55001

Date: Fri, 22 Feb 2013 05:03:23 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Ray Cloud Browser by Ray Technologies
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Content-Type: application/json

{ "map": 0,
"section": 0,
"count": 31701,
"first": 0,
"readahead": 4096,
"regions": [
{"name":"contig-256000092 485463 nucleotides", "nucleotides":485463},
{"name":"contig-207000075 447363 nucleotides", "nucleotides":447363},
{"name":"contig-255000091 320321 nucleotides", "nucleotides":320321},
{"name":"contig-17 290352 nucleotides", "nucleotides":290352},
{"name":"contig-80 255554 nucleotides", "nucleotides":255554},
{"name":"contig-269000011 233955 nucleotides", "nucleotides":233955},
{"name":"contig-5 207507 nucleotides", "nucleotides":207507},
{"name":"contig-253000001 203979 nucleotides", "nucleotides":203979},
{"name":"contig-24 176868 nucleotides", "nucleotides":176868},
{"name":"contig-51 139462 nucleotides", "nucleotides":139462},
{"name":"contig-79 134613 nucleotides", "nucleotides":134613},
{"name":"contig-93 132985 nucleotides", "nucleotides":132985},
{"name":"contig-105 125302 nucleotides", "nucleotides":125302},


(message body is truncated)

The client can then ask for a bunch of k-mers for a given region.

GET /server/?tag=RAY_MESSAGE_TAG_GET_REGION_KMER_AT_LOCATION&map=0&section=0&region=4&location=2000&readahead=512 HTTP/1.1
Host: smart.cloud.raytrek.com:55001

HTTP/1.1 200 OK
Date: Fri, 22 Feb 2013 05:05:29 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Ray Cloud Browser by Ray Technologies
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Content-Type: application/json

{
"map": 0,
"section": 0,
"region": 4,
"kmerLength": 61,
"location": 2000,
"name":"contig-80 255554 nucleotides",
"nucleotides":255554,
"readahead": 512,
"vertices": [
{"position":1744,"value":"CCGGTCAAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTG"},
{"position":1745,"value":"CGGTCAAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGC"},
{"position":1746,"value":"GGTCAAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCA"},
{"position":1747,"value":"GTCAAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCAT"},
{"position":1748,"value":"TCAAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATG"},
{"position":1749,"value":"CAAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGA"},
{"position":1750,"value":"AAACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAA"},
{"position":1751,"value":"AACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAG"},
{"position":1752,"value":"ACGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGC"},
{"position":1753,"value":"CGTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGCG"},
{"position":1754,"value":"GTACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGCGT"},
{"position":1755,"value":"TACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGCGTT"},
{"position":1756,"value":"ACATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGCGTTA"},
{"position":1757,"value":"CATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGCGTTAT"},
{"position":1758,"value":"ATAACGAATGGTAGGATACAGGACGTATTTACCTTCACATTTGACTGCATGAAGCGTTATC"},


(message body is truncated)

The two last queries in the HTTP API of Ray Cloud Browser allows the client to get attributes of a k-mer and to get annotations of a k-mers.

GET /server/?tag=RAY_MESSAGE_TAG_GET_KMER_FROM_STORE&map=0&object=CGGCGCTTCCCATCACCTTAAGTTATCCAGAGGACATATTTGTGATGGAATCACACATATC&depth=512 HTTP/1.1
Host: smart.cloud.raytrek.com:55001

HTTP/1.1 200 OK
Date: Fri, 22 Feb 2013 05:07:59 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Ray Cloud Browser by Ray Technologies
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Content-Type: application/json

{
"map": 0,
"object": "CGGCGCTTCCCATCACCTTAAGTTATCCAGAGGACATATTTGTGATGGAATCACACATATC",
"vertices": [
{
        "value": "CGGCGCTTCCCATCACCTTAAGTTATCCAGAGGACATATTTGTGATGGAATCACACATATC",
        "coverage": 144,
        "parents": ["G"],
        "children": ["G"]
},
{
        "value": "GCGGCGCTTCCCATCACCTTAAGTTATCCAGAGGACATATTTGTGATGGAATCACACATAT",
        "coverage": 155,
        "parents": ["C", "T"],
        "children": ["A", "C"]
},


(message body is truncated)

GET /server/?tag=RAY_MESSAGE_TAG_GET_OBJECT_ANNOTATIONS&map=0&object=CGGCGCTTCCCATCACCTTAAGTTATCCAGAGGACATATTTGTGATGGAATCACACATATC HTTP/1.1
Host: smart.cloud.raytrek.com:55001

HTTP/1.1 200 OK
Date: Fri, 22 Feb 2013 05:10:08 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Ray Cloud Browser by Ray Technologies
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Content-Type: application/json

{
"results": [
{ "object": "CGGCGCTTCCCATCACCTTAAGTTATCCAGAGGACATATTTGTGATGGAATCACACATATC",
"annotations": [
{ "type": "LocationAnnotation", "section": 0,  "region": 4,  "location": 2000 }
]
}]}



The data inside the web service are currently added and managed with RayCloudBrowser-client -- a command-line client that uses the Ray Cloud Browser C++ API. The available commands are:

RayCloudBrowser-client add-map
RayCloudBrowser-client add-section
RayCloudBrowser-client create-map
RayCloudBrowser-client create-map-annotations-with-section
RayCloudBrowser-client create-section
RayCloudBrowser-client describe-configuration
RayCloudBrowser-client describe-json-file
RayCloudBrowser-client describe-map
RayCloudBrowser-client describe-map-annotations
RayCloudBrowser-client describe-map-object
RayCloudBrowser-client describe-map-object-annotations
RayCloudBrowser-client describe-map-objects
RayCloudBrowser-client describe-map-with-region
RayCloudBrowser-client describe-section


Running any of these commands without arguments will give you a help page.


I think this visualization project is exciting and eventually, the command-line client for managing a deployment will be totally replaced by new actions available in the endpoint of the web service, like pushing new maps or new sections.

A really cool feature for the long term vision is to have a web action in the HTTP API of Ray Cloud Browser to allow end users to push their FASTQ sequences directly into the cloud.

Something that I am really proud of with the HTTP API of Ray Cloud Browser is that it abstracts totally how the objects are actually stored by the web service.

For instance, RAY_MESSAGE_TAG_GET_MAP_INFORMATION just tells the endpoint that it's for the map # 0 in the list of maps returned by RAY_MESSAGE_TAG_GET_MAPS.

Right now, the storage engine uses memory-mapped files with O_RDONLY for open(), and PROT_READ and MAP_SHARED for mmap().

2013-02-20

Using canonical names for cloud instances

I am using these public cloud services:


Product
Service Provider
Amazon Elastic Compute Cloud (EC2)
Amazon Web Services, Inc. (AWS)
Windows Azure Linux Virtual Machines
Microsoft Corporation
Rackspace Cloud Servers
Rackspace, U.S. Inc.
IBM SmartCloud®
IBM Corporation


My canonical names:


Name Type Value
browser.cloud.raytrek.com. CNAME ec2-23-23-55-35.compute-1.amazonaws.com.
thor.cloud.raytrek.com. CNAME 108-166-117-29.static.cloud-ips.com.
smart.cloud.raytrek.com. CNAME vhost0147.dc1.on.ca.compute.ihost.com.
azure.cloud.raytrek.com. CNAME ray-tech.cloudapp.net.
plp.cloud.raytrek.com. CNAME ec2-54-235-237-179.compute-1.amazonaws.com.
orion.cloud.raytrek.com. CNAME ec2-54-242-199-219.compute-1.amazonaws.com.

2013-02-19

Testing a Silver instance on IBM SmartCloud

IBM SmartCloud is free for 90 days.

My Silver instance runs Red Hat Enterprise Linux v5.4, has 4 cCPU has 8 GiB RAM, and 1060 GiB disk.

I can connect to my free-of-charge instance with the following command:

ssh -i ibmcloud_seb@boisvert.info_rsa idcuser@vhost0147.dc1.on.ca.compute.ihost.com


In the documentation, it says:

Open ports

  • 22 is the SSH port for the idcuser account
  • 523 is the DB2 Administration Server port
  • 50001 is the DB2 instance port for the db2inst1 user
  • 55001 is the DB2 Text Search port
  • 60000:60003 are the DPF ports for the FCM protocol
Warning: Every additional port is a potential security risk.


I usually like my port 80 when it's opened.


Another thing that I don't like is that there is no vim (the editor).

[idcuser@vhost0147 conf]$ vim
-bash: vim: command not found

And you can not install it either.

[idcuser@vhost0147 conf]$ sudo yum install -y vim
Loaded plugins: rhnplugin, security
This system is not registered with RHN.
RHN support will be disabled.
Setting up Install Process
No package vim available.
Nothing to do

Anyway, IBM SmartCloud is cool regardless because it's in the cloud, right.

So I just told Apache httpd to use the port 55001 (DB2-related, but not in use).

Now I have a nice web server with 8 GiB RAM, 4 vCPUs, and 1 TiB disk.


http://vhost0147.dc1.on.ca.compute.ihost.com:55001/

I created these two CNAME entries that point right to this instance:

smart.cloud.boisvert.info

and

smart.cloud.raytrek.com

So I have these two nice addresses too:

http://smart.cloud.boisvert.info:55001

http://smart.cloud.raytrek.com:55001


Now, let's deploy Ray Cloud Browser on that.


(To be continued)

Update 2013-02-20:

Ray Cloud Browser is now deployed in IBM SmartCloud:

http://smart.cloud.raytrek.com:55001/client

2013-02-16

Architecture informatique pour l'achat de billets du Festival d'été de Québec

Aujourd'hui, l'achat de billets pour le Festival d'été de Québec était pénible pour les clients (Radio-Canada.ca, journaldequebec.com).

À partir du site http://infofestival.com/ il est possible d'aller sur le site des achats de billets. Le site est https://achat.infofestival.com/

Le site infofestival.com est en Ontario (Brampton) alors que achat.infofestival.com est dans les nuages, chez Amazon Web Services, Inc., dans la ville de Ashburn aux États-Unis.

L'entrée A dans le fichier de zone DNS pour infofestival:

infofestival.com.    31    IN    A    50.100.3.86

Voici les entrées DNS de type CNAME dans le fichier de zone pour achat.infofestival.com:


achat.infofestival.com. 300     IN      CNAME   sale-gtickets-1409583281.us-east-1.elb.amazonaws.com.
sale-gtickets-1409583281.us-east-1.elb.amazonaws.com. 60 IN A 23.23.167.37
sale-gtickets-1409583281.us-east-1.elb.amazonaws.com. 60 IN A 23.21.140.223
sale-gtickets-1409583281.us-east-1.elb.amazonaws.com. 60 IN A 107.22.217.101
sale-gtickets-1409583281.us-east-1.elb.amazonaws.com. 60 IN A 54.243.112.146


Donc, le système d'achat de billets implémente un balancement de charge sur 4 instances dans Amazon Elastic Compute Cloud. Le balancement de charge est fait avec Amazon Elastic Load Balancing.

Voici les 4 entrées A dans le fichier de zone DNS pour les 4 instances:

ec2-23-23-167-37.compute-1.amazonaws.com. 604800 IN A 23.23.167.37
ec2-23-21-140-223.compute-1.amazonaws.com. 604800 IN A 23.21.140.223
ec2-107-22-217-101.compute-1.amazonaws.com. 300    IN A 107.22.217.101

ec2-54-243-112-146.compute-1.amazonaws.com. 604800 IN A    54.243.112.146

L'hypothèse est que c'est infofestival.com qui a échoué sous les requêtes, et non achat.infofestival.com.

Une autre hypothèse est que les 4 instances parlaient au même engine de stockage d'information, et donc le balancement de charge était inutile.

Outils utilisés: dig, nslookup.

Ajout (2013-02-17):

@JeanSebTr et @j15e ont indiqué que les 4 instances EC2 -- ec2-23-23-167-37, ec2-23-21-140-223, ec2-107-22-217-101, ec2-54-243-112-146 -- ne sont pas les instances qui roulent l'application du #FEQ. Ces 4 instances EC2 s'occupent du balancement de charge sur d'autres instances EC2. Les problèmes techniques de hier étaient (probablement) sur les autres instances EC2 du #FEQ qui reçoivent les requêtes. J'ai beaucoup appris en discutant avec @JeanSebTr car il connait bien ELB.

Exemple d'architecture:

L'adresse MyLoadBalancer-2119387095.us-east-1.elb.amazonaws.com. est le balanceur.

Elle pointe vers une instance EC2 (ou plusieurs) qui roule le logiciel de balancement de charge d'Amazon Web Services, Inc.

Voici les entrées dans le fichier de zone DNS:

MyLoadBalancer-2119387095.us-east-1.elb.amazonaws.com. 53 IN A 107.20.157.169
 
ec2-107-20-157-169.compute-1.amazonaws.com. 300    IN A 107.20.157.169

Deux de mes instances EC2 sont enregistrées dans le balanceur et roule une application (Ray Cloud Browser).

ec2-23-23-55-35.compute-1.amazonaws.com. 603363    IN A 23.23.55.35
 

ec2-54-235-237-179.compute-1.amazonaws.com. 86400 IN A 54.235.237.179

Les visiteurs de http://myloadbalancer-2119387095.us-east-1.elb.amazonaws.com/client/ vont voir deux versions (car mes deux instances EC2 n'ont pas les mêmes données -- mon exemple ne fait pas de sens).

2013-02-13

A catalog of IBM Blue Gene/Q errors (for science)

Once in a while, I get to run things on a IBM Blue Gene/Q. And when that happens, some of my jobs always crash with random errors.

For science, here are some of them.

Update 2013-02-27 with MPI I/O: 

This requires fcntl(2) to be implemented. As of 8/25/2011 it is not. Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 4,cmd F_SETLKW/E,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 23.
- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
ADIOI_Set_lock:: Resource deadlock avoided
ADIOI_Set_lock:offset 25625204815, length 6282003
Abort(1) on node 2501 (rank 2501 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2501


SRA056234-Picea-glauca-2013-02-12-1

2013-02-13 04:56:26.627 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: terminated due to: killing the job timed out
2013-02-13 04:56:26.628 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: abnormal termination by signal 35 from rank 2712 due to RAS event with record ID 279083. END_JOB control action heartbeat timed out after 60 seconds
2013-02-13 04:56:26.628 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: 937 RAS events
2013-02-13 04:56:26.628 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: most recent RAS event text: CFAM Machine Check. Message=REASON: Core4 failed (uncorrectable error).  DETAILS: CFAM_Status=0xc0000000, MachineCheck [Core 4 Chiplet chkstp reg=0x8400000000000000: , Summary bit for xfir_lt, Chkstp from FIR1 [Core 4 PCB FIR1=0x0000000020000000: , A2-L2 UE]]!, DrillDown=CFAM_Status=0xc0000000, MachineCheck [Core 4 Chiplet chkstp reg=0x8400000000000000: , Summary bit for xfir_lt, Chkstp from FIR1 [Core 4 PCB FIR1=0x0000000020000000: , A2-L2 UE]]!


SRA056234-Picea-glauca-2012-12-22-13

2012-12-22 21:40:47.545 (FATAL) [0x40000ee8a50] :23842:ibm.runjob.client.Job: could not start job: block is unavailable due to a previous failure
2012-12-22 21:40:47.546 (FATAL) [0x40000ee8a50] :23842:ibm.runjob.client.Job: node R00-M0-N00-J00 is not available: Software Failure


SRA056234-Picea-glauca-2013-01-18-15

2013-01-19 13:49:22.860 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: terminated due to: killing the job timed out
2013-01-19 13:49:22.861 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: abnormal termination by signal 35 from rank 3345 due to RAS event with record ID 259250. END_JOB control action heartbeat timed out after 120 seconds
2013-01-19 13:49:22.861 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: 189 RAS events
2013-01-19 13:49:22.861 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: most recent RAS event text: A BQL double bit error threshold was exceeded for Switch 2 Group EVEN and ODD


SRA056234-Picea-glauca-2013-02-10-1

2013-02-11 16:10:37.815 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: received signal 15
2013-02-11 16:10:37.815 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: signal sent from USER
2013-02-11 16:10:37.815 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: sent from pid 12709
2013-02-11 16:10:37.816 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: could not read /proc/12709/exe
2013-02-11 16:10:37.816 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: Permission denied
2013-02-11 16:10:37.817 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: sent from uid 0 (root)
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: terminated by signal 9
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: abnormal termination by signal 9 from rank 10513
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: 139 RAS events
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable Error Summary : count=1 MCFIR error status:  [MEMORY_CE] This bit is set when a memory CE is detected on a non-maintenance memory read op;


2013-02-07

Ray is a software robot


Text snippets below that are in bold font face are from Wikipedia.

According to Wikipedia, Ray

    "comes in two variants: a manned prototype version (...) and an    unmanned, computer-controlled version (...)."

This part means basically that our product can be launched in a interactive terminal (manned) or with a job scheduler using a job description (unmanned).

    "RAY differs from previous Metal Gears in that it is not a nuclear launch platform, but instead a weapon of conventional warfare."

Ray is a versatile software for conventional workflows, although it can also perform assembly from nuclear DNA (DNA from the nucleus of a cell).

    "The Metal Gear RAY is more organic in appearance and in function than previous models."

Ray is appealing both from its exterior look, but also in its design blueprints, and source code.

    "Its streamlined shape helps to deflect enemy fire and allows for greater maneuverability both on land and in water."

Ray can maneuver both in the cloud, on super computers, on laptops, on servers, and on toasters.

    "It also has a nervous-system-like network of conductive nanotubes, which connect the widely dispersed sensor systems and relay commands from the cockpit to the various parts of RAY's body, automatically bypassing damaged systems and rerouting to auxiliary systems when needed."

This part of the Wikipedia article is actually the one that I prefer. In that extract, it is acknowledged that Ray uses a sophisticated network to "relay commands" to "the various parts." This unmatched technology allows automatic rerouting of commands using a tailored system of relays. This system is the virtual communicator in Ray Platforn.

I recommend a video called "Are robots hurting job growth?" from CBSNEWS to learn more about software robots.

Finally, for those wondering why Ray performed so well on the Snake dataset in the Assemblathon 2 competition, it's because one of the purposes of Ray is to annihilate Solid Snake. According to Wikipedia:

    "When Solidus Snake took over Arsenal along with the slave RAYs, he had them confront Raiden, who destroys them all."

So now you know!

It should be stated that the "slave RAYS" receive "relay commands" from "the cockpit", which is actually the master Ray.

I sure hope that this highly technical and scientific post is as much educative (and entertaining) as I wanted it to be when I started it around one hour ago. The extensive external references contained herein will contribute to the upcoming popularity of this post.

Disclaimers: 
Ray is free software (GPLv3); 
Ray Platform is free software (LGPLv3);
Ray Cloud Browser is free software (GPLv3).
---
Sébastien Boisvert
Principal Product Manager, Ray Genomics Software Suite
(that's obviously a self-proclaimed not-serious title -- I am a PhD student ;-))

There was an error in this gadget