2014-10-17

Profiling the Thorium actor model engine with LTTng UST

Thorium is an actor model engine in C 1999. It uses MPI and Pthreads.

The latency (in Thorium) when sending small messages between actors recently came to my attention.

In this post, LTTng-UST is used to generate actor message delivery paths annotated with time deltas in each step.

Perf


I have been working with perf for a while now, but found it only useful mostly for hardware counters.

I typically use the following command to record events with perf. Note that ($thread is the Linux LWP (lightweight process) thread number.

perf record -g \
    -e cache-references,cache-misses,cpu-cycles,ref-cycles,instructions,branch-instructions,branch-misses \
    -t $thread -o $thread.perf.data

 

 As far as I know, perf can not trace things like message delivery paths in userspace.

Tracing with LTTng-UST

 
This week, I started to read about tracepoints (perf does support "Tracepoint Events"). In particular, I wanted to use tracepoints to understand some erratic behaviors in Thorium.

LTTng-UST, the Linux Trace Toolkit Next Generation Userspace Tracer, has quite a long name.

A friend of mine (Francis Giraldeau) is a researcher in the field of tracing. He helped me getting started with LTTng-UST. I also got some help from the lttng-dev mailing list.

The data model for defining and storing tracepoint events (called CTF or Common Trace Format) is probably the big difference between LTTng and the other tracers. The LWN papers (part 1 and part 2) about LTTng are very interesting too and they discuss the CTF format.

According to the LTTng documentation (which is great by the way), tracing is done today in 3 easy steps:
  1. Instrument (add tracepoints in the source code);
  2. Trace (run the application while recording tracepoint events);
  3. Investigate (analyze tracepoint data using various techniques).

In the BIOSAL project, we already have (primitive) tracepoints, but the data gathered is basically analyzed (printed or discarded) in real time, which is probably a bad idea in the first place.

Regardless of where the tracepoint data go, we are using the same semantic as the one in LTTng-UST (LTTng was the inspiration). We insert a tracepoint in our code with something like this:

    /* trace message:actor_send events */
    thorium_tracepoint(message, actor_send, message);


Here is a little explanation: message (the first argument) is the provider, actor_send is the event name, and message (the third argument) is the data that we want to submit for tracing.


Adding my first tracepoint


To try LTTng, I added a define (#define thorium_tracepoint tracepoint) to use LTTng's tracepoint().

I also added a tracepoint in thorium_actor_send (actor.c) with the line below.

tracepoint(hello_world, my_first_tracepoint, name * name, "x^2");


I also added a rule for engine/thorium/tracepoints/lttng/hello-tp.o
in the Makefile (option: THORIUM_USE_LTTNG).

I then started the Spate application, and ran lttng list.

[boisvert@bigmem biosal]$ lttng list --userspace
Spawning a session daemon
UST events:
-------------
None





I ran it again since the LTTng daemon was not running.




[boisvert@bigmem biosal]$ lttng list --userspace
UST events:
-------------

PID: 23298 - Name: ./applications/spate_metagenome_assembler/spate
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      hello_world:my_first_tracepoint (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 23300 - Name: ./applications/spate_metagenome_assembler/spate
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      hello_world:my_first_tracepoint (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 23299 - Name: ./applications/spate_metagenome_assembler/spate
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      hello_world:my_first_tracepoint (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 23297 - Name: ./applications/spate_metagenome_assembler/spate
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      hello_world:my_first_tracepoint (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)



Great. This is very simple to use.


Tracing inside a LTTng session


Now, I am ready to do actual tracing (with the tracepoint "hello_world:my_first_tracepoint"). The first thing to do now is to create a session.

[boisvert@bigmem biosal]$ lttng create my-userspace-session
Session my-userspace-session created.
Traces will be written in /home/boisvert/lttng-traces/my-userspace-session-20141017-154714


Then, I enable all userspace tracepoints for my session and start tracing.


[boisvert@bigmem biosal]$ lttng enable-event --userspace --all
All UST events are enabled in channel channel0


[boisvert@bigmem biosal]$ lttng start
Tracing started for session my-userspace-session


Then I run my application.

[boisvert@bigmem biosal]$ mpiexec -n 4 ./applications/spate_metagenome_assembler/spate -k 51 -threads-per-node 8 ~/dropbox/S.aureus.fasta.gz


Once my application terminates, I stopped tracing and destroyed my session.

[boisvert@bigmem biosal]$ lttng stop
Waiting for data availability.
Tracing stopped for session my-userspace-session
[boisvert@bigmem biosal]$
[boisvert@bigmem biosal]$ lttng destroy
Session my-userspace-session destroyed


Tracepoint data files were written in a directory in my home.

[boisvert@bigmem biosal]$ ls ~/lttng-traces/
my-userspace-session-20141017-154714



There are now 135 tracepoint files (all related to "channel0" for this single LTTng tracepoint session).

[boisvert@bigmem biosal]$ find ~/lttng-traces/my-userspace-session-20141017-154714/|wc -l
135


So far, I did step 1) Instrument and step 2) Trace. The third and last step is 3) Investigate.

One easy way to inspect LTTng traces is a tool named babeltrace. By default, this tool dumps the events in ASCII text in the standard output.

[boisvert@bigmem ~]$ babeltrace ~/lttng-traces/my-userspace-session-20141017-154714/ | head
[15:50:56.347344203] (+?.?????????) bigmem hello_world:my_first_tracepoint: { cpu_id = 38 }, { my_string_field = "x^2", my_integer_field = -689379607 }
[15:50:56.347520299] (+0.000176096) bigmem hello_world:my_first_tracepoint: { cpu_id = 39 }, { my_string_field = "x^2", my_integer_field = -695379712 }
[15:50:56.347582196] (+0.000061897) bigmem hello_world:my_first_tracepoint: { cpu_id = 38 }, { my_string_field = "x^2", my_integer_field = -691379644 }
[15:50:56.347698898] (+0.000116702) bigmem hello_world:my_first_tracepoint: { cpu_id = 36 }, { my_string_field = "x^2", my_integer_field = -695379712 }
[15:50:56.347765853] (+0.000066955) bigmem hello_world:my_first_tracepoint: { cpu_id = 38 }, { my_string_field = "x^2", my_integer_field = -693379679 }
[15:50:56.348014813] (+0.000248960) bigmem hello_world:my_first_tracepoint: { cpu_id = 7 }, { my_string_field = "x^2", my_integer_field = -695379712 }
[15:50:56.348053162] (+0.000038349) bigmem hello_world:my_first_tracepoint: { cpu_id = 38 }, { my_string_field = "x^2", my_integer_field = -683379484 }
[15:50:56.348172916] (+0.000119754) bigmem hello_world:my_first_tracepoint: { cpu_id = 46 }, { my_string_field = "x^2", my_integer_field = -695379712 }
[15:50:56.348228416] (+0.000055500) bigmem hello_world:my_first_tracepoint: { cpu_id = 38 }, { my_string_field = "x^2", my_integer_field = -683379484 }
[15:50:56.348305385] (+0.000076969) bigmem hello_world:my_first_tracepoint: { cpu_id = 46 }, { my_string_field = "x^2", my_integer_field = -695379712 }






Let's create useful tracepoints now.


Concrete tracepoints in Thorium



Below is a list of tracepoints in the delivery path of any message in Thorium.

- thorium_message:actor_send
- thorium_message:worker_send
- thorium_message:worker_send_enqueue
- thorium_message:node_send
- thorium_message:node_send_system
- thorium_message:node_send_dispatch
- thorium_message:node_dispatch_message
- thorium_message:worker_pool_enqueue
- thorium_message:transport_send
- thorium_message:transport_receive
- thorium_message:node_receive
- thorium_message:worker_receive
- thorium_message:actor_receive


Here is a delivery trace for a message delivered within a node (suspicious time delta is shown in red).

[boisvert at bigmem biosal]$ babeltrace  ~/lttng-traces/auto-20141018-104939/ |grep "message_number = 8000," | awk '{print $4" "$2}'|sed 's/: / /g'|sed 's/(//g'|sed 's/)//g'
thorium_message:actor_send +0.000010835
thorium_message:worker_send +0.000000876
thorium_message:worker_enqueue_message +0.000001410
thorium_message:worker_dequeue_message +0.003886256  
thorium_message:worker_pool_dequeue +0.000001434
thorium_message:node_send +0.000001438
thorium_message:node_send_system +0.000001006
thorium_message:node_dispatch_message +0.000001231
thorium_message:worker_pool_enqueue +0.000001827
thorium_message:node_send_dispatch +0.000001100
thorium_message:worker_receive +0.000003944
thorium_message:actor_receive +0.000003028

A message sent between nodes also shows this high delta (4 ms).

[boisvert at bigmem biosal]$ babeltrace  ~/lttng-traces/auto-20141018-104939/ |grep "message_number = 8800," | awk '{print $4" "$2}'|sed 's/: / /g'|sed 's/(//g'|sed 's/)//g'
thorium_message:actor_send +0.004115537
thorium_message:worker_send +0.000001402
thorium_message:worker_enqueue_message +0.000001457
thorium_message:worker_dequeue_message +0.004062639  
thorium_message:worker_pool_dequeue +0.000001536
thorium_message:node_send +0.000000934
thorium_message:node_send_system +0.000001185
thorium_message:transport_send +0.000001039
thorium_message:node_receive +0.000061961
thorium_message:node_dispatch_message +0.000001502
thorium_message:worker_pool_enqueue +0.000002299
thorium_message:worker_receive +0.000001632
thorium_message:actor_receive +0.000003461


Conclusion


According to the LTTng traces, the way messages are dequeued in Thorium is not very efficient.


Edit 2014-10-18: added actual tracepoints
Edit 2014-10-18: fixed format
There was an error in this gadget