Title Analyze performance overhead of using Fast Downward through a Singularity container
Priority feature Status resolved
Superseder Nosy List cedric, florian, guillem, jendrik, malte, silvan
Assigned To guillem Keywords
Optional summary

Created on 2018-03-19.12:28:02 by guillem, last changed by guillem.

File name Uploaded Type Edit Remove
brfs-time-memory-scatter-plots.tar.gz guillem, 2018-05-09.12:18:38 application/gzip
msg7132 (view) Author: guillem Date: 2018-05-11.17:19:02
Yes, you're right, the distinction between Docker / Singularity should be not too relevant at the level of the support 
(to be tested though), since Singularity easily allows to start an image off a docker image, but it could make a 
difference at the performance level, which needs to be tested. One point though is whether we want to provide support 
for a Docker "user container", as this would be geared towards easing the execution of experiments, but docker doesn't 
seem to be too HPC-friendly anyhow.
An option would be to test things (both user and dev containers) first with Singularity, and after that see if it is 
worth provide any of the two as a Docker container as well, if we think that might increase the outreach.

The questions you mentions at the end of need to be of course addressed in the form of documentation. I suggest the 
first step with this could be to create a first (Github? Bitbucket?) repository with documentation, scripts and 
singularity recipes; documentation which could be moved to the website if we finally approve to support this, but which 
in the meantime can be used by us.

The issue of running experiments with Lab should perhaps be addressed separately. I've run the experiments using an 
ugly hack, but on the other hand integrating the container with Lab should not be terribly difficult, with the 
appropriate Lab experience :-) Where before we use a callstring of the form e.g.:

./ --overall-time-limit 30m --overall-memory-limit 3584M --alias lama-first domain.pddl problem.pddl

now we would use:

./fast-downward.img --overall-time-limit 30m --overall-memory-limit 3584M --alias lama-first domain.pddl problem.pddl

i.e. simply replace the python script with the binary image. Special care should be taken with options such as "--build 
XXX" (i.e. do we want to provide images with different builds inside? probably not, but to be discussed), or "--
validate" (i.e. do we want to distribute VAL inside of the image?), but "standard" command line options should work the 

On the other hand, all of LAB's build steps could be done away with if we use a singularity container, and replaced 
with some "singularity pull downward:latest" command, or similar, which simply fetches the binary image, etc.

To wrap up, I suggest closing this issue, putting Docker temporarily on hold and going for a singularity "end-user 
container" (issue788) and a Singularity-based dev kit (issue787).
msg7129 (view) Author: malte Date: 2018-05-11.12:11:46
Agreed on the way to proceed. I like the proposed name change for this issue.
Some comments before we close the issue:

Firstly, for those who are still not too sure about the design goals of
Singularity etc., I found these links useful:

Secondly, sometimes you say "Singularity" and sometimes you say
"Docker/Singularity". It is important to note that we have only really looked at
Singularity, and it would be good if we could do some investigation of
runtime/memory overhead with Docker, too. I expect it to be low because we don't
access the network, file system or kernel space much, so containerization should
not make a huge difference. But we should have some numbers, so that we can tick
this box before we recommend using Docker. Because of its root requirement, we
can only do a local experiment, but it should not be terribly difficult to find
a machine on which we can run a local 5-minute blind search experiment, even for
the whole benchmark suite, over a few days or so.

Thirdly, I agree that we should clearly separate the use cases of "user
container" (a container to run experiments) and "dev container" (a container for
development). User containers are what Singularity is intended for, and I can
see this being very useful for a number of reasons. I'm more skeptical regarding
dev containers (like Jendrik, but perhaps a bit less so), but I'm willing to be

Before we move on to more complex things, I would first like to better
understand how to build a user container, though. It looks like you've already
done all the major work, but I don't know how to reproduce it. For example, I'd
be interested in the following things:

1) How do I create a user container myself?
2) How do I run the planner locally on my machine from a user container?
3) How do I run the planner on the grid, in a lab experiment, from a user container?
msg7128 (view) Author: guillem Date: 2018-05-11.11:18:46
I would perhaps close this current issue (and perhaps change its title 
to "Analyze performance overhead of using Fast Downward through a Singularity 
for easier future reference?), and then articulate what still needs to be done 
on two main

(1) Provide a Singularity/Docker-Based "Fast Downward Developer Toolkit".
    The idea here would be of course to (a) think the general structure / 
    (b) write a few scripts to automate everything that can be automated, and
    (c) test this a bit among ourselves before deciding whether it is useful
    enough to be released / maintained.

(2) Provide an off-the-shelf Singularity/Docker Fast Downward image which can 
be used
    by researchers to run experiments with Fast Downward without having to 
worry about
    installation of dependencies, etc. (the issue of LP software licenses 
should be
    considered here as well.) The main issue here would be to decide which 
revisions / tags
    get built into an image, how we tag the images, etc., and in general which 
policy do we 
    follow in order to minimize the overhead of giving support to containerized 
    of the planner. Relevant points here would include: do we patch images if a 
bug is found?
    do we offer more than one image? when do we update the image? What exactly 
do we include
    in the image (e.g. GCC, Clang, Validate...)? etc.

I think that the work on both of these issues can be relatively (but not 
completely) independent.
I could start with (1) and perhaps give a bit of thought to (2) to kick off 
some discussion in
the next meeting...?
msg7127 (view) Author: malte Date: 2018-05-09.13:39:10
Thanks, Guillem! Very thorough. I would summarize that the results are very
positive: there is no significant performance penalty through containerization,
and being able to use more modern software even gives us a bit of a speed boost.
So from that side there would be no obstacles towards providing a container.

Referring back to our recent discussion in the Fast Downward meeting, I suppose
the other question is how easy/difficult it is to work with the container. I
suppose there are three aspects to this: what would need to do to provide
containers, how difficult is it to use them for development, and how difficult
is it to use them for running the planner/conducting experiments. Of course not
all these aspects need to be part of this issue.

How do you suggest to proceed?
msg7126 (view) Author: guillem Date: 2018-05-09.13:29:52
Ok, a few notes about the results:

** Image 1 vs Image 2 **
With respect to mem. consumption, the second image has a small constant increase (less than 1 
MB) which might be due to the small diff in compiler version, 
but which doesn't seem cause for concern. This holds across the three different search 
strategies, both opt and sat.
The differences in runtime seem equally minor. Overall, this supports the idea that, as 
expected, there is no major difference between running an image 
with a binary compiled from within the image and another where the binary has been compiled in 
the cluster and moved into the image.

** Image 1 vs native planner (column 4) **
Mem. consumption: the "native" planner (no singularity involved) has a smaller mem. consumption, 
but the difference seems to be roughly constant and even in the 
instances where overall mem. consumption is highest, the difference is not larger than 4-5MB. 
This holds again across the three different search strategies.
The scatter plot for BrFS (file "brfs-memory-xenial-half_vs_native.png") shows that indeed the 
relative differences are higher when the total mem consumption is very low,
but as the total mem consumption increases, become smaller.

In terms of total runtime, the scatter plot for BrFS ("brfs-total_time-xenial-
half_vs_native.png") shows a slight imbalance towards the singularity image being marginally 
in more instances than the other way around, which is somewhat surprising. I don't have a clear 
explanation for this.
Results for LMCUT do not show significant differences; in lama, there are indeed
significant differences in runtime between same instances, but overall they seem to cancel off. 
A reason for this could be the possible non-determinism of the lama config?

** Image 1 vs Image 3 **
Finally, comparing these two images gives a rough idea of the improvement we could get if we 
switch to an image with the latest ubuntu, newer compiler and python, etc.
In terms of memory the BrFS scatter plot ("brfs-memory-xenial-half_vs_bionic-full.png") shows 
there are some consistent savings in using ubuntu bionic, but they become quite marginal
as the total memory consumption increases. Same happens with the other search strategies, with 
the exception that for lama, in a few domains (e.g. philosophers domain) the trend
seems to invert beyond a certain mem. limit... which is somewhat strange, given that for those 
very same instance the number of expanded nodes remains the same.

In terms of total runtime, the BrFS scatter plot ("brfs-total_time-xenial-half_vs_bionic-
full.png") shows that in most instances the latest ubuntu is between a 5% and a 15% faster, 
which is quite interesting.
Only in a handful of instances the xenial image is faster, and these seem to be instances where 
the overall time is quite small.
msg7125 (view) Author: guillem Date: 2018-05-09.12:18:38
I have run different experiments for each configuration, for further clarity. There are:

== Breadth-First Search ==:
== A* + lm-cut ==:
== lama-first ==:

Let me provide the interpretation of each revision here again:

1dd300e9d9fc: "Half" singularity image: the experiments run a containerized singularity image, 
               but the binary inside the container was compiled directly on the cluster (GCC 5.4.0, Python 2.7.11)
552ab3681c03: "Full" singularity image on ubuntu xenial (16.04). GCC 5.3.1
77378d182932: "Full" singularity image on ubuntu bionic (18.04). GCC 7.3.0
ffd0ec660144: planner run natively on the cluster, no containerization at all. (GCC 5.4.0, Python 2.7.11)

One would expect the first and the second images above to perform roughly the same, but I thought it would be good to validate that as well.
The main objective is to compare the first and the fourth, which should give an idea of the overhead caused by the containerization technology..
Finally, the third image is there so that we can also get an idea of the benefits of using a newer compiler / Python version
(which is interesting in itself if the performance difference is large enough, as all experiments run in the cluster could benefit from that).

I am also attaching a tarball with relative scatter plots for memory and total time figures, only for the breadth-first search experiment,
for all possible pairs of images.

I'll add my interpretation of the results in a different message.
msg7120 (view) Author: malte Date: 2018-05-07.14:17:10
I would also be interested in the blind search scatter plots that Florian mentions.
msg7111 (view) Author: florian Date: 2018-05-02.22:23:44
The planner will compile with an LP solver if it detects one. If you have one
installed on your grid account, the binaries compiled on the grid would have one
but the binaries compiled in the container wouldn't. In that case it might be
good to try a build without LPs (I believe there is a build config for that, or
at least a CMake option).

Also, for the record: as we discussed earlier, testing blind search would make
it easier to debug the memory behavior and a relative scatter plot would make it
easier to see the time and memory differences.
msg7110 (view) Author: guillem Date: 2018-05-02.22:14:21
That's a good hint. Another probable cause we were discussing with Florian could be the different OS 
- the singularity image is using Ubuntu xenial, which is the one that features the version of GCC / 
Python closest to the ones on the cluster, but a better alternative to troubleshoot this will be to 
use the same version of centos as used in the cluster. I'm seeing now that not all nodes in the 
cluster have the same version (e.g. login node is 7.3.1611, ase01 node is 7.4.1708), but still 
switching to 7.3 should be a better approximation. I'll test that tomorrow.

As for your point, indeed the binary sizes of the main downward binary vary significantly. Just for 
the record:

* compiled on cluster with GCC 5.4.0: 122868 KB.
* compiled on Ubuntu xenial with GCC 5.3.1: 93030 KB.
* compiled on Ubuntu 18.04 with GCC 7.3.0: 96033 KB.
msg7109 (view) Author: jendrik Date: 2018-05-02.18:55:37
Interesting! Two quick comments: Yes, the memory is only measured for the search 
component. The difference in memory might stem from different executable sizes. 
We saw this before when one executable had the LP code linked in and a second 
executable had not.
msg7108 (view) Author: guillem Date: 2018-05-02.16:46:06
ok, so I finally pushed this a bit forward. These are the results of the comparison:

I'm using Lab's default satisficing suite, with (separate) time limits of 5min+5min for translator and search components,"lama-first" configuration only, 64 bits build.

I didn't fully adapt Lab to running a container-based fast downward, but rather went for an quick and dirty hack (sorry!) where I simply generate with LAB 
all of the "run" scripts of one single experiment with different code revisions, and then, before submitting the actual experiment, 
grep the run scripts and replace on some of them the calls to for calls to the appropriate singularity images, depending on the revision ID of the call. 
I use 4 mercurial revision IDs: one is the actual version of the FD code which will be benchmarked, the other 3 get replaced by different singularity images
which I wanted to benchmark as well. Quite ugly, but unless I'm mistaken, this guarantees that the results are reliable.

The different columns are:

* 1dd300e9d9fc-fd_lama: The image suggested by Malte in some previous comment: the C++ binary is compiled in the cluster with GCC 5.4.0, and embedded into an ubuntu xenial - based singularity container, which uses GCC 5.3.1 and Python 2.7.11.
* 552ab3681c03-fd_lama: Same configuration as above, but the C++ binary is built not in the cluster but within the xenial image, i.e. with its GCC 5.3.1.
* 77378d182932-fd_lama: This builds the code in a singularity container, but using the latest Ubuntu 18.04, with GCC 7.3.0 and Python 2.7.15.
* ffd0ec660144-fd_lama: This is the actual planner _without any containerization_, i.e. built and run directly in the cluster.

Note that the different revision IDs are just a consequence of the ugly hack, all singularity images are built (64 bits) on the same revision as the "singularity-less" binary, i.e. revision ffd0ec660144.

From the results, I'd say that
(a) the difference don't seem too significant: Runtimes are e.g. consistently better for the latest-compiler image, but that doesn't seem to have a large impact on coverage.
(b) The memory consumption figures are way more shocking. The two singularity images where the binary has been built within the singularity image show a reduction in the aggregated memory consumption of ~40%.
      The per-domain differences can be much higher, e.g. 5 times smaller in gripper, similar in blocksworld.
      I do not have an explanation for such a figure - I doubt it is entirely due to differences in the compiler itself, since the difference is large even when comparing GCC 5.3.1 vs GCC 5.4.0
       Perhaps there's some problem in taking memory measurements through singularity (but if that was the case, the first image would also be affected...).
      I understand this is memory consumption by the search component only, right?

(c) Finally, I'm not sure either of what to make of the (small) differences in number of node expansions, etc. Is this something that you usually observe? Might again be due to some random behavior in python, etc.?
     Otherwise, the planners are built all at the same revision, and run with exactly the same command line parameters.

In any case, if we can make sense of these two last observations, it'd seem that performance is not an issue despite the use of the intermediate containerization layer.
msg6956 (view) Author: malte Date: 2018-03-22.18:52:07
Hi Guillem,

very interesting!

However, with the results spread over different experiments, the time-related
attributes are not meaningful, at least not their summaries. Can you add
"score_total_time" (or is it called score_time?) to the analysis? This can be
better compared in terms of absolute numbers and is generally more interpretable
than the raw search_time/total_time attributes.

And perhaps someone can help Guillem out in aggregating the data into one table?

Given that you are using an overall time and memory limit, the Python version
can also make a big difference in some domains where the translator is critical.
Can you report the Python versions, too? Are they all 64-bit Pythons?

I find the 2011 and 2014 benchmarks a bit problematic and would be interested in
seeing results on a wider benchmark set at some point if we can manage that. If
it's with a shorter time limit (like 5 minutes, or even 3 minutes), that would
be fine with me.

The point about using modern compiler versions is well taken, although the
numbers for 7.2 look worse than for 5.4 with reduced coverage in three domains.
(But perhaps this changes once we can look at the runtime and not just coverage.)

From the output of "module", it appears that GCC 5.4 is available on the grid,
so that could be useful to test. But I think it would be better to actually test
with the same executable. With static builds, it is usually possible to just
copy the executable from machine A to machine B if they are not too far apart.
This is the main reason why we introduced static builds (precisely because the
computer servers' compiler always lagged behind a lot). Perhaps in this case
it's also useful to use a search time limit instead of an overall time limit to
reduce the influence of the Python version.

At some point it would be good to have some more directly comparable data, but
this is certainly looking interesting.
msg6955 (view) Author: guillem Date: 2018-03-22.18:41:57
For the record - a better strategy to achieve compiler version parity could be to "module load 
GCC/5.4.0-2.26" in the cluster. This seems to incur into some module incompatibility with the 
"Python/2.7.11-goolf-1.7.20" module that we load routinely from within LAB, but we can force the 
loading anyway (with LMOD_DISABLE_SAME_NAME_AUTOSWAP=no, some use-at-your-own-risk option), and as 
far as I'm seeing that does not cause any runtime problems.
msg6954 (view) Author: guillem Date: 2018-03-22.18:33:39
I have run some preliminary comparison between running the planner "natively" 
and running it through a singularity container. For the latter option, I've 
used two different base ubuntu images on top of which FD is built: 16.04 and 
Domains are from SAT track of last two IPCs, "lama-first" configuration only.
I adapted LAB to be able to run a singularity image, but didn't manage to get 
the native and the singularity runs to be on the same experiment / results 
table, sorry for that, it is definitely not too convenient to analyze the 
results. These are:

* Native version, compiled with the default GCC 4.8.4 in the cluster.

* Singularity image w./ Ubuntu 16.04 (xenial), GCC 5.4.0

* Singularity image w./ Ubuntu 17.10 (artful), GCC 7.2.0.

I couldn't find a readily available docker image with the same exact version 
of GCC 4.8.4 as in the cluster, which unfortunately makes these results not 
too informative, as there is no way
to tell apart the overhead of the virtualization layer from the potential 
improvements due to using a more modern compiler.
I will try to find / build some image with the exact same compiler version 
soon and update these results.
One thing that these results _do_ tell, however, is that using a 
containerized version of the planner seems to be a good option in clusters 
where installed environments are not too modern.
Not directly related to this, but perhaps interesting as well, providing a 
container of the exact environment on which the experiments for a certain 
paper are run seems to me a good practice in terms of reproducibility.

For the reference, the base Singularity recipe is here: 
All version BTW run the 64bit release.
Time and memory measurements are "as perceived" by the planner from within 
the container, i.e. do not include the time and memory overhead of 
bootstraping the container itself.
I'll check with Florian next on week on the best way to measure these 
msg6950 (view) Author: guillem Date: 2018-03-19.15:50:23
We can run some full-IPC-benchmarks Singularity tests on the grid to compare the 
performance of the "raw" vs the "singularity-containerized" planner; we can then run 
some three-way comparison with easier IPC instances on a laptop - won't be as 
conclusive, but at least a starting point to have a more informed discussion.
msg6948 (view) Author: florian Date: 2018-03-19.15:37:38
It is going to be difficult to run performance test with Docker because sciCORE
will not install this on the grid. The Docker service apparently needs more
access rights to run. This is one of the selling points of Singularity that HPC
admins like.
msg6946 (view) Author: guillem Date: 2018-03-19.12:28:02
As per the e-mail discussion, we'd like to offer the users of the planner a ready-
made Docker / Singularity Fast Downward image. 
Before that, we should however run some performance tests to better understand the 
overheads of each option.
In case we finally go ahead with this, we should make sure the usage of the 
container is properly documented and encouraged.
Date User Action Args
2018-05-11 17:19:02guillemsetstatus: in-progress -> resolved
messages: + msg7132
title: Provide Fast Downward Container -> Analyze performance overhead of using Fast Downward through a Singularity container
2018-05-11 12:11:46maltesetmessages: + msg7129
2018-05-11 11:18:46guillemsetmessages: + msg7128
2018-05-09 13:39:11maltesetmessages: + msg7127
2018-05-09 13:29:52guillemsetmessages: + msg7126
2018-05-09 12:18:38guillemsetfiles: + brfs-time-memory-scatter-plots.tar.gz
messages: + msg7125
2018-05-07 14:17:10maltesetmessages: + msg7120
2018-05-02 22:23:44floriansetmessages: + msg7111
2018-05-02 22:14:21guillemsetmessages: + msg7110
2018-05-02 18:55:37jendriksetmessages: + msg7109
2018-05-02 16:46:06guillemsetmessages: + msg7108
2018-03-22 18:52:07maltesetmessages: + msg6956
2018-03-22 18:41:57guillemsetmessages: + msg6955
2018-03-22 18:33:39guillemsetmessages: + msg6954
2018-03-22 10:30:18silvansetnosy: + silvan
2018-03-19 15:50:23guillemsetmessages: + msg6950
2018-03-19 15:37:38floriansetmessages: + msg6948
2018-03-19 15:24:24floriansetnosy: + florian
2018-03-19 14:09:42cedricsetnosy: + cedric
2018-03-19 13:21:26jendriksetnosy: + malte, jendrik
2018-03-19 12:28:02guillemcreate