Issue 770: Analyze performance overhead of using Fast Downward through a Singularity container

Title	Analyze performance overhead of using Fast Downward through a Singularity container
Priority	feature	Status	resolved
Superseder		Nosy List	cedric, florian, guillem, jendrik, malte, silvan
Assigned To	guillem	Keywords
Optional summary

Created on 2018-03-19.12:28:02 by guillem, last changed by guillem.

Files
File name	Uploaded	Type	Edit	Remove
brfs-time-memory-scatter-plots.tar.gz	guillem, 2018-05-09.12:18:38	application/gzip

Messages
msg7132 (view)	Author: guillem	Date: 2018-05-11.17:19:02
Yes, you're right, the distinction between Docker / Singularity should be not too relevant at the level of the support (to be tested though), since Singularity easily allows to start an image off a docker image, but it could make a difference at the performance level, which needs to be tested. One point though is whether we want to provide support for a Docker "user container", as this would be geared towards easing the execution of experiments, but docker doesn't seem to be too HPC-friendly anyhow. An option would be to test things (both user and dev containers) first with Singularity, and after that see if it is worth provide any of the two as a Docker container as well, if we think that might increase the outreach. The questions you mentions at the end of need to be of course addressed in the form of documentation. I suggest the first step with this could be to create a first (Github? Bitbucket?) repository with documentation, scripts and singularity recipes; documentation which could be moved to the website if we finally approve to support this, but which in the meantime can be used by us. The issue of running experiments with Lab should perhaps be addressed separately. I've run the experiments using an ugly hack, but on the other hand integrating the container with Lab should not be terribly difficult, with the appropriate Lab experience :-) Where before we use a callstring of the form e.g.: ./fast-downward.py --overall-time-limit 30m --overall-memory-limit 3584M --alias lama-first domain.pddl problem.pddl now we would use: ./fast-downward.img --overall-time-limit 30m --overall-memory-limit 3584M --alias lama-first domain.pddl problem.pddl i.e. simply replace the python script with the binary image. Special care should be taken with options such as "--build XXX" (i.e. do we want to provide images with different builds inside? probably not, but to be discussed), or "-- validate" (i.e. do we want to distribute VAL inside of the image?), but "standard" command line options should work the same. On the other hand, all of LAB's build steps could be done away with if we use a singularity container, and replaced with some "singularity pull downward:latest" command, or similar, which simply fetches the binary image, etc. To wrap up, I suggest closing this issue, putting Docker temporarily on hold and going for a singularity "end-user container" (issue788) and a Singularity-based dev kit (issue787).
msg7129 (view)	Author: malte	Date: 2018-05-11.12:11:46
Agreed on the way to proceed. I like the proposed name change for this issue. Some comments before we close the issue: Firstly, for those who are still not too sure about the design goals of Singularity etc., I found these links useful: https://singularity.lbl.gov/about https://singularity.lbl.gov/faq https://geekyap.blogspot.de/2016/11/docker-vs-singularity-vs-shifter-in-hpc.html Secondly, sometimes you say "Singularity" and sometimes you say "Docker/Singularity". It is important to note that we have only really looked at Singularity, and it would be good if we could do some investigation of runtime/memory overhead with Docker, too. I expect it to be low because we don't access the network, file system or kernel space much, so containerization should not make a huge difference. But we should have some numbers, so that we can tick this box before we recommend using Docker. Because of its root requirement, we can only do a local experiment, but it should not be terribly difficult to find a machine on which we can run a local 5-minute blind search experiment, even for the whole benchmark suite, over a few days or so. Thirdly, I agree that we should clearly separate the use cases of "user container" (a container to run experiments) and "dev container" (a container for development). User containers are what Singularity is intended for, and I can see this being very useful for a number of reasons. I'm more skeptical regarding dev containers (like Jendrik, but perhaps a bit less so), but I'm willing to be convinced. Before we move on to more complex things, I would first like to better understand how to build a user container, though. It looks like you've already done all the major work, but I don't know how to reproduce it. For example, I'd be interested in the following things: 1) How do I create a user container myself? 2) How do I run the planner locally on my machine from a user container? 3) How do I run the planner on the grid, in a lab experiment, from a user container?
msg7128 (view)	Author: guillem	Date: 2018-05-11.11:18:46
I would perhaps close this current issue (and perhaps change its title retrospectively to "Analyze performance overhead of using Fast Downward through a Singularity container", for easier future reference?), and then articulate what still needs to be done on two main issues: (1) Provide a Singularity/Docker-Based "Fast Downward Developer Toolkit". The idea here would be of course to (a) think the general structure / workflow, (b) write a few scripts to automate everything that can be automated, and (c) test this a bit among ourselves before deciding whether it is useful enough to be released / maintained. (2) Provide an off-the-shelf Singularity/Docker Fast Downward image which can be used by researchers to run experiments with Fast Downward without having to worry about installation of dependencies, etc. (the issue of LP software licenses should be considered here as well.) The main issue here would be to decide which revisions / tags get built into an image, how we tag the images, etc., and in general which policy do we follow in order to minimize the overhead of giving support to containerized versions of the planner. Relevant points here would include: do we patch images if a bug is found? do we offer more than one image? when do we update the image? What exactly do we include in the image (e.g. GCC, Clang, Validate...)? etc. I think that the work on both of these issues can be relatively (but not completely) independent. I could start with (1) and perhaps give a bit of thought to (2) to kick off some discussion in the next meeting...?
msg7127 (view)	Author: malte	Date: 2018-05-09.13:39:10
Thanks, Guillem! Very thorough. I would summarize that the results are very positive: there is no significant performance penalty through containerization, and being able to use more modern software even gives us a bit of a speed boost. So from that side there would be no obstacles towards providing a container. Referring back to our recent discussion in the Fast Downward meeting, I suppose the other question is how easy/difficult it is to work with the container. I suppose there are three aspects to this: what would need to do to provide containers, how difficult is it to use them for development, and how difficult is it to use them for running the planner/conducting experiments. Of course not all these aspects need to be part of this issue. How do you suggest to proceed?
msg7126 (view)	Author: guillem	Date: 2018-05-09.13:29:52
Ok, a few notes about the results: Image 1 vs Image 2 With respect to mem. consumption, the second image has a small constant increase (less than 1 MB) which might be due to the small diff in compiler version, but which doesn't seem cause for concern. This holds across the three different search strategies, both opt and sat. The differences in runtime seem equally minor. Overall, this supports the idea that, as expected, there is no major difference between running an image with a binary compiled from within the image and another where the binary has been compiled in the cluster and moved into the image. Image 1 vs native planner (column 4) Mem. consumption: the "native" planner (no singularity involved) has a smaller mem. consumption, but the difference seems to be roughly constant and even in the instances where overall mem. consumption is highest, the difference is not larger than 4-5MB. This holds again across the three different search strategies. The scatter plot for BrFS (file "brfs-memory-xenial-half_vs_native.png") shows that indeed the relative differences are higher when the total mem consumption is very low, but as the total mem consumption increases, become smaller. In terms of total runtime, the scatter plot for BrFS ("brfs-total_time-xenial- half_vs_native.png") shows a slight imbalance towards the singularity image being marginally faster in more instances than the other way around, which is somewhat surprising. I don't have a clear explanation for this. Results for LMCUT do not show significant differences; in lama, there are indeed significant differences in runtime between same instances, but overall they seem to cancel off. A reason for this could be the possible non-determinism of the lama config? Image 1 vs Image 3 Finally, comparing these two images gives a rough idea of the improvement we could get if we switch to an image with the latest ubuntu, newer compiler and python, etc. In terms of memory the BrFS scatter plot ("brfs-memory-xenial-half_vs_bionic-full.png") shows there are some consistent savings in using ubuntu bionic, but they become quite marginal as the total memory consumption increases. Same happens with the other search strategies, with the exception that for lama, in a few domains (e.g. philosophers domain) the trend seems to invert beyond a certain mem. limit... which is somewhat strange, given that for those very same instance the number of expanded nodes remains the same. In terms of total runtime, the BrFS scatter plot ("brfs-total_time-xenial-half_vs_bionic- full.png") shows that in most instances the latest ubuntu is between a 5% and a 15% faster, which is quite interesting. Only in a handful of instances the xenial image is faster, and these seem to be instances where the overall time is quite small.
msg7125 (view)	Author: guillem	Date: 2018-05-09.12:18:38
I have run different experiments for each configuration, for further clarity. There are: == Breadth-First Search ==: http://ai.cs.unibas.ch/_tmp_files/frances/singularity-embedded-nolp-brfs.html == A* + lm-cut ==: http://ai.cs.unibas.ch/_tmp_files/frances/singularity-embedded-nolp-lmcut.html == lama-first ==: http://ai.cs.unibas.ch/_tmp_files/frances/singularity-embedded-nolp-lama.html Let me provide the interpretation of each revision here again: 1dd300e9d9fc: "Half" singularity image: the experiments run a containerized singularity image, but the binary inside the container was compiled directly on the cluster (GCC 5.4.0, Python 2.7.11) 552ab3681c03: "Full" singularity image on ubuntu xenial (16.04). GCC 5.3.1 77378d182932: "Full" singularity image on ubuntu bionic (18.04). GCC 7.3.0 ffd0ec660144: planner run natively on the cluster, no containerization at all. (GCC 5.4.0, Python 2.7.11) One would expect the first and the second images above to perform roughly the same, but I thought it would be good to validate that as well. The main objective is to compare the first and the fourth, which should give an idea of the overhead caused by the containerization technology.. Finally, the third image is there so that we can also get an idea of the benefits of using a newer compiler / Python version (which is interesting in itself if the performance difference is large enough, as all experiments run in the cluster could benefit from that). I am also attaching a tarball with relative scatter plots for memory and total time figures, only for the breadth-first search experiment, for all possible pairs of images. I'll add my interpretation of the results in a different message.
msg7120 (view)	Author: malte	Date: 2018-05-07.14:17:10
I would also be interested in the blind search scatter plots that Florian mentions.
msg7111 (view)	Author: florian	Date: 2018-05-02.22:23:44
The planner will compile with an LP solver if it detects one. If you have one installed on your grid account, the binaries compiled on the grid would have one but the binaries compiled in the container wouldn't. In that case it might be good to try a build without LPs (I believe there is a build config for that, or at least a CMake option). Also, for the record: as we discussed earlier, testing blind search would make it easier to debug the memory behavior and a relative scatter plot would make it easier to see the time and memory differences.
msg7110 (view)	Author: guillem	Date: 2018-05-02.22:14:21
That's a good hint. Another probable cause we were discussing with Florian could be the different OS - the singularity image is using Ubuntu xenial, which is the one that features the version of GCC / Python closest to the ones on the cluster, but a better alternative to troubleshoot this will be to use the same version of centos as used in the cluster. I'm seeing now that not all nodes in the cluster have the same version (e.g. login node is 7.3.1611, ase01 node is 7.4.1708), but still switching to 7.3 should be a better approximation. I'll test that tomorrow. As for your point, indeed the binary sizes of the main downward binary vary significantly. Just for the record: * compiled on cluster with GCC 5.4.0: 122868 KB. * compiled on Ubuntu xenial with GCC 5.3.1: 93030 KB. * compiled on Ubuntu 18.04 with GCC 7.3.0: 96033 KB.
msg7109 (view)	Author: jendrik	Date: 2018-05-02.18:55:37
Interesting! Two quick comments: Yes, the memory is only measured for the search component. The difference in memory might stem from different executable sizes. We saw this before when one executable had the LP code linked in and a second executable had not.
msg7108 (view)	Author: guillem	Date: 2018-05-02.16:46:06
ok, so I finally pushed this a bit forward. These are the results of the comparison: http://ai.cs.unibas.ch/_tmp_files/frances/singularity-embedded-all.html I'm using Lab's default satisficing suite, with (separate) time limits of 5min+5min for translator and search components,"lama-first" configuration only, 64 bits build. I didn't fully adapt Lab to running a container-based fast downward, but rather went for an quick and dirty hack (sorry!) where I simply generate with LAB all of the "run" scripts of one single experiment with different code revisions, and then, before submitting the actual experiment, grep the run scripts and replace on some of them the calls to fast-downward.py for calls to the appropriate singularity images, depending on the revision ID of the call. I use 4 mercurial revision IDs: one is the actual version of the FD code which will be benchmarked, the other 3 get replaced by different singularity images which I wanted to benchmark as well. Quite ugly, but unless I'm mistaken, this guarantees that the results are reliable. The different columns are: * 1dd300e9d9fc-fd_lama: The image suggested by Malte in some previous comment: the C++ binary is compiled in the cluster with GCC 5.4.0, and embedded into an ubuntu xenial - based singularity container, which uses GCC 5.3.1 and Python 2.7.11. * 552ab3681c03-fd_lama: Same configuration as above, but the C++ binary is built not in the cluster but within the xenial image, i.e. with its GCC 5.3.1. * 77378d182932-fd_lama: This builds the code in a singularity container, but using the latest Ubuntu 18.04, with GCC 7.3.0 and Python 2.7.15. * ffd0ec660144-fd_lama: This is the actual planner _without any containerization_, i.e. built and run directly in the cluster. Note that the different revision IDs are just a consequence of the ugly hack, all singularity images are built (64 bits) on the same revision as the "singularity-less" binary, i.e. revision ffd0ec660144. From the results, I'd say that (a) the difference don't seem too significant: Runtimes are e.g. consistently better for the latest-compiler image, but that doesn't seem to have a large impact on coverage. (b) The memory consumption figures are way more shocking. The two singularity images where the binary has been built within the singularity image show a reduction in the aggregated memory consumption of ~40%. The per-domain differences can be much higher, e.g. 5 times smaller in gripper, similar in blocksworld. I do not have an explanation for such a figure - I doubt it is entirely due to differences in the compiler itself, since the difference is large even when comparing GCC 5.3.1 vs GCC 5.4.0 Perhaps there's some problem in taking memory measurements through singularity (but if that was the case, the first image would also be affected...). I understand this is memory consumption by the search component only, right? (c) Finally, I'm not sure either of what to make of the (small) differences in number of node expansions, etc. Is this something that you usually observe? Might again be due to some random behavior in python, etc.? Otherwise, the planners are built all at the same revision, and run with exactly the same command line parameters. In any case, if we can make sense of these two last observations, it'd seem that performance is not an issue despite the use of the intermediate containerization layer.
msg6956 (view)	Author: malte	Date: 2018-03-22.18:52:07
Hi Guillem, very interesting! However, with the results spread over different experiments, the time-related attributes are not meaningful, at least not their summaries. Can you add "score_total_time" (or is it called score_time?) to the analysis? This can be better compared in terms of absolute numbers and is generally more interpretable than the raw search_time/total_time attributes. And perhaps someone can help Guillem out in aggregating the data into one table? Given that you are using an overall time and memory limit, the Python version can also make a big difference in some domains where the translator is critical. Can you report the Python versions, too? Are they all 64-bit Pythons? I find the 2011 and 2014 benchmarks a bit problematic and would be interested in seeing results on a wider benchmark set at some point if we can manage that. If it's with a shorter time limit (like 5 minutes, or even 3 minutes), that would be fine with me. The point about using modern compiler versions is well taken, although the numbers for 7.2 look worse than for 5.4 with reduced coverage in three domains. (But perhaps this changes once we can look at the runtime and not just coverage.) From the output of "module", it appears that GCC 5.4 is available on the grid, so that could be useful to test. But I think it would be better to actually test with the same executable. With static builds, it is usually possible to just copy the executable from machine A to machine B if they are not too far apart. This is the main reason why we introduced static builds (precisely because the computer servers' compiler always lagged behind a lot). Perhaps in this case it's also useful to use a search time limit instead of an overall time limit to reduce the influence of the Python version. At some point it would be good to have some more directly comparable data, but this is certainly looking interesting.
msg6955 (view)	Author: guillem	Date: 2018-03-22.18:41:57
For the record - a better strategy to achieve compiler version parity could be to "module load GCC/5.4.0-2.26" in the cluster. This seems to incur into some module incompatibility with the "Python/2.7.11-goolf-1.7.20" module that we load routinely from within LAB, but we can force the loading anyway (with LMOD_DISABLE_SAME_NAME_AUTOSWAP=no, some use-at-your-own-risk option), and as far as I'm seeing that does not cause any runtime problems.
msg6954 (view)	Author: guillem	Date: 2018-03-22.18:33:39
I have run some preliminary comparison between running the planner "natively" and running it through a singularity container. For the latter option, I've used two different base ubuntu images on top of which FD is built: 16.04 and 17.10. Domains are from SAT track of last two IPCs, "lama-first" configuration only. I adapted LAB to be able to run a singularity image, but didn't manage to get the native and the singularity runs to be on the same experiment / results table, sorry for that, it is definitely not too convenient to analyze the results. These are: * Native version, compiled with the default GCC 4.8.4 in the cluster. http://ai.cs.unibas.ch/_tmp_files/frances/raw01.html * Singularity image w./ Ubuntu 16.04 (xenial), GCC 5.4.0 http://ai.cs.unibas.ch/_tmp_files/frances/singularity-xenial.html * Singularity image w./ Ubuntu 17.10 (artful), GCC 7.2.0. http://ai.cs.unibas.ch/_tmp_files/frances/singularity01.html I couldn't find a readily available docker image with the same exact version of GCC 4.8.4 as in the cluster, which unfortunately makes these results not too informative, as there is no way to tell apart the overhead of the virtualization layer from the potential improvements due to using a more modern compiler. I will try to find / build some image with the exact same compiler version soon and update these results. One thing that these results _do_ tell, however, is that using a containerized version of the planner seems to be a good option in clusters where installed environments are not too modern. Not directly related to this, but perhaps interesting as well, providing a container of the exact environment on which the experiments for a certain paper are run seems to me a good practice in terms of reproducibility. For the reference, the base Singularity recipe is here: https://github.com/gfrances/downward-images/blob/master/singularity/latest. All version BTW run the 64bit release. Time and memory measurements are "as perceived" by the planner from within the container, i.e. do not include the time and memory overhead of bootstraping the container itself. I'll check with Florian next on week on the best way to measure these properly.
msg6950 (view)	Author: guillem	Date: 2018-03-19.15:50:23
We can run some full-IPC-benchmarks Singularity tests on the grid to compare the performance of the "raw" vs the "singularity-containerized" planner; we can then run some three-way comparison with easier IPC instances on a laptop - won't be as conclusive, but at least a starting point to have a more informed discussion.
msg6948 (view)	Author: florian	Date: 2018-03-19.15:37:38
It is going to be difficult to run performance test with Docker because sciCORE will not install this on the grid. The Docker service apparently needs more access rights to run. This is one of the selling points of Singularity that HPC admins like.
msg6946 (view)	Author: guillem	Date: 2018-03-19.12:28:02
As per the e-mail discussion, we'd like to offer the users of the planner a ready- made Docker / Singularity Fast Downward image. Before that, we should however run some performance tests to better understand the overheads of each option. In case we finally go ahead with this, we should make sure the usage of the container is properly documented and encouraged.

History
Date	User	Action	Args
2018-05-11 17:19:02	guillem	set	status: in-progress -> resolved messages: + msg7132 title: Provide Fast Downward Container -> Analyze performance overhead of using Fast Downward through a Singularity container
2018-05-11 12:11:46	malte	set	messages: + msg7129
2018-05-11 11:18:46	guillem	set	messages: + msg7128
2018-05-09 13:39:11	malte	set	messages: + msg7127
2018-05-09 13:29:52	guillem	set	messages: + msg7126
2018-05-09 12:18:38	guillem	set	files: + brfs-time-memory-scatter-plots.tar.gz messages: + msg7125
2018-05-07 14:17:10	malte	set	messages: + msg7120
2018-05-02 22:23:44	florian	set	messages: + msg7111
2018-05-02 22:14:21	guillem	set	messages: + msg7110
2018-05-02 18:55:37	jendrik	set	messages: + msg7109
2018-05-02 16:46:06	guillem	set	messages: + msg7108
2018-03-22 18:52:07	malte	set	messages: + msg6956
2018-03-22 18:41:57	guillem	set	messages: + msg6955
2018-03-22 18:33:39	guillem	set	messages: + msg6954
2018-03-22 10:30:18	silvan	set	nosy: + silvan
2018-03-19 15:50:23	guillem	set	messages: + msg6950
2018-03-19 15:37:38	florian	set	messages: + msg6948
2018-03-19 15:24:24	florian	set	nosy: + florian
2018-03-19 14:09:42	cedric	set	nosy: + cedric
2018-03-19 13:21:26	jendrik	set	nosy: + malte, jendrik
2018-03-19 12:28:02	guillem	create

Issue770