Issue 889: Code tests on buildbot run out of memory

Title	Code tests on buildbot run out of memory
Priority	bug	Status	resolved
Superseder		Nosy List	florian, jendrik, malte, silvan
Assigned To	jendrik	Keywords	infrastructure
Optional summary

Created on 2019-01-18.17:12:46 by florian, last changed by jendrik.

Messages
msg8625 (view)	Author: jendrik	Date: 2019-03-06.12:20:32
The problem here was that 50M was too low for the search code (100M was enough). Our buildbots recently started complaining that 50M is also too low for the translator, so I raised the limit to 100M there as well, which fixed the problem.
msg8499 (view)	Author: jendrik	Date: 2019-01-23.12:47:21
I improved the output.
msg8498 (view)	Author: jendrik	Date: 2019-01-22.19:28:52
Done. All builds are green again on the buildbot :-) I'll leave this open until the logging output is fixed (see msg8477).
msg8497 (view)	Author: malte	Date: 2019-01-22.11:28:24
Please update the baseline, but please also have a look that the new data looks reasonable in a diff comparison to the old one.
msg8496 (view)	Author: jendrik	Date: 2019-01-22.10:09:42
Yes, the build changed from 32-bit to 64-bit. Should I just update the baseline? The link also shows a second failed tests ("medium_translator"), which I fixed by "building" the translator before running the tests.
msg8483 (view)	Author: malte	Date: 2019-01-21.10:38:27
Did any build options or compiler versions change? For example, did this change from a 32-bit build to a 64-bit build or something along those lines? Any other changes in the environment, e.g. OS version?
msg8482 (view)	Author: florian	Date: 2019-01-21.09:47:30
I updated the title to make it clear this is a memory issue instead of a timeout. The code tests passed with the higher limits but now a nightly test failed. It showed a much higher memory usage than before (http://buildbot.fast-downward.org/#/builders/11/builds/41). I'm not sure what caused this.
msg8481 (view)	Author: malte	Date: 2019-01-19.12:23:09
Trying 100 MiB sounds good.
msg8480 (view)	Author: florian	Date: 2019-01-19.11:38:51
Looks like I crossed channels with both of you :-) Should we just raise the limit to 100 MB? Merge and shrink will still run out of memory on this large satellite task even with the higher limit. As far as I know, there is no way to test this on all systems without pushing to the default branch. (We could enable building other revisions on the buildbot but we want https access first.) I tried testing locally but the limits are different here causing problems only below 22 MB.
msg8479 (view)	Author: jendrik	Date: 2019-01-19.11:17:17
Yes, the list of failures is collected and only printed at the end. I agree that it would be better to print the failure message right after a failing test. Ideally, we would have a pytest-like setup where the output is only printed if the test fails. Here, the second to last test fails (not the last one, as suspected below). It sets a memory limit of 50 MB for the search which leads to the error message /buildbot/worker/linux-build-worker-clang3-lp/build/builds/release/bin/downward: error while loading shared libraries: libstdc++.so.6: failed to map segment from shared object
msg8478 (view)	Author: florian	Date: 2019-01-19.11:07:42
Ahh, searching for "search memory limit: 50 MB" only finds one hit and it looks to be return code 127 that we got. Could it be that the memory limit is too low to load the shared library (libstdc++.so.6)? Run astar(merge_and_shrink(...)) on large task: INFO Running translator. ... Done! [5.310s CPU, 5.307s wall-clock] translate exit code: 0 INFO Running search (release). INFO search stdin: output.sas INFO search time limit: None INFO search memory limit: 50 MB INFO search command line string: ...downward --search 'astar(...)' ... < output.sas /buildbot/worker/linux-build-worker-clang3-lp/build/builds/release/bin/downward: error while loading shared libraries: libstdc++.so.6: failed to map segment from shared object Remove intermediate file output.sas search exit code: 127 Driver aborting after search
msg8477 (view)	Author: malte	Date: 2019-01-19.11:06:58
Looking at the detailed output, I think the failing run is the one that starts on line 4878 in the output. I suppose it would be useful to print the exact test parameters where that test begins, in the same format as at the end of the report. It currently mentions some information about the test, but not all (like which limits are set). Also, for failing tests it would be useful to repeat the same information along with the expected and actual exit code after the output. Looking at the output for that run, perhaps the problem is that the memory limit is so low that the search cannot even start, so the code that triggers on out-of-memory conditions cannot execute to terminate the search gracefully.
msg8476 (view)	Author: florian	Date: 2019-01-19.10:54:40
It looks like the last test runs without a memory limit (I copied part of the log that shows this below). Is the list of failures that is printed in the last line collected throughout the tests and maybe not referring to the test just preceding it? If so, how can we find the failed test? Run astar(merge_and_shrink(...)) on large task: INFO Running translator. INFO translator stdin: None INFO translator time limit: None INFO translator memory limit: None INFO translator command line string: ...translate.py ...domain.pddl \ ...p25-HC-pfile5.pddl --sas-file output.sas ... translate exit code: 0 INFO Running search (release). INFO search stdin: output.sas INFO search time limit: 1s INFO search memory limit: None INFO search command line string: ...builds/release/bin/downward --search 'astar(...)' --internal-plan-file sas_plan < output.sas reading input... [t=0.000346632s] ... Peak memory: 922120 KB caught signal 24 -- exiting Time limit has been reached. Remove intermediate file output.sas search exit code: 23 Driver aborting after search Failures: [... fast-downward.py', '--search-memory-limit', '50M', ...] failed: expected 22, got 127
msg8475 (view)	Author: malte	Date: 2019-01-18.22:50:11
If it is supposed to use a 50 MB limit, that limit seems not to be working. The output you linked mentioned a peak memory of 922120 KB.
msg8474 (view)	Author: florian	Date: 2019-01-18.17:12:46
The buildbot currently fails on some code tests. The tests use a 50MB memory limit but run out of time unexpectedly after ~20 seconds. Here is an example: http://buildbot.fast-downward.org/#/builders/13/builds/93

History
Date	User	Action	Args
2019-03-06 12:20:32	jendrik	set	messages: + msg8625
2019-01-23 12:47:21	jendrik	set	status: in-progress -> resolved messages: + msg8499
2019-01-22 19:28:52	jendrik	set	status: chatting -> in-progress assignedto: jendrik messages: + msg8498
2019-01-22 11:28:24	malte	set	messages: + msg8497
2019-01-22 10:09:42	jendrik	set	messages: + msg8496
2019-01-21 10:38:27	malte	set	messages: + msg8483
2019-01-21 09:47:30	florian	set	messages: + msg8482 title: Code tests on buildbot time out -> Code tests on buildbot run out of memory
2019-01-19 12:23:09	malte	set	messages: + msg8481
2019-01-19 11:38:51	florian	set	messages: + msg8480
2019-01-19 11:17:17	jendrik	set	messages: + msg8479
2019-01-19 11:07:42	florian	set	messages: + msg8478
2019-01-19 11:06:58	malte	set	messages: + msg8477
2019-01-19 10:54:40	florian	set	messages: + msg8476
2019-01-18 22:50:11	malte	set	status: unread -> chatting messages: + msg8475
2019-01-18 17:12:46	florian	create

Issue889