Issue889

Title Code tests on buildbot run out of memory
Priority bug Status resolved
Superseder Nosy List florian, jendrik, malte, silvan
Assigned To jendrik Keywords infrastructure
Optional summary

Created on 2019-01-18.17:12:46 by florian, last changed by jendrik.

Messages
msg8625 (view) Author: jendrik Date: 2019-03-06.12:20:32
The problem here was that 50M was too low for the search code (100M was enough).
Our buildbots recently started complaining that 50M is also too low for the
translator, so I raised the limit to 100M there as well, which fixed the problem.
msg8499 (view) Author: jendrik Date: 2019-01-23.12:47:21
I improved the output.
msg8498 (view) Author: jendrik Date: 2019-01-22.19:28:52
Done. All builds are green again on the buildbot :-)

I'll leave this open until the logging output is fixed (see msg8477).
msg8497 (view) Author: malte Date: 2019-01-22.11:28:24
Please update the baseline, but please also have a look that the new data looks
reasonable in a diff comparison to the old one.
msg8496 (view) Author: jendrik Date: 2019-01-22.10:09:42
Yes, the build changed from 32-bit to 64-bit. Should I just update the baseline?

The link also shows a second failed tests ("medium_translator"), which I fixed
by "building" the translator before running the tests.
msg8483 (view) Author: malte Date: 2019-01-21.10:38:27
Did any build options or compiler versions change? For example, did this change
from a 32-bit build to a 64-bit build or something along those lines? Any other
changes in the environment, e.g. OS version?
msg8482 (view) Author: florian Date: 2019-01-21.09:47:30
I updated the title to make it clear this is a memory issue instead of a timeout.

The code tests passed with the higher limits but now a nightly test failed. It
showed a much higher memory usage than before
(http://buildbot.fast-downward.org/#/builders/11/builds/41). I'm not sure what
caused this.
msg8481 (view) Author: malte Date: 2019-01-19.12:23:09
Trying 100 MiB sounds good.
msg8480 (view) Author: florian Date: 2019-01-19.11:38:51
Looks like I crossed channels with both of you :-)

Should we just raise the limit to 100 MB? Merge and shrink will still run out of
memory on this large satellite task even with the higher limit. As far as I
know, there is no way to test this on all systems without pushing to the default
branch. (We could enable building other revisions on the buildbot but we want
https access first.) I tried testing locally but the limits are different here
causing problems only below 22 MB.
msg8479 (view) Author: jendrik Date: 2019-01-19.11:17:17
Yes, the list of failures is collected and only printed at the end. I agree that
it would be better to print the failure message right after a failing test.
Ideally, we would have a pytest-like setup where the output is only printed if
the test fails.

Here, the second to last test fails (not the last one, as suspected below). It
sets a memory limit of 50 MB for the search which leads to the error message

/buildbot/worker/linux-build-worker-clang3-lp/build/builds/release/bin/downward:
error while loading shared libraries: libstdc++.so.6: failed to map segment from
shared object
msg8478 (view) Author: florian Date: 2019-01-19.11:07:42
Ahh, searching for "search memory limit: 50 MB" only finds one hit and it looks
to be return code 127 that we got. Could it be that the memory limit is too low
to load the shared library (libstdc++.so.6)?


Run astar(merge_and_shrink(...)) on large task:
INFO     Running translator.
...
Done! [5.310s CPU, 5.307s wall-clock]

translate exit code: 0
INFO     Running search (release).
INFO     search stdin: output.sas
INFO     search time limit: None
INFO     search memory limit: 50 MB
INFO     search command line string: ...downward --search 'astar(...)' ... <
output.sas
/buildbot/worker/linux-build-worker-clang3-lp/build/builds/release/bin/downward: 
error while loading shared libraries: libstdc++.so.6: failed to map segment from
shared object
Remove intermediate file output.sas

search exit code: 127
Driver aborting after search
msg8477 (view) Author: malte Date: 2019-01-19.11:06:58
Looking at the detailed output, I think the failing run is the one that starts
on line 4878 in the output. I suppose it would be useful to print the exact test
parameters where that test begins, in the same format as at the end of the
report. It currently mentions some information about the test, but not all (like
which limits are set).

Also, for failing tests it would be useful to repeat the same information along
with the expected and actual exit code after the output.

Looking at the output for that run, perhaps the problem is that the memory limit
is so low that the search cannot even start, so the code that triggers on
out-of-memory conditions cannot execute to terminate the search gracefully.
msg8476 (view) Author: florian Date: 2019-01-19.10:54:40
It looks like the last test runs without a memory limit (I copied part of the
log that shows this below). Is the list of failures that is printed in the last
line collected throughout the tests and maybe not referring to the test just
preceding it? If so, how can we find the failed test?



Run astar(merge_and_shrink(...)) on large task:
INFO     Running translator.
INFO     translator stdin: None
INFO     translator time limit: None
INFO     translator memory limit: None
INFO     translator command line string: ...translate.py ...domain.pddl \
    ...p25-HC-pfile5.pddl --sas-file output.sas
...
translate exit code: 0
INFO     Running search (release).
INFO     search stdin: output.sas
INFO     search time limit: 1s
INFO     search memory limit: None
INFO     search command line string: ...builds/release/bin/downward --search
    'astar(...)' --internal-plan-file sas_plan < output.sas
reading input... [t=0.000346632s]
...
Peak memory: 922120 KB
caught signal 24 -- exiting
Time limit has been reached.
Remove intermediate file output.sas
search exit code: 23
Driver aborting after search

Failures:
[... fast-downward.py', '--search-memory-limit', '50M', ...] failed: expected
22, got 127
msg8475 (view) Author: malte Date: 2019-01-18.22:50:11
If it is supposed to use a 50 MB limit, that limit seems not to be working.
The output you linked mentioned a peak memory of 922120 KB.
msg8474 (view) Author: florian Date: 2019-01-18.17:12:46
The buildbot currently fails on some code tests. The tests use a 50MB memory
limit but run out of time unexpectedly after ~20 seconds.

Here is an example:
http://buildbot.fast-downward.org/#/builders/13/builds/93
History
Date User Action Args
2019-03-06 12:20:32jendriksetmessages: + msg8625
2019-01-23 12:47:21jendriksetstatus: in-progress -> resolved
messages: + msg8499
2019-01-22 19:28:52jendriksetstatus: chatting -> in-progress
assignedto: jendrik
messages: + msg8498
2019-01-22 11:28:24maltesetmessages: + msg8497
2019-01-22 10:09:42jendriksetmessages: + msg8496
2019-01-21 10:38:27maltesetmessages: + msg8483
2019-01-21 09:47:30floriansetmessages: + msg8482
title: Code tests on buildbot time out -> Code tests on buildbot run out of memory
2019-01-19 12:23:09maltesetmessages: + msg8481
2019-01-19 11:38:51floriansetmessages: + msg8480
2019-01-19 11:17:17jendriksetmessages: + msg8479
2019-01-19 11:07:42floriansetmessages: + msg8478
2019-01-19 11:06:58maltesetmessages: + msg8477
2019-01-19 10:54:40floriansetmessages: + msg8476
2019-01-18 22:50:11maltesetstatus: unread -> chatting
messages: + msg8475
2019-01-18 17:12:46floriancreate