Issue1015

Title Try to fix cases where the translator incurs a segfault when running out of memory
Priority bug Status chatting
Superseder Nosy List augusto, florian, jendrik, malte, silvan
Assigned To Keywords
Optional summary

Created on 2021-02-26.16:07:21 by silvan, last changed by malte.

Files
File name Uploaded Type Edit Remove
out-of-memory-reversed.svg silvan, 2021-02-26.17:06:01 image/svg+xml
out-of-memory.svg silvan, 2021-02-26.17:05:54 image/svg+xml
Messages
msg10306 (view) Author: malte Date: 2021-05-12.11:03:31
> One thing to consider with the second option is that Python usually prints
> messages on stderr when it fails to allocate memory. To avoid having this
> output trigger unexplained errors in Lab reports, we should catch/remove that
> output, if possible. Or think if we shouldn't prefer the first option instead.

We currently catch this in the driver, but indeed the suggestion was to remove the special-case handling in the driver.

(The problem with the first option is that in the future it might be the case that we terminate the translator with no good reason when it would otherwise have succeeded, so it's not a simple choice to take.)
msg10305 (view) Author: silvan Date: 2021-05-12.08:49:59
One thing to consider with the second option is that Python usually prints messages on stderr when it fails to allocate memory. To avoid having this output trigger unexplained errors in Lab reports, we should catch/remove that output, if possible. Or think if we shouldn't prefer the first option instead.
msg10299 (view) Author: malte Date: 2021-05-07.16:51:14
OK, so it looks like what we'll try to do is implement a wrapper around the translator that runs the translator under ptrace, detects if the translator runs out of memory and reacts accordingly and otherwise behaves transparently, i.e., just passes on whatever the translator returns.

One question is how exactly we do want to react. The drastic solution is to terminate the traced process and exit with TRANSLATE_OUT_OF_MEMORY (20) as soon as we see a failed allocation. The less drastic solution is to let the planner continue to run and just set a flag if we see a failed allocation, and then translate a segfault exit code from the translator into a TRANSLATE_OUT_OF_MEMORY exit code if the flag is set. The first one will potentially give us cleaner output (no MemoryError chains, although perhaps they go away if we remove our attempts to catch them), while the second one is less risky if we're not sure that every failed memory allocation actually means that the translator will fail.


This can be done in C using ptrace or using a Python library like python-ptrace. Using C is perhaps the better option because it gives us fewer dependencies and will also induce less overhead.

The driver should use the wrapper only if available. We need to decide what the exact mechanism should be (e.g. use if available; warn if not available; enable/disable by command-line options etc.). It doesn't need to be cross-platform, Linux-only is enough.

We currently already have two special mechanisms for dealing with the translator running out of memory, one inside the translator and one inside the driver. If we implement this new solution, I suggest we remove both of them.
msg10298 (view) Author: augusto Date: 2021-05-07.15:06:11
We ran the first experiment on the hard-to-ground tasks. None of the solved tasks (1840 in total) had a failed memory allocation call. This is probably a good indication that we could "look" for this call, in particular, to identify OOM runs.

The second experiment on IPC instances also showed the same behavior. However, I did not have any run with a segfault, although some of the tasks produce a segfault when I ran them locally in the login node. Maybe it is also important to point out that in failed tasks, the number of "failed memory allocation" calls had a huge variance (some tasks had only 3 calls, while some had more than 60k).  


(I realized I added it as a summary before, my bad.)
msg10251 (view) Author: malte Date: 2021-04-26.14:02:07
In a sprint meeting, we identified strace/ptrace as another diagnostics tool. At first glance it looks quite promising. The plan is to do a grid experiment (after testing things locally) where we see if the strace output contains the information we need. If yes, we could wrap the translator call around a ptrace call, either with some manual C code or using a library like python-ptrace (which does not seem to be actively maintained, but perhaps is still workable).
msg10157 (view) Author: malte Date: 2021-02-26.19:30:53
Starting a C++ implementation wouldn't really solve it. Finishing one would. ;-)
msg10156 (view) Author: augusto Date: 2021-02-26.19:18:53
Maybe the best solution is to finally start implementing the translator in C++? :-)
msg10155 (view) Author: malte Date: 2021-02-26.17:32:34
If fil-profiler is able to reliably determine *that* Python runs out of memory, this would already help solve our problem. :-)

>> One thing we could do is add a safety margin to the memory used and add checks
>> against running out of memory to the translator in strategic places.

> Could you explain a bit further, please? The driver already reserves some slack
> memory to be able to exit properly in (some) cases.

I don't think you mean "driver" here. The memory reserve you mention is useful for a graceful shutdown in cases where Python is still able to raise a MemoryError, but it doesn't help in cases where Python crashes on some more "internal" memory allocation.

> What does "checking against
> running out of memory in the translator" mean? From what I saw, the culprit is
> always within the method instantiate(task, model) (this probably needs to be traced
> down more precisely, but still).

Have a "soft" memory limit and a "hard" memory limit. The hard memory limit is the one that we use with "ulimit" (or equivalent), i.e., past this point, malloc will not return more memory.

Periodically in the code, check how much memory we currently use. If it exceeds the soft limit, quit.

This should be fine if we check it often enough and don't allocate too much memory in a single allocation, both of which are under control.

This is similar to the way that we can gracefully terminate when runtime expires in the search code, where the soft limit triggers SIGXCPU and the hard limit triggers something harsher (perhaps SIGKILL).


Something that directly interfaces with the memory allocation that Python uses internally would be better, of course. Perhaps replacing or hooking into C's malloc function would be good enough, and perhaps that's what fil does internally. Unfortunately C has no standard way of doing this like C++'s set_new_handler that we use in the search code. But it seems straight-forward enough on Linux, and I don't think we need a solution for every platform.

See for example here:
http://www.gnu.org/savannah-checkouts/gnu/libc/manual/html_node/Hooks-for-Malloc.html
https://www.stev.org/post/chowtooverridemallocfree
msg10154 (view) Author: silvan Date: 2021-02-26.17:05:54
I briefly tried the fil-profiler, but when the translator runs out of memory, it only creates pretty useless svg files, attaching in next message.

> One thing we could do is add a safety margin to the memory used and add checks against running out of memory to the translator in strategic places.
Could you explain a bit further, please? The driver already reserves some slack memory to be able to exit properly in (some) cases. What does "checking against running out of memory in the translator" mean? From what I saw, the culprit is always within the method instantiate(task, model) (this probably needs to be traced down more precisely, but still).
msg10153 (view) Author: malte Date: 2021-02-26.16:23:07
Also this: https://pythonspeed.com/articles/python-out-of-memory/
msg10151 (view) Author: malte Date: 2021-02-26.16:20:46
Yes, doing this in a way that avoids falsely interpreting actual segfaults as running out of memory is a bit tricky. (On the other hand, I don't think we ever had Python segfaults other than ones caused by running out of memory. In the absence of extension libraries, this would be a serious bug in Python itself.)

One thing we could do is add a safety margin to the memory used and add checks against running out of memory to the translator in strategic places. Another is to see if there are tools or libraries out there that can help. I searched for this a while ago and found something that looked potentially useful here:

https://pythonspeed.com/articles/memory-profiler-data-scientists/
msg10149 (view) Author: silvan Date: 2021-02-26.16:07:20
In our experiments, we regularly get "unexplained errors" for large tasks of organic-synthesis-opt18-strips because the translator does not properly exists when running out of memory. We rarely had the same issue with our translator tests on Github actions. We want to try catching such errors.

I don't know how to tackle this issue. Trying to catch segfaults when calling the translator in driver doesn't seem to be a clean way in cases where the translator would actually crash with a segfault unrelated to running out of memory.
History
Date User Action Args
2021-05-12 11:03:31maltesetmessages: + msg10306
2021-05-12 08:49:59silvansetmessages: + msg10305
2021-05-07 16:51:14maltesetmessages: + msg10299
2021-05-07 15:06:11augustosetmessages: + msg10298
summary: We ran the first experiment on the hard-to-ground tasks. None of the solved tasks (1840 in total) had a failed memory allocation call. This is probably a good indication that we could "look" for this call in particular to identify OOM runs. We are now running the experiments with all the IPC domains. ->
2021-05-06 13:40:34augustosetsummary: We ran the first experiment on the hard-to-ground tasks. None of the solved tasks (1840 in total) had a failed memory allocation call. This is probably a good indication that we could "look" for this call in particular to identify OOM runs. We are now running the experiments with all the IPC domains.
2021-04-26 14:02:07maltesetmessages: + msg10251
2021-02-26 19:30:54maltesetmessages: + msg10157
2021-02-26 19:18:53augustosetnosy: + augusto
messages: + msg10156
2021-02-26 17:32:34maltesetmessages: + msg10155
2021-02-26 17:06:01silvansetfiles: + out-of-memory-reversed.svg
2021-02-26 17:05:55silvansetfiles: + out-of-memory.svg
messages: + msg10154
2021-02-26 16:23:07maltesetmessages: + msg10153
2021-02-26 16:20:47maltesetstatus: unread -> chatting
messages: + msg10151
2021-02-26 16:07:21silvancreate