Issue1015

Title Try to fix cases where the translator incurs a segfault when running out of memory
Priority bug Status chatting
Superseder Nosy List augusto, florian, jendrik, malte, silvan
Assigned To Keywords
Optional summary

Created on 2021-02-26.16:07:21 by silvan, last changed by malte.

Files
File name Uploaded Type Edit Remove
out-of-memory-reversed.svg silvan, 2021-02-26.17:06:01 image/svg+xml
out-of-memory.svg silvan, 2021-02-26.17:05:54 image/svg+xml
Messages
msg10157 (view) Author: malte Date: 2021-02-26.19:30:53
Starting a C++ implementation wouldn't really solve it. Finishing one would. ;-)
msg10156 (view) Author: augusto Date: 2021-02-26.19:18:53
Maybe the best solution is to finally start implementing the translator in C++? :-)
msg10155 (view) Author: malte Date: 2021-02-26.17:32:34
If fil-profiler is able to reliably determine *that* Python runs out of memory, this would already help solve our problem. :-)

>> One thing we could do is add a safety margin to the memory used and add checks
>> against running out of memory to the translator in strategic places.

> Could you explain a bit further, please? The driver already reserves some slack
> memory to be able to exit properly in (some) cases.

I don't think you mean "driver" here. The memory reserve you mention is useful for a graceful shutdown in cases where Python is still able to raise a MemoryError, but it doesn't help in cases where Python crashes on some more "internal" memory allocation.

> What does "checking against
> running out of memory in the translator" mean? From what I saw, the culprit is
> always within the method instantiate(task, model) (this probably needs to be traced
> down more precisely, but still).

Have a "soft" memory limit and a "hard" memory limit. The hard memory limit is the one that we use with "ulimit" (or equivalent), i.e., past this point, malloc will not return more memory.

Periodically in the code, check how much memory we currently use. If it exceeds the soft limit, quit.

This should be fine if we check it often enough and don't allocate too much memory in a single allocation, both of which are under control.

This is similar to the way that we can gracefully terminate when runtime expires in the search code, where the soft limit triggers SIGXCPU and the hard limit triggers something harsher (perhaps SIGKILL).


Something that directly interfaces with the memory allocation that Python uses internally would be better, of course. Perhaps replacing or hooking into C's malloc function would be good enough, and perhaps that's what fil does internally. Unfortunately C has no standard way of doing this like C++'s set_new_handler that we use in the search code. But it seems straight-forward enough on Linux, and I don't think we need a solution for every platform.

See for example here:
http://www.gnu.org/savannah-checkouts/gnu/libc/manual/html_node/Hooks-for-Malloc.html
https://www.stev.org/post/chowtooverridemallocfree
msg10154 (view) Author: silvan Date: 2021-02-26.17:05:54
I briefly tried the fil-profiler, but when the translator runs out of memory, it only creates pretty useless svg files, attaching in next message.

> One thing we could do is add a safety margin to the memory used and add checks against running out of memory to the translator in strategic places.
Could you explain a bit further, please? The driver already reserves some slack memory to be able to exit properly in (some) cases. What does "checking against running out of memory in the translator" mean? From what I saw, the culprit is always within the method instantiate(task, model) (this probably needs to be traced down more precisely, but still).
msg10153 (view) Author: malte Date: 2021-02-26.16:23:07
Also this: https://pythonspeed.com/articles/python-out-of-memory/
msg10151 (view) Author: malte Date: 2021-02-26.16:20:46
Yes, doing this in a way that avoids falsely interpreting actual segfaults as running out of memory is a bit tricky. (On the other hand, I don't think we ever had Python segfaults other than ones caused by running out of memory. In the absence of extension libraries, this would be a serious bug in Python itself.)

One thing we could do is add a safety margin to the memory used and add checks against running out of memory to the translator in strategic places. Another is to see if there are tools or libraries out there that can help. I searched for this a while ago and found something that looked potentially useful here:

https://pythonspeed.com/articles/memory-profiler-data-scientists/
msg10149 (view) Author: silvan Date: 2021-02-26.16:07:20
In our experiments, we regularly get "unexplained errors" for large tasks of organic-synthesis-opt18-strips because the translator does not properly exists when running out of memory. We rarely had the same issue with our translator tests on Github actions. We want to try catching such errors.

I don't know how to tackle this issue. Trying to catch segfaults when calling the translator in driver doesn't seem to be a clean way in cases where the translator would actually crash with a segfault unrelated to running out of memory.
History
Date User Action Args
2021-02-26 19:30:54maltesetmessages: + msg10157
2021-02-26 19:18:53augustosetnosy: + augusto
messages: + msg10156
2021-02-26 17:32:34maltesetmessages: + msg10155
2021-02-26 17:06:01silvansetfiles: + out-of-memory-reversed.svg
2021-02-26 17:05:55silvansetfiles: + out-of-memory.svg
messages: + msg10154
2021-02-26 16:23:07maltesetmessages: + msg10153
2021-02-26 16:20:47maltesetstatus: unread -> chatting
messages: + msg10151
2021-02-26 16:07:21silvancreate