Issue984

Title segfault in out-of-memory handler
Priority bug Status resolved
Superseder Nosy List clemens, jendrik, malte, silvan
Assigned To Keywords
Optional summary

Created on 2020-11-20.14:31:06 by clemens, last changed by clemens.

Files
File name Uploaded Type Edit Remove
driver.log clemens, 2020-11-20.19:29:32 text/x-log
run.log clemens, 2020-11-20.19:12:44 text/x-log
Messages
msg11560 (view) Author: clemens Date: 2024-01-31.17:13:14
Alright, then I'll do that. Thanks!
msg11559 (view) Author: malte Date: 2024-01-31.17:07:23
> Regarding cout: Can you point me to where this is currently used and we don't
> want to use it? I can only find the occurrences in `system_windows.cc` and
> thought we discussed that it's okay to ignore the Windows case for now.

You're right, I missed that the place where I saw it was in system_windows.cc.

Looks good to merge.
msg11555 (view) Author: clemens Date: 2024-01-31.14:10:57
I indeed forgot about the end of the main function because I didn't take notes properly. A respective change is checked into the pull request now.

Regarding cout: Can you point me to where this is currently used and we don't want to use it? I can only find the occurrences in `system_windows.cc` and thought we discussed that it's okay to ignore the Windows case for now.
msg11554 (view) Author: malte Date: 2024-01-31.13:56:16
I've renamed the issue to reflect where the error is happening for the benefit of future reference (including the change log).

The changes in the pull request look good, but I suggest we also make some further changes we discussed in the morning to make sure that the signal handlers and out-of-memory handlers avoid unsafe operations. These changes would address the same kind of issue (segmentation faults happening in scenarios where we would like to perform an emergency exit). Specifically, we should avoid using cout in these scenarios because it is not reentrant and can use dynamic memory.

We also discussed changing the end of the main function to *not* use the special-purpose reentrant functions. To me, this is related clean-up, but it's not necessary for fixing the code problem, so while I think it makes sense to include it in this, it's not strictly necessary.
msg11553 (view) Author: clemens Date: 2024-01-31.11:48:44
I have adapted the memory-out handler to use the _Exit() function instead of exit(). The pull request can be found here: https://github.com/aibasel/downward/pull/215

The exit code 15 is reported as an unexpected error, so nothing to worry about here I guess. I generally think it's a good idea to have these limits.

> I would prefer not mixing changes to our general infrastructure for out-of-memory stuff with h^m code changes, so I'd be happier with the second option, a new issue.

Agreed, I opened issue1135 for this.
msg11549 (view) Author: malte Date: 2024-01-30.16:29:52
Regarding the exit code 15, it looks like you're hitting the hard limit for output that is enforced by lab. Have you checked the "unexplained errors" section of the report? If it doesn't mention this, we have a potential problem.

If it does mention it, then this is our current expected behaviour. The main reason why we enforce these limits is to avoid a single experiment filling up the entire hard disk and thus bringing the grid down for everyone. I think with the new quota system this is less critical than it used to be, so if someone wants to rediscuss this, we can do it. I think it's also easy enough to change the soft and hard limits in lab experiments if the defaults don't work for you.

I think a better solution would be less output in cases like these; I think our default hard limit is 10 MiB per run, which should be generous if we're logging reasonably. In an experiments with 50K runs, this could in principle add up to 500 GiB, although of course in most cases only a small set of runs will write such long logs.

> Finally, for what it's worth: when working through the code of the hm landmark factory I stumbled
> over some small things we could change in the code to make it more readable. Should I address them
> in a pull request for this issue or should I open a new issue for that or should I just forget
> about it?

I would prefer not mixing changes to our general infrastructure for out-of-memory stuff with h^m code changes, so I'd be happier with the second option, a new issue.
msg11548 (view) Author: malte Date: 2024-01-30.16:21:49
A pity we couldn't reproduce it, but these kinds of behaviours are very volatile.


Whether we can reproduce the error or not, I think it is unreasonable to expect that calling std::exit will reliably succeed when called from a "new-handler" (what gets called in these out-of-memory situations).

The usual C++ library documentation (https://en.cppreference.com/w/cpp/memory/new/set_new_handler) suggests that a new-handler should do one of three things:

"1) make more memory available,
2) terminate the program (e.g. by calling std::terminate),
3) throw exception of type std::bad_alloc or derived from std::bad_alloc."

We used to do 3), but moved to writing some output and exiting directly because we could get that to work more reliably. Basically, our code is not prepared to gracefully deal with exceptions everywhere memory could possibly be allocated, while also guaranteeing that no further memory allocations occur during exception handling.

With 1) we can only attempt to shut down more gracefully but not keep going indefinitely. We used to do some of that as well (and still do it in the Python code, but it doesn't work reliably).

This leaves 2, which is kind-of what we're doing except that we're calling exit, not terminate. std::terminate is another flavor of "emergency exit"; it calls std::abort. I think there is a reason why the library documentation suggests std::terminate and not std::exit; there's just too many things that can and will go wrong once you're out of memory. In other words, our proposed change would change what we do to what is recommended for new-handlers.

TL;DR: Yes, I think it's wrong to call std::exit in this place, for the same reason that we're not calling it from the signal handlers. I think the suggestion still makes sense. Of the various different options to terminate, my recommendation would be to use the same one we use for signal handlers because I don't think there is a reason to handle these two things differently.
msg11547 (view) Author: clemens Date: 2024-01-30.15:09:42
I have tried to reproduce the segfault without success. For the problem we considered so far (Airport #50) the planner exits with exit code 22 (search-out-of-memory) no matter whether I run it locally, on the grid login node, or in a grid experiment. I also checked some recent landmark experiments and found that in all these cases the planner is terminated without complaints. So how should we procede? Should we still consider exiting with _Exit() or abort() in the out_of_memory_handler? Or should we just drop this without taking any further steps?

While looking through the experiments, I have found another strange error code that only seems to occur for lm_hm(m=2). It is exitcode--15 and occurs in parcprinter (e.g., https://ai.dmi.unibas.ch/_experiments/ai/downward/issue383/data/issu383-v5-opt-eval/issu383-v5-opt-issue383-base-issue383-v5-compare.html#error-parcprinter-08-strips). I've briefly looked into it and it seems to be caused by logging too much; due to parcprinter's high costs there are many f-layers and there's one log message whenever a new f-layer is reached. The log I've looked at went up to f-layer 871912 and apparently seldomly jumped by more than 1 from one output to the next. If we consider this an issue, it's definitely something unrelated to what we're dealing with here.

Finally, for what it's worth: when working through the code of the hm landmark factory I stumbled over some small things we could change in the code to make it more readable. Should I address them in a pull request for this issue or should I open a new issue for that or should I just forget about it?
msg11545 (view) Author: malte Date: 2024-01-29.22:50:29
To answer an older question by Clemens:

> Something I observed is that the memory-consumption as reported by htop
> increases to the point where no memory is left at which point the planner
> terminates. However, the memory usage reported by the logger of Fast Downward
> does not increase during the execution of this function.

This is likely because htop is showing you a different memory statistic from what Fast Downward shows. From what I recall, Fast Downward's output corresponds to peak memory usage measuring how much memory the process allocated, even if it has not been "actively used" yet. This would be "VIRT" in top/htop. What you see going up is perhaps something like "RES" in top/htop, which doesn't count memory that has been reserved but not yet used.
msg11544 (view) Author: malte Date: 2024-01-29.22:39:42
In more detail, the proposed change would be:

1. Rename "exit_after_receiving_signal" to "exit_immediately" (because it now gets a wider scope). Or perhaps "exit_immediately_with" if we want to have the name analogous to "exit_with".

2. In both the Windows and Unix implementation of "out_of_memory_handler", call "exit_immediately" instead of "exit_with".
msg11543 (view) Author: malte Date: 2024-01-29.17:58:22
Next steps: see if we can reproduce the segfault on the grid and if using _Exit solves it.
msg11542 (view) Author: malte Date: 2024-01-29.17:58:00
We found a grid experiment that shows a segmentation fault on this task (Airport #50). It's in the Zenodo experiment logs for our ICAPS 2021 paper "Exploiting Cyclic Dependencies in Landmark Heuristics", run 70 of experiment cycles-210311_hm2-with-LM-log.

However, interestingly, the segmentation fault happens *after* we report that we ran out of memory. We quickly terminate after this output, so I think the explanation must be that a segmentation fault happens in the destructors that are invoked as a consequence of calling "exit".

It's probably impossible or at the very least a fool's errand to try to make sure that the destructors work correctly if we run out of memory at an arbitrary time during code execution. Instead, we should not call "exit" in this scenario, but call "_Exit" or "abort", like we already do when we terminate due to a signal.
msg11541 (view) Author: malte Date: 2024-01-29.17:28:58
Clemens and I looked into this and confirmed with dmesg that when we try to reproduce this on Clemens's machine, the Fast Downward process is killed by Linux's out-of-memory (OOM) killer.

This happens when Linux promises to a process that a certain amount of memory is available but doesn't really have enough memory available, and once enough memory is actually used to run out of memory, the promise can't be kept and the OS kills some process (usually one using a lot of memory).

For the program being killed, there is really no way to avoid this. Mitigation could be setting a memory limit or having swap memory enabled so that we're less likely to run out of memory, but of course we often don't want to start swapping either. In a grid experiment this should never happen if we set the memory limits low enough that all concurrent runs on the machine have enough memory available even if they use it all (also taking into account the memory needs of other processes running on the machine). I think Slurm ensures this too if configured correctly?

But I think this is something we can't really do anything about, and also our program is not really at fault.

However, Clemens recalls seeing segmentation faults (rather than kills) on the grid, so we need to investigate this.
msg11538 (view) Author: clemens Date: 2024-01-29.15:49:32
So there was a problem in my call regarding memory-limit. (I did not specify the --overall-memory-limit as a driver argument but only after --search, which is not allowed, but running the example below lead to crashing before that flaw in the commandline call was fully parsed and detected.) So my call from before didn't actually impose a memory limit and therefore it's fine that according to htop all my computer's memory was exhausted. So at least this curiosity is resolved.

Without a memory limit, the problem occurs in LandmarkFactoryHM::initialize(...) as described below.

Further update on the configuration string: we don't need the reasonable orderings to reproduce this:
./fast-downward.py $BENCHMARKS/airport/p50-airport5MUC-p15.pddl --search "eager_greedy([landmark_sum(lm_hm())])"
msg11516 (view) Author: clemens Date: 2024-01-10.13:59:50
I have tried to figure out where the code breaks without much success. For the given problem (airport/p50-airport5MUC-p15.pddl), it seems that it fails to allocate memory somewhere in LandmarkFactoryHM::initialize(...). Where exactly seems to depend on factors outside the planner. Sometimes it breaks in the call to LandmarkFactoryHM::get_m_sets(...), sometimes it get's through that just fine and then breaks in the for-loop that follows. 

Something I observed is that the memory-consumption as reported by htop increases to the point where no memory is left at which point the planner terminates. However, the memory usage reported by the logger of Fast Downward does not increase during the execution of this function. Is there a reasonable explanation for this observation? Even if I restrict the overall memory limit of Fast Downward to something small (e.g., 10 MB), executing the planner consumes GBs of memory according to htop.

At this point, I don't have much more insights to share. I don't see where in these there could be an integer overflow within these lines of code. I don't think I can make progress in this issue without the help of others.

By the way, the updated configuration string is
./fast-downward.py $BENCHMARKS/airport/p50-airport5MUC-p15.pddl --search "eager_greedy([landmark_sum(lm_factory=lm_reasonable_orders_hps(lm_hm()))])"
msg10800 (view) Author: malte Date: 2022-07-19.21:06:47
I think this is probably easy to fix -- my guess is an integer overflow. If someone wants to work on this and needs some help, let me know.
msg10739 (view) Author: malte Date: 2022-07-06.16:42:46
The updated configuration string is

./fast-downward.py $BENCHMARKS/airport/p50-airport5MUC-p15.pddl --search "eager_greedy([lmcount(lm_factory=lm_reasonable_orders_hps(lm_hm()))])"

When I run this with a memory limit, I quickly run out of memory with this (end of) output:

INFO     search command line string: /home/helmert/downward/master/builds/release/bin/downward --search 'eager_greedy([lmcount(lm_factory=lm_reasonable_orders_hps(lm_hm()))])' --internal-plan-file sas_plan < output.sas
[t=1.645e-05s, 9880 KB] reading input...
[t=0.0757357s, 22020 KB] done reading input!
[t=0.0771249s, 22416 KB] Initializing landmark count heuristic...
[t=0.0771588s, 22416 KB] Generating landmark graph...
[t=0.0784749s, 23528 KB] Building a landmark graph with reasonable orders.
[t=0.0797567s, 24448 KB] h^m landmarks m=2
Failed to allocate memory.
Memory limit has been reached.
Peak memory: 3695084 KB
Remove intermediate file output.sas
search exit code: 22

Driver aborting after search
INFO     Planner time: 2.8e+01s
msg10738 (view) Author: malte Date: 2022-07-06.16:39:44
This was previously accidentally posted as a summary rather than change note. Clemens already answered it in msg9756.

Can you include a logfile? On my computer this does not segfault, it just uses more and more memory (~16 GiB) until I have to kill it. Airport #50 is a giant task, so this would perhaps be a performance issue but not necessarily a bug. I wonder if the segfault is just a symptom of running out of resources or a genuine bug.
msg9759 (view) Author: jendrik Date: 2020-11-20.19:42:55
You can see the meaning of the exit codes here: https://github.com/aibasel/lab/blob/master/downward/outcomes.py

256 - 11 = 245
11 = signal.SIGSEV
msg9758 (view) Author: clemens Date: 2020-11-20.19:29:32
According to driver.log on the grid, the planner exit code was 245. What does this mean? It's not listed on the page for exit codes on the Fast Downward website.
msg9757 (view) Author: jendrik Date: 2020-11-20.19:19:36
The driver.log file should give you the planner's exit code, which might be helpful.
msg9756 (view) Author: clemens Date: 2020-11-20.19:12:44
Adding the requested log.

As of now, I'm somewhat unsure whether it really is a segfault. It was listed as such in the experiments I ran on the grid, but doesn't say so in the log.
msg9755 (view) Author: silvan Date: 2020-11-20.15:19:21
(moving Clemen's report to the change notes)

A segfault is triggered during computation of the landmark graph (i.e., LandmarkFactory::compute_lm_graph(...)) for exactly one task in the entire benchmark set. The following configuration reproduces this error:

./fast-downward.py $BENCHMARKS/airport/p50-airport5MUC-p15.pddl --search "eager_greedy([lmcount(lm_factory=lm_hm(reasonable_orders=true))])"
History
Date User Action Args
2024-01-31 17:13:14clemenssetstatus: chatting -> resolved
messages: + msg11560
2024-01-31 17:07:23maltesetmessages: + msg11559
2024-01-31 14:10:57clemenssetmessages: + msg11555
2024-01-31 13:56:16maltesetmessages: + msg11554
title: Segfault in h^m Landmark Factory -> segfault in out-of-memory handler
2024-01-31 11:48:44clemenssetmessages: + msg11553
2024-01-30 16:29:52maltesetmessages: + msg11549
2024-01-30 16:21:49maltesetmessages: + msg11548
2024-01-30 15:09:42clemenssetmessages: + msg11547
2024-01-29 22:50:29maltesetmessages: + msg11545
2024-01-29 22:39:42maltesetmessages: + msg11544
2024-01-29 17:58:22maltesetmessages: + msg11543
2024-01-29 17:58:00maltesetmessages: + msg11542
2024-01-29 17:28:58maltesetmessages: + msg11541
2024-01-29 15:49:32clemenssetmessages: + msg11538
2024-01-10 13:59:50clemenssetmessages: + msg11516
2022-07-19 21:06:47maltesetmessages: + msg10800
2022-07-06 16:42:46maltesetmessages: + msg10739
2022-07-06 16:39:44maltesetmessages: + msg10738
summary: Can you include a logfile? On my computer this does not segfault, it just uses more and more memory (~16 GiB) until I have to kill it. Airport #50 is a giant task, so this would perhaps be a performance issue but not necessarily a bug. I wonder if the segfault is just a symptom of running out of resources or a genuine bug. ->
2020-11-20 19:42:55jendriksetmessages: + msg9759
2020-11-20 19:29:32clemenssetfiles: + driver.log
messages: + msg9758
2020-11-20 19:19:36jendriksetmessages: + msg9757
2020-11-20 19:12:44clemenssetfiles: + run.log
messages: + msg9756
2020-11-20 18:48:53maltesetsummary: Can you include a logfile? On my computer this does not segfault, it just uses more and more memory (~16 GiB) until I have to kill it. Airport #50 is a giant task, so this would perhaps be a performance issue but not necessarily a bug. I wonder if the segfault is just a symptom of running out of resources or a genuine bug.
2020-11-20 15:20:40silvansetnosy: + jendrik
2020-11-20 15:19:22silvansetstatus: unread -> chatting
nosy: + malte, silvan, clemens
messages: + msg9755
summary: A segfault is triggered during computation of the landmark graph (i.e., LandmarkFactory::compute_lm_graph(...)) for exactly one task in the entire benchmark set. The following configuration reproduces this error: ./fast-downward.py $BENCHMARKS/airport/p50-airport5MUC-p15.pddl --search "eager_greedy([lmcount(lm_factory=lm_hm(reasonable_orders=true))])" -> (no value)
2020-11-20 14:31:06clemenscreate