Issue937

Title landmark code chokes on certain unsolvable problems
Priority bug Status resolved
Superseder Nosy List clemens, jendrik, malte, remo, salome, silvan
Assigned To Keywords
Optional summary
Related to issue467 and issue998.

Created on 2019-10-26.21:30:13 by malte, last changed by clemens.

Summary
Related to issue467 and issue998.
Messages
msg10582 (view) Author: clemens Date: 2022-02-10.17:53:52
Alright, thanks everybody! This is merged now.
msg10576 (view) Author: malte Date: 2022-02-10.14:55:22
I don't think the flag needs reviewing. It's a temporary measure, you checked it, and if there's any problem with it, I'm sure we'll figure it out. I'm fine with merging.
msg10573 (view) Author: clemens Date: 2022-02-10.12:52:55
I changed the code again to enforce the dependency of *mk_acyclic_graph* on *calc_achievers* with a flag denoting whether achievers have been calculated yet or not. It's not a long-term solution (mainly because the dependency itself is a bad smell), but it ensures in the code what was documented in comments above the functions prior to this change.

I tested locally that the assertions don't trigger. I don't think this needs another set of experiments, but let me know if you disagree. I would say this is ready to merge. I leave it up to you, Malte, whether you want to take another look at the code before we merge it.
msg10517 (view) Author: malte Date: 2022-02-07.18:38:52
I left a few comments on github, but none that would require code changes. I accidentally submitted one directly and one via a review, but I hope they'll both show up.

Did you try running this with assertions enabled? We should do this more frequently in general, but definitely for cases like these where failing assertions are involved.

Regarding my previous question in msg10515, I was worried for a second by the last four entries, which indicate a not further described failure on these tasks after the change. I think these are all unsolvable tasks. But then I saw that there was the same failure before the change, I just overlooked this first in the list. Presumably this is addressed by one of the other issues we're working on right now?

The (modest) runtime/coverage differences can probably be explained by the sporadic grid errors. It would be good to fix these in the future and perhaps also avoid that we see a wrong error description for them in lab. But these are both unrelated to this issue.
msg10516 (view) Author: salome Date: 2022-02-07.18:38:21
There are 3 types of error:
 - reaching the soft limit for run.log
 - "WARNING: overflow on h^add! Costs clamped to 100000000": happens with lama configurations on two organic-synthesis problems, for both base and v1)
 - exitcode248 for two mystery and two miconic-fulladl tasks for lm_zg, also both for base and v1. This is related to issue998 (when the translator detects unsolvability, lm_zg returns an empty lm-graph and PerStateBitset cannot handle 0-length Bitsets).

So in my opinion these errors should not hinder us from merging.
msg10515 (view) Author: malte Date: 2022-02-07.18:22:57
Did you look into the unexplained errors in the satisficing experiment?
msg10505 (view) Author: silvan Date: 2022-02-04.14:43:13
These GLIBCXX and CXXABI indeed happen sporadic. They are due to a problem with the module system; apparently, in rare cases the "module load GCC..." that I assume you have in your .bashrc or in the setup keyword of Lab's environment doesn't work. I solved this by manually setting up the environment variables via the setup keyword using "export PATH=...". To be more precise: I checked what the effect of "module load ..." is on my PATH and export that effect in all of my Lab experiments. I don't load anything in .bashrc. I do this since at least a year and have not for a single time seen this error again.
msg10504 (view) Author: salome Date: 2022-02-04.14:35:26
I looked into the error a bit more. It's exit signal 1, which Fast Downward doesn't use in its ExitCodes (the driver also complains with an AssertionError). I checked the output from one of the failed runs; run.err complains about missing GLIBCXX and CXXABI. I don't know if this is related to previous GLIBCXX issues; was it also the case there that it only happened seemingly randomly on a very small number of tasks?

In summary it seems that this error has nothing to do with the issue in itself. I would also say it didn't affect the experiments in such a scale that we need to rerun the experiments (in total 15 tasks failed due to this error), but I don't have a strong opinion here.
msg10495 (view) Author: salome Date: 2022-02-03.18:21:50
The results look good to me. The only thing I noticed and found a bit strange is that much of the coverage gain is due to "error-search-plan-found-and-out-of-memory" almost not happening anymore (it reduces two times from 3 to 1 and once from 2 to 1). I'm not familiar with this error at all, can somebody explain when it triggers and if it makes sense that the proposed change gets rid of most occurences of it? (If not I'll take a look at it myself tomorrow.)
msg10492 (view) Author: clemens Date: 2022-02-03.17:04:27
As promised, here are the reports:
https://ai.dmi.unibas.ch/_tmp_files/buechner/issue937-v1-optimal-compare.html
https://ai.dmi.unibas.ch/_tmp_files/buechner/issue937-v1-satisficing-compare.html

In the optimal case, all scores improve consistently. We even improve coverage by 4 tasks in seq-opt-bjolp-opt, something I didn't expect from this change. Only the memory is affected slightly negatively, but this doesn't seem to be systematic. There are some differences in the number of landmarks, but (as usual) this is only due to the trucks-strips domain where the translator is non-deterministic.

In the satisficing case, the change seems to have almost no influence at all. 

Does anybody else find something we should address before we can merge this issue?
msg10491 (view) Author: clemens Date: 2022-02-03.14:39:20
We have implemented a possible solution for this issue. The function *remove_first_weakest_cycle_edge* assumed that a cycle has at least one reasonable ordering. As noted in msg9025, this assuption is wrong because it is not necessarily true for unsolvable tasks, where it is possible to find cycles of only natural (or stronger) orderings. Our solution
(a) changes the function such that it doesn't rely on this wrong assumption, and
(b) signal that the task is unsolvable in a way that does not require any changes on the side of landmark heuristics.

To resolve (a), we adapt *remove_first_weakest_cycle_edge* such that it does what its name suggests even if the weakest ordering is natural (or stronger). Ultimately, this also guarantees that the graph really is acyclic when calling *mk_acyclic_graph*. There was a discussion (between Salomé, Remo, and me) whether it would be better to leave the cycle be if it only consists of natural orderings, as we could not reproduce the argument for unsolvability (or prove it) if we break up the cycle. However, we decided to remove the edge because it corresponds to the behavior communicated by the function name. Furthermore, we accept a loss of information even if the weakest cycle is reasonable in order to make the graph acyclic. In the end, this is just another indicator that not every landmark-based approach requires acyclic graphs and we should eventually move this functionality to a separate (proxy?) landmark factory as discussed in issue996.

To resolve (b), we implement the following approach: if the removed (i.e., weakest) ordering in a cycle is natural (or stronger), we clear the *first_achievers* of all landmarks present in that cycle. The heuristic can later pick up that no operator can (first) achieve one or more landmarks. This requires that (first) achievers are computed (i.e., *calc_achievers is called) before the call to *mk_acyclic_graph*. We had to change the order, but this shouldn't be a problem because the two functions are independent.

A pull request can be found here:
https://github.com/aibasel/downward/pull/87/files

Experiments for the relevant landmark configurations (as discussed in the meta issue987) are finished, but I have problems uploading them to our tmp_files... I'll give an update once I was able to publish them.
msg9025 (view) Author: malte Date: 2019-10-26.21:30:13
When a landmark factory produces a cycle of natural (or stronger) orderings, LandmarkFactory::remove_first_weakest_cycle_edge will choke because it cannot break the cycle by removing a reasonable or obedient-reasonable ordering.

The correct behaviour in such a case is to notice that the task is unsolvable.

An example task that showcases this behaviour was sent to the Fast Downward mailing list on 25.10.2019.

In debug mode, the example task triggers an assertion because there is a landmark without achievers, which is what issue467 is about, and therefore in debug mode it never actually gets to the stage where the cycle issue is encountered. However, in release mode it does, and I get a crash when trying (and failing) to break the cycle.

The two problems are related in the sense that they can both be triggered by unsolvable tasks, but they are unrelated in the sense that they require independent fixes.

In this example task, fixing issue467 by avoiding further landmark processing when landmarks without achievers are encountered would address the problem, but for other tasks all landmarks might have achievers, but we might still detect an unresolvable cycle.
History
Date User Action Args
2022-03-07 16:55:52clemenssetstatus: chatting -> resolved
2022-02-10 17:53:52clemenssetmessages: + msg10582
2022-02-10 14:55:22maltesetmessages: + msg10576
2022-02-10 12:52:55clemenssetmessages: + msg10573
2022-02-07 18:38:52maltesetmessages: + msg10517
2022-02-07 18:38:21salomesetmessages: + msg10516
2022-02-07 18:22:57maltesetmessages: + msg10515
2022-02-04 14:43:13silvansetmessages: + msg10505
2022-02-04 14:35:26salomesetmessages: + msg10504
2022-02-03 18:21:50salomesetmessages: + msg10495
2022-02-03 17:04:27clemenssetmessages: + msg10492
2022-02-03 14:39:20clemenssetmessages: + msg10491
2022-02-03 09:19:08salomesetnosy: + salome
2022-02-03 09:17:42salomesetnosy: + clemens, remo
2021-01-28 12:19:42salomesetsummary: Related to issue467. -> Related to issue467 and issue998.
2019-10-28 10:39:49silvansetnosy: + silvan
2019-10-26 21:30:13maltecreate