Issue 1036: Make landmark status manager deterministic and flawless

Title	Make landmark status manager deterministic and flawless
Priority	bug	Status	resolved
Superseder		Nosy List	clemens, jendrik, malte, salome, silvan, thomas
Assigned To	clemens	Keywords
Optional summary

Created on 2021-09-14.10:01:44 by clemens, last changed by clemens.

Messages
msg11328 (view)	Author: clemens	Date: 2023-09-04.16:04:21
We have evaluated again after merging, whether changing the alias "seq-opt-bjolp" would benefit from using reasonable orderings. Indeed it does, according to these results: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/bjolp-v1-eval/bjolp-v1-bjolp-base-bjolp-v1-compare.html We therefore decided to adapt the alias in the main code base and I will commit the change to referencing this issue.
msg11253 (view)	Author: clemens	Date: 2023-08-02.14:27:07
Adding the latest changes from the sprint was no issue. I've merged this. Thanks everyone!
msg11239 (view)	Author: clemens	Date: 2023-07-30.21:49:58
I did and I also ran anoter set of experiments just to make sure these changes don't affect the performance. Here are the results: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v12-satisficing-single-eval/issue1036-v12-satisficing-single-issue1036-base-issue1036-v12-compare.html https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v12-satisficing-anytime-eval/issue1036-v12-satisficing-anytime-issue1036-base-issue1036-v12-compare.html https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v12-optimal-eval/issue1036-v12-optimal-compare.html Everything still looks pretty much the same, so I tink we're still good to merge this. My original plan was to do so before a potential release tomorrow but I don't want to rush anything now. In particular, I didn't catch up with the progress of the sprint that ended on Friday and whether it affects this issue. (I think it shouldn't, but let's see.) Also, I still need to write a commit message. Furthermore, it might make sense to immediately follow up merging this issue with issue1089 and this needs even more work. In any case, I'll wait with merging until I start working again on August 2.
msg11142 (view)	Author: malte	Date: 2023-07-21.12:42:27
We looked at this together, and my open questions were resolved. (Clemens will make another commit or set of commits as the result of our discussion.)
msg11141 (view)	Author: malte	Date: 2023-07-21.11:27:45
I left a few comments and would like to discuss three of the recent commits. If you want to merge this today as indicated on Discord, I suggest we talk about these commits on Zoom.
msg11121 (view)	Author: clemens	Date: 2023-07-17.15:58:25
Thank you Malte for the feedback, I've incorporated most of your comments. I would like to react to some of them. Obedient-reasonable orderings: I have created issue1089 and have the patch ready to be rebased on the final version of this issue as soon as it is merged. "progress_basic": What we implement in this function is what we call "basic progression" in the paper. I don't have a better idea right now, how to call it, but I'm open to discuss the name. What happens in this progression function is straightforward: If the transition achieves a landmark (i.e., it does not hold in the starting state and holds in the resulting state) then it should be removed from the "future", and in any case, if it holds in the resulting state, it should be marked "past". If this explanation triggers any good ideas for names, please let me know. Semantics: I think we've got it covered. Salomé and I both did our best to match it with the theory, so I don't think it's necessary that somebody else takes a look as long as nothing breaks (which doesn't seem to be the case). issue202: I've read the entire thread, most of it is no longer relevant after we merge this issue. The only thing that we could consider again, is how / which reasonable orderings are generated (msg1150). With the new theory and implementation, we are on the safe side in terms of dealing with reasonable orderings; as long as they are sound, we're good, no matter if both landmarks can be achieved simultaneously or not. The question then shifts towards which ones we actually want to generate. More orderings are generally more informative, but make the landmark progression (and hence heuristic evaluation) more expensive. Adding an ordering A--r-->B where A and B are always achieved simultaneously for the first time are never beneficial. In conclusion, I suggest to close issue202 and instead open a new one dedicated to looking into this topic specifically. So what's left to do: (1) Clarify the way we want to deal with const bitsets/bitsetview. (Discussion started on Github.) (2) Discuss possible alternatives for the name "progress_basic". (3) Close issue202 and open a new one to look into generation of reasonable orderings. I think that's it.
msg11107 (view)	Author: malte	Date: 2023-07-07.19:57:24
I finished looking at the code and left some comments. I'm not a fan of some of the names, such as the frequent use of "fut", which I don't think needs abbreviating, and "progress_basic" is a method name that feels like it could be improved. But that's not sufficient reason not to merge. I didn't really focus on whether we're computing the right thing because that would require more time and a re-reading of parts of the paper. Instead I focused more on integration aspects and the like; I assume the two of you are already on top of the semantics. If you want an additional set of eyes on this, perhaps ask Thomas? We can also have a more thorough review of the code at a later stage when it's closer to what we want it to be eventually. Like I said a few times, I also think there are many efficiency improvement opportunities, but for now I think we can set these aside and focus on having something correct merged first.
msg11106 (view)	Author: malte	Date: 2023-07-07.15:03:53
I agree with the interpretation of the experimental results: they are not a reason to avoid merging the issue. It would be good to think about efficiency improvements at some point (including memory efficiency), but correctness comes first, and the landmark code in general needs more changes anyway, and it's better to optimize once the most important changes have landed. So if I didn't miss anything, what remains is a code review. I'll have a look. (First comment: it looks like all the experiments should be removed from the pull request? This is already mentioned on github.)
msg11105 (view)	Author: malte	Date: 2023-07-07.14:33:33
I suggest removing anything related to obedient-reasonable orderings from the code once we switch to the new kind of progression, including removal from landmark_graph.h (also renumbering the remaining edge types). We should not forget to update the user-visible documentation, i.e., explain how our code differs from the HPS paper that we cite. From a code review perspective, I'd prefer doing it in a separate issue, but if whoever does the work prefers integrating the change into this issue, this also works for me. Once we merge this, can we look into issue202? Perhaps something we can then close easily.
msg11092 (view)	Author: clemens	Date: 2023-06-22.14:19:14
An open question that we forgot before is what we want to do about obedient-reasonable orderings. In the current version, they are basically obsolete; they are generated by default, but they are never considered in the progression. This might actually be another source for changing behavior in the search. Any opinions on that topic? As I don't particularly like obedient-reasonable orderings, I would remove them without hesitation if it were up to me, but I see some historic value in having them stick around. The way they exist in the code right now, though, they're just clutter that doesn't help anybody, neither a planner nor a programmer.
msg11091 (view)	Author: clemens	Date: 2023-06-22.12:02:06
After several approaches to speed up the heuristic evaluation, we think we might have found a version that is acceptable and aligns properly with the theory. We again did a single-search experiment with several configurations and an anytime experiment for the classic LAMA. Single-search: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v11-satisficing-single-eval/issue1036-v11-satisficing-single-issue1036-base-issue1036-v11-compare.html Unexplained errors are mostly due to h^m not supporting axioms. All others I've seen frequently in past landmark issues. The lama-first configurations (with and without preferred operators) improve their state evaluation, expansion, and generation scores by around 3 without any gain in coverage. Time- and memory-scores decline slightly. For memory, this is expected, as we store two per-state-bitsets of size #landmarks where previously we only stored one. Regarding time, it is also expected to some extent, as progressing the landmarks is now done for every uncovered transition (i.e., when calling notify_state_transition) while previously some of the reasoning only happened when evaluating the heuristic. We solve 7 fewer tasks using the lm_zg configuration. Here, clearly memory is the limiting factor as all 7 tasks run out of memory. Nevertheless, time scores increase by 5.35 (search time) and 3.47 (total time). One possible explanation is that ZG (in my experience) does not produce the most meaningful orderings. In particular, it might only produce orderings where storing both past and future landmarks provides no benefit at all compared to storing only one of these sets and computing the other on demand. (With meaningful orderings, this is not possible, but with the orderings produced by ZG it might be.) Hence, additional memory requirements provide little to no benefit and hence cannot compensate for the loss in performance due to increased memory requirements. With BJOLP we also observe a decline in the scores except for time. I don't know much more to say about it, though. For three out of four configurations, the cost of the plans found overall increases. While this is a pity, I'm not sure there's much we can do about it. Since all configurations use greedy search, it might be that we miss short paths that branch off early because on the longer ones landmarks are reached sooner and therefore heuristic values are lower. Therefore, I wouldn't put too much emphasis on these cost values for the single-search results. Anytime: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v11-satisficing-anytime-eval/issue1036-v11-satisficing-anytime-issue1036-base-issue1036-v11-compare.html The overall plan costs also increase for the anytime configurations. However, it is not consistent over all domains: Shorter plans are found in 2 domains (grid and scanalyzer) while longer ones are found in 11 (the worst increase observed in elevators). A possible explanation might be the same as with single-search: LAMA starts with a greedy search and if it doesn't get to finish the first weighted A* run, it will report that solution found greedily. Looking at the planner time, we can find a slight improvement, indicating that whenever LAMA gets through all search iterations, it is generally faster than before. However, it is a bit disappointing that the more accurate (and in particular safe) heuristic does not improve the overall search behaviour... Besides making the inadmissible landmark heuristics more accurate, our updated landmark progression comes with the advantage that there are no restrictions on the landmark graph. For example, it would allow to have cycles in the landmark graph. Consequently, we would have more freedom in the landmark generation and can use more information that we previously dropped. We expect that allowing cycles will improve performance, but suggest to not include testing this within this issue but instead deal with this in issue994 and issue996. An additional benefit of the updated progression is that reasonable orderings are available for optimal planning. We have tested the impact of using them in the following experiment: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-base-optimal-eval/issue1036-base-optimal-compare.html Unexplained errors are mostly due to trying to use unsupported features (axioms and conditional effects) and some planners writing too much output. In this report, I compare three versions for each configuration: - base (old progression which did not allow reasonable orderings) - v11 (updated progression without reasonable orderings) - v11-reasonable (updated progression with reasonable orderings) This is to gain a clearer picture of the changes. The report contains pairwise comparisons for all of these variants. - base vs. v11 to see how much difference the progression makes when not changing anything else, - v11 vs. v11-reasonable to see how adding reasonable orderings to the mix helps without changing anything else, and - base vs. v11-reasonable to see the old recommended setting to what we think would be a sensible new recommended setting. The three comparisons are ordered in this manner and follow directly one after the other. The leftmost comparison of LM-cut with itself is included to see that the found plans are indeed optimal. On the one hand we observe that the updated progression without reasonable orderings is generally slower (base vs. v11). On the other hand, using reasonable orderings with the new progression has several positive effects (v11 vs. v11-reasonable): Mainly, using this additional information improves the heuristic and reduces the number of expanded states before the last f-layer in many problems. (There are domains where the opposite is the case. In our paper we show an example why this can happen with path-dependent heuristics.) Comparing base with v11-reasonable directly sometimes justifies the change and sometimes things get slightly worse. The same holds for coverage numbers: in some cases they improve in others they decline. Overall, Salomé and I have a strong opinion to merge these changes soon and accept that in some cases it means to take a step back in terms of performance. Merging this will clean up an important part of the landmark code and the new version is in line with the recently developt theoretical basis. Furthermore, we see potential to increase performance in follow-up issues like: - making the generation of reasonable orderings more reasonable (issue383) - allowing cycles (issue994 and issue996) - cleining up the definition of preferred operators (issue1075) and potentially more. We are now looking for a volunteer to make a code review. Maybe Malte would like to take a look himself?
msg11087 (view)	Author: clemens	Date: 2023-06-02.14:09:20
On another note: We were notified on our discord server that lama-first doesn't find a plan on a problem where astar with the blind heuristic does. Here is the original message by Javi: """ Hi all, I was running some experiments with FD and I came across an instance where blind search returns a solution where LAMA does not. FYI: Experiment 1 - blind search command: fast-downward.py domain.pddl instance.pddl --search "astar(blind())" Experiment 2 - LAMA command: fast-downward.py --alias lama-first domain.pddl instance.pddl domain.pddl: https://github.com/AI-Planning/pddl-generators/blob/main/satellite/domain.pddl instance.pddl: (define (problem satellite-06) (:domain satellite) (:objects sat1 - satellite ins1 ins2 - instrument mod1 - mode dir1 dir2 dir3 - direction ) (:init (pointing sat1 dir3) (power_avail sat1) (calibration_target ins1 dir1) (calibration_target ins2 dir1) (on_board ins1 sat1) (on_board ins2 sat2) (supports ins2 mod1) (supports ins1 mod1)) (:goal (and (pointing sat1 dir1) (have_image dir2 mod1) (have_image dir1 mod1) (have_image dir3 mod1)))) The plan found with blind search is: (switch_on ins1 sat1) (turn_to sat1 dir1 dir3) (calibrate sat1 ins1 dir1) (take_image sat1 dir1 ins1 mod1) (turn_to sat1 dir2 dir1) (take_image sat1 dir2 ins1 mod1) (turn_to sat1 dir3 dir2) (take_image sat1 dir3 ins1 mod1) (turn_to sat1 dir1 dir3) ; cost = 9 (unit cost) Best, Javi """ While investigating, I discovered that the source of the issue lies in the flawed LAMA progression. The cause is not a buggy implementation of the theory but the theory itself. Consequently, LAMA is incomplete, which might be a new discovery, at least for me it is, I wasn't aware of this. But it is not an issue anymore with the progression we propose in our paper mentioned in msg11086. To give some more details about the issue: For this problem there exists a pair of landmarks "calibrated" (let's call it C for short) and "not calibrated" (let's call it nC). nC holds in the initial state and until the calibrate action is applied from which point onward C holds but nC never holds again. More specifically, nC does not have any achievers. Due to a reasonable ordering somewhere else in the graph, LAMA progression does not accept another parent of C and when the calibrate action is applied, C is not accepted. Since LAMA then thinks C must still be achieved and since there is a greedy-necessary ordering nC-->C, LAMA infers that nC must also again be achieved. Since nC has no achievers, though, this is impossible and the landmark nC contributes infinity to the heuristic value of the state. Hence, the main flaw of LAMA progression that sometimes landmarks are not accepted (or in our new terminology marked past) when they should be is the reason for wrongly declaring nC as required again which leads to the heuristic declaring this state as a dead-end. After exploring the rest of the search space only to arrive at the same conclusion, LAMA aborts without finding a solution. I'm mentioning this here to keep note of the discord conversation somewhere. I think the problem that Javi observed fits into this issue as it should be resolved as soon as this issue is merged. Side remark: I've made sure FF has nothing to do with the issue here, as the problem is also unsolvable when removing it from the LAMA callstring.
msg11086 (view)	Author: clemens	Date: 2023-06-02.12:08:14
It's been a while and our understanding of this topic developed quite a bit. Our paper with the title "Landmark Progression in Heuristic Search" will be published at ICAPS 2023. Salomé and I discussed that it probably makes sense to tackle this issue based on the theory in that paper and scratch what we have so far. I have started to do so and the code is available under the old pull request https://github.com/aibasel/downward/pull/70. Since the issue only affects configurations with reasonable orderings, and since reasonable orderings are forbidden in optimal configurations, we only experimented with satisficing so far. However, with our fixed progression, reasonable orderings should also be allowed with optimal planning, so this is something to test sometime soon. We split the experiments in a single-search batch and an anytime batch. Let's start with a look at the single-search configurations: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v6-satisficing-single-eval/issue1036-v6-satisficing-single-issue1036-base-issue1036-v6-compare.html On the positive side, the lama-first and lama-first-pref configurations seem to benefit in terms of expansions/evaluations. These scores go up by ~3, in the case of lama-first-pref despite the coverage going down by 2. This aligns nicely with the theory which says that the heuristic becomes more accurate. However, this is at the cost of evaluation time, which we can see clearly for all the configurations, but worst of all for bjolp. Furthermore, with lm_zg we lose 7 in coverage, which is a bad sign. For the full anytime LAMA configuration, results are available here: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1036/data/issue1036-v6-satisficing-anytime-eval/issue1036-v6-satisficing-anytime-issue1036-base-issue1036-v6-compare.html The main metric to consider here is probably the plan quality, i.e., cost. Plans become more expensive in general. This is somewhat unexpected because with better heuristic accuracy, we would expect better plans. In some domains, cheaper plans are found, but not often enough to justify this change, I guess. Slower heuristic evaluation can explain this: With less time, fewer iterations of the search happen and the weighted A* weights remain higher for the last iteration on average. Thus, heigher bounds for plan costs. We did not analyze whether or not this is really the issue here. In both experiments we can also observe an increase in memory usage. This is expected from the changes: While before this issue, we stored one bitset "reached landmarks" for all states, we now store two bitsets "past" and "future" landmarks for each state (see the paper for details). This increases the overall memory footprint and also shows up as more memory-outs compared to previously. Accessing these two bitsets instead of just one also contributes to slowing down the code, but the main slowdown seems to lie in computing whether a landmark holds in a state or not; this happens significantly more often than prior to this issue. Note that in the paper, we report an improvement of our changes. However, this comes from additional changes that we planned to do in separate issues. For example: allowing cycles in the landmark graph, leading to more orderings and in turn even more accurate heuristics. These will hopefully improve performance later on, but we would like to keep the changes separate as much as possible. This is an argument to accept some "damage" in this issue with the expectation that some of it will be recovered later. But we did not give up yet to try ind speed up the heuristic evaluation before we claim to be ready to be merged. In the meantime, Salomé and I keep working on it and would of course appreciate your inputs if you have any.
msg10451 (view)	Author: clemens	Date: 2021-10-25.11:10:08
The new experiments ran through, they contain both v1 (now from the correct branch) and v2 which contains the smaller changes discussed with Malte. At a first glance, the results don't look too different from what we've looked at before. https://ai.dmi.unibas.ch/_tmp_files/buechner/issue1036-v1-v2-optimal-compare.html https://ai.dmi.unibas.ch/_tmp_files/buechner/issue1036-v1-v2-satisficing-compare.html https://ai.dmi.unibas.ch/_tmp_files/buechner/issue1036-v1-v2-lama-anytime-compare.html I didn't have time to look into the other things yet, so there's still a lot to do. I'll keep you updated.
msg10448 (view)	Author: malte	Date: 2021-10-12.11:20:52
Some other things we discussed that I think should be addressed before merging: - The code doesn't currently explain what it implements. It should point to the workshop paper in the code comments, and perhaps also in the user documentation. - The implementation differs from what is described in the workshop paper because it defines the L^- landmarks differently, based on a recomputation in every step rather than storing L^- information with the search nodes as described in the paper. It is not clear if it is semantically equivalent. If yes, we need to explain to ourselves why. If not, we should point this out. Either way, the discrepancy to the literature should be mentioned in the code. - We need a change log entry. Like all change log entries, it should make clear what changes on the user side. Two important changes are that obedient-reasonable orders are no longer used (but still computed) and that reasonable orders can now be used in admissible heuristics (but we haven't evaluated this). Both of these are a bit hard to communicate. Why spend time computing obedient-reasonable orders if they are subsequently ignored? Why make reasonable orders usable for admissible heuristics but not try them? So I think both of these should be addressed before our next release (in December), though not necessarily in this issue.
msg10447 (view)	Author: clemens	Date: 2021-10-12.11:13:05
Malte and I had a look at the results and the code together. In summary, Malte didn't find anything too concerning in the experiments. Since this is fixing a bug, we agreed it should be fine to merge even if the experiment shows slightly worse performance in general. Some things we would still like to understand first and I will have a look at: - Why is seq-opt-bjolp consistently worse in the airport domain? - Why is lm_zg consistently worse in the openstacks-sat14-strips domain? However, while going through the code diff, we discussed several smaller changes that I will implement and re-evaluate before I look deeper into these questions. One important thing we found out is that this is actually very closely related to issue202. This was not on our radar when Thomas and I wrote the HSDIP paper.
msg10438 (view)	Author: clemens	Date: 2021-09-14.11:39:37
I have created a pull request, would somebody be weilling to review? https://github.com/aibasel/downward/pull/70 I have also conducted experiments to check whether this impacts the performance of the planner significantly: https://ai.dmi.unibas.ch/_tmp_files/buechner/issue383-v1-optimal-issue383-hsdip-base-issue383-hsdip-v1-compare.html https://ai.dmi.unibas.ch/_tmp_files/buechner/issue383-v1-lama-anytime-issue383-hsdip-base-issue383-hsdip-v1-compare.html https://ai.dmi.unibas.ch/_tmp_files/buechner/issue383-v1-lama-anytime-issue383-hsdip-base-issue383-hsdip-v1-compare.html Unfortunately, the experiments were done on a different branch, however with the exact same code version. Should I still rerun them to have the data with the correct revision numbers and algorithm names? I see some slightly concerning results in the experiments: - coverage goes down by 2 for "lama" (lama-anytime) and "lama-first" (satisficing), and by 1 for "lama-first-pref" (satisficing) - time scores go up by 1.5 for "seq-opt-bjolp" (optimal) and by 3 for "lm_zg" (satisficing) - total costs increas for all LAMA configurations, which is unexpected since we observed a different effect in our HSDIP experiments (the code diff there was more complex, though, because we have fixed some other issues as well) Other than that, I think the results look fine. I would be interested in other people's opinions: do you think these results are concerning, given we're fixing a bug here? Or could this already be good enough to merge?
msg10436 (view)	Author: clemens	Date: 2021-09-14.10:01:43
As discussed in issue383, LandmarkStatusManager::update_accepted_landmarks(..) is non-deterministic; the function LandmarkStatusManager::landmark_is_leaf(..) depends on the acceptance information about other landmarks, so the order in which this is updated may influence the result. In our HSDIP 2021 paper (https://ai.dmi.unibas.ch/papers/buechner-keller-icaps2021whsdip.pdf), Thomas and I have worked out another issue with the landmark status manager (which we call landmark graph progression in the paper): in the presence of reasonable orderings, some landmarks might not be accepted even though they were made true for the last time in all (optimal) plans. Hence, the resulting heuristic becomes inadmissible and cannot be used for cost-optimal planning. (This is not really a problem in the code, because reasonable orderings are not permitted in combination with optimal planning.) In this issue, I would like to change the status manager so that it is (i) deterministic and (ii) flawless. We can achieve this by implementing the landmark graph progression called ARO (admissible reasonable orderings) in our paper.

History
Date	User	Action	Args
2023-09-04 16:04:21	clemens	set	messages: + msg11328
2023-08-02 14:27:07	clemens	set	status: in-progress -> resolved messages: + msg11253
2023-07-30 21:49:58	clemens	set	messages: + msg11239
2023-07-21 12:42:27	malte	set	messages: + msg11142
2023-07-21 11:27:45	malte	set	messages: + msg11141
2023-07-17 15:58:25	clemens	set	messages: + msg11121
2023-07-07 19:57:24	malte	set	messages: + msg11107
2023-07-07 15:03:53	malte	set	messages: + msg11106
2023-07-07 14:33:33	malte	set	messages: + msg11105
2023-06-22 14:19:14	clemens	set	messages: + msg11092
2023-06-22 12:02:06	clemens	set	messages: + msg11091
2023-06-02 14:09:20	clemens	set	messages: + msg11087
2023-06-02 12:08:14	clemens	set	messages: + msg11086
2021-10-25 11:10:09	clemens	set	messages: + msg10451
2021-10-12 11:20:52	malte	set	messages: + msg10448
2021-10-12 11:13:06	clemens	set	status: chatting -> in-progress assignedto: clemens messages: + msg10447
2021-09-14 11:39:37	clemens	set	status: unread -> chatting messages: + msg10438
2021-09-14 10:01:44	clemens	create

Issue1036