Issue1140

Title Sub-optimal plans for A*-inspired eager configuration without re-opening and with consistent heuristic
Priority bug Status resolved
Superseder Nosy List clemens, florian, gabi, jendrik, malte, remo, salome
Assigned To Keywords
Optional summary
Pull request: https://github.com/aibasel/downward/pull/225

Created on 2024-06-27.17:43:41 by remo, last changed by remo.

Summary
Pull request: https://github.com/aibasel/downward/pull/225
Files
File name Uploaded Type Edit Remove
example.png remo, 2024-06-27.17:43:41 image/png
example.sas remo, 2024-06-27.17:43:58 application/octet-stream
Messages
msg11668 (view) Author: remo Date: 2024-07-22.19:12:10
We decided to move forward with merging this issue despite the significant
impact on run time for configurations where we would not expect this effect
(see results linked in msg11641, msg11647, and msg11650). We suspect that these
systematic yet unpredictable differences in performance may be caused by the
grid, that is the environment where we run experiments, and we may further
investigate this as an independent issue.
msg11650 (view) Author: remo Date: 2024-07-19.11:26:19
Salomé noticed an inconsistency with the most recent EHC report.
(https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-ehc-v2-eval/ehc-release.html)
In msg11647 I wrote that it shows the numbers for a revision that includes debug
output, but it is in fact based on an undocumented experiment run on the same
revision as the first round of experiments, just on the satisficing benchmark
set. Here is the report for the actual debug print run which should have been in
its place:
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-ehc-v2-eval/ehc-release-fixed.html
The (correct) second round results for EHC are similar to those of Eager with
FF, with +3 total time score for the new revision.

For the sake of completeness I started a third round of experiments for these
two configurations based on the revision that would actually be pushed to main,
i.e. without the debug output introduced for the second round of experiments.
The revisions for this third round are therefore the same as for the first round
of experiments (minus some edits in comments), but run on the satisficing
benchmarks instead of the optimal. Here are the reports:
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-18-issue1140-ehc-v3-eval/ehc-release.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-18-issue1140-ff-v3-eval/ff-release.html

Summarizing the total time score changes from old to new for all three rounds we
get:
        EHC    Eager with FF
Round 1 -3.5   -4.57
Round 2 +3.03  +4.33
Round 3 -2.8   -5.09

At a glance it looks like the trends are systematic in all cases, analogous to
Malte's analysis, but I didn't run the numbers.
msg11649 (view) Author: malte Date: 2024-07-18.15:00:04
Thanks Remo, looks good to merge to me. :-)

The EHC results cannot be explained by noise, though. There is a clear systematic trend in the data that must have an explanation other than randomness. It's a bit disconcerting that we cannot understand what the cause of this is, but I think we should move on regardless. (One more wild shot: are we sure the randomization of runs in lab was working correctly in these tests?).

If we don't have an explanation, it's likely we'll see similar things again in the future, so here is one statistical test you can use to see if a performance difference can be adequately explained by randomness:

1. Count in how many domains configuration A has a strictly higher score_total_time than configuration B. Call this M.

2. Count in how many domains configuration B has a strictly higher score_total_time than configuration A. Call this N. (Domains where both score equally are ignored.)

3. For a binomially distributed random variable X with sample size n = M + N and probability p = 0.5, the significance level is min(2 * P[X <= min(M, N)], 1.0).

A low significance level indicates that such a result would be unlikely to occur randomly. Traditionally people often use 0.05 as a threshold (which would happen randomly once in twenty experiments); any threshold is arbitrary.

If I counted the winners per domain correctly, for the first EHC result, the significance level is 4.290179322907761e-13. This means that if the difference is truly random, we'd see such an extreme result randomly once every 2330904898684 experiments.

For the second EHC result, the significance level is 6.209495737562187e-08. In a truly random experiment, we'd see such an extreme result once every 16104367 experiments.

Here's the Python code for those interested. (The test is slightly more general than what I described; you can also test binomial distributions with p != 0.5. Hence the formula looks different than above. Hope I didn't mess it up or mess up the domain counts.)

=========================================================================================

#! /usr/bin/env python3
# -*- coding: utf-8 -*-

from math import factorial


def binomial(n, k):
    return factorial(n) // (factorial(k) * factorial(n - k))

def prob_binomial(n, p, k):
    """Returns P[X = k] for a random variable X that is binomially
    distributed with n trials and probability p."""
    return binomial(n, k) * p ** k * (1 - p) ** (n - k)

def prob_binomial_less_eq(n, p, k):
    """Returns P[X <= k] for a random variable X that is binomially
    distributed with n trials and probability p."""
    return sum(prob_binomial(n, p, i) for i in range(k + 1))


def prob_binomial_greater_eq(n, p, k):
    """Returns P[X >= k] for a random variable X that is binomially
    distributed with n trials and probability p."""
    return sum(prob_binomial(n, p, i) for i in range(k, n + 1))


def two_tailed_significance(n, p, k):
    return min(1.0, 2 * min(
        prob_binomial_less_eq(n, p, k),
        prob_binomial_greater_eq(n, p, k)))


print(two_tailed_significance(57, 0.5, 54))
print(two_tailed_significance(57, 0.5, 3))
print(1 / two_tailed_significance(57, 0.5, 3))

print(two_tailed_significance(68, 0.5, 56))
print(two_tailed_significance(68, 0.5, 12))
print(1 / two_tailed_significance(68, 0.5, 12))
msg11648 (view) Author: remo Date: 2024-07-18.12:12:14
I missed the link to the new Eager with FF configuration results, so here it is:
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-ff-v2-eval/ff-release.html
msg11647 (view) Author: remo Date: 2024-07-18.12:10:22
The second round of experiments ran through and results seem to have improved.
The only change in the code from the first to the second round of experiments is
the addition of a debug print stating the total number of open list insertions.

The A*-inspired eager configuration with reopen_closed=false shows very similar
results as in the first round, providing optimal solutions and dropping some
time score in the process. The magnitude of the slowdown is smaller for the
second round however, going from -0.45 total time score to -0.27. As expected,
the number of open list insertions is higher for the new revision than the old.
(see right-hand side of report tables)
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-astar-v2-eval/astar-release.html

For the actual A* configuration, counting the open list insertions shows that
they are identical for the old and new revision, confirming Malte's assumption
that this should be the case. Where we had a slight decrease in time score in
the first round, we now have a slight increase, thus this can indeed be
attributed to noise. Lastly the constant 4KB additional memory per task we
observed in round 1 are likely due to a larger executable as in the second
experiment memory remains exactly the same and the absolute values seem to be
identical to the new revision of the first experiment. It seems that the very
minor code change between experiments caused the old revision to also add the
+4KB to its executable.
(see left-hand side of report tables)
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-astar-v2-eval/astar-release.html

I accidently ran EHC and Eager with FF on the optimal benchmark set in the first
round, the second round uses the satisficing set.

The EHC configuration still shows the negative trend for total time score:
first round -3.5, second round -2.85. The trend looks similarly systematic
albeit a bit less pronounced. (The -1 in coverage is due to a task close to
timeout.)
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-ehc-v2-eval/ehc-release.html

Lastly the FF configuration, where we also saw a decrease in total time score in
the first round, has reversed the trend and the new revision performs better
now with the total time score going from -4.57 to +4.33. Seeing that rerunning
essentially the same experiment can lead to such different comparative scores,
it doesn't seem far fetched that this is the case for EHC too and that another
run may sway in favor of the new revision there as well.

In conclusion, to me it looks like the new round of experiments directly
addresses most of the questions left open by the first round and suggests that
the persistent drop in EHC's performance may be due to noise. And we even fixed
the actual issue along the way!
msg11645 (view) Author: remo Date: 2024-07-17.12:15:46
All EHC runs come from a single Lab experiment, so they ran at the same time. I 
split the experiments into the major configurations, so there are four scripts 
that ran separately:

- A* + the A*-equivalent eager call with reopen_closed=false
- LAMA
- Eager with FF
- EHC with FF

Meaning the runs executed at the same time within these groups, but not across 
the groups.

I'll rerun experiments with a wiped revision cache and added statistics about 
open list insertions.
msg11644 (view) Author: malte Date: 2024-07-17.11:08:06
I looked a bit more at the experimental results. The ones for A* (with reopening) might also be noise, the variations in time are very small, and the change in memory usage seems to be a constant 4 KB, which could simply be a slightly larger executable.

What I find striking are the ehc-release results. Here search seems to be significantly slower. But the only thing we changed in the EHC code itself is code formatting, and otherwise the only relevant changes are the refactoring in state_space_search.{h,cc}. It is theoretically possible that we suffer from extra function calls here that aren't inlined, but I'd be surprised if the compiler is not smart enough to sort this out, and even if it isn't, I'd be surprised by this having such a large impact.

I think it's more likely that something weird is happening with the experiment (such as an old compile from the revision cache, like Gabi suggested, or some other grid effect). Were all results generated at the same time with a single lab call?
msg11642 (view) Author: malte Date: 2024-07-17.09:28:45
Like I said on Discord or our live discussions, I suggest we don't call these A* configurations (e.g. in the issue title), as I'm worried that it can confuse people. (It certainly confused me when I first saw the discussion until I understood what was really meant.) The A* configurations are the "astar" plugin, not the configurations of "eager" with "reopen_closed=false" that we test here.

> The configurations A* (reopen_closed=true), eager with FF, and enforced hill
> climbing with FF show a similar picture. The search behavior seems to be the
> same for the old and new version of the code, with the number of expansions
> and plan costs being identical. However, the new version has higher memory
> usage, which can maybe be explained by the fact that it potentially inserts
> more elements to the open list, and is generally slower.

The A* configuration (i.e. the one with reopen_closed) should not be affected by this change. Does it indeed insert more things into open? That suggests we messed something up and should look at the code again closely.
msg11641 (view) Author: remo Date: 2024-07-17.01:19:33
During the ongoing Sprint we have decided to implement and test the fix outlined
below in msg11602: When a cheaper path to a state is encountered, we now also
insert the state into the open list with its new g-value if its associated
SearchNode is marked as open, not only if reopen_closed is set to true. The
effect of the reopen_closed flag is then closer to what its name suggests and
only controls whether closed states should be reinserted into the open list or
not.

We ran experiments for a handful of configurations using the affected search
algorithms:
- A* and A* with reopen_closed=false using the blind heuristic
- LAMA
- Eager search using the FF heuristic
- Enforced hill climbing using the FF heuristic
(I accidentally ran the last two configurations on the optimal benchmark set,
but the more appropriate satisficing set is currently running.)

While making the change above, we simultaneously cleaned up functions in
search_space.cc and made assertions tighter. In order to see whether these
assertions trigger unexpectedly, we ran all tested configurations in debug mode.
No errors related to these assertions occurred, as the following reports show:
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-12-issue1140-astar-v1-eval/astar-debug.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-10-issue1140-lama-v1-eval/lama-debug.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-12-issue1140-ff-v1-eval/ff-debug.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-ehc-v1-eval/ehc-debug.html

Looking at the A* configurations with reopen_closed=false, we can see that the
original version of the code does indeed return sub-optimal plans for 6 tasks
across 6 different domains (some being alternative formulations).
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-12-issue1140-astar-v1-eval/astar-release.html

The configurations A* (reopen_closed=true), eager with FF, and enforced hill
climbing with FF show a similar picture. The search behavior seems to be the
same for the old and new version of the code, with the number of expansions and
plan costs being identical. However, the new version has higher memory usage,
which can maybe be explained by the fact that it potentially inserts more
elements to the open list, and is generally slower.
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-12-issue1140-astar-v1-eval/astar-release.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-12-issue1140-ff-v1-eval/ff-release.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-16-issue1140-ehc-v1-eval/ehc-release.html

The results for LAMA are similar in that coverage is unaffected, plan costs are
comparable between old and new, and the new version is slightly slower. The
pattern does not match in terms of memory, as the new version has lower memory
usage here.
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1140/data/issue1140-2024-07-10-issue1140-lama-v1-eval/lama-release.html
msg11602 (view) Author: remo Date: 2024-06-27.17:43:41
Clemens, Florian, Gabi, and I came across what seems to be a bug in
EagerSearch. We considered the case of running an A* search without re-opening
and with a consistent heuristic. To achieve this, we used the equivalence
listed under the A* entry on our website, namely that

--search astar(evaluator)

is equivalent to

--evaluator h=evaluator
--search eager(tiebreaking([sum([g(), h]), h], unsafe_pruning=false),
               reopen_closed=true, f_eval=sum([g(), h]))

We expected that setting reopen_closed to false gives us A* without re-opening
and that combining it with the blind heuristic should return optimal plans.
However, we were able to construct an example where this configuration returns a
sub-optimal plan, see the attachments for a drawing of the task (and the next post
for the corresponding SAS file. We'd expect the plan [b, c, e] but we get [b, d, f].

Our understanding of the issue is the following: State 2 is first added to the
open list with a sub-optimal g-value (10). When expanding state 1 we discover a
cheaper path to state 2, so it should be added to the open list again with the
optimal g-value (2). It seems that EagerSearch does currently not do that, as
this only happens when reopen_closed is set to true. Instead it only updates the
parent pointer of state 2's node, which leads to the search expanding states 3
and 4 and returning the sub-optimal solution before expanding state 2 as the
f-value it was inserted into the open list with is too high.

A simple potential fix could be to not only re-insert a node when reopen_closed
is true, but also when the successor node in question is open (line 238 in
eager_search.cc).
History
Date User Action Args
2024-07-22 19:12:10remosetstatus: chatting -> resolved
messages: + msg11668
2024-07-19 11:26:19remosetmessages: + msg11650
2024-07-18 15:00:04maltesetmessages: + msg11649
2024-07-18 12:12:14remosetmessages: + msg11648
2024-07-18 12:10:22remosetmessages: + msg11647
title: Sub-optimal plans for A* without re-opening and with consistent heuristic -> Sub-optimal plans for A*-inspired eager configuration without re-opening and with consistent heuristic
2024-07-17 12:15:46remosetmessages: + msg11645
2024-07-17 11:08:06maltesetmessages: + msg11644
2024-07-17 09:28:45maltesetmessages: + msg11642
2024-07-17 01:19:33remosetstatus: unread -> chatting
messages: + msg11641
summary: Pull request: https://github.com/aibasel/downward/pull/225
2024-07-16 18:02:00salomesetnosy: + salome
2024-06-27 17:43:58remosetfiles: + example.sas
2024-06-27 17:43:41remocreate