I left some comments on bitbucket.
Regarding the experimental results, the analysis comes across as a bit biased.
It only gives the good news and doesn't mention that there is also bad news. In
general, it would be great if a basic assessment of an issue could be done based
on what is mentioned in the issue tracker, without having to dig into the
detailed results. This could make many reviewing tasks much faster. But for this
the description would have to be fair, and it looks a bit cherry-picky to me.
(That can very easily happen accidentally; no complaints!)
To illustrate the point, let me focus on the negative outcomes instead:
With the same argument ("it performs worse") with which we discard the two 10M
configurations and focus only on the 1M results, we could instead discard the
two "original" configurations and focus only on "landmarks-goal". That argument
would perhaps be even more valid than the one for discarding the 10M results
because the amount by which "landmarks-goal" outperforms "original" is much
larger than the amount by which 1M outperforms 10M.
The good 1M configuration outperforms 10M by only 2 tasks -- not enough evidence
that increasing the abstraction size is a bad idea in general. After all, if you
increase the amount of time/memory for something, a factor of 10 is a large
jump. 1M might very well be too low even if 10M is too high. If we see good
results (+2 coverage) for landmarks-goal-1M and bad results (-5 coverage) for
landmarks-goal-10M, do we hurt or help our good landmarks-goal config?
Beyond coverage, if we focus on the two landmarks-goal configs, we see an
overall loss in all scores other than total time. The overall time for finding
flaws and for finding traces increases, rather than decreasing. None of these is
alarming, but is it not worth mentioning? Heuristic quality becomes generally
worse, also for the best overall config (with 1M). Given how close the
landmarks-goal 1M and 10M configs are in the baseline and given that both of
them suffer in heuristics quality in this change and the change negatively
affects coverage overall, I don't think it looks like such a clear win.
I think the analysis should also mention that there is an effect on heuristic
quality, and that the effect is negative for the two best configurations. From
just the issue discussion alone, one could easily come to the wrong conclusion
that this is just a data structure improvement that does not affect what is
computed, only how fast. But that's not the case; we also change the heuristic
because the transitions end up in a different order than before. (The
description mentions that, but from "just" reading our CEGAR papers it is not
clear why this would affect the heuristic values. I assume it is because we find
different abstract plans depending on the ordering of transitions, as we don't
use any rigorous tie-breaking strategy there.)
Of course, all this is playing devil's advocate, and I've overemphasized the
negatives. It is meant mainly as a meta-comment on which points in a result
analysis help get a quick review. :-)
Weighing both the positives and the negatives, to be honest, I don't think the
experiments give all that much support for making the change, but there is also
not much reason to *not* make the change. So from my side, feel free to merge this.
More generally, it looks like there are still some interesting algorithm and
data structure considerations for the CEGAR loop that might be worth looking
into at some point if you're interested. If the "total_time_for_..." numbers in
the experiments include all relevant bottlenecks, it might be well worth
considering if we could cut the time for finding traces down somewhat. But then
I'm not sure how much it matters in the grander scheme of things where other
aspects such as cost partitioning come into play.
|