Issue431

Title M&S: greedy shrink bisimulation strategies test for an abstraction's size rather than the abstraction's varset
Priority bug Status resolved
Superseder Nosy List malte, silvan
Assigned To silvan Keywords
Optional summary

Created on 2014-04-08.20:18:56 by silvan, last changed by silvan.

Files
File name Uploaded Type Edit Remove
compare_coverage.py malte, 2014-04-13.15:29:46 text/x-python
ijcai2011_data.py malte, 2014-04-13.15:26:21 text/x-python
Messages
msg3142 (view) Author: silvan Date: 2014-04-25.10:08:02
Resolved.
msg3140 (view) Author: malte Date: 2014-04-24.19:40:10
I agree.
msg3133 (view) Author: silvan Date: 2014-04-22.11:53:25
Okay, I'll add it to the TODO list. Thanks for the investigation!

I conducted further experiments, comparing the baseline experiment data from the
ijcai/jacm experiments with new data for both implementations of the issue (no
test for atomicity, correct test).

IJCAI experiment:
http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-22-issue431-compare-ijcai2011-abs.html

Here, all versions achieve the same coverage (except one case with a difference
of 1).

JACM experiment:
http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-22-issue431-compare-jacm-abs.html

Not testing for atomicity and the baseline achieve the same results, and as in
the previous experiment (msg3103), the "corrected test" actually performs worse.
However, when looking at the domain-wise coverage, the cgl-10k-configuration
with the corrected test performs better than baseline in many domains, and worse
only in a few domains (but with a huge difference).

As no-test and baseline seems to be identical, I created a direct comparison
table of the no-test and the correct-test version:
http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-22-issue431-compare-jacm-comp.html
I am not sure if that helps gaining any further insights, but anyway, as there
is just *one* configuration which *sometimes* profits from a corrected version
of the test, but all other configurations get worse with it, I think that we
should remove the test. 

What do you think?
msg3132 (view) Author: malte Date: 2014-04-21.22:46:29
> Now I believe that there was something else going on there and will
> investigate the old code to try to find out more. (I have one
> hypothesis, but I'm not sure yet.)

OK, I've investigated this as far as I think it makes sense.

As far as I can tell, the reason for the difference in behaviour is that the old
finite-N greedy bisimulation code was a mixture between greedy and non-greedy
bisimulation. In particular, in cases where N is large enough to bisimulate
everything perfectly it would never enable greedy bisimulation.

Perhaps we want to add such a hybrid strategy to our master merge-and-shrink
TODO list (one that first runs perfect bisimulation, followed by greedy
bisimulation if we're still above the threshold). But I don't think we need to
worry about it in this issue.
msg3124 (view) Author: malte Date: 2014-04-14.11:46:25
> Concerning msg3119, I don't understand why we should get a perfect heuristic
> for any greedy configuration, even with threshold=N. I thought that applying a
> greedy bisimulation does not guarantee to keep perfect information?

You are right; I just thought it would be perfect in Gripper anyway if we only
start applying it after merging in the non-goal variables. But this idea was
mainly based on the old results we had (where we did solve all 20 with some of
the greedy configurations). Now I believe that there was something else going on
there and will investigate the old code to try to find out more. (I have one
hypothesis, but I'm not sure yet.)

> Concerning msg3120, those unexplained errors unfortunately keep occurring for
> certain merge-and-shrink configurations. The error is that vector::reserve() is
> being called for a value too large. I got used to ignore errors with exit-code
> 134, which isn't probably the best thing to do either.

Maybe we can address this in the code in some way. It would be nice to only get
errors for actual errors.

> msg3122: So that means that we have to find differences in the implementation of
> back then and now. Should we test some gripper instances at the revision at
> which the code was integrated into the default branch?

Yes, that was what I was planning.
msg3123 (view) Author: silvan Date: 2014-04-14.11:38:01
Thanks for the scripts and the listing in msg3118!

Concerning msg3119, I don't understand why we should get a perfect heuristic for
any greedy configuration, even with threshold=N. I thought that applying a
greedy bisimulation does not guarantee to keep perfect information? I agree
though that there is something wrong with either the old or the new results,
when it comes to greedy-thresholdN-configurations.

Concerning msg3120, those unexplained errors unfortunately keep occurring for
certain merge-and-shrink configurations. The error is that vector::reserve() is
being called for a value too large. I got used to ignore errors with exit-code
134, which isn't probably the best thing to do either.

msg3121: I agree. As stated above, I find it rather strange that any greedy
configuration should yield the perfect heuristic.

msg3122: So that means that we have to find differences in the implementation of
back then and now. Should we test some gripper instances at the revision at
which the code was integrated into the default branch?
msg3122 (view) Author: malte Date: 2014-04-13.16:30:19
I have to switch to something else now, so a final progress update before I leave.

I've made a new commit to the existing ijcai-2011 branch to make it compile on
modern g++ versions. This old code is a mess, but fortunately I found the
necessary info to reproduce the IJCAI 2011 experiments in
"new-scripts/ijcai-2011.sh".

I can reproduce the perfect coverage on Gripper on this version as follows:

$ cd src
$ ./build_all
(unfortunately doesn't support options in this old version, so no "-j4" or similar)
$ cd search
$ ../translate/translate.py ../../benchmarks/gripper/prob20.pddl &&
../preprocess/preprocess < output.sas && ./downward < output --search
'astar(mas(shrink_strategy=7,merge_strategy=5,max_states=200000))'
msg3121 (view) Author: malte Date: 2014-04-13.15:59:16
OK, I looked at the output for some gripper runs with various parameters and
merge strategies in a bit of detail. At the moment, the odd thing for me is not
that we don't solve the full 20 tasks with configs like greedy-200K, but rather
that we *used* to solve all of them. So I'll try to reproduce the older results
from the IJCAI paper and if that succeeds, try to find out which code change
caused the change in behaviour.
msg3120 (view) Author: malte Date: 2014-04-13.15:34:21
Also, can you look at the logs for the unexplained errors in
http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-12-ijcai2011-exp-abs.html#unexplained-errors?
I can't see much of a pattern there except that it's always the same config.
msg3119 (view) Author: malte Date: 2014-04-13.15:32:44
We see that the worst cases occur for gripper and to a lesser degree miconic in
certain gop configurations. These are both domains which include a single
"transporting variable" which gop with threshold 1 should essentially throw
away, but gop with a higher threshold shouldn't. My intuition is that we should
get a perfect heuristic in a domain like gripper with DFP-gop-200k, so something
is at odds with the new results. I suggest we find out what is happening there
before we look at the JACM results in more detail.
msg3118 (view) Author: malte Date: 2014-04-13.15:29:46
The second part, which is the main script, is attached to this message. The
script assumes that the properties file of the IJCAI 2011 comparison experiment
is in the current directory under name "ijcai2011-experiment.properties". I've
tested it with the properties file for 2014-04-12-ijcai2011-exp.

Here is the output:

DFP-bop-200k: 293 (old) vs 317 (new) => +24
DFP-b-200k:   248 (old) vs 271 (new) => +23
DFP-b-100k:   268 (old) vs 290 (new) => +22
DFP-bop-100k: 309 (old) vs 330 (new) => +21
DFP-bop-10k:  333 (old) vs 354 (new) => +21
M&S-gop-inf:  320 (old) vs 336 (new) => +16
hhh-200k:     262 (old) vs 277 (new) => +15
M&S-bop-inf:  192 (old) vs 204 (new) => +12
DFP-b-10k:    317 (old) vs 329 (new) => +12
hhh-10k:      310 (old) vs 322 (new) => +12
hhh-100k:     278 (old) vs 289 (new) => +11
M&S-b-inf:    156 (old) vs 167 (new) => +11
DFP-gop-10k:  333 (old) vs 337 (new) => +4
DFP-gop-100k: 310 (old) vs 302 (new) => -8
DFP-gop-200k: 295 (old) vs 285 (new) => -10

DFP-bop-200k: 293 (old) vs 317 (new) => +24
+10: miconic
+4: airport
+2: blocks, mprime, pipesworld-notankage
+1: mystery, psr-small, trucks-strips, zenotravel

DFP-b-200k: 248 (old) vs 271 (new) => +23
+5: blocks, miconic
+4: airport
+2: pipesworld-notankage
+1: driverlog, grid, pipesworld-tankage, psr-small, rovers, tpp, trucks-strips,
zenotravel
-1: freecell

DFP-b-100k: 268 (old) vs 290 (new) => +22
+7: blocks
+5: miconic
+4: pipesworld-notankage
+3: airport
+1: driverlog, pipesworld-tankage, psr-small, tpp, trucks-strips
-1: mprime, satellite

DFP-bop-100k: 309 (old) vs 330 (new) => +21
+8: miconic
+4: blocks, pipesworld-notankage
+3: airport
+2: mprime, pipesworld-tankage
+1: mystery, psr-small
-1: satellite, trucks-strips
-2: freecell

DFP-bop-10k: 333 (old) vs 354 (new) => +21
+8: miconic
+5: blocks
+4: pipesworld-notankage
+2: freecell
+1: airport, mprime, pipesworld-tankage
-1: trucks-strips

M&S-gop-inf: 320 (old) vs 336 (new) => +16
+7: blocks
+3: freecell
+2: mystery, pipesworld-notankage
+1: airport, miconic

hhh-200k: 262 (old) vs 277 (new) => +15
+6: airport
+4: miconic
+2: blocks, pipesworld-tankage
+1: freecell, grid, mystery, pipesworld-notankage
-1: driverlog
-2: logistics00

M&S-bop-inf: 192 (old) vs 204 (new) => +12
+5: miconic
+3: blocks
+1: logistics98, mystery, trucks-strips, zenotravel

DFP-b-10k: 317 (old) vs 329 (new) => +12
+4: freecell
+2: mprime, mystery
+1: airport, pipesworld-notankage, pipesworld-tankage, rovers

hhh-10k: 310 (old) vs 322 (new) => +12
+6: pipesworld-notankage
+3: freecell
+2: airport
+1: blocks, logistics00
-1: pipesworld-tankage

hhh-100k: 278 (old) vs 289 (new) => +11
+5: airport
+2: blocks, miconic
+1: depot, freecell, pipesworld-tankage
-1: logistics00

M&S-b-inf: 156 (old) vs 167 (new) => +11
+5: miconic
+3: blocks
+2: zenotravel
+1: psr-small

DFP-gop-10k: 333 (old) vs 337 (new) => +4
+7: blocks
+3: freecell
+2: pipesworld-tankage
+1: airport, mprime, mystery, psr-small, tpp
-1: driverlog, rovers, trucks-strips
-2: zenotravel
-4: gripper, logistics00

DFP-gop-100k: 310 (old) vs 302 (new) => -8
+10: blocks
+5: pipesworld-notankage
+4: airport, pipesworld-tankage
+2: mystery
+1: psr-small
-1: driverlog, logistics98, satellite, trucks-strips
-2: grid, mprime, rovers
-3: logistics00, zenotravel
-5: miconic
-13: gripper

DFP-gop-200k: 295 (old) vs 285 (new) => -10
+10: blocks
+4: airport
+3: pipesworld-notankage
+2: mystery
+1: depot, freecell
-1: driverlog, logistics98, satellite, trucks-strips
-2: grid, zenotravel
-3: logistics00
-7: miconic
-13: gripper
msg3117 (view) Author: malte Date: 2014-04-13.15:26:21
I've written a script to simplify the per-domain comparisons. It's not very
polished, but I think the core of it may be useful for future experiments, too,
so I'm attaching it. It has two parts; the first part, which contains the data
from the original IJCAI 2011 experiment is attached to this message.
msg3116 (view) Author: malte Date: 2014-04-13.13:21:44
> Unfortunately, I don't know how to use a custom order for the columns and the
> lab doc page is not showing at the moment.

One way is to make sure that the alphabetical ordering of nicknames matches the
desired order -- in the extreme case by prefixing the nicknames with "A-", "B-",
"C-" etc.

> I suspect that we might not have used the correct configuration, and that
> also here, we should have used threshold=1 for the gop-configurations
> instead of threshold=N (as done for the greedy-configurations
> of the JACM experiment).

IIRC, the finite-N bisimulations used one set of code and the infinite-N ones
used another one, and the threshold parameter didn't yet exist back then. So I
think threshold=1 was really only used for the infinite-N cases back then.

I only had a brief look so far, but if I recall the IJCAI paper numbers
correctly, we don't get 20 solved problems in all cases where we should
according to the paper. Maybe this is related to why we lose coverage in the gop
configurations. I'll check the per-domain numbers more thoroughly later.
msg3115 (view) Author: silvan Date: 2014-04-13.12:27:55
I ran a big default-branch experiment to reproduce the experiments from the
IJCAI 2011 paper and the JACM 2014 paper. Unfortunately, I don't know how to use
a custom order for the columns and the lab doc page is not showing at the
moment. I try to change this later. For the IJCAI experiment, I found the table
to be small enough and ordered enough to compare the values by hand, for the
JACM experiment, I wrote down the coverage numbers below.

http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-12-ijcai2011-exp-abs.html

We achieve better coverage than the values reported in the paper (always in a
similar amount), with the exception for the gop-configurations, where we
actually perform worse. I suspect that we might not have used the correct
configuration, and that also here, we should have used threshold=1 for the
gop-configurations instead of threshold=N (as done for the greedy-configurations
of the JACM experiment).

http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-12-jacm-exp-abs.html

First, not that the JACM does not use *any* domain from the IPC08 (they say they
use benchmarks from IPC98-IPC11, but they exclude 08-domains which have been
reused in 2011, and this is true for all of them.)

First value: our experiment. Second value: paper experiment.

cgl-b-10k: 493 vs 472
cgl-b-inf: 282 vs 274
cgl-b-inf-nolr: 249 vs 247
cgl-g-10k: 486 vs 460
cgl-h-10k: 401 vs 388
cgl-h-10k-nolr: 379 vs 361

rl-b-10k: 490 vs 463
rl-b-100k: 467 vs 443
rl-b-200k: 457 vs 428
rl-b-inf: 255 vs 252
rl-g-10k: 488 vs 461
rl-g-100k: 488 vs 465
rl-g-200k: 488 vs 466
rl-g-inf: 424 vs 402
rl-h-10k: 372 vs 359
rl-h-100k: 337 vs 330
rl-h-200k: 330 vs 321
rl-h-inf: 209 vs 208
rl-rnd-100k: 179 vs 180

I think all values are better in a "similar fashion", thus this should reflect
our fast downward changes since the paper's experiments.

I did not run the greedy-configurations of both experiments on the issue's code yet.
msg3104 (view) Author: malte Date: 2014-04-09.18:44:18
I also think it should be removed, but I see one problem that we should deal
with before we close this issue.

We may end up with an implementation which may be different from what is
described in the papers. So it would be good if we could check what is described
in the papers (the original one on bisimulation-based M&S strategies at IJCAI
and the JACM one) and check what is described there, how it relates to what we
have currently implemented, and how the results described there map to the
results we have now.

Depending on the outcomes we get, we might want to add an errata to the papers
or some comment to the code that describes how the code differs from the papers.
msg3103 (view) Author: silvan Date: 2014-04-09.10:36:34
You can find the experiments here:
http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-08-issue431-compare-abs.html

Apparently, the current implementation of the test for atomicity of an
abstraction hardly ever triggered. In terms of coverage, it doesn't make any
difference if the test is there or not. If we correct the test however, coverage
drops.

I'd vote for removing the test. What do you think?
msg3102 (view) Author: silvan Date: 2014-04-08.20:18:56
ShrinkBisimulation::shrink(...) tests for abs.size() == 1 when using a greedy
bisimulation shrink strategy to find out whether abs is an atomic abstraction or
not (because atomic abstractions should not be shrunk). This is obviously not
what the comment in the code promises to do.

We want to test the following scenarios
- test against the number of variables of the abstraction (must be equal one for
atomic abstractions)
- do not perform any test at all and also greedily shrink atomic abstractions

These variants will be compared against the current implementation of the
ill-defined test.
History
Date User Action Args
2014-04-25 10:08:02silvansetstatus: in-progress -> resolved
messages: + msg3142
2014-04-24 19:40:10maltesetmessages: + msg3140
2014-04-22 11:53:25silvansetmessages: + msg3133
2014-04-21 22:46:29maltesetmessages: + msg3132
2014-04-14 11:46:25maltesetmessages: + msg3124
2014-04-14 11:38:01silvansetmessages: + msg3123
2014-04-13 16:30:19maltesetmessages: + msg3122
2014-04-13 15:59:16maltesetmessages: + msg3121
2014-04-13 15:34:21maltesetmessages: + msg3120
2014-04-13 15:32:44maltesetmessages: + msg3119
2014-04-13 15:29:46maltesetfiles: + compare_coverage.py
messages: + msg3118
2014-04-13 15:26:21maltesetfiles: + ijcai2011_data.py
messages: + msg3117
2014-04-13 13:21:44maltesetmessages: + msg3116
2014-04-13 12:27:55silvansetmessages: + msg3115
2014-04-09 18:44:18maltesetmessages: + msg3104
2014-04-09 10:36:34silvansetmessages: + msg3103
2014-04-08 20:18:56silvancreate