Issue 431: M&S: greedy shrink bisimulation strategies test for an abstraction's size rather than the abstraction's varset

Title	M&S: greedy shrink bisimulation strategies test for an abstraction's size rather than the abstraction's varset
Priority	bug	Status	resolved
Superseder		Nosy List	malte, silvan
Assigned To	silvan	Keywords
Optional summary

Created on 2014-04-08.20:18:56 by silvan, last changed by silvan.

Files
File name	Uploaded	Type	Edit	Remove
compare_coverage.py	malte, 2014-04-13.15:29:46	text/x-python
ijcai2011_data.py	malte, 2014-04-13.15:26:21	text/x-python

Messages
msg3142 (view)	Author: silvan	Date: 2014-04-25.10:08:02
Resolved.
msg3140 (view)	Author: malte	Date: 2014-04-24.19:40:10
I agree.
msg3133 (view)	Author: silvan	Date: 2014-04-22.11:53:25
Okay, I'll add it to the TODO list. Thanks for the investigation! I conducted further experiments, comparing the baseline experiment data from the ijcai/jacm experiments with new data for both implementations of the issue (no test for atomicity, correct test). IJCAI experiment: http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-22-issue431-compare-ijcai2011-abs.html Here, all versions achieve the same coverage (except one case with a difference of 1). JACM experiment: http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-22-issue431-compare-jacm-abs.html Not testing for atomicity and the baseline achieve the same results, and as in the previous experiment (msg3103), the "corrected test" actually performs worse. However, when looking at the domain-wise coverage, the cgl-10k-configuration with the corrected test performs better than baseline in many domains, and worse only in a few domains (but with a huge difference). As no-test and baseline seems to be identical, I created a direct comparison table of the no-test and the correct-test version: http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-22-issue431-compare-jacm-comp.html I am not sure if that helps gaining any further insights, but anyway, as there is just one configuration which sometimes profits from a corrected version of the test, but all other configurations get worse with it, I think that we should remove the test. What do you think?
msg3132 (view)	Author: malte	Date: 2014-04-21.22:46:29
> Now I believe that there was something else going on there and will > investigate the old code to try to find out more. (I have one > hypothesis, but I'm not sure yet.) OK, I've investigated this as far as I think it makes sense. As far as I can tell, the reason for the difference in behaviour is that the old finite-N greedy bisimulation code was a mixture between greedy and non-greedy bisimulation. In particular, in cases where N is large enough to bisimulate everything perfectly it would never enable greedy bisimulation. Perhaps we want to add such a hybrid strategy to our master merge-and-shrink TODO list (one that first runs perfect bisimulation, followed by greedy bisimulation if we're still above the threshold). But I don't think we need to worry about it in this issue.
msg3124 (view)	Author: malte	Date: 2014-04-14.11:46:25
> Concerning msg3119, I don't understand why we should get a perfect heuristic > for any greedy configuration, even with threshold=N. I thought that applying a > greedy bisimulation does not guarantee to keep perfect information? You are right; I just thought it would be perfect in Gripper anyway if we only start applying it after merging in the non-goal variables. But this idea was mainly based on the old results we had (where we did solve all 20 with some of the greedy configurations). Now I believe that there was something else going on there and will investigate the old code to try to find out more. (I have one hypothesis, but I'm not sure yet.) > Concerning msg3120, those unexplained errors unfortunately keep occurring for > certain merge-and-shrink configurations. The error is that vector::reserve() is > being called for a value too large. I got used to ignore errors with exit-code > 134, which isn't probably the best thing to do either. Maybe we can address this in the code in some way. It would be nice to only get errors for actual errors. > msg3122: So that means that we have to find differences in the implementation of > back then and now. Should we test some gripper instances at the revision at > which the code was integrated into the default branch? Yes, that was what I was planning.
msg3123 (view)	Author: silvan	Date: 2014-04-14.11:38:01
Thanks for the scripts and the listing in msg3118! Concerning msg3119, I don't understand why we should get a perfect heuristic for any greedy configuration, even with threshold=N. I thought that applying a greedy bisimulation does not guarantee to keep perfect information? I agree though that there is something wrong with either the old or the new results, when it comes to greedy-thresholdN-configurations. Concerning msg3120, those unexplained errors unfortunately keep occurring for certain merge-and-shrink configurations. The error is that vector::reserve() is being called for a value too large. I got used to ignore errors with exit-code 134, which isn't probably the best thing to do either. msg3121: I agree. As stated above, I find it rather strange that any greedy configuration should yield the perfect heuristic. msg3122: So that means that we have to find differences in the implementation of back then and now. Should we test some gripper instances at the revision at which the code was integrated into the default branch?
msg3122 (view)	Author: malte	Date: 2014-04-13.16:30:19
I have to switch to something else now, so a final progress update before I leave. I've made a new commit to the existing ijcai-2011 branch to make it compile on modern g++ versions. This old code is a mess, but fortunately I found the necessary info to reproduce the IJCAI 2011 experiments in "new-scripts/ijcai-2011.sh". I can reproduce the perfect coverage on Gripper on this version as follows: $ cd src $ ./build_all (unfortunately doesn't support options in this old version, so no "-j4" or similar) $ cd search $ ../translate/translate.py ../../benchmarks/gripper/prob20.pddl && ../preprocess/preprocess < output.sas && ./downward < output --search 'astar(mas(shrink_strategy=7,merge_strategy=5,max_states=200000))'
msg3121 (view)	Author: malte	Date: 2014-04-13.15:59:16
OK, I looked at the output for some gripper runs with various parameters and merge strategies in a bit of detail. At the moment, the odd thing for me is not that we don't solve the full 20 tasks with configs like greedy-200K, but rather that we used to solve all of them. So I'll try to reproduce the older results from the IJCAI paper and if that succeeds, try to find out which code change caused the change in behaviour.
msg3120 (view)	Author: malte	Date: 2014-04-13.15:34:21
Also, can you look at the logs for the unexplained errors in http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-12-ijcai2011-exp-abs.html#unexplained-errors? I can't see much of a pattern there except that it's always the same config.
msg3119 (view)	Author: malte	Date: 2014-04-13.15:32:44
We see that the worst cases occur for gripper and to a lesser degree miconic in certain gop configurations. These are both domains which include a single "transporting variable" which gop with threshold 1 should essentially throw away, but gop with a higher threshold shouldn't. My intuition is that we should get a perfect heuristic in a domain like gripper with DFP-gop-200k, so something is at odds with the new results. I suggest we find out what is happening there before we look at the JACM results in more detail.
msg3118 (view)	Author: malte	Date: 2014-04-13.15:29:46
The second part, which is the main script, is attached to this message. The script assumes that the properties file of the IJCAI 2011 comparison experiment is in the current directory under name "ijcai2011-experiment.properties". I've tested it with the properties file for 2014-04-12-ijcai2011-exp. Here is the output: DFP-bop-200k: 293 (old) vs 317 (new) => +24 DFP-b-200k: 248 (old) vs 271 (new) => +23 DFP-b-100k: 268 (old) vs 290 (new) => +22 DFP-bop-100k: 309 (old) vs 330 (new) => +21 DFP-bop-10k: 333 (old) vs 354 (new) => +21 M&S-gop-inf: 320 (old) vs 336 (new) => +16 hhh-200k: 262 (old) vs 277 (new) => +15 M&S-bop-inf: 192 (old) vs 204 (new) => +12 DFP-b-10k: 317 (old) vs 329 (new) => +12 hhh-10k: 310 (old) vs 322 (new) => +12 hhh-100k: 278 (old) vs 289 (new) => +11 M&S-b-inf: 156 (old) vs 167 (new) => +11 DFP-gop-10k: 333 (old) vs 337 (new) => +4 DFP-gop-100k: 310 (old) vs 302 (new) => -8 DFP-gop-200k: 295 (old) vs 285 (new) => -10 DFP-bop-200k: 293 (old) vs 317 (new) => +24 +10: miconic +4: airport +2: blocks, mprime, pipesworld-notankage +1: mystery, psr-small, trucks-strips, zenotravel DFP-b-200k: 248 (old) vs 271 (new) => +23 +5: blocks, miconic +4: airport +2: pipesworld-notankage +1: driverlog, grid, pipesworld-tankage, psr-small, rovers, tpp, trucks-strips, zenotravel -1: freecell DFP-b-100k: 268 (old) vs 290 (new) => +22 +7: blocks +5: miconic +4: pipesworld-notankage +3: airport +1: driverlog, pipesworld-tankage, psr-small, tpp, trucks-strips -1: mprime, satellite DFP-bop-100k: 309 (old) vs 330 (new) => +21 +8: miconic +4: blocks, pipesworld-notankage +3: airport +2: mprime, pipesworld-tankage +1: mystery, psr-small -1: satellite, trucks-strips -2: freecell DFP-bop-10k: 333 (old) vs 354 (new) => +21 +8: miconic +5: blocks +4: pipesworld-notankage +2: freecell +1: airport, mprime, pipesworld-tankage -1: trucks-strips M&S-gop-inf: 320 (old) vs 336 (new) => +16 +7: blocks +3: freecell +2: mystery, pipesworld-notankage +1: airport, miconic hhh-200k: 262 (old) vs 277 (new) => +15 +6: airport +4: miconic +2: blocks, pipesworld-tankage +1: freecell, grid, mystery, pipesworld-notankage -1: driverlog -2: logistics00 M&S-bop-inf: 192 (old) vs 204 (new) => +12 +5: miconic +3: blocks +1: logistics98, mystery, trucks-strips, zenotravel DFP-b-10k: 317 (old) vs 329 (new) => +12 +4: freecell +2: mprime, mystery +1: airport, pipesworld-notankage, pipesworld-tankage, rovers hhh-10k: 310 (old) vs 322 (new) => +12 +6: pipesworld-notankage +3: freecell +2: airport +1: blocks, logistics00 -1: pipesworld-tankage hhh-100k: 278 (old) vs 289 (new) => +11 +5: airport +2: blocks, miconic +1: depot, freecell, pipesworld-tankage -1: logistics00 M&S-b-inf: 156 (old) vs 167 (new) => +11 +5: miconic +3: blocks +2: zenotravel +1: psr-small DFP-gop-10k: 333 (old) vs 337 (new) => +4 +7: blocks +3: freecell +2: pipesworld-tankage +1: airport, mprime, mystery, psr-small, tpp -1: driverlog, rovers, trucks-strips -2: zenotravel -4: gripper, logistics00 DFP-gop-100k: 310 (old) vs 302 (new) => -8 +10: blocks +5: pipesworld-notankage +4: airport, pipesworld-tankage +2: mystery +1: psr-small -1: driverlog, logistics98, satellite, trucks-strips -2: grid, mprime, rovers -3: logistics00, zenotravel -5: miconic -13: gripper DFP-gop-200k: 295 (old) vs 285 (new) => -10 +10: blocks +4: airport +3: pipesworld-notankage +2: mystery +1: depot, freecell -1: driverlog, logistics98, satellite, trucks-strips -2: grid, zenotravel -3: logistics00 -7: miconic -13: gripper
msg3117 (view)	Author: malte	Date: 2014-04-13.15:26:21
I've written a script to simplify the per-domain comparisons. It's not very polished, but I think the core of it may be useful for future experiments, too, so I'm attaching it. It has two parts; the first part, which contains the data from the original IJCAI 2011 experiment is attached to this message.
msg3116 (view)	Author: malte	Date: 2014-04-13.13:21:44
> Unfortunately, I don't know how to use a custom order for the columns and the > lab doc page is not showing at the moment. One way is to make sure that the alphabetical ordering of nicknames matches the desired order -- in the extreme case by prefixing the nicknames with "A-", "B-", "C-" etc. > I suspect that we might not have used the correct configuration, and that > also here, we should have used threshold=1 for the gop-configurations > instead of threshold=N (as done for the greedy-configurations > of the JACM experiment). IIRC, the finite-N bisimulations used one set of code and the infinite-N ones used another one, and the threshold parameter didn't yet exist back then. So I think threshold=1 was really only used for the infinite-N cases back then. I only had a brief look so far, but if I recall the IJCAI paper numbers correctly, we don't get 20 solved problems in all cases where we should according to the paper. Maybe this is related to why we lose coverage in the gop configurations. I'll check the per-domain numbers more thoroughly later.
msg3115 (view)	Author: silvan	Date: 2014-04-13.12:27:55
I ran a big default-branch experiment to reproduce the experiments from the IJCAI 2011 paper and the JACM 2014 paper. Unfortunately, I don't know how to use a custom order for the columns and the lab doc page is not showing at the moment. I try to change this later. For the IJCAI experiment, I found the table to be small enough and ordered enough to compare the values by hand, for the JACM experiment, I wrote down the coverage numbers below. http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-12-ijcai2011-exp-abs.html We achieve better coverage than the values reported in the paper (always in a similar amount), with the exception for the gop-configurations, where we actually perform worse. I suspect that we might not have used the correct configuration, and that also here, we should have used threshold=1 for the gop-configurations instead of threshold=N (as done for the greedy-configurations of the JACM experiment). http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-12-jacm-exp-abs.html First, not that the JACM does not use any domain from the IPC08 (they say they use benchmarks from IPC98-IPC11, but they exclude 08-domains which have been reused in 2011, and this is true for all of them.) First value: our experiment. Second value: paper experiment. cgl-b-10k: 493 vs 472 cgl-b-inf: 282 vs 274 cgl-b-inf-nolr: 249 vs 247 cgl-g-10k: 486 vs 460 cgl-h-10k: 401 vs 388 cgl-h-10k-nolr: 379 vs 361 rl-b-10k: 490 vs 463 rl-b-100k: 467 vs 443 rl-b-200k: 457 vs 428 rl-b-inf: 255 vs 252 rl-g-10k: 488 vs 461 rl-g-100k: 488 vs 465 rl-g-200k: 488 vs 466 rl-g-inf: 424 vs 402 rl-h-10k: 372 vs 359 rl-h-100k: 337 vs 330 rl-h-200k: 330 vs 321 rl-h-inf: 209 vs 208 rl-rnd-100k: 179 vs 180 I think all values are better in a "similar fashion", thus this should reflect our fast downward changes since the paper's experiments. I did not run the greedy-configurations of both experiments on the issue's code yet.
msg3104 (view)	Author: malte	Date: 2014-04-09.18:44:18
I also think it should be removed, but I see one problem that we should deal with before we close this issue. We may end up with an implementation which may be different from what is described in the papers. So it would be good if we could check what is described in the papers (the original one on bisimulation-based M&S strategies at IJCAI and the JACM one) and check what is described there, how it relates to what we have currently implemented, and how the results described there map to the results we have now. Depending on the outcomes we get, we might want to add an errata to the papers or some comment to the code that describes how the code differs from the papers.
msg3103 (view)	Author: silvan	Date: 2014-04-09.10:36:34
You can find the experiments here: http://ai.cs.unibas.ch/_tmp_files/sieverss/2014-04-08-issue431-compare-abs.html Apparently, the current implementation of the test for atomicity of an abstraction hardly ever triggered. In terms of coverage, it doesn't make any difference if the test is there or not. If we correct the test however, coverage drops. I'd vote for removing the test. What do you think?
msg3102 (view)	Author: silvan	Date: 2014-04-08.20:18:56
ShrinkBisimulation::shrink(...) tests for abs.size() == 1 when using a greedy bisimulation shrink strategy to find out whether abs is an atomic abstraction or not (because atomic abstractions should not be shrunk). This is obviously not what the comment in the code promises to do. We want to test the following scenarios - test against the number of variables of the abstraction (must be equal one for atomic abstractions) - do not perform any test at all and also greedily shrink atomic abstractions These variants will be compared against the current implementation of the ill-defined test.

History
Date	User	Action	Args
2014-04-25 10:08:02	silvan	set	status: in-progress -> resolved messages: + msg3142
2014-04-24 19:40:10	malte	set	messages: + msg3140
2014-04-22 11:53:25	silvan	set	messages: + msg3133
2014-04-21 22:46:29	malte	set	messages: + msg3132
2014-04-14 11:46:25	malte	set	messages: + msg3124
2014-04-14 11:38:01	silvan	set	messages: + msg3123
2014-04-13 16:30:19	malte	set	messages: + msg3122
2014-04-13 15:59:16	malte	set	messages: + msg3121
2014-04-13 15:34:21	malte	set	messages: + msg3120
2014-04-13 15:32:44	malte	set	messages: + msg3119
2014-04-13 15:29:46	malte	set	files: + compare_coverage.py messages: + msg3118
2014-04-13 15:26:21	malte	set	files: + ijcai2011_data.py messages: + msg3117
2014-04-13 13:21:44	malte	set	messages: + msg3116
2014-04-13 12:27:55	silvan	set	messages: + msg3115
2014-04-09 18:44:18	malte	set	messages: + msg3104
2014-04-09 10:36:34	silvan	set	messages: + msg3103
2014-04-08 20:18:56	silvan	create

Issue431