Time for an update! We did start working on this issue as part of the most recent sprint and I've continued to work on it ever since. Parts (1) and (2) have already been reviewed by Simon and Salomé, respectively. Part (3) was a hard nut because it required getting into some details of the code which took a lot of time. I have temporarily messed up some stuff in the landmark and ordering generation but everything should be fixed now. We even have empirical proof that the refactored versions are faster when summarized over all (more on this below). So part (3) is now also ready and I'm looking for reviewers. To separate the changes from already reviewed changes, I've created a pull request on my own branch: https://github.com/ClemensBuechner/downward/pull/4. Since the changes within each factory are in mostly independent, it would be possible to split the workload among several reviewers. I would appreciate a fast turnaround especially on this issue because I've really put a lot of effort in it and I wish to avoid it getting stuck and other changes to interfere before we get around to merging this.
Now back on the topic of performance. I have data on the difference between the base version and the version of the already reviewed code, but it raises some questions related to the landmark factories which are only partially toched at this point. So I think it makes more sense to look at the changes after also refactoring the factories, i.e., comparing base with the version for which I'd like reviews at the moment. Here are the corresponding lab reports:
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue992/data/issue992-v9-sat-eval/issue992-v9-sat-issue992-base-issue992-factories-v4-compare.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue992/data/issue992-v9-anytime-eval/issue992-v9-anytime-issue992-base-issue992-factories-v4-compare.html
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue992/data/issue992-v9-opt-eval/issue992-v9-opt-issue992-base-issue992-factories-v8-compare.html
Note that they use a timeout of only 5 minutes opposed to our usual 30 minutes. Within this bound, I would summarize the changes as "all looks good if not better". Here are some noteworthy observations:
- Landmark graph generation time goes down with almost all landmark factories, which results in higher coverage in some cases.
- LAMA-first solves 3-4 more tasks and yields higher overall scores throughout.
- LAMA (anytime) also solves 3 more tasks but finds slightly more expensive tasks in summary. Costs for individual tasks increase in some cases and decrease in others, so there's no systematic change in behavior. Also, we've seen before that there is some variation, even when rerunning the same configuration twice, so I'm not worried about this.
- Also, the number of landmarks and orderings remains the same for all factories, so the difference in costs within LAMA cannot stem from a change like this. (I've verified for a small sample that indeed the landmark graphs are identical and not only the number of landmarks and orderings.)
- Our optimal lm_hm(m=2) configuration with uniform cost partitioning solves 24 tasks more. This is almost certainly due to the speedup within hm-landmark generation because we're using `vector` instead of `list` now. (For reference: https://ai.dmi.unibas.ch/_visualizer/?c=eJyNUEluwzAM_Iqhcy057aXOO3o3GJuxBWgDRW8J8vdSNdq0tx45CznDu8LAZDF3zmZW50rZnGds29d6aWvn68lX1Q90hZ5jUdfL-0Gql0oROmC7oLiZZhRkA5all5kLppwfCdLUjRiQRBlDx9ZjsW7Z3orm1DSNjPt_fftfX6KYkLjUuFr3ZZ6YUz4bA1YP3uo52Atk3U-mw020sidwFtYMcQ0r0GC-W5oBGMzvN8TENS7gzPOOZiC93Y76KRJ_7KncfRNgQcoStqQ46aZI-jlz9J23oVttyMLcH08YxpFwBPnswTw-Ab_9iwE%3D)
- Using CEGAR seeded with abstractions based on landmarks seems to suffer slightly, solving 2 tasks less. In one case, time is the issue, in the other it is memory. So I'm not quite sure what the take home message here is. Maybe to analyze this further, we should first run a full 30 minute experiment to see whether the changes still make a difference or whether we got unlucky here.
I'm happy to discuss the changes further if you think it's necessary. I also don't consider this the final evaluation to justify merging, they're mainly here to convince people that I didn't break the code entirely (because from the diff alone it might look like it).
|