Issue 791: Use task interface for StateRegistry

Title	Use task interface for StateRegistry
Priority	feature	Status	resolved
Superseder		Nosy List	florian, jendrik, malte
Assigned To	florian	Keywords
Optional summary	This is part of issue509

Created on 2018-06-04.14:07:08 by florian, last changed by malte.

Summary
This is part of issue509

Messages
msg7603 (view)	Author: malte	Date: 2018-09-19.19:21:15
Excellent! Kill the beast!
msg7588 (view)	Author: florian	Date: 2018-09-19.16:41:48
I merged revision v2 and ignored the debug branch (it still lives in my bitbucket repository if we need it for something).
msg7583 (view)	Author: malte	Date: 2018-09-19.16:16:59
> I have no doubt that the decrease in performance is significant. I meant > unlucky with respect to the number of additional timeouts which is much lower > in the latest experiments on the same revision. I am not worried about that number for experiments like these, I'd always look at score_total_time instead for experiments like these, for which timeouts don't matter because the score converges to 0 anyway as the time limit is reached. (Well, at least for 30 minute timeouts.) > Anyway, I agree that this is the compiler's fault. Should I merge revision v2? > It has the same code as debug8 but without the history of debug commits and > experiments. It also still returns "const TaskProxy &" instead of "TaskProxy". Pick what you like best.
msg7581 (view)	Author: florian	Date: 2018-09-19.16:10:20
I have no doubt that the decrease in performance is significant. I meant unlucky with respect to the number of additional timeouts which is much lower in the latest experiments on the same revision. Anyway, I agree that this is the compiler's fault. Should I merge revision v2? It has the same code as debug8 but without the history of debug commits and experiments. It also still returns "const TaskProxy &" instead of "TaskProxy".
msg7573 (view)	Author: malte	Date: 2018-09-19.15:03:35
> May the variance is just very high and we got unlucky in the first experiment. I don't think so, there is clearly a significant change here not explainable by noise. For example, looking at the 32-bit results for score_total_time in https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debugbase-issue791-debug8-compare.html we have a negative change in 43 domains and a positive change in only 7 domains. Generally speaking, I recommend looking at per-domain score_total_time results in cases like these to get a good overall picture and cancel out some noise (by aggregating tasks within one domain). If I calculated this correctly, 43:7 is significant at a p level less than 0.000001. The positive news is that in the 64-bit, things look quite a bit more random, with 36 negative vs. 14 positive changes. This is still highly significant, but at a p level of "only" around 0.025. Based on the 64-bit results and the fact that the "bad" diff clearly looks like it seems to be the compiler's fault rather than ours, I think we should accept these performance numbers.
msg7566 (view)	Author: florian	Date: 2018-09-19.13:46:04
I have more strange results from the 64bit experiment. I added more debug versions to further pinpoint the change that influences performance. debugbase: current version of default branch debug5: initializes StateRegistry with a task (not a proxy) but then stores both the task and the proxy and has accessors for both. The unpack method uses get_task. All other changes of this issue are also contained in this version. debug6: exactly like debug5 but the unpack method uses get_task_proxy. debug7: removes the member "task" and the getter for it. debug8: changes the ctor to use a proxy instead of the pointer Between debugbase and debug7 the code only got faster (both in 32 bit and in 64 bit builds). https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debugbase-issue791-debug7-compare.html From debug7 to debug8 performance dropped (the drop is less pronounced in the 64 bit build but still noticable). https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debug7-issue791-debug8-compare.html The diff is tiny and doesn't look critical at all: https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug8..issue791-debug7#diff Interestingly, the baseline also has a lot of noise. While the first experiment had 530 timeouts in the release32 build (increasing to 539 after the patch), this experiment has 538 (increasing to 540). Overall, the change from base to debug8 is not that dramatic in this experiment, so we could just ignore it and merge the issue. May the variance is just very high and we got unlucky in the first experiment. https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debugbase-issue791-debug8-compare.html
msg7555 (view)	Author: malte	Date: 2018-09-19.09:11:42
Compiler hiccups aside, the only thing in the diff that might plausibly have an impact on performance is the existence of the attribute StateRegistry::task, which affects how the state registry is laid out in memory. (Normally, of course we would not expect an unnecessary attribute to be useful, but it can conceivably happen due to cache effects.)
msg7554 (view)	Author: malte	Date: 2018-09-19.09:06:30
I would have a look if the behaviour is reproducible with a 64-bit build. If not, we should not bother further.
msg7553 (view)	Author: florian	Date: 2018-09-19.08:32:07
It gets even more curious: I tried what you suggested and saw no effect in the experiment. Then I wanted to confirm that the diff I mentioned in msg7548 was responsible. I created two new debug revisions: debug5: initializes StateRegistry with a task (not a proxy) but then stores both the task and the proxy and has accessors for both. The unpack method uses get_task. All other changes of this issue are also contained in this version. debug6: exactly like debug5 but the unpack method uses get_task_proxy I also included debugbase (last revision on default branch) and debug4 (including all changes) in the experiment. So in total we have the following diffs: debugbase -> debug5: most of the changes debug5 -> debug6: only changes the way unpack accesses the task debug6 -> debug4: switches ctor from task to task proxy and removes unused get_task I expected to see the difference from debug5 to debug6 but surprisingly, debug6 is even a little bit faster than debugbase: http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt3-issue791-debugbase-issue791-debug6-compare.html The actual difference happens when we switch the constructor argument and remove the dead code from debug6 to debug4: http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt3-issue791-debug4-issue791-debug6-compare.html Diff (displayed in the wrong way again: red is new, green is old): https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug6..issue791-debug4#diff This code is only executed once, so I have no idea why it would have any influence on performance.
msg7550 (view)	Author: malte	Date: 2018-09-18.22:57:46
Curious. "const TaskProxy &" is an extra level of indirection compared to "const AbstractTask &". It would be unusual for this to make such a huge difference in performance because I don't think we're using it in an inner loop. (Are we?) But you could try return TaskProxy instead of const TaskProxy &. That should have exactly the same representation in memory as const AbstractTask &.
msg7548 (view)	Author: florian	Date: 2018-09-18.22:48:42
The experiments on the intermediate versions are done. I used 4 revisions: * base: the version on default where the branch started * debug1: introduces the unpack() method of global states * debug2: additionally introduces TaskProxy::create_state * debug3: additionally uses task proxies in the state registry (equivalent to v2 below) I'm focusing on timeouts in the release build but the reports also have the data for out-of-memory errors and the debug build if you are interested. The display of the diffs is not ideal because I committed the revisions in the wrong order. base -> debug1 (adding unpack): -9 timeouts (code got faster) http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt-issue791-base-issue791-debug1-compare.html actual diff: https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug1..issue791-base#diff diff ignoring the merge of other issues https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug1..f2c6ad7320b7#diff debug1 -> debug2 (adding create_state): +1 timeout (no large change in speed) http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt-issue791-debug1-issue791-debug2-compare.html diff: https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug2..issue791-debug1#diff debug2 -> debug3 (adding get_task_proxy): +16 timeouts (code got slower) http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt-issue791-debug2-issue791-debug3-compare.html diff (displayed the wrong way, the red part is in debug3, the green in debug2) https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug2..issue791-debug3#diff Looks like the overhead comes from changing TaskProxy task_proxy(registry->get_task()); to TaskProxy task_proxy = registry->get_task_proxy(); in GlobalState::unpack() and const AbstractTask &task; const AbstractTask &get_task() const { return task; } to TaskProxy task_proxy; const TaskProxy &get_task_proxy() const { return task_proxy; } in StateRegistry. I still do not see why.
msg7533 (view)	Author: malte	Date: 2018-09-18.18:44:13
To be fair, the (positive) effect on the debug config, which is also very consistent, is equally puzzling. This may be one of these cases where the compiler takes a different turn based on something more or less random. I'm interested in the results for the intermediate configurations, but maybe there isn't much we can do here. Another thing you can try (although I don't know how easy it is) is run the experiment with a different compiler.
msg7532 (view)	Author: florian	Date: 2018-09-18.18:26:00
Unfortunately, there was no difference after changing the `const State &&` thing: http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-v2-opt-issue791-base-issue791-v2-compare.html I tried running more local tests with versions of the code that are between the default branch and the tip of the issue branch but the variance is so high that it is hard to see any change. I'll put those intermediate tests on the grid as well.
msg7519 (view)	Author: florian	Date: 2018-09-18.15:16:08
I made a similar local test and reached the same conclusion. I started another experiment on the grid. I have no idea where the high variance comes from, we saw it in issue794 as well.
msg7515 (view)	Author: malte	Date: 2018-09-18.15:00:05
I made a little local test of the `const State &&` thing we discussed on bitbucket, just running ./build.py && ./fast-downward.py ~/downward/benchmarks/barman-opt11-strips/pfile01-001.pddl --search "astar(blind())" locally with and without the "const". Unfortunately, runtime fluctuates a lot on my machine for this task, by more than 10%, so it's hard to tell a difference. But for what it's worth, the best of 5 runs was 20.1368s without const and 21.4119s with const. So maybe we're on to something.
msg7513 (view)	Author: malte	Date: 2018-09-18.14:01:24
I looked at the code again and didn't find anything that should have a major performance impact. There are a few places where we move a State around, rather than a vector<int>, and certainly the overhead of that is non-zero, but also it should be small. A before/after profile might reveal if we do any copying we don't expect, for example by checking how many times we call the copy constructor of State as opposed to the move constructor, etc.
msg7511 (view)	Author: florian	Date: 2018-09-18.13:57:57
I tried analyzing this with callgrind but the differences are so small, its hard to see in a profile. I'm trying a different task now and will also try cachegrind to see if cache efficiency could be responsible.
msg7510 (view)	Author: malte	Date: 2018-09-18.13:35:26
It is a small slowdown, but it is quite consistent. It would be good to figure out what is going on here. Would a profile help?
msg7508 (view)	Author: florian	Date: 2018-09-18.12:37:23
Strangely, the debug mode is faster and the release mode is slower after the change in the issue: http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-v1-opt-issue791-base-issue791-v1-compare.html
msg7504 (view)	Author: malte	Date: 2018-09-18.09:26:25
Code looks clean to me. (I left two comments.) If it works correctly, it can be merged. ;-) I'd appreciate a small experiment (5 minutes sufficient), and given the new assertions, this should include a debug config.
msg7501 (view)	Author: jendrik	Date: 2018-09-18.00:07:52
I left two small comments. Code looks good to me.
msg7493 (view)	Author: florian	Date: 2018-09-17.16:03:25
I updated the pull request with the change from issue824.
msg7181 (view)	Author: florian	Date: 2018-06-04.20:31:32
Thanks Jendrik. I made the changes you suggested.
msg7176 (view)	Author: jendrik	Date: 2018-06-04.18:37:53
I left some minor comments on Bitbucket. I think this is an elegant solution.
msg7170 (view)	Author: florian	Date: 2018-06-04.17:51:35
Pull request #43 shows the difference to branch issue774 because this one should be merged first. Once it is merged, I'll update the pull request. https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/43
msg7167 (view)	Author: florian	Date: 2018-06-04.17:01:52
> The registry should also be generated from a task in the same way successor > generator and similar classes are generated. Maybe this was a bad idea. The successor generator, int packer, and axiom evaluator can all be used through a const reference but the state registry cannot. Returning it as a non-const reference from a task proxy would mean that the method could not be const. Maybe its better to have a constructor of the state registry that takes a task proxy as a parameter and have the search engine create a registry internally.
msg7157 (view)	Author: florian	Date: 2018-06-04.14:07:08
As part of using the task interface everywhere, we want to switch the StateRegistry to the new interface. This will be mostly done with issue774 but some issues will remain. For example, the registry should use a task proxy internally, not a shared pointer to an abstract task. The registry should also be generated from a task in the same way successor generator and similar classes are generated. In this issue, we want to clean up the registry and fix these remaining issues.

History
Date	User	Action	Args
2018-09-19 19:21:15	malte	set	messages: + msg7603
2018-09-19 16:41:49	florian	set	status: reviewing -> resolved messages: + msg7588
2018-09-19 16:16:59	malte	set	messages: + msg7583
2018-09-19 16:10:20	florian	set	messages: + msg7581
2018-09-19 15:03:35	malte	set	messages: + msg7573
2018-09-19 13:46:04	florian	set	messages: + msg7566
2018-09-19 09:11:42	malte	set	messages: + msg7555
2018-09-19 09:06:30	malte	set	messages: + msg7554
2018-09-19 08:32:07	florian	set	messages: + msg7553
2018-09-18 22:57:46	malte	set	messages: + msg7550
2018-09-18 22:48:42	florian	set	messages: + msg7548
2018-09-18 18:44:13	malte	set	messages: + msg7533
2018-09-18 18:26:00	florian	set	messages: + msg7532
2018-09-18 15:16:08	florian	set	messages: + msg7519
2018-09-18 15:00:05	malte	set	messages: + msg7515
2018-09-18 14:01:24	malte	set	messages: + msg7513
2018-09-18 13:57:57	florian	set	messages: + msg7511
2018-09-18 13:35:26	malte	set	messages: + msg7510
2018-09-18 12:37:23	florian	set	messages: + msg7508
2018-09-18 09:26:25	malte	set	messages: + msg7504
2018-09-18 00:07:52	jendrik	set	messages: + msg7501
2018-09-17 16:03:25	florian	set	messages: + msg7493
2018-06-04 20:31:32	florian	set	messages: + msg7181
2018-06-04 18:37:53	jendrik	set	status: in-progress -> reviewing messages: + msg7176
2018-06-04 17:51:35	florian	set	messages: + msg7170
2018-06-04 17:01:52	florian	set	messages: + msg7167
2018-06-04 14:07:08	florian	create

Issue791