Issue791

Title Use task interface for StateRegistry
Priority feature Status resolved
Superseder Nosy List florian, jendrik, malte
Assigned To florian Keywords
Optional summary
This is part of issue509

Created on 2018-06-04.14:07:08 by florian, last changed by malte.

Summary
This is part of issue509
Messages
msg7603 (view) Author: malte Date: 2018-09-19.19:21:15
Excellent! Kill the beast!
msg7588 (view) Author: florian Date: 2018-09-19.16:41:48
I merged revision v2 and ignored the debug branch (it still lives in my
bitbucket repository if we need it for something).
msg7583 (view) Author: malte Date: 2018-09-19.16:16:59
> I have no doubt that the decrease in performance is significant. I meant
> unlucky with respect to the number of additional timeouts which is much lower
> in the latest experiments on the same revision.

I am not worried about that number for experiments like these, I'd always look
at score_total_time instead for experiments like these, for which timeouts don't
matter because the score converges to 0 anyway as the time limit is reached.
(Well, at least for 30 minute timeouts.)

> Anyway, I agree that this is the compiler's fault. Should I merge revision v2?
> It has the same code as debug8 but without the history of debug commits and
> experiments. It also still returns "const TaskProxy &" instead of "TaskProxy".

Pick what you like best.
msg7581 (view) Author: florian Date: 2018-09-19.16:10:20
I have no doubt that the decrease in performance is significant. I meant unlucky
with respect to the number of additional timeouts which is much lower in the
latest experiments on the same revision.

Anyway, I agree that this is the compiler's fault. Should I merge revision v2?
It has the same code as debug8 but without the history of debug commits and
experiments. It also still returns "const TaskProxy &" instead of "TaskProxy".
msg7573 (view) Author: malte Date: 2018-09-19.15:03:35
> May the variance is just very high and we got unlucky in the first experiment.

I don't think so, there is clearly a significant change here not explainable by
noise. For example, looking at the 32-bit results for score_total_time in

https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debugbase-issue791-debug8-compare.html

we have a negative change in 43 domains and a positive change in only 7 domains.
Generally speaking, I recommend looking at per-domain score_total_time results
in cases like these to get a good overall picture and cancel out some noise (by
aggregating tasks within one domain). If I calculated this correctly, 43:7 is
significant at a p level less than 0.000001.

The positive news is that in the 64-bit, things look quite a bit more random,
with 36 negative vs. 14 positive changes. This is still highly significant, but
at a p level of "only" around 0.025. Based on the 64-bit results and the fact
that the "bad" diff clearly looks like it seems to be the compiler's fault
rather than ours, I think we should accept these performance numbers.
msg7566 (view) Author: florian Date: 2018-09-19.13:46:04
I have more strange results from the 64bit experiment. I added more debug
versions to further pinpoint the change that influences performance.

debugbase: current version of default branch

debug5: initializes StateRegistry with a task (not a proxy) but then stores both
the task and the proxy and has accessors for both. The unpack method uses
get_task. All other changes of this issue are also contained in this version.

debug6: exactly like debug5 but the unpack method uses get_task_proxy.

debug7: removes the member "task" and the getter for it.

debug8: changes the ctor to use a proxy instead of the pointer

Between debugbase and debug7 the code only got faster (both in 32 bit and in 64
bit builds).
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debugbase-issue791-debug7-compare.html

From debug7 to debug8 performance dropped (the drop is less pronounced in the 64
bit build but still noticable).
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debug7-issue791-debug8-compare.html
The diff is tiny and doesn't look critical at all:
https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug8..issue791-debug7#diff


Interestingly, the baseline also has a lot of noise. While the first experiment
had 530 timeouts in the release32 build (increasing to 539 after the patch),
this experiment has 538 (increasing to 540). Overall, the change from base to
debug8 is not that dramatic in this experiment, so we could just ignore it and
merge the issue. May the variance is just very high and we got unlucky in the
first experiment.
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue791-debug-opt4-issue791-debugbase-issue791-debug8-compare.html
msg7555 (view) Author: malte Date: 2018-09-19.09:11:42
Compiler hiccups aside, the only thing in the diff that might plausibly have an
impact on performance is the existence of the attribute StateRegistry::task,
which affects how the state registry is laid out in memory. (Normally, of course
we would not expect an unnecessary attribute to be useful, but it can
conceivably happen due to cache effects.)
msg7554 (view) Author: malte Date: 2018-09-19.09:06:30
I would have a look if the behaviour is reproducible with a 64-bit build. If
not, we should not bother further.
msg7553 (view) Author: florian Date: 2018-09-19.08:32:07
It gets even more curious: I tried what you suggested and saw no effect in the
experiment. Then I wanted to confirm that the diff I mentioned in msg7548 was
responsible. I created two new debug revisions:

debug5: initializes StateRegistry with a task (not a proxy) but then stores both
the task and the proxy and has accessors for both. The unpack method uses
get_task. All other changes of this issue are also contained in this version.

debug6: exactly like debug5 but the unpack method uses get_task_proxy

I also included debugbase (last revision on default branch) and debug4
(including all changes) in the experiment. So in total we have the following diffs:

debugbase -> debug5: most of the changes
debug5 -> debug6: only changes the way unpack accesses the task
debug6 -> debug4: switches ctor from task to task proxy and removes unused get_task

I expected to see the difference from debug5 to debug6 but surprisingly, debug6
is even a little bit faster than debugbase:
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt3-issue791-debugbase-issue791-debug6-compare.html

The actual difference happens when we switch the constructor argument and remove
the dead code from debug6 to debug4:
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt3-issue791-debug4-issue791-debug6-compare.html

Diff (displayed in the wrong way again: red is new, green is old):
https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug6..issue791-debug4#diff

This code is only executed once, so I have no idea why it would have any
influence on performance.
msg7550 (view) Author: malte Date: 2018-09-18.22:57:46
Curious. "const TaskProxy &" is an extra level of indirection compared to "const
AbstractTask &". It would be unusual for this to make such a huge difference in
performance because I don't think we're using it in an inner loop. (Are we?)

But you could try return TaskProxy instead of const TaskProxy &. That should
have exactly the same representation in memory as const AbstractTask &.
msg7548 (view) Author: florian Date: 2018-09-18.22:48:42
The experiments on the intermediate versions are done. I used 4 revisions:

* base: the version on default where the branch started
* debug1: introduces the unpack() method of global states
* debug2: additionally introduces TaskProxy::create_state
* debug3: additionally uses task proxies in the state registry (equivalent to v2
below)


I'm focusing on timeouts in the release build but the reports also have the data
for out-of-memory errors and the debug build if you are interested. The display
of the diffs is not ideal because I committed the revisions in the wrong order.

base -> debug1 (adding unpack):
  -9 timeouts (code got faster)
 
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt-issue791-base-issue791-debug1-compare.html
  actual diff:
   
https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug1..issue791-base#diff
  diff ignoring the merge of other issues
   
https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug1..f2c6ad7320b7#diff


debug1 -> debug2 (adding create_state):
  +1 timeout (no large change in speed)
 
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt-issue791-debug1-issue791-debug2-compare.html
  diff:
   
https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug2..issue791-debug1#diff

debug2 -> debug3 (adding get_task_proxy):
  +16 timeouts (code got slower)
 
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-debug-opt-issue791-debug2-issue791-debug3-compare.html
  diff (displayed the wrong way, the red part is in debug3, the green in debug2)
   
https://bitbucket.org/FlorianPommerening/downward-issues/compare/issue791-debug2..issue791-debug3#diff


Looks like the overhead comes from changing
    TaskProxy task_proxy(registry->get_task());
to
    TaskProxy task_proxy = registry->get_task_proxy();
in GlobalState::unpack()
and 
    const AbstractTask &task;
    const AbstractTask &get_task() const {
        return task;
    }
to
    TaskProxy task_proxy;
    const TaskProxy &get_task_proxy() const {
        return task_proxy;
    }
in StateRegistry.

I still do not see why.
msg7533 (view) Author: malte Date: 2018-09-18.18:44:13
To be fair, the (positive) effect on the debug config, which is also very
consistent, is equally puzzling. This may be one of these cases where the
compiler takes a different turn based on something more or less random. I'm
interested in the results for the intermediate configurations, but maybe there
isn't much we can do here.

Another thing you can try (although I don't know how easy it is) is run the
experiment with a different compiler.
msg7532 (view) Author: florian Date: 2018-09-18.18:26:00
Unfortunately, there was no difference after changing the `const State &&` thing:

http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-v2-opt-issue791-base-issue791-v2-compare.html

I tried running more local tests with versions of the code that are between the
default branch and the tip of the issue branch but the variance is so high that
it is hard to see any change. I'll put those intermediate tests on the grid as well.
msg7519 (view) Author: florian Date: 2018-09-18.15:16:08
I made a similar local test and reached the same conclusion. I started another
experiment on the grid. I have no idea where the high variance comes from, we
saw it in issue794 as well.
msg7515 (view) Author: malte Date: 2018-09-18.15:00:05
I made a little local test of the `const State &&` thing we discussed on
bitbucket, just running

./build.py && ./fast-downward.py
~/downward/benchmarks/barman-opt11-strips/pfile01-001.pddl --search "astar(blind())"

locally with and without the "const". Unfortunately, runtime fluctuates *a lot*
on my machine for this task, by more than 10%, so it's hard to tell a
difference. But for what it's worth, the best of 5 runs was 20.1368s without
const and 21.4119s with const. So maybe we're on to something.
msg7513 (view) Author: malte Date: 2018-09-18.14:01:24
I looked at the code again and didn't find anything that should have a major
performance impact. There are a few places where we move a State around, rather
than a vector<int>, and certainly the overhead of that is non-zero, but also it
should be small.

A before/after profile might reveal if we do any copying we don't expect, for
example by checking how many times we call the copy constructor of State as
opposed to the move constructor, etc.
msg7511 (view) Author: florian Date: 2018-09-18.13:57:57
I tried analyzing this with callgrind but the differences are so small, its hard
to see in a profile. I'm trying a different task now and will also try
cachegrind to see if cache efficiency could be responsible.
msg7510 (view) Author: malte Date: 2018-09-18.13:35:26
It is a small slowdown, but it is quite consistent. It would be good to figure
out what is going on here. Would a profile help?
msg7508 (view) Author: florian Date: 2018-09-18.12:37:23
Strangely, the debug mode is faster and the release mode is slower after the
change in the issue:

http://ai.cs.unibas.ch/_tmp_files/pommeren/issue791-v1-opt-issue791-base-issue791-v1-compare.html
msg7504 (view) Author: malte Date: 2018-09-18.09:26:25
Code looks clean to me. (I left two comments.) If it works correctly, it can be
merged. ;-) I'd appreciate a small experiment (5 minutes sufficient), and given
the new assertions, this should include a debug config.
msg7501 (view) Author: jendrik Date: 2018-09-18.00:07:52
I left two small comments. Code looks good to me.
msg7493 (view) Author: florian Date: 2018-09-17.16:03:25
I updated the pull request with the change from issue824.
msg7181 (view) Author: florian Date: 2018-06-04.20:31:32
Thanks Jendrik. I made the changes you suggested.
msg7176 (view) Author: jendrik Date: 2018-06-04.18:37:53
I left some minor comments on Bitbucket. I think this is an elegant solution.
msg7170 (view) Author: florian Date: 2018-06-04.17:51:35
Pull request #43 shows the difference to branch issue774 because this one should
be merged first. Once it is merged, I'll update the pull request.

https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/43
msg7167 (view) Author: florian Date: 2018-06-04.17:01:52
> The registry should also be generated from a task in the same way successor
> generator and similar classes are generated.

Maybe this was a bad idea. The successor generator, int packer, and axiom
evaluator can all be used through a const reference but the state registry
cannot. Returning it as a non-const reference from a task proxy would mean that
the method could not be const. Maybe its better to have a constructor of the
state registry that takes a task proxy as a parameter and have the search engine
create a registry internally.
msg7157 (view) Author: florian Date: 2018-06-04.14:07:08
As part of using the task interface everywhere, we want to switch the
StateRegistry to the new interface. This will be mostly done with issue774 but
some issues will remain. For example, the registry should use a task proxy
internally, not a shared pointer to an abstract task. The registry should also
be generated from a task in the same way successor generator and similar classes
are generated.

In this issue, we want to clean up the registry and fix these remaining issues.
History
Date User Action Args
2018-09-19 19:21:15maltesetmessages: + msg7603
2018-09-19 16:41:49floriansetstatus: reviewing -> resolved
messages: + msg7588
2018-09-19 16:16:59maltesetmessages: + msg7583
2018-09-19 16:10:20floriansetmessages: + msg7581
2018-09-19 15:03:35maltesetmessages: + msg7573
2018-09-19 13:46:04floriansetmessages: + msg7566
2018-09-19 09:11:42maltesetmessages: + msg7555
2018-09-19 09:06:30maltesetmessages: + msg7554
2018-09-19 08:32:07floriansetmessages: + msg7553
2018-09-18 22:57:46maltesetmessages: + msg7550
2018-09-18 22:48:42floriansetmessages: + msg7548
2018-09-18 18:44:13maltesetmessages: + msg7533
2018-09-18 18:26:00floriansetmessages: + msg7532
2018-09-18 15:16:08floriansetmessages: + msg7519
2018-09-18 15:00:05maltesetmessages: + msg7515
2018-09-18 14:01:24maltesetmessages: + msg7513
2018-09-18 13:57:57floriansetmessages: + msg7511
2018-09-18 13:35:26maltesetmessages: + msg7510
2018-09-18 12:37:23floriansetmessages: + msg7508
2018-09-18 09:26:25maltesetmessages: + msg7504
2018-09-18 00:07:52jendriksetmessages: + msg7501
2018-09-17 16:03:25floriansetmessages: + msg7493
2018-06-04 20:31:32floriansetmessages: + msg7181
2018-06-04 18:37:53jendriksetstatus: in-progress -> reviewing
messages: + msg7176
2018-06-04 17:51:35floriansetmessages: + msg7170
2018-06-04 17:01:52floriansetmessages: + msg7167
2018-06-04 14:07:08floriancreate