Title make axiom evaluator faster, especially when there are no axioms
Priority feature Status resolved
Superseder Nosy List florian, jendrik, malte
Assigned To jendrik Keywords
Optional summary

Created on 2014-02-10.17:46:39 by malte, last changed by jendrik.

msg3075 (view) Author: jendrik Date: 2014-03-16.13:12:41
Thanks for the reminder, I should have thought of this myself. The documentation 
is now pushed.
msg3070 (view) Author: malte Date: 2014-03-15.22:03:48
experiments/issue420/ doesn't work for me (I'm on the
newest lab version):

2014-03-15 22:00:44,147 INFO     Building 200 runs
2014-03-15 22:00:44,148 CRITICAL The required resource can not be found:

I think I know why, but we cannot close the issue with the experiment file in
this state. The point of having the experiment scripts in the repository is so
that it's possible to understand and ideally repeat them later. So if there are
additional setup steps required to run the experiments, they should be
documented in the experiment scripts. Can you add documentation for this? It's
fine to add it directly in the default branch.
msg3068 (view) Author: jendrik Date: 2014-03-15.21:09:40
Merged and pushed.

I would definitely like to improve some things in lab in order to make 
experiments using easier. Only afterwards we should discuss best 
practices for experiments. I'll move this up in my todo list.

The randomization has been implemented, but needs further testing, before it can 
be enabled by default. You can use it by adding randomize_task_order=True in the 
MaiaEnvironment constructor.
msg3067 (view) Author: malte Date: 2014-03-15.20:13:43
Thanks! I don't mind seeing this merged if you would like to merge it now.

Please make sure to add the regression experiment to the repository, too. I'd
like to start collecting some info on "best practices" experiments for certain
things to help running experiments in the future. (Especially for those of us
who are less familiar with lab.)

It's worrying that the runtime for the tidybot tasks vary so much (56-122
seconds for the same configuration and task). It's also a bit worrying that a
task that was solved in the previous experiment is solved in 0 out of 50 cases
in the new experiment. Either we were extremely lucky there, or we have some
systematic effects going on.

Has the randomization of task order in lab been implemented yet? If yes, it may
be a good idea to activate it by default, because otherwise we can get
systematic errors through interference with other things on the grid, depending
on which other things are run at what time on the grid. If not, I have some
ideas of how to implement it in a minimally invasive way which we might want to
msg3066 (view) Author: jendrik Date: 2014-03-15.18:47:21
I have run an experiment with 50 copies of each of the two tasks:

In both cases where we previously had a regression, the new version now solves 
the task more often, so I'd say that the regressions were due to noise.

I think we don't need to run experiments for your change, since you tested it 
thoroughly on one task already and we probably both think that the change can 
only make things faster.

Can I merge this?
msg3059 (view) Author: malte Date: 2014-03-14.20:10:24
Pushed (but not tagged).

I see a 37.60 sec => 36.22 sec improvement on PSR-Large #9 with blind search.
These numbers cannot be directly compared to my previous message because this is
now with packed states, and only a single task rather than the sum of ten tasks.
The improvement is a bit underwhelming because the packed states slow things
down and may also hurt cache locality. So it could be that they reduce not just
the improvement in relative terms but also in absolute terms. I hope we'll get
rid of this overhead in issue348.

If we want to run experiments on this second version, it's only worth
considering domains with axioms. The main configurations worth checking out are
blind search, and CG heuristic with or without preferred operators and with lazy
or eager search. I'm not sure if it's worth running detailed experiments on
this, though. Your decision. :-)

I'd still be interested in the stuff mentioned in msg3055 before we close this.
msg3058 (view) Author: malte Date: 2014-03-14.19:37:35
Thanks for the Valgrind numbers, Jendrik! They confirm that the axiom evaluator
can be a major bottleneck in domains with axioms, and that the packed states are
a bit problematic here.

As mentioned in the first message for this issue, I had a look at what we might
do to reduce the overhead of reallocations inside the queue used by the
evaluator in the nontrivial case (where we do actually have axioms). I'll
describe the findings in some detail because they are similar to what I've seen
in other time-critical parts of our code (e.g. LM-Cut computations, relaxation
heuristics in general), so maybe there is something interesting to learn.

I evaluated the different variations of the algorithm by running astar(blind())
on 10 of the 12 first PSR-Large problems (those solved within less than a
minute). I was expecting the different optimized versions of the code to give
only small runtime improvements and was worried about noise, so I ran everything
10 times to be able to judge the impact of noise vs. algorithmic differences.
For each run, I looked at the runtime of each individual benchmark and at the
sum of the 10 benchmarks. I report the data as follows:

TASK ALGORITHM                    MIN    AVG    MAX   SPAN    DEV
sum  old                        63.94  63.98  64.04   0.10   0.04

=> For the sum of all 10 benchmarks, with the "old" algorithm (= original axiom
evaluator code), the min runtime in the 10 runs was 63.94 seconds, the average
was 63.98, the maximum 64.04, the span (difference between max and min) 0.10
seconds and the estimated standard deviation was 0.04. It turns out that all
spans and standard deviations were low enough to trust the averages to be
reliable (at least for the tested machine).

All experiments are based on the before-issue214 version of the master branch
because I didn't want the overhead of the packed states to dominate the outcome.
(The hope is that we will be able to eliminate most of this overhead.)

The main idea was to reduce dynamic memory allocations that happen as the queue
grows. A simple way to get rid of this overhead is to make the queue a static
variable of the axiom evaluation method and clear it whenever the method is
entered (or left). We cannot really use such as static variable in the long run
because it's global state we need to get rid eventually; instead we could use an
instance variable of the class. But making things static is easier for a quick
test, since it's only a two-line change in the cc file and none in the header
file. So I stared by using static variables instead of instance variables. (I'm
elaborating this point because it turned out that the static option can be quite
a bit faster than the instance variable.)

I was surprised to see that the queue was implemented as a deque, not a vector,
maybe because FIFO order appears in some way "more logical" than LIFO. But the
outcome is really independent of FIFO vs. LIFO, so there's no particular reason
to use a deque. Deques manage memory quite differently from vectors; they don't
move things around in memory as they grow. Hence there is less memory management
to amortize over time, and the impact of making things static would be expected
to be lower for a deque than for a vector.

Here is the result of making the deque static:

TASK ALGORITHM                    MIN    AVG    MAX   SPAN    DEV
sum  old                        63.94  63.98  64.04   0.10   0.04
sum  static-deque               63.30  63.42  63.72   0.42   0.11

So a mild advantage for static-queue, around 1%. But I thought that FIFO would
be a bad idea, since LIFO should generally be more efficient, even when using a
deque, due to more localized memory access patterns. So the next version used
push_back instead of push_front, in addition to making the queue static:

TASK ALGORITHM                    MIN    AVG    MAX   SPAN    DEV
sum  old                        63.94  63.98  64.04   0.10   0.04
sum  static-deque               63.30  63.42  63.72   0.42   0.11
sum  push-back                  60.78  60.92  61.08   0.30   0.09

Around 5% faster, quite nice for such a tiny change. But is a deque really a
good idea here? Vectors are generally faster. Let's switch to a vector, but go
back to a local variable instead of a static variable so that we can measure the
impact of moving to a static variable for vectors:

TASK ALGORITHM                    MIN    AVG    MAX   SPAN    DEV
sum  old                        63.94  63.98  64.04   0.10   0.04
sum  static-deque               63.30  63.42  63.72   0.42   0.11
sum  push-back                  60.78  60.92  61.08   0.30   0.09
sum  vector                     71.70  71.77  71.86   0.16   0.05

Urgh, that's slow! But this may be because vectors are more heavily affected by
the memory allocations that we won't have to perform when using a static queue.
So let's make it static again:

TASK ALGORITHM                    MIN    AVG    MAX   SPAN    DEV
sum  old                        63.94  63.98  64.04   0.10   0.04
sum  static-deque               63.30  63.42  63.72   0.42   0.11
sum  push-back                  60.78  60.92  61.08   0.30   0.09
sum  vector                     71.70  71.77  71.86   0.16   0.05
sum  static-vector              56.40  56.51  56.74   0.34   0.10

Very nice! Even though using a local variable vector is more than 10% slower
than what we started from, using a static vector is actually more than 10%
faster! OK, now clean things up and turn the queue into an instance variable:

TASK ALGORITHM                    MIN    AVG    MAX   SPAN    DEV
sum  old                        63.94  63.98  64.04   0.10   0.04
sum  static-deque               63.30  63.42  63.72   0.42   0.11
sum  push-back                  60.78  60.92  61.08   0.30   0.09
sum  vector                     71.70  71.77  71.86   0.16   0.05
sum  static-vector              56.40  56.51  56.74   0.34   0.10
sum  instance-vector            59.74  59.93  60.40   0.66   0.18

Oops! I really didn't expect to see such a decline in performance. All that
happens here is one more level of indirection when accessing the vector. It's
still better than what we started with, of course, but it's a pity. I don't
think there's much we can do about it, though. (On the other hand, this does
suggest that we might see benefits by getting rid of another level of
indirection that we have by using a vector instead of a plain C pointer. There
may be more savings here by using a more low-level queue. But I didn't want to
complicate things to this extent.)

I thought I was done here, but then I noticed that on the last axiom layer, the
second half of the main for loop (which seeds the queue for the next iteration
based on the negation by failure info) isn't really needed. I thought there
might be quite many variables in this last layer in something like PSR-Large,
and it might be worth saving this effort by skipping this part of the loop in
the last iteration. The result:

sum  old                        63.94  63.98  64.04   0.10   0.04
sum  static-deque               63.30  63.42  63.72   0.42   0.11
sum  push-back                  60.78  60.92  61.08   0.30   0.09
sum  vector                     71.70  71.77  71.86   0.16   0.05
sum  static-vector              56.40  56.51  56.74   0.34   0.10
sum  instance-vector            59.74  59.93  60.40   0.66   0.18
sum  no-last-nbf                57.58  57.68  57.94   0.36   0.10

Indeed another substantial improvement of around 2% compared to the previous
version obtained by a simple if test! At this point I realized that it would be
cleaner to simply not store the negation-by-failure info for the last layer.
Then, instead of testing whether we are in the last layer, we could simply
iterate over an empty vector in the last iteration, getting rid of the special
case in the loop and avoiding to construct information we will never use. But
then I noticed that this optimization is already present in the code: the vector
over which we iterate in the last loop is empty, but still these iterations take
roughly 2% of the overall planner runtime! That was the second big surprise for
me, but whatever I tried when playing around with this confirmed this, so I left
it at this version.
msg3057 (view) Author: jendrik Date: 2014-03-14.18:03:26
YOu can find the branch at
msg3056 (view) Author: malte Date: 2014-03-14.17:24:51
PS: Can you send me a link to the repository where this issue is hosted? Once
I've done a few more experiments, I'd like to push some things related to the queue.
msg3055 (view) Author: malte Date: 2014-03-14.17:04:14
Well, if ever there was an issue where I expected not to lose performance, it
was this one. :-) Interesting that there is a reduction in coverage. Is there
any explanation for this, other than noisy runtime on the grid? (Can you check
the log files of the four relevant runs?)

For cases like this one, I think it makes sense to run additional tests on the
instances with strange behaviour (in this case, the two instances responsible
for loss in coverage), using multiple runs to reduce the noise.

We did this for another issue in the past (I can't remember which one), but the
results were hard to read because we used duplicate configurations (essentially)
to fake the multiple runs. We would get much nicer results if we could use
duplicated benchmarks instead, since the results would be grouped properly and
we wouldn't get dozens or hundreds of columns. This is probably overkill for
this issue, but as a general functionality, it would be interesting for other
issues in the future.
msg3054 (view) Author: jendrik Date: 2014-03-14.16:15:23
astar(blind()) on psr-large #09:

Reported are the total time (under valgrind) and the percentage of the time in the axiom 

Unpacked states: 1853.16s total time, 80.85% axiom evaluator.
Packed states  : 2407.98s total time, 96.65% axiom evaluator.
msg3053 (view) Author: jendrik Date: 2014-03-14.14:27:44
The experiment is finished and here are the results:

I'll make the requested profile for the last revision without issue214 merged and 
get back to you with the numbers.
msg3052 (view) Author: malte Date: 2014-03-14.13:20:45
OK, there certainly seems to be room for improvement there. I'll gather some
numbers with blind search, where the impact of the axiom evaluator should be
most pronounced. Will take a few hours.

Jendrik, how much work would it be to find out what the total runtime is and how
much time is spend inside the axiom evaluator in PSR-Large #09 with
astar(blind()), using kcachegrind? I'd be most interested in the numbers before
merging issue214 (but before and after is also fine).
msg3051 (view) Author: malte Date: 2014-03-14.12:28:49
It's not the creation of the queue that costs time, but rather the reallocations
inside the push_back calls. In the additive, FF and CG heuristic, IIRC a change
like this made rougly a 10% difference. Of course, 10% of 13.5% would still be
quite little, but then it would only be a tiny change. I'll have a look.
msg3050 (view) Author: jendrik Date: 2014-03-14.02:25:33
I profiled psr-large #30 with the config you proposed. AxiomEvaluator::evaluate() needs 13.5% of the time, 
but it seems the time for creating the queue is negligible since it doesn't even show up in the call graph 
and I cannot find any time for it in the numbers.

I have started an experiment that evaluates the simple change (directly exiting AxiomEvaluator::evaluate() 
on tasks without axioms) for blind() and lmcut() on OPTIMAL_WITH_IPC11.
msg2947 (view) Author: malte Date: 2014-02-10.17:46:39
The axiom evaluator spends quite a bit of time pushing values into a deque even
if there are no axioms. Method AxiomEvaluator::evaluate should grow the
following test at the very start:

if (g_axioms.empty())

(I've seen a method "has_axioms" in an unmerged branch. This would of course be
nicer to use here, so we might want to add it now. But maybe this will give us
more merge conflicts later, so I have no preference whether or not to add this.)

Jendrik and Florian, can one of you make this change and evaluate it on the
optimal benchmark suite with lmcut(), blind() and the CEGAR heuristic (in the
config for which Jendrik produced the profile)? I'd expect that this gives quite
nice time savings for blind search, smaller savings for CEGAR and even smaller
savings for LM-Cut.

While we're at it, we might also try speeding up axiom evaluation for problems
that *do* have axioms by reducing memory management overhead. This could
possibly be done by reusing the queue between calls, in the same way that we
reuse the queue in additive_heuristic.{cc,h} and other places. (Reusing the
queue made quite a difference there.) Before doing this, it would make sense to
first run a profiler on a reasonably large psr-large file with something like
--heuristic h=cg() --search lazy_greedy(h,preferred=h) to see if
AxiomEvaluator::evaluate requires a substantial amount of time. Can someone run
this? (I cannot easily run kcachegrind on my machines.)
Date User Action Args
2014-03-16 13:12:41jendriksetstatus: chatting -> resolved
messages: + msg3075
2014-03-15 22:03:48maltesetstatus: resolved -> chatting
messages: + msg3070
2014-03-15 21:09:40jendriksetstatus: chatting -> resolved
messages: + msg3068
2014-03-15 20:13:43maltesetmessages: + msg3067
2014-03-15 18:47:21jendriksetmessages: + msg3066
2014-03-14 20:10:24maltesetmessages: + msg3059
2014-03-14 19:37:35maltesetmessages: + msg3058
title: make axiom evaluator faster when there are no axioms -> make axiom evaluator faster, especially when there are no axioms
2014-03-14 18:03:26jendriksetmessages: + msg3057
2014-03-14 17:24:51maltesetmessages: + msg3056
2014-03-14 17:04:14maltesetmessages: + msg3055
2014-03-14 16:15:23jendriksetmessages: + msg3054
2014-03-14 14:27:44jendriksetmessages: + msg3053
2014-03-14 13:20:45maltesetmessages: + msg3052
2014-03-14 12:28:49maltesetmessages: + msg3051
2014-03-14 02:25:33jendriksetmessages: + msg3050
2014-03-03 19:53:36jendriksetassignedto: jendrik
2014-02-10 17:46:39maltecreate