Issue 420: make axiom evaluator faster, especially when there are no axioms

Title	make axiom evaluator faster, especially when there are no axioms
Priority	feature	Status	resolved
Superseder		Nosy List	florian, jendrik, malte
Assigned To	jendrik	Keywords
Optional summary

Created on 2014-02-10.17:46:39 by malte, last changed by jendrik.

Messages
msg3075 (view)	Author: jendrik	Date: 2014-03-16.13:12:41
Thanks for the reminder, I should have thought of this myself. The documentation is now pushed.
msg3070 (view)	Author: malte	Date: 2014-03-15.22:03:48
experiments/issue420/issue420-v1-regressions.py doesn't work for me (I'm on the newest lab version): 2014-03-15 22:00:44,147 INFO Building 200 runs 2014-03-15 22:00:44,148 CRITICAL The required resource can not be found: /home/helmert/downward/testme-lab-random-order/benchmarks/tidybot-opt11-strips/p14-00.pddl aborting I think I know why, but we cannot close the issue with the experiment file in this state. The point of having the experiment scripts in the repository is so that it's possible to understand and ideally repeat them later. So if there are additional setup steps required to run the experiments, they should be documented in the experiment scripts. Can you add documentation for this? It's fine to add it directly in the default branch.
msg3068 (view)	Author: jendrik	Date: 2014-03-15.21:09:40
Merged and pushed. I would definitely like to improve some things in lab in order to make experiments using common_setup.py easier. Only afterwards we should discuss best practices for experiments. I'll move this up in my todo list. The randomization has been implemented, but needs further testing, before it can be enabled by default. You can use it by adding randomize_task_order=True in the MaiaEnvironment constructor.
msg3067 (view)	Author: malte	Date: 2014-03-15.20:13:43
Thanks! I don't mind seeing this merged if you would like to merge it now. Please make sure to add the regression experiment to the repository, too. I'd like to start collecting some info on "best practices" experiments for certain things to help running experiments in the future. (Especially for those of us who are less familiar with lab.) It's worrying that the runtime for the tidybot tasks vary so much (56-122 seconds for the same configuration and task). It's also a bit worrying that a task that was solved in the previous experiment is solved in 0 out of 50 cases in the new experiment. Either we were extremely lucky there, or we have some systematic effects going on. Has the randomization of task order in lab been implemented yet? If yes, it may be a good idea to activate it by default, because otherwise we can get systematic errors through interference with other things on the grid, depending on which other things are run at what time on the grid. If not, I have some ideas of how to implement it in a minimally invasive way which we might want to discuss.
msg3066 (view)	Author: jendrik	Date: 2014-03-15.18:47:21
I have run an experiment with 50 copies of each of the two tasks: http://ai.cs.unibas.ch/_tmp_files/seipp/issue420-issue420-v1-regressions- compare.html In both cases where we previously had a regression, the new version now solves the task more often, so I'd say that the regressions were due to noise. I think we don't need to run experiments for your change, since you tested it thoroughly on one task already and we probably both think that the change can only make things faster. Can I merge this?
msg3059 (view)	Author: malte	Date: 2014-03-14.20:10:24
Pushed (but not tagged). I see a 37.60 sec => 36.22 sec improvement on PSR-Large #9 with blind search. These numbers cannot be directly compared to my previous message because this is now with packed states, and only a single task rather than the sum of ten tasks. The improvement is a bit underwhelming because the packed states slow things down and may also hurt cache locality. So it could be that they reduce not just the improvement in relative terms but also in absolute terms. I hope we'll get rid of this overhead in issue348. If we want to run experiments on this second version, it's only worth considering domains with axioms. The main configurations worth checking out are blind search, and CG heuristic with or without preferred operators and with lazy or eager search. I'm not sure if it's worth running detailed experiments on this, though. Your decision. :-) I'd still be interested in the stuff mentioned in msg3055 before we close this.
msg3058 (view)	Author: malte	Date: 2014-03-14.19:37:35
Thanks for the Valgrind numbers, Jendrik! They confirm that the axiom evaluator can be a major bottleneck in domains with axioms, and that the packed states are a bit problematic here. As mentioned in the first message for this issue, I had a look at what we might do to reduce the overhead of reallocations inside the queue used by the evaluator in the nontrivial case (where we do actually have axioms). I'll describe the findings in some detail because they are similar to what I've seen in other time-critical parts of our code (e.g. LM-Cut computations, relaxation heuristics in general), so maybe there is something interesting to learn. I evaluated the different variations of the algorithm by running astar(blind()) on 10 of the 12 first PSR-Large problems (those solved within less than a minute). I was expecting the different optimized versions of the code to give only small runtime improvements and was worried about noise, so I ran everything 10 times to be able to judge the impact of noise vs. algorithmic differences. For each run, I looked at the runtime of each individual benchmark and at the sum of the 10 benchmarks. I report the data as follows: TASK ALGORITHM MIN AVG MAX SPAN DEV sum old 63.94 63.98 64.04 0.10 0.04 => For the sum of all 10 benchmarks, with the "old" algorithm (= original axiom evaluator code), the min runtime in the 10 runs was 63.94 seconds, the average was 63.98, the maximum 64.04, the span (difference between max and min) 0.10 seconds and the estimated standard deviation was 0.04. It turns out that all spans and standard deviations were low enough to trust the averages to be reliable (at least for the tested machine). All experiments are based on the before-issue214 version of the master branch because I didn't want the overhead of the packed states to dominate the outcome. (The hope is that we will be able to eliminate most of this overhead.) The main idea was to reduce dynamic memory allocations that happen as the queue grows. A simple way to get rid of this overhead is to make the queue a static variable of the axiom evaluation method and clear it whenever the method is entered (or left). We cannot really use such as static variable in the long run because it's global state we need to get rid eventually; instead we could use an instance variable of the class. But making things static is easier for a quick test, since it's only a two-line change in the cc file and none in the header file. So I stared by using static variables instead of instance variables. (I'm elaborating this point because it turned out that the static option can be quite a bit faster than the instance variable.) I was surprised to see that the queue was implemented as a deque, not a vector, maybe because FIFO order appears in some way "more logical" than LIFO. But the outcome is really independent of FIFO vs. LIFO, so there's no particular reason to use a deque. Deques manage memory quite differently from vectors; they don't move things around in memory as they grow. Hence there is less memory management to amortize over time, and the impact of making things static would be expected to be lower for a deque than for a vector. Here is the result of making the deque static: TASK ALGORITHM MIN AVG MAX SPAN DEV sum old 63.94 63.98 64.04 0.10 0.04 sum static-deque 63.30 63.42 63.72 0.42 0.11 So a mild advantage for static-queue, around 1%. But I thought that FIFO would be a bad idea, since LIFO should generally be more efficient, even when using a deque, due to more localized memory access patterns. So the next version used push_back instead of push_front, in addition to making the queue static: TASK ALGORITHM MIN AVG MAX SPAN DEV sum old 63.94 63.98 64.04 0.10 0.04 sum static-deque 63.30 63.42 63.72 0.42 0.11 sum push-back 60.78 60.92 61.08 0.30 0.09 Around 5% faster, quite nice for such a tiny change. But is a deque really a good idea here? Vectors are generally faster. Let's switch to a vector, but go back to a local variable instead of a static variable so that we can measure the impact of moving to a static variable for vectors: TASK ALGORITHM MIN AVG MAX SPAN DEV sum old 63.94 63.98 64.04 0.10 0.04 sum static-deque 63.30 63.42 63.72 0.42 0.11 sum push-back 60.78 60.92 61.08 0.30 0.09 sum vector 71.70 71.77 71.86 0.16 0.05 Urgh, that's slow! But this may be because vectors are more heavily affected by the memory allocations that we won't have to perform when using a static queue. So let's make it static again: TASK ALGORITHM MIN AVG MAX SPAN DEV sum old 63.94 63.98 64.04 0.10 0.04 sum static-deque 63.30 63.42 63.72 0.42 0.11 sum push-back 60.78 60.92 61.08 0.30 0.09 sum vector 71.70 71.77 71.86 0.16 0.05 sum static-vector 56.40 56.51 56.74 0.34 0.10 Very nice! Even though using a local variable vector is more than 10% slower than what we started from, using a static vector is actually more than 10% faster! OK, now clean things up and turn the queue into an instance variable: TASK ALGORITHM MIN AVG MAX SPAN DEV sum old 63.94 63.98 64.04 0.10 0.04 sum static-deque 63.30 63.42 63.72 0.42 0.11 sum push-back 60.78 60.92 61.08 0.30 0.09 sum vector 71.70 71.77 71.86 0.16 0.05 sum static-vector 56.40 56.51 56.74 0.34 0.10 sum instance-vector 59.74 59.93 60.40 0.66 0.18 Oops! I really didn't expect to see such a decline in performance. All that happens here is one more level of indirection when accessing the vector. It's still better than what we started with, of course, but it's a pity. I don't think there's much we can do about it, though. (On the other hand, this does suggest that we might see benefits by getting rid of another level of indirection that we have by using a vector instead of a plain C pointer. There may be more savings here by using a more low-level queue. But I didn't want to complicate things to this extent.) I thought I was done here, but then I noticed that on the last axiom layer, the second half of the main for loop (which seeds the queue for the next iteration based on the negation by failure info) isn't really needed. I thought there might be quite many variables in this last layer in something like PSR-Large, and it might be worth saving this effort by skipping this part of the loop in the last iteration. The result: sum old 63.94 63.98 64.04 0.10 0.04 sum static-deque 63.30 63.42 63.72 0.42 0.11 sum push-back 60.78 60.92 61.08 0.30 0.09 sum vector 71.70 71.77 71.86 0.16 0.05 sum static-vector 56.40 56.51 56.74 0.34 0.10 sum instance-vector 59.74 59.93 60.40 0.66 0.18 sum no-last-nbf 57.58 57.68 57.94 0.36 0.10 Indeed another substantial improvement of around 2% compared to the previous version obtained by a simple if test! At this point I realized that it would be cleaner to simply not store the negation-by-failure info for the last layer. Then, instead of testing whether we are in the last layer, we could simply iterate over an empty vector in the last iteration, getting rid of the special case in the loop and avoiding to construct information we will never use. But then I noticed that this optimization is already present in the code: the vector over which we iterate in the last loop is empty, but still these iterations take roughly 2% of the overall planner runtime! That was the second big surprise for me, but whatever I tried when playing around with this confirmed this, so I left it at this version.
msg3057 (view)	Author: jendrik	Date: 2014-03-14.18:03:26
YOu can find the branch at https://bitbucket.org/jendrikseipp/downward/branch/issue420
msg3056 (view)	Author: malte	Date: 2014-03-14.17:24:51
PS: Can you send me a link to the repository where this issue is hosted? Once I've done a few more experiments, I'd like to push some things related to the queue.
msg3055 (view)	Author: malte	Date: 2014-03-14.17:04:14
Well, if ever there was an issue where I expected not to lose performance, it was this one. :-) Interesting that there is a reduction in coverage. Is there any explanation for this, other than noisy runtime on the grid? (Can you check the log files of the four relevant runs?) For cases like this one, I think it makes sense to run additional tests on the instances with strange behaviour (in this case, the two instances responsible for loss in coverage), using multiple runs to reduce the noise. We did this for another issue in the past (I can't remember which one), but the results were hard to read because we used duplicate configurations (essentially) to fake the multiple runs. We would get much nicer results if we could use duplicated benchmarks instead, since the results would be grouped properly and we wouldn't get dozens or hundreds of columns. This is probably overkill for this issue, but as a general functionality, it would be interesting for other issues in the future.
msg3054 (view)	Author: jendrik	Date: 2014-03-14.16:15:23
astar(blind()) on psr-large #09: Reported are the total time (under valgrind) and the percentage of the time in the axiom evaluator. Unpacked states: 1853.16s total time, 80.85% axiom evaluator. Packed states : 2407.98s total time, 96.65% axiom evaluator.
msg3053 (view)	Author: jendrik	Date: 2014-03-14.14:27:44
The experiment is finished and here are the results: http://ai.cs.unibas.ch/_tmp_files/seipp/issue420-issue420-v1-compare.html I'll make the requested profile for the last revision without issue214 merged and get back to you with the numbers.
msg3052 (view)	Author: malte	Date: 2014-03-14.13:20:45
OK, there certainly seems to be room for improvement there. I'll gather some numbers with blind search, where the impact of the axiom evaluator should be most pronounced. Will take a few hours. Jendrik, how much work would it be to find out what the total runtime is and how much time is spend inside the axiom evaluator in PSR-Large #09 with astar(blind()), using kcachegrind? I'd be most interested in the numbers before merging issue214 (but before and after is also fine).
msg3051 (view)	Author: malte	Date: 2014-03-14.12:28:49
It's not the creation of the queue that costs time, but rather the reallocations inside the push_back calls. In the additive, FF and CG heuristic, IIRC a change like this made rougly a 10% difference. Of course, 10% of 13.5% would still be quite little, but then it would only be a tiny change. I'll have a look.
msg3050 (view)	Author: jendrik	Date: 2014-03-14.02:25:33
I profiled psr-large #30 with the config you proposed. AxiomEvaluator::evaluate() needs 13.5% of the time, but it seems the time for creating the queue is negligible since it doesn't even show up in the call graph and I cannot find any time for it in the numbers. I have started an experiment that evaluates the simple change (directly exiting AxiomEvaluator::evaluate() on tasks without axioms) for blind() and lmcut() on OPTIMAL_WITH_IPC11.
msg2947 (view)	Author: malte	Date: 2014-02-10.17:46:39
The axiom evaluator spends quite a bit of time pushing values into a deque even if there are no axioms. Method AxiomEvaluator::evaluate should grow the following test at the very start: if (g_axioms.empty()) return; (I've seen a method "has_axioms" in an unmerged branch. This would of course be nicer to use here, so we might want to add it now. But maybe this will give us more merge conflicts later, so I have no preference whether or not to add this.) Jendrik and Florian, can one of you make this change and evaluate it on the optimal benchmark suite with lmcut(), blind() and the CEGAR heuristic (in the config for which Jendrik produced the profile)? I'd expect that this gives quite nice time savings for blind search, smaller savings for CEGAR and even smaller savings for LM-Cut. While we're at it, we might also try speeding up axiom evaluation for problems that do have axioms by reducing memory management overhead. This could possibly be done by reusing the queue between calls, in the same way that we reuse the queue in additive_heuristic.{cc,h} and other places. (Reusing the queue made quite a difference there.) Before doing this, it would make sense to first run a profiler on a reasonably large psr-large file with something like --heuristic h=cg() --search lazy_greedy(h,preferred=h) to see if AxiomEvaluator::evaluate requires a substantial amount of time. Can someone run this? (I cannot easily run kcachegrind on my machines.)

History
Date	User	Action	Args
2014-03-16 13:12:41	jendrik	set	status: chatting -> resolved messages: + msg3075
2014-03-15 22:03:48	malte	set	status: resolved -> chatting messages: + msg3070
2014-03-15 21:09:40	jendrik	set	status: chatting -> resolved messages: + msg3068
2014-03-15 20:13:43	malte	set	messages: + msg3067
2014-03-15 18:47:21	jendrik	set	messages: + msg3066
2014-03-14 20:10:24	malte	set	messages: + msg3059
2014-03-14 19:37:35	malte	set	messages: + msg3058 title: make axiom evaluator faster when there are no axioms -> make axiom evaluator faster, especially when there are no axioms
2014-03-14 18:03:26	jendrik	set	messages: + msg3057
2014-03-14 17:24:51	malte	set	messages: + msg3056
2014-03-14 17:04:14	malte	set	messages: + msg3055
2014-03-14 16:15:23	jendrik	set	messages: + msg3054
2014-03-14 14:27:44	jendrik	set	messages: + msg3053
2014-03-14 13:20:45	malte	set	messages: + msg3052
2014-03-14 12:28:49	malte	set	messages: + msg3051
2014-03-14 02:25:33	jendrik	set	messages: + msg3050
2014-03-03 19:53:36	jendrik	set	assignedto: jendrik
2014-02-10 17:46:39	malte	create

Issue420