Issue 814: make h^add/h^max/h^FF more efficient, especially in 64-bit builds

Title	make h^add/h^max/h^FF more efficient, especially in 64-bit builds
Priority	feature	Status	resolved
Superseder		Nosy List	cedric, florian, guillem, jendrik, malte, silvan, thomas
Assigned To	jendrik	Keywords
Optional summary

Created on 2018-09-12.11:18:22 by malte, last changed by jendrik.

Files
File name	Uploaded	Type	Edit	Remove
massif-profiles.tar.gz	jendrik, 2018-09-12.13:54:57	application/gzip

Messages
msg8222 (view)	Author: jendrik	Date: 2018-12-10.22:24:03
Merged. I opened issue877 for tracking ideas related to relaxation heuristics.
msg8219 (view)	Author: malte	Date: 2018-12-10.20:02:00
If nobody else objects, we can merge this.
msg8218 (view)	Author: jendrik	Date: 2018-12-10.19:22:20
I took care of your comments. Can we merge this? I will start collecting possible performance enhancements in a new meta issue.
msg8216 (view)	Author: malte	Date: 2018-12-10.17:12:09
Thanks! I left two comments on bitbucket; no further comments. Once this is merged, we should rerun the experiments of issue213 (I suggest rerunning all of them; the landmark-based ones need rerunning, too, and other things have changed in the code since). We should also make sure we keep any ideas regarding performance enhancements or other improvements that we discussed in the context of this issue and might still want to implement.
msg8215 (view)	Author: jendrik	Date: 2018-12-10.16:39:41
Done. Here is the pull request: https://bitbucket.org/jendrikseipp/downward/pull-requests/111
msg8214 (view)	Author: malte	Date: 2018-12-10.15:41:26
Thanks! :-) v14 vs. v16: no clear useful effect of the alignment. However, I noticed that UnaryOperator is only 28 bytes naturally, so aligning it to 32 bytes wastes memory, which is very likely to eat any positive effects we might otherwise perhaps get. For now it's not worth using the alignment, but we may want to revisit this if/when we add non-unary operators for more correct support of conditional effects in h^FF (as discussed), in which case the operator data structure would naturally have a size of 32 bytes. Given that we are very deliberate about the memory for these data structures, I would suggest adding the static_asserts for the size even without using the alignment, i.e., static_assert the sizes of 16 and 28 bytes. This is part of the desired performance characteristics of the code. v16 vs. v17: looks good default (v12-base) vs. v17: looks good! :-) Suggested next steps: 1. Remove the __aligned__ attributes, but keep the static assertions, asserting 16 and 28 bytes. 2. Merge from default. 3. Prepare a pull request. I'd like to have a final look before we merge this. If anyone else wants to do a code review, please speak up, as we will otherwise merge this soon.
msg8212 (view)	Author: jendrik	Date: 2018-12-09.08:59:42
Here are the results for the alignment experiment: https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-align-sat-issue814-v14-issue814-v16-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-align-hmax-issue814-v14-issue814-v16-compare.html
msg8210 (view)	Author: jendrik	Date: 2018-12-08.18:12:40
Here is the evaluation of the clean up (renaming ArrayChain to ArrayPool, v17): https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v17-sat-issue814-v16-issue814-v17-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v17-hmax-issue814-v16-issue814-v17-compare.html And here is the comparison of the default and the issue branch: https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v17-sat-issue814-v12-base-issue814-v17-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v17-hmax-issue814-v12-base-issue814-v17-compare.html
msg8202 (view)	Author: jendrik	Date: 2018-12-08.12:59:05
You're right. I'll start two new experiments.
msg8201 (view)	Author: malte	Date: 2018-12-08.12:57:42
Hi Jendrik, great! But I think to decide whether we like the alignment it would be good to have a direct "with or without" experiment, i.e., a comparison betwene two versions where the only difference is on the alignment, where one version forces alignment for none of the data structures, and the other one forces alignment for both. If I understand correctly, we don't currently have that, only experiments for 0 vs. 1 and 1 vs. 2 of the data structures. Or did I get confused about the versions?
msg8199 (view)	Author: jendrik	Date: 2018-12-08.12:52:24
Here are the results for the version that aligns the propositions to a 16-byte alignment (v16): https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v16-sat-issue814-v15-issue814-v16-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v16-hmax-issue814-v15-issue814-v16-compare.html I made the cleanup you suggested (v17). Experiments comparing v17 to v16 and the default branch are running.
msg8195 (view)	Author: malte	Date: 2018-12-07.23:10:51
I didn't want to do that because I don't think it's sufficiently polished. I think it would need one or more two users to get a sufficiently rounded-off interface for real reuse. For example, the pool should probably grow a reserve method, and it would make sense to templatize it, but it's generally not a good idea to nail down such API details without knowing what the users actually need. Leaving it local to the relaxation heuristics sends a clearer message of "this is a one-off tool we designed for one particular user; it isn't a general tool yet".
msg8194 (view)	Author: jendrik	Date: 2018-12-07.23:05:52
I agree. Regarding the cleanup, I propose to move the new array_pool.{h,cc} files into the "algorithms" directory. What do you think?
msg8193 (view)	Author: malte	Date: 2018-12-07.23:01:03
Thanks! Again, clear win for v14, and this time also v15 seems to be quite an improvement. (In absolute numbers, there is not much of a gain, but if you look at the score_total_time by domain, it does seem to be a statistically significant, though not very significant effect.)
msg8192 (view)	Author: jendrik	Date: 2018-12-07.22:42:43
Here are the results for h^max: https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v15-hmax-issue814-v13-issue814-v14-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v15-hmax-issue814-v14-issue814-v15-compare.html
msg8188 (view)	Author: malte	Date: 2018-12-07.17:20:49
Regarding the cleanup: perhaps it's better to get this merged soon and worry about making it more beautiful later, as it wasn't beautiful before this issue either. In other words, I have no objections towards merging this without a lot more cleanup. It is clear that the relaxation heuristic code could be prettier, getting rid of code duplications, etc., and also that there are further ideas for performance improvements. But I think they can wait. Two changes I'd like to see before merging: - Get rid of array_chain.cc (see comment inside the file). - Find a better name for "ArrayChain". There seems to be no universally agreed name for this sort of thing, but the word "pool" is often used when a bunch of objects can be allocated individually but only deallocated collectively, which is what this is about. So I suggest changing "ArrayChain" to "ArrayPool", of course also making the file name changes and changes of variable names, related class names etc. Once we have something we would like to merge, I would also like to see a final experiment comparing our latest version to whatever is the most appropriate baseline revision from default.
msg8187 (view)	Author: malte	Date: 2018-12-07.17:16:15
The results for the no-sentinel change look very good. The forced alignment makes no big difference in either this or the previous experiment. I tested if we get 32-byte alignment automatically even without forcing it (after all, all unary operators live in one large vector, and very large objects are often handled specially by memory allocators), but this does not seem to be the case. On the example I tested, without the forced alignment the unary operators were not 32-byte aligned. So the alignment doesn't seem to happen automatically. But experimentally it doesn't seem to have much of a positive impact here. Before we abandon the idea, I would like to see one more experiment, with and without alignment, but also aligning the propositions with a 16-byte alignment. (Please also add the corresponding assert about the size being 16 bytes.) If the forced alignment doesn't help in this experiment, I suggest we get rid of it again.
msg8185 (view)	Author: jendrik	Date: 2018-12-07.16:27:21
The last message actually holds the results for h^add.
msg8184 (view)	Author: jendrik	Date: 2018-12-07.15:59:13
Here are the corresponding results for h^max (see msg8175): https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v15-hadd-issue814-v13-issue814-v14-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v15-hadd-issue814-v14-issue814-v15-compare.html
msg8181 (view)	Author: jendrik	Date: 2018-12-07.14:15:46
To simplify discussing what to clean up, I created a pull request: https://bitbucket.org/jendrikseipp/downward/pull-requests/111
msg8178 (view)	Author: malte	Date: 2018-12-07.09:13:20
Looks good! Can you add experiments for the other two heuristics as well? (In the case of h^max, using A*.)
msg8177 (view)	Author: jendrik	Date: 2018-12-07.09:13:11
During live discussions we talked about another performance-related idea (that doesn't need to be tackled within this issue): O. Make data needed for setup_exploration_queue() trivially copyable. This should make the compiler use a memcpy and thus speed up the initialization of the queue.
msg8175 (view)	Author: jendrik	Date: 2018-12-07.08:51:23
We have made two more changes to the code: * Remove sentinel from ArrayChain (v14) * Align UnaryOperator on 32 byte boundaries (v15) Here are the results: https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v15-issue814-v13-issue814-v14-compare.html https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v15-issue814-v14-issue814-v15-compare.html
msg8159 (view)	Author: malte	Date: 2018-12-05.20:14:22
Looks very good!
msg8157 (view)	Author: jendrik	Date: 2018-12-05.20:08:45
[Assigning this to myself while I'm working it.] I merged with the default branch (v12) and then implemented idea J. (v13). Here are the results: https://ai.dmi.unibas.ch/_tmp_files/seipp/issue814-v13-issue814-v12-issue814-v13-compare.html
msg7753 (view)	Author: malte	Date: 2018-09-21.17:37:45
Note mostly to self: I merged from default after v11, so future experiments should compare to an updated baseline.
msg7738 (view)	Author: malte	Date: 2018-09-21.13:37:51
To close some open ends for the end of the sprint, here is the status regarding the items A.-N. from the earlier messages: A. Use integers with indices instead of pointers. => Done. This helped a lot. B. Use direct pointers to the data instead of vectors (for example with some kind of terminator entry instead of a size field). => Done, although they are not exactly pointers. Also helped a lot. C. Something like the small string optimization. (Store small vectors "inline" -- almost all our vectors will be small.) => not worth doing for the main loop as the "precondition_of" vectors are not small; may provide some benefit for the preferred operators calculation, but unclear if it's worth trying D. Change the order of attributes in these classes to reduce losses due to alignment. => Done. Helped a lot at some point, but not at the point where it was actually done in the issue branch because at that point the most important class (UnaryOperator) only had a single 4-byte gap, which could only be moved around, not removed. E. Use bitfields for elements with small ranges. (For example, we have some booleans.) => Done. If it helped, it didn't help much, but that is explainable because it only applied to Proposition, the less critical data structure. Our critical data structure, UnaryOperator, did not have such attributes. F. Get rid of data that is only needed during initialization and does not need to be part of the working set. => Done. G. Use struct-of-vector representation instead of vector-of-struct representation, keeping related data together. => Not done, might be worth trying out. H. Separate rarely accessed data, frequently read data, and frequently read+written data. => Not done, might be worth trying out. I. Compute h^FF as a side effect of marking preferred operators rather than actually computing the relaxed plan explicitly. => Not done, worth doing because it also simplifies the code. J. Replace the current test of whether a marked operator is a preferred operator => Not done. Worth trying because it will make the code cleaner. K. Compile the goal into an operator. => Not done. Would allow some simplifications while making the setup more complicated. Could be worth trying out, but not top priority. L. Have all propositions live in a single vector rather than a 2D data structure. => Done. This was necessary for several other optimizations replacing pointers by ints, which were critical. M. Introduce a num_preconditions attribute. => Done. We did this together with another change (because we had to do it for that change), so the performance was not evaluated separately. N. Precompute the set of no-precondition actions. => Not done. Probably won't change too much, but would also be quite easy to do. Not high priority.
msg7706 (view)	Author: malte	Date: 2018-09-21.11:47:00
This won't get merged for the current sprint. Even if the cleanup could be done very quickly, there would be no time to run the necessary experiments in time. So the question is when we can continue working on this. I have two other urgent commitments coming up (a project proposal presentation and a project report), but hopefully I can work on this about a week from today. If not, we should perhaps think of alternative ways to proceed, so that this won't languish for a long time.
msg7705 (view)	Author: malte	Date: 2018-09-21.11:41:05
Corrected link for the first experiment mentioned in the previous message: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v10-base-30min-issue814-base-issue814-v10-compare.html
msg7696 (view)	Author: malte	Date: 2018-09-21.09:26:26
Experimental results for 30 minutes and for h^max are in. 30-minute experiments: issue814-v10-base-30min-issue814-base-issue814-v10-compare They look good. The coverage differences between base and v10 are much smaller than for 5 minutes, but that is to be expected because with the larger time limit, the memory limit becomes more of an issue. We get the following changes in coverage between base and v10: h^add, 32-bit: 2115 => 2121 (+6) h^add, 64-bit: 2105 => 2121 (+16) h^FF, 32-bit: 2068 => 2074 (+6) h^FF, 64-bit: 2060 => 2073 (+13) So still some nice improvements, and the 32- vs. 64-bit comparison also looks fine to me. experiments with h^max: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v11-base-hmax-30min-issue814-base-issue814-v11-compare.html A+h^max, 32-bit: 800 => 797 (-3) A+h^max, 64-bit: 793 => 796 (+3) So we get the desired effect of 64-bit becoming closer to 32-bit, but we actually lose some 32-bit coverage. The h^max code is very similar to the h^add code, so I wondered what might be going on here. Some of the changes we made to get rid of pointers require a bit more work in the heuristic computation than before, and I suspect this is the main cause. In h^add and h^FF that was compensated by a smaller memory footprint during the computation helping a lot with cache performance. With A* + h^max, large tasks where this could matter typically cannot be solved anyway because this configuration is almost never able to solve large problems anyway. So the overhead is more pronounced. In all three tasks we lose, the solution times for the base revision are in excess of 1600 seconds, somethimes more than 1750 seconds. Perhaps we are also a bit unlucky with grid noise. So all in all I think this is mostly fine. We might claw back a bit of runtime during cleanup. For the preconditions data structure, we currently use a size field and a sentinel, and we only need one of the two. There are also some things in the data structure that we don't need for h^max, so we could also improve performance a bit by giving it its own data structures or separating out the data structures only needed for preferred operators. But I think for the overall goal of allowing a 64-bit default build, these numbers already look fine. So the main thing that needs to be done next is to clean up again.
msg7670 (view)	Author: malte	Date: 2018-09-20.19:41:16
Experiments with h^max have been started. Note to self: v11 is identical to v10 except that it includes h^max, so no need for experiments with the other heuristics.
msg7667 (view)	Author: malte	Date: 2018-09-20.19:01:27
I found 5-minute data for the base revision (from the base vs. v4 experiment), so we can actually have a look at 5-minute coverage for base vs. v10: h^add, 32-bit: 2034 => 2045 (+6) h^add, 64-bit: 2017 => 2045 (+28) h^FF, 32-bit: 2001 => 2006 (+5) h^FF, 64-bit: 1998 => 2008 (+10)
msg7663 (view)	Author: malte	Date: 2018-09-20.18:29:04
I combined the data from some of the old experiments and was disappointed to see that v10 has significantly worse coverage than the base revision -- until I realized that the early experiments used a 30-minute timeout and we later switched to a 5-minute timeout, so I was comparing the two revisions with different timeouts. :-) An apples-to-apples comparison looks much nicer. This table includes a bunch of different revisions: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v1_7_10.html base and v1 use a 30-minute timeout, while v6, v7, v9 and v10 use a 5-minute timeout. Both v6 and v7 should be reasonably good revisions, while v9 is broken. Going from v6 to v10 we see the following changes in coverage for the release builds: h^add, 32-bit: 2039 => 2045 (+6) h^add, 64-bit: 2029 => 2045 (+16) h^FF, 32-bit: 2005 => 2006 (+1) h^FF, 64-bit: 2000 => 2008 (+8) In particular, in v10 the 64-bit build has caught up with the 32-bit build, at least for the 5-minute timeout. If we look at the scores, the time scores are slightly better for the 64-bit build and the memory scores are slightly worse, which is more or less what we would expect/hope. The new thing in v10 is that it uses a custom data structure to reduce the vector overhead: instead of 12 bytes in 32-bit and 24 bytes in 64-bit, each vector now uses 4 bytes in the UnaryOperator/Proposition data structures themselves, plus 4 bytes in UnaryOperator for storing the number of preconditions per operator because we want to be able to access that quickly. All the vector contents are stored in one big vector, so there are some additional memory savings per vector for each UnaryOperator/Proposition due to less memory management overhead. On the negative side, we pay 4 additional bytes per vector for a sentinel object. This could be changed: given that we store the number of preconditions per operator anyway, we could replace the sentinel object by a length field for the usage in UnaryOperator, and the usage in Proposition is less critical, so we could add a size field there, too. I have started a 30-minute baseline vs. v10 experiment to see what the results are for the longer timeout. Even if the results are good, there are two major drawbacks with v10 that need to be addressed before we can consider merging this. - h^max is currently disabled because it has not yet been adapted to the changes in the base class. - The new code is too ugly to merge because all changes after v4 were written as quick prototypes. So even if performance turns out to be good enough, there is still a lot of cleanup work to do. I think I can address the first of these two today and start an experiment overnight. I won't get the second of these done today, but perhaps I can make at least a bit of progress.
msg7657 (view)	Author: malte	Date: 2018-09-20.17:23:44
OK, the good news is that the new revision (v10 = v9 minus the segfault plus space-optimizing UnaryOperator) solves far more problems than v9. However, that's largely because v9 produces all these segfaults. ;-) I'll try to cook up a comparison between v10 and some older non-broken revisions without rerunning the experiment. If that works out, the data will be slightly less reliable because the data were aggregated at different times on the grid, and we know that the other running experiments etc. affect the outcomes.
msg7623 (view)	Author: jendrik	Date: 2018-09-20.11:14:54
I agree that the safe-STL mode sounds interesting.
msg7619 (view)	Author: malte	Date: 2018-09-20.11:05:45
Anticlimax time! The experiments for v6-v9 are done, and there are no sustantial improvements. However, I have excuses: v6: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v6-issue814-v5-issue814-v6-compare.html This is where we reorder the attributes of Proposition and UnaryOperator for better 64-bit memory layout. We noticed before that doing this for Proposition doesn't really help because there are typically so many more unary operators than propositions. On closer inspection, it turns out that the 64-bit memory layout for UnaryOperator was already as good as it could get with the current collection of attributes. The change merely moves a 4-byte gap from the middle of the object to the end, which does not really help. If I looked at it correctly, we do save 8 bytes for the propositions, but they don't matter much. (Reordering unary operators would have helped if we had done it earlier, but with the move from pointers to ints, the other gaps we had have already been closed.) v7: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v7-issue814-v6-issue814-v7-compare.html This is where we change some attributes to bitfields to save space. However, this is for Proposition, not for UnaryOperator, which I overlooked yesterday, and as said above, that class doesn't matter so much. There are some minor improvements that may or may not be noise. v8 and v9: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v8-issue814-v7-issue814-v8-compare.html https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v9-issue814-v8-issue814-v9-compare.html This is where we replace vectors with a more compact data structure, but it turns out I optimized the wrong data structure again, Proposition rather than UnaryOperator. So no real impact. Of the two variants, the vector-based one (v9) is slightly better in terms of time and the deque-based one (v8) is slightly better in terms of memory, as expected, with the vector-based one looking slightly better overall to me. But perhaps both variants are still worth checking once I've actually applied this to the correct class. There are also lots of segmentation faults here because I resized a vector to the number of unary operators where I should have used the number of propositions, so it crashes on (the few) tasks where there are more facts than operator effects. In the end it wasn't too hard to diagnose this, but it would have been even easier if vector::operator[] performed bounds checking in debug mode. I found this link, which is something we might want to consider: https://stackoverflow.com/questions/38958523/safe-versions-of-stl-on-linux The seg fault is fixed. Next step: apply the optimization of v8/v9 to the correct data structure.
msg7613 (view)	Author: malte	Date: 2018-09-19.21:49:08
Results for v5 are in: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v5-issue814-v4-issue814-v5-compare.html We hoped/expected performance to remain similar from v4 to v5 in 32-bit mode, but to improve in 64-bit mode, although we did not expect 64-bit mode to catch up with 32-bit mode. The coverage numbers more or less bear this out: h^add, 32-bit: 2038 => 2038 (unchanged) h^add, 64-bit: 2018 => 2031 (+13) h^FF, 32-bit: 2006 => 2003 (-3) h^FF, 64-bit: 1990 => 1998 (+8) (All results for release builds; the debug builds are only included in the experiment to test if they trip an assertion.) So h^FF has suffered a bit in 32-bit mode, but we got the desired improvements in 64-bit mode. As expected, there is still a non-trivial gap between 32-bit and 64-bit, but the numbers are moving in the right direction. In addition to v6 and v7, I have submitted two quite speculative new revisions v8 and v9 that are supposed to help with the vector space overhead. They are almost identical, except that v8 uses a deque and v9 uses a vector in an internal data structure. The vector should have faster access, but the deque might have a nicer peak memory profile because it grows in a more stable way. All of these are 5-minute experiments, so we should have results for v6-v9 by tomorrow morning.
msg7605 (view)	Author: malte	Date: 2018-09-19.19:40:32
I started experiments for v6, which reorders the attributes of Proposition and UnaryOperator in a way that should help with 64-bit performance. I also started experiments for v7, which should the size of Proposition by 8 bytes by using bitfields. The disadvantage of bitfields is that they require additional computation, and they don't always pay off. I tried to combine attributes that aren't commonly accessed in the first place and that don't require arithmetic to mitigate these issues. The expectation for v6 is that it makes essentially no difference for 32 bits and that it helps a bit for 64 bits on tasks that are large but simple. The expectation for v7 is that it may help a little bit for both versions on tasks that are large but simple, but this is more speculative. This exhausts the low-hanging fruit for reducing the 32-bit/64-bit memory layout gap. More things can certainly be done, but they require a bit more code.
msg7604 (view)	Author: malte	Date: 2018-09-19.19:26:29
I started experiments for v5, which uses (integer) indices instead of pointers everywhere in the core data structures of h^add, h^max and h^FF. I hope/expect that performance in 32-bit mode will remain quite similar to the way it currently is, and that performance in 64-bit mode will become closer to 32-bit. However, I expect a substantial gap to remain between the two because the 64-bit memory layout is still unfavourably affected by alignment issues and its vectors are twice as large (the vector itself, not its contents). But with indices instead of pointers used everywhere, further changes will be much simpler.
msg7600 (view)	Author: malte	Date: 2018-09-19.19:08:39
OK, so in all the SIGKILL cases, the translator was killed after running out of time. It is supposed to be aborted more gracefully, and in all cases except for these 10 this also worked, so I'm not sure what to do but am happy to leave this alone for now. If someone wants to have a look, here is a ways to log at the run.log files; vary as necessary for driver.log, properties etc. Can be copy-pasted into bash: cd /infai/helmert/downward/malte/experiments/issue814/data/issue814-v4 BADRUNS=" 00191 11493 12559 18140 23684 25142 25348 29191 30683 30686" for x in $BADRUNS; do echo $x GROUP=$(echo $x \| head -c 3) RUNDIR=runs-$GROUP*/$x less $RUNDIR/run.log done
msg7599 (view)	Author: malte	Date: 2018-09-19.18:57:42
Results for v4 are in: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v4-issue814-base-issue814-v4-compare.html Good news: - Simplifying is still much faster in v4 than in base. - We still have a very nice improvement in the memory score. - We see a coverage increase in settlers, which may or may not be caused by the filtering of duplicate preconditions. - We now simplify to the same number of unary operators as in base, so the bugs in the previous versions are fixed. (The numbers only differ in trucks_strips, where the translator behaves unpredictably.) - No more failed assertions. Bad news: - There are some unexplained errors in the report which I have not seem before, namely sigkills. They seem to happen mostly, but perhaps (hard to say at the moment) in cases where the translator failed to complete in the given time/memory limits (300 seconds, however much memory). They occur both for base and the new code, so it does not seem to be related. Still, probably something that shouldn't happen. I'll take a closer look.
msg7598 (view)	Author: malte	Date: 2018-09-19.18:38:11
I am removing the old summary, which comes from the time before we knew what was going on here.
msg7569 (view)	Author: malte	Date: 2018-09-19.14:30:49
Duplicated preconditions are now filtered in the relaxation heuristics. I have tagged v4 with this change and started a new 5-minute experiment. I'm comparing to base rather than v3 because v1, v2 and v3 all misbehave in various ways. In the affected domains (= only settlers, I think), the change should improve h^add, and it may lead to more simplification because there are now more cases with 5 or fewer preconditions, which is the magic number for simplification. I left a note about these duplicates and their special treatment in RelaxationHeuristic at issue497, which is about canonicalizing operator definitions and similar things.
msg7567 (view)	Author: malte	Date: 2018-09-19.14:00:59
Indeed, there are duplicated preconditions in Settlers, and it's related to the stupid way our file format includes actual preconditions (not just effect conditions) as part of effect specifications. For example, in settlers-sat18-adl/p01.pddl the action "build-docks p1" contains 71 effects (most of them conditional), including for example these: =============== 1 902 0 893 0 1 1 900 0 893 0 1 =============== Both of these encode the precondition (not effect condition) var893 = 0.
msg7565 (view)	Author: malte	Date: 2018-09-19.13:37:13
Results for v1 vs. v2 here: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v2-issue814-v1-issue814-v2-compare.html Results for v2 vs. v3 here: https://ai.dmi.unibas.ch/_tmp_files/helmert/issue814-v3-issue814-v2-issue814-v3-compare.html The debug experiments uncovered a violated assumption. We assume that the set of unary operator preconditions is sorted and unique, but it is not. Because we sort the preconditions ourselves, this must mean (unless there is a bug) that it is not unique, i.e., the same fact is listed multiple times as a precondition. This may include the case where it is listed once as a precondition and once as an effect condition. All examples of this are in the settlers domain. If we indeed have replicated conditions, then getting rid of them could also benefit other heuristics, so it is perhaps something to keep in mind beyond the scope of this issue.
msg7558 (view)	Author: malte	Date: 2018-09-19.10:23:12
I tagged v3 with the fix for RelaxationHeuristic::simplify and started an experiment comparing v2 and v3. In the interest of not clogging the grid, I wanted to use a 1-minute timeout, but then saw this is not enough to translate the first task in agricola, which is one of the affected domains. So I'm using a 5-minute timeout instead.
msg7556 (view)	Author: malte	Date: 2018-09-19.09:33:33
Experiments for base vs. v1 are in: http://ai.cs.unibas.ch/_tmp_files/helmert/issue814-v1-issue814-base-issue814-v1-compare.html Observations: * We see some improvements in coverage. However, all or almost all of them will be due to different tie-breaking in determing best achievers, which is just us being more lucky now. So nothing to see here. * We reach the h^add clamp in a number of domains, which is unfortunate, but I cannot think of a good and easy solution for this. For now, I'm fine with not doing anything about it. * Memory scores are noticeably improved. This makes sense because we are now more careful about memory usage in the heuristic setup. For example. RelaxationHeuristic::simplify no longer maintains three copies of the operator data. * RelaxationHeuristic::simplify is much faster. For the additive heuristic, its runtime is cut in half. * The negative news: the behaviour of simplification has changed in a few domains. I now see what could be causing this. For unary operators with more than 5 preconditions, the old code was able to filter out exact duplicates (in terms of preconditions and effects), while the new code ignores them entirely. (Number of unary operator preconditions is the sum of operator preconditions and effect conditions.) This is easy to fix, and fixing it will make simplify slower again in the affected domains, but there are only a few such domains, so it should still be much faster overall than the current code. Experiments for v1 vs. v2 are still running, but should be done soon.
msg7552 (view)	Author: malte	Date: 2018-09-19.00:10:00
Experiments for v2 have been started. Results for both v1 and v2 should be available by tomorrow morning. The difference between v1 and v2 is mainly code cleanup and preparation for the following (presumably) more performance-critical changes. The idea is to isolate the larger, but less critical changes, so that we can verify the assumption that they are indeed less critical. This will then help evaluate the impact of the changes in v3 (which is yet to come) in isolation. Of course, we've been wrong before about what is and isn't critical. :-) On another note, in msg7496 I wrote that "the expectation is that h^add behaviour should not change". This is wrong: the code should not change the h^add heuristic values, but it will affect the h^add preferred operators, and these are important for performance. h^add preferred operators are the same as h^FF preferred operators, so they are subject to the same tie-breaking dependencies. What this means is that we should expect arbitrary performance fluctuations already in v1, which is a bit unfortunate, but of course that is also the reason why we want to get the changes that affect tie-breaking out of the way at the start of the branch. Unless I'm missing something at the moment, tie-breaking should remain stable from v1 onwards.
msg7530 (view)	Author: malte	Date: 2018-09-18.18:02:25
The experiments for v1 (which only affects RelaxationHeuristic::simplify) have been started. These are the first experiments we are running. It is most definitely not the case that someone rewrote revision control history to hide the fact that their first implementation had a subtle bug.
msg7496 (view)	Author: malte	Date: 2018-09-17.19:38:34
We started with a cleanup of RelaxationHeuristic::simplify. The code is in my bitbucket repository: https://bitbucket.org/malte/downward/branch/issue814 This is not the most obvious part in need of improvement, but our study of the relaxation heuristic code shows that some things going there are silly, and now that we actually understand the code deeply might be a good time to make it less silly. (For what it's worth, we have had issues with this code multiple times.) Changing simplify can change the order of unary operators and their preconditions, which in turn can affect tie-breaking in the FF heuristic in unpredictable ways. So this can have rather wild effects on FF, and it would be good to run an experiment. It would also be good to run an experiment to see that the new code is not broken because it is quite hairy (though less hairy than the old code, in my opinion). @Jendrik: I will need your help with this. We can either set up an experiment together, or if you can already start it yourself, that would of course also be great. The revisions to compare are issue814-base and issue814-v1. I added output to show the time for simplify, so issue814-base is not actually the revision from the default but "default + this output". I would suggest running "--evaluator h=add() --search lazy_greedy([h],preferred=[h])" plus the same configuration with ff() instead of add(), on the full satisficing suite. In addition, we should run debug configurations. I think it is enough to run a debug configuration for h^add, not also h^ff. I suggest a 30-minute timeout. It is not really needed for this experiment, but I think it may help to have the same setup for this and our later experiments in this issue, and later ones will want 30 minutes. The expectation is that h^add behaviour should not change, but h^FF behaviour might. The new code should simplify away the same number of unary operators as the old one. As additional attributes, we should parse the number of unary operators before and after simplification and the time to simplify. Example output: ======================================================================== Simplifying 29928 unary operators... done! [17292 unary operators] time to simplify: 0.022623s ======================================================================== => Here we want to extract 29928 (simplify_before), 17292 (simplify_after) and 0.022623s (simplify_time).
msg7471 (view)	Author: malte	Date: 2018-09-15.22:53:22
I had a closer look at how the code works, what data is used where, and so on. The critical data structure seems to be "UnaryOperator". The main loop of the heuristic only needs the attributes "unsatisfied_preconditions", "cost" and "effect" of this data structure, so it would make sense to have the main data structure only consist of these three, so that 5 unary operators fit into a single cache line. (A cache line is 64 bytes, and these three attributes can be stored in 12 bytes if we change "effect" from a pointer into an index, which would require reorganizing how the propositions are stored.) The three attributes don't quite play the same role: "unsatisfied_preconditions" and "cost" are touched more frequently (whenever a precondition of an operator is reached) and are read and written, while "effect" is only accessed after all preconditions of an operator have been reached, and is read-only. I think it is still a good idea to not split them up further because the three are frequently accessed together. For propositions, the key attributes are "cost", "reached_by" (both accessed whenever an operator with the proposition as an effect is triggered; read/write access; the latter only needed when we want preferred operators or for the FF heuristic), "is_goal" and "precondition_of" (accessed only once when the proposition is processed; read access). When the latter two are accessed, "cost" is also accessed, so it may or may not be a good idea to keep all four together. All other attributes are not accessed in the main heuristic computation. The ones that are involved in the per-state setup or in the preferred operator computation are still performance-critical, but less so than the ones involved in the main heuristic computation, and I think it would make sense to store them separately. The attributes of UnaryOperator are used as follows: operator_no => needed for preferred operator computation precondition => needed for preferred operator computation; size of the vector needed for per-state initialization; may be a good idea to store this size separately to speed up this initialization effect => needed for heuristic computation base_cost => needed for per-state initialization; also needed for preferred operator computation, but that could be solved differently (see point J. below) unsatisfied_preconditions => needed for heuristic computation; needed for per-state initialization cost => needed for heuristic computation; needed for per-state initialization; also needed for preferred operator computation, but that could be solved differently (see point J. below) The attributes of Proposition are used as follows: is_goal => needed for heuristic computation id => not needed after one-time setup of heuristic is done precondition_of => needed for heuristic computation cost => needed for heuristic computation; needed for per-state initialization reached_by => needed for heuristic computation (only when using pref. ops. or for FF heuristic); needed for preferred operator computation marked => needed for preferred operator computation Some possible changes of how things are computed that could help improve data organization, continuing the numbering from the earlier messages: I. Compute h^FF as a side effect of marking preferred operators rather than actually computing the relaxed plan explicitly. This is a simple optimization that could make the code both faster and leaner. J. Replace the current test of whether a marked operator is a preferred operator (= is applicable) by something more principled. We used to test cost == 0, which was fine until we added axioms and operators with cost 0. Now we test cost == 0 as a quick necessary criterion and then check whether we actually have an operator (not an axiom) and test for applicability. It should be possible not to test "cost" here by checking whether all preconditions have reached_by == nullptr. (We have to iterate over the preconditions anyway.) We still need to test that it is an operator, not an axiom. A nice side-effect is that we no longer need to test for applicability. K. Compile the goal into an operator. This is frequently useful, and we already do this in other heuristics. The main advantage is that we can rid of the is_goal attribute. L. Have all propositions live in a single vector rather than a 2D data structure. This is necessary for replacing "Proposition *" by "int" (indices) in our main data structures, which we need to do to make 64-bit builds as compact as 32-bit builds. M. Introduce a num_preconditions attribute. This could further speed up the setup step of the heuristic computation. Depending on how we split the data, we could go to the extreme of having the initialization be a simple memcpy, but we would perhaps have to pay for that elsewhere. N. Precompute the set of no-precondition actions. This could further speed up the setup step of the heuristic computation. An alternative, in some ways more elegant way of achieving the same effect would be to introduce an artificial precondition, like we do in the "normal form" of delete-relaxed tasks we use in our lecture (e.g. for LM-Cut). This could be combined with compiling the current state into an action, analogously to compiling the goal into an action (having an artificial "initial state" atom "i" and a 0-cost action with precondition "i" and effects representing the actual state. But I think while this would in some sense be conceptually more elegant, I don't think the code would be cleaner. (Also, it would probably be a bit slower, though perhaps not enough to matter.) So it's probably not worth following up on this idea.
msg7417 (view)	Author: guillem	Date: 2018-09-12.21:45:07
Looks like the real optimization workshop was in Basel, not Bern :-)
msg7416 (view)	Author: malte	Date: 2018-09-12.21:34:20
I played around a bit with A.+D.+E. Using vectors of ints instead of pointers had close to no effect in my unscientific experiment, as did the changes I made to make the Proposition class more compact. But making UnaryOperator a bit more compact using D had a very noticeable effect, reducing 64-bit runtime from 21 seconds to 17 seconds in my example task. All that needs to be done for this is to make "base_cost" the second rather than fourth attribute of Proposition. If it is the fourth attribute, as currently, we lose 8 bytes per object to alignment, as pointers (including vectors, which are three pointers) are aligned on 8-byte boundaries. Currently, Proposition has a size of 56 bytes in 64-bit mode. With the change, it has 48 bytes in 64-bit mode. In either version, it has a size of 32 bytes in 32-bit mode, suggesting there is further scope for improvement by making the 64-bit representation more compact than it currently is. Beyond the points A-F, our current data organization is cache-inefficient because it mixes data that is rarely access with data this commonly accessed, and it mixes data that is frequently read and written with data this frequently read but not written. This is not ideal, and it could be addressed by organizing the data differently. We could also make more of an effort to keep data close together that is frequently accessed together. So two more items for this: G. Use struct-of-vector representation instead of vector-of-struct representation, keeping related data together. H. Separate rarely accessed data, frequently read data, and frequently read+written data.
msg7414 (view)	Author: malte	Date: 2018-09-12.18:54:25
Also F., because this is unnecessary data in the operators, and we have so many more operators than propositions.
msg7413 (view)	Author: malte	Date: 2018-09-12.18:53:02
Let me correct myself regarding how good we expect the small-vector optimization (point C.) to be: I realized that the vectors we have are "which operators is a given proposition a precondition of?", and that is often not a small number. On a task with many operators, it's common that the ratio between the number of operators and number of propositions is in the hundreds, and then (if, as is typical, operators have few effects), these vectors will actually be large. So I think the key improvement will really be to use ints instead of pointers inside these vectors. Everything else will probably help, too, but I expect it will help much less so. Also, Guillem and I looked at the number of L1 cache misses and all-levels cache misses for an example task with cachegrind. (We used the default parameters of cachegrind; no idea which cache sizes this actually simulates, but let's for the moment assume it is reasonable.) The task is parking-sat11-strips/pfile10-037.pddl. L1 cache misses: 32-bit: 1015778432 64-bit: 1509012539 (so ~50% more) Last-level cache misses: 32-bit: 246781261 64-bit: 972779999 (so roughly 4 times as many) Also, the overwhelming majority of cache misses happens at the same places in the inner loop of the heuristic computation. In fact, the line cost += amount; in additive_heuristic.h account for more than half of the cache misses of the whole planner execution. In summary, point A. from msg7405 should hopefully get 64-bit performance much closer to 32-bit performance, and all the other performance enhancements will perhaps have less of an impact.
msg7405 (view)	Author: malte	Date: 2018-09-12.15:49:03
Silvan and I had a closer look. We agree with what Jendrik said in issue213: this is very likely due to cache efficiency. The major slowdown happens in the inner loops of the computation of h^add, which is algorithmically very simple code not using fancy data structures. (Nothing but vectors, I think, and they are not resized during the run, only used like arrays.) Even before we had a closer look at the code, my impression was that the affected domains are all ones with unusually large numbers of operators or axioms. This supports the hypothesis that the working set of the heuristic computation fits well into the cache in the 32-bit build, but not in the 64-bit build, and that this is what is causing the slowdown. The important classes are UnaryOperator and Proposition. Possible ways of making them more compact: A. Use integers with indices instead of pointers. B. Use direct pointers to the data instead of vectors (for example with some kind of terminator entry instead of a size field). C. Something like the small string optimization. (Store small vectors "inline" -- almost all our vectors will be small.) D. Change the order of attributes in these classes to reduce losses due to alignment. E. Use bitfields for elements with small ranges. (For example, we have some booleans.) F. Get rid of data that is only needed during initialization and does not need to be part of the working set. All these are conceptually straight-forward, but some of them are a bit of work to implement. The most complicated one is C., but I also expect significant gains from this because almost all our operators tend to have few preconditions. This is similar to what we saw with successor generators, except that for these it is even more extreme because the overwhelming majority of vectors was of size 1. We looked at h^add, but the situation should be the same for a large set of heuristics, including at least: h^add, h^max, h^FF, h^CG, h^cea. So perhaps some follow-up work needed here. The first three in the list share significant parts of the implementation, so should probably be dealt with at the same time.
msg7400 (view)	Author: malte	Date: 2018-09-12.14:37:09
Looks like time efficiency might be more critical than memory efficiency based on the mentioned experiment.
msg7395 (view)	Author: jendrik	Date: 2018-09-12.13:54:57
Here are two memory profiles for tetris-sat14:p020.pddl comparing the two builds. They can be opened with massif-visualizer (available in Ubuntu repos). From a quick introspection, I suspect the extra memory usage stems from the larger size of the Proposition pointers stored in UnaryOperator.
msg7386 (view)	Author: malte	Date: 2018-09-12.11:18:22
This is part of issue213. One specific configuration to test is --heuristic h=add() --search lazy_greedy([h], preferred=[h]) See the experiments linked from msg7383.

History
Date	User	Action	Args
2018-12-10 22:24:03	jendrik	set	status: reviewing -> resolved messages: + msg8222
2018-12-10 20:02:00	malte	set	messages: + msg8219
2018-12-10 19:22:20	jendrik	set	status: in-progress -> reviewing messages: + msg8218
2018-12-10 17:12:09	malte	set	messages: + msg8216
2018-12-10 16:39:41	jendrik	set	messages: + msg8215
2018-12-10 15:41:26	malte	set	messages: + msg8214
2018-12-09 08:59:42	jendrik	set	messages: + msg8212
2018-12-08 18:12:40	jendrik	set	messages: + msg8210
2018-12-08 12:59:05	jendrik	set	messages: + msg8202
2018-12-08 12:57:42	malte	set	messages: + msg8201
2018-12-08 12:52:24	jendrik	set	messages: + msg8199
2018-12-07 23:10:51	malte	set	messages: + msg8195
2018-12-07 23:05:52	jendrik	set	messages: + msg8194
2018-12-07 23:01:03	malte	set	messages: + msg8193
2018-12-07 22:42:43	jendrik	set	messages: + msg8192
2018-12-07 17:20:49	malte	set	messages: + msg8188
2018-12-07 17:16:16	malte	set	messages: + msg8187
2018-12-07 16:27:21	jendrik	set	messages: + msg8185
2018-12-07 15:59:13	jendrik	set	messages: + msg8184
2018-12-07 14:15:46	jendrik	set	messages: + msg8181
2018-12-07 09:13:20	malte	set	messages: + msg8178
2018-12-07 09:13:11	jendrik	set	messages: + msg8177
2018-12-07 08:51:23	jendrik	set	messages: + msg8175
2018-12-05 20:14:22	malte	set	messages: + msg8159
2018-12-05 20:08:45	jendrik	set	status: chatting -> in-progress assignedto: jendrik messages: + msg8157
2018-09-21 17:37:45	malte	set	messages: + msg7753
2018-09-21 13:37:51	malte	set	messages: + msg7738
2018-09-21 11:47:00	malte	set	messages: + msg7706
2018-09-21 11:41:05	malte	set	messages: + msg7705
2018-09-21 09:26:27	malte	set	messages: + msg7696
2018-09-20 19:41:16	malte	set	messages: + msg7670
2018-09-20 19:01:27	malte	set	messages: + msg7667
2018-09-20 18:29:04	malte	set	messages: + msg7663
2018-09-20 17:23:44	malte	set	messages: + msg7657
2018-09-20 11:14:54	jendrik	set	messages: + msg7623
2018-09-20 11:05:45	malte	set	messages: + msg7619
2018-09-20 10:11:53	thomas	set	nosy: + thomas
2018-09-19 21:49:08	malte	set	messages: + msg7613
2018-09-19 19:40:32	malte	set	messages: + msg7605
2018-09-19 19:26:29	malte	set	messages: + msg7604
2018-09-19 19:08:39	malte	set	messages: + msg7600
2018-09-19 18:57:42	malte	set	messages: + msg7599
2018-09-19 18:38:11	malte	set	messages: + msg7598 summary: Possible steps: 1. Do a memory profile of --heuristic h=add() --search lazy_greedy([h], preferred=[h]). 2. Improve memory usage. 3. Rerun the affected experiments from msg7383. ->
2018-09-19 18:36:53	malte	set	messages: - msg7597
2018-09-19 18:36:50	malte	set	messages: - msg7585
2018-09-19 18:36:45	malte	set	messages: + msg7597
2018-09-19 16:19:48	malte	set	messages: + msg7585
2018-09-19 14:30:49	malte	set	messages: + msg7569
2018-09-19 14:00:59	malte	set	messages: + msg7567
2018-09-19 13:37:13	malte	set	messages: + msg7565
2018-09-19 10:23:12	malte	set	messages: + msg7558
2018-09-19 09:33:33	malte	set	messages: + msg7556
2018-09-19 00:10:00	malte	set	messages: + msg7552
2018-09-18 18:02:25	malte	set	messages: + msg7530
2018-09-17 19:38:34	malte	set	messages: + msg7496
2018-09-15 22:53:22	malte	set	messages: + msg7471 title: make h^add/h^max/h^FF more efficient in 64-bit builds -> make h^add/h^max/h^FF more efficient, especially in 64-bit builds
2018-09-14 08:40:45	cedric	set	nosy: + cedric
2018-09-13 11:03:21	florian	set	nosy: + florian
2018-09-12 21:45:07	guillem	set	messages: + msg7417
2018-09-12 21:41:12	guillem	set	nosy: + guillem
2018-09-12 21:34:20	malte	set	messages: + msg7416
2018-09-12 18:54:25	malte	set	messages: + msg7414
2018-09-12 18:53:02	malte	set	messages: + msg7413
2018-09-12 15:49:03	malte	set	messages: + msg7405 title: make lazy greedy search more efficient in 64-bit builds -> make h^add/h^max/h^FF more efficient in 64-bit builds
2018-09-12 14:37:09	malte	set	messages: + msg7400 title: make lazy greedy search more memory-efficient in 64-bit builds -> make lazy greedy search more efficient in 64-bit builds
2018-09-12 13:54:57	jendrik	set	files: + massif-profiles.tar.gz status: unread -> chatting messages: + msg7395
2018-09-12 12:45:14	silvan	set	nosy: + silvan
2018-09-12 11:31:14	jendrik	set	nosy: + jendrik
2018-09-12 11:18:22	malte	create

Issue814