Title make h^add/h^max/h^FF more efficient, especially in 64-bit builds
Priority feature Status resolved
Superseder Nosy List cedric, florian, guillem, jendrik, malte, silvan, thomas
Assigned To jendrik Keywords
Optional summary

Created on 2018-09-12.11:18:22 by malte, last changed by jendrik.

File name Uploaded Type Edit Remove
massif-profiles.tar.gz jendrik, 2018-09-12.13:54:57 application/gzip
msg8222 (view) Author: jendrik Date: 2018-12-10.22:24:03

I opened issue877 for tracking ideas related to relaxation heuristics.
msg8219 (view) Author: malte Date: 2018-12-10.20:02:00
If nobody else objects, we can merge this.
msg8218 (view) Author: jendrik Date: 2018-12-10.19:22:20
I took care of your comments. Can we merge this?

I will start collecting possible performance enhancements in a new meta issue.
msg8216 (view) Author: malte Date: 2018-12-10.17:12:09
Thanks! I left two comments on bitbucket; no further comments.

Once this is merged, we should rerun the experiments of issue213 (I suggest
rerunning all of them; the landmark-based ones need rerunning, too, and other
things have changed in the code since).

We should also make sure we keep any ideas regarding performance enhancements or
other improvements that we discussed in the context of this issue and might
still want to implement.
msg8215 (view) Author: jendrik Date: 2018-12-10.16:39:41
Done. Here is the pull request:
msg8214 (view) Author: malte Date: 2018-12-10.15:41:26
Thanks! :-)

v14 vs. v16: no clear useful effect of the alignment. However, I noticed that
UnaryOperator is only 28 bytes naturally, so aligning it to 32 bytes wastes
memory, which is very likely to eat any positive effects we might otherwise
perhaps get. For now it's not worth using the alignment, but we may want to
revisit this if/when we add non-unary operators for more correct support of
conditional effects in h^FF (as discussed), in which case the operator data
structure would naturally have a size of 32 bytes.

Given that we are very deliberate about the memory for these data structures, I
would suggest adding the static_asserts for the size even without using the
alignment, i.e., static_assert the sizes of 16 and 28 bytes. This is part of the
desired performance characteristics of the code.

v16 vs. v17: looks good

default (v12-base) vs. v17: looks good! :-)

Suggested next steps:

1. Remove the __aligned__ attributes, but keep the static assertions, asserting
16 and 28 bytes.

2. Merge from default.

3. Prepare a pull request. I'd like to have a final look before we merge this. 

If anyone else wants to do a code review, please speak up, as we will otherwise
merge this soon.
msg8212 (view) Author: jendrik Date: 2018-12-09.08:59:42
Here are the results for the alignment experiment:
msg8210 (view) Author: jendrik Date: 2018-12-08.18:12:40
Here is the evaluation of the clean up (renaming ArrayChain to ArrayPool, v17):

And here is the comparison of the default and the issue branch:
msg8202 (view) Author: jendrik Date: 2018-12-08.12:59:05
You're right. I'll start two new experiments.
msg8201 (view) Author: malte Date: 2018-12-08.12:57:42
Hi Jendrik, great!

But I think to decide whether we like the alignment it would be good to have a
direct "with or without" experiment, i.e., a comparison betwene two versions
where the only difference is on the alignment, where one version forces
alignment for none of the data structures, and the other one forces alignment
for both. If I understand correctly, we don't currently have that, only
experiments for 0 vs. 1 and 1 vs. 2 of the data structures. Or did I get
confused about the versions?
msg8199 (view) Author: jendrik Date: 2018-12-08.12:52:24
Here are the results for the version that aligns the propositions to a 16-byte alignment (v16):

I made the cleanup you suggested (v17). Experiments comparing v17 to v16 and the default branch are 
msg8195 (view) Author: malte Date: 2018-12-07.23:10:51
I didn't want to do that because I don't think it's sufficiently polished. I
think it would need one or more two users to get a sufficiently rounded-off
interface for real reuse. For example, the pool should probably grow a reserve
method, and it would make sense to templatize it, but it's generally not a good
idea to nail down such API details without knowing what the users actually need.
Leaving it local to the relaxation heuristics sends a clearer message of "this
is a one-off tool we designed for one particular user; it isn't a general tool yet".
msg8194 (view) Author: jendrik Date: 2018-12-07.23:05:52
I agree.

Regarding the cleanup, I propose to move the new array_pool.{h,cc} files into the 
"algorithms" directory. What do you think?
msg8193 (view) Author: malte Date: 2018-12-07.23:01:03
Thanks! Again, clear win for v14, and this time also v15 seems to be quite an
improvement. (In absolute numbers, there is not much of a gain, but if you look
at the score_total_time by domain, it does seem to be a statistically
significant, though not very significant effect.)
msg8192 (view) Author: jendrik Date: 2018-12-07.22:42:43
Here are the results for h^max:
msg8188 (view) Author: malte Date: 2018-12-07.17:20:49
Regarding the cleanup: perhaps it's better to get this merged soon and worry
about making it more beautiful later, as it wasn't beautiful before this issue
either. In other words, I have no objections towards merging this without a lot
more cleanup.

It is clear that the relaxation heuristic code could be prettier, getting rid of
code duplications, etc., and also that there are further ideas for performance
improvements. But I think they can wait.

Two changes I'd like to see before merging:

- Get rid of (see comment inside the file).

- Find a better name for "ArrayChain". There seems to be no universally agreed
name for this sort of thing, but the word "pool" is often used when a bunch of
objects can be allocated individually but only deallocated collectively, which
is what this is about. So I suggest changing "ArrayChain" to "ArrayPool", of
course also making the file name changes and changes of variable names, related
class names etc.

Once we have something we would like to merge, I would also like to see a final
experiment comparing our latest version to whatever is the most appropriate
baseline revision from default.
msg8187 (view) Author: malte Date: 2018-12-07.17:16:15
The results for the no-sentinel change look very good.

The forced alignment makes no big difference in either this or the previous
experiment. I tested if we get 32-byte alignment automatically even without
forcing it (after all, all unary operators live in one large vector, and very
large objects are often handled specially by memory allocators), but this does
not seem to be the case. On the example I tested, without the forced alignment
the unary operators were not 32-byte aligned. So the alignment doesn't seem to
happen automatically. But experimentally it doesn't seem to have much of a
positive impact here.

Before we abandon the idea, I would like to see one more experiment, with and
without alignment, but also aligning the propositions with a 16-byte alignment.
(Please also add the corresponding assert about the size being 16 bytes.)

If the forced alignment doesn't help in this experiment, I suggest we get rid of
it again.
msg8185 (view) Author: jendrik Date: 2018-12-07.16:27:21
The last message actually holds the results for h^add.
msg8184 (view) Author: jendrik Date: 2018-12-07.15:59:13
Here are the corresponding results for h^max (see msg8175):
msg8181 (view) Author: jendrik Date: 2018-12-07.14:15:46
To simplify discussing what to clean up, I created a pull request:
msg8178 (view) Author: malte Date: 2018-12-07.09:13:20
Looks good! Can you add experiments for the other two heuristics as well? (In
the case of h^max, using A*.)
msg8177 (view) Author: jendrik Date: 2018-12-07.09:13:11
During live discussions we talked about another performance-related idea (that 
doesn't need to be tackled within this issue):

O. Make data needed for setup_exploration_queue() trivially copyable. This 
should make the compiler use a memcpy and thus speed up the initialization of 
the queue.
msg8175 (view) Author: jendrik Date: 2018-12-07.08:51:23
We have made two more changes to the code:

* Remove sentinel from ArrayChain (v14)
* Align UnaryOperator on 32 byte boundaries (v15)

Here are the results:
msg8159 (view) Author: malte Date: 2018-12-05.20:14:22
Looks very good!
msg8157 (view) Author: jendrik Date: 2018-12-05.20:08:45
[Assigning this to myself while I'm working it.]

I merged with the default branch (v12) and then implemented idea J. (v13). Here are the results:
msg7753 (view) Author: malte Date: 2018-09-21.17:37:45
Note mostly to self: I merged from default after v11, so future experiments
should compare to an updated baseline.
msg7738 (view) Author: malte Date: 2018-09-21.13:37:51
To close some open ends for the end of the sprint, here is the status regarding
the items A.-N. from the earlier messages:

A. Use integers with indices instead of pointers.
   => Done. This helped a lot.

B. Use direct pointers to the data instead of vectors (for example with
   some kind of terminator entry instead of a size field).
   => Done, although they are not exactly pointers. Also helped a lot.

C. Something like the small string optimization. (Store small vectors
   "inline" -- almost all our vectors will be small.)
   => not worth doing for the main loop as the "precondition_of" vectors
      are not small; may provide some benefit for the preferred operators
      calculation, but unclear if it's worth trying

D. Change the order of attributes in these classes to reduce losses
   due to alignment.
   => Done. Helped a lot at some point, but not at the point where it was
      actually done in the issue branch because at that point the most
      important class (UnaryOperator) only had a single 4-byte gap, which
      could only be moved around, not removed.

E. Use bitfields for elements with small ranges. (For example, we have
   some booleans.)
   => Done. If it helped, it didn't help much, but that is explainable
      because it only applied to Proposition, the less critical data
      structure. Our critical data structure, UnaryOperator, did not have
      such attributes.

F. Get rid of data that is only needed during initialization and does not
   need to be part of the working set.
   => Done.

G. Use struct-of-vector representation instead of vector-of-struct
   representation, keeping related data together.
   => Not done, might be worth trying out.

H. Separate rarely accessed data, frequently read data, and frequently
   read+written data.
   => Not done, might be worth trying out.

I. Compute h^FF as a side effect of marking preferred operators rather than
   actually computing the relaxed plan explicitly.
   => Not done, worth doing because it also simplifies the code.

J. Replace the current test of whether a marked operator is a preferred operator
   => Not done. Worth trying because it will make the code cleaner.

K. Compile the goal into an operator.
   => Not done. Would allow some simplifications while making the setup more
      complicated. Could be worth trying out, but not top priority.

L. Have all propositions live in a single vector rather than a 2D data
   => Done. This was necessary for several other optimizations replacing
      pointers by ints, which were critical.

M. Introduce a num_preconditions attribute.
   => Done. We did this together with another change (because we had to do it
      for that change), so the performance was not evaluated separately.

N. Precompute the set of no-precondition actions.
   => Not done. Probably won't change too much, but would also be quite
      easy to do. Not high priority.
msg7706 (view) Author: malte Date: 2018-09-21.11:47:00
This won't get merged for the current sprint. Even if the cleanup could be done
very quickly, there would be no time to run the necessary experiments in time.
So the question is when we can continue working on this.

I have two other urgent commitments coming up (a project proposal presentation
and a project report), but hopefully I can work on this about a week from today.
If not, we should perhaps think of alternative ways to proceed, so that this
won't languish for a long time.
msg7705 (view) Author: malte Date: 2018-09-21.11:41:05
Corrected link for the first experiment mentioned in the previous message:
msg7696 (view) Author: malte Date: 2018-09-21.09:26:26
Experimental results for 30 minutes and for h^max are in.

30-minute experiments:


They look good. The coverage differences between base and v10 are much smaller
than for 5 minutes, but that is to be expected because with the larger time
limit, the memory limit becomes more of an issue.

We get the following changes in coverage between base and v10:

h^add, 32-bit: 2115 => 2121 (+6)
h^add, 64-bit: 2105 => 2121 (+16)

h^FF,  32-bit: 2068 => 2074 (+6)
h^FF,  64-bit: 2060 => 2073 (+13)

So still some nice improvements, and the 32- vs. 64-bit comparison also looks
fine to me.

experiments with h^max:

A*+h^max, 32-bit: 800 => 797 (-3)
A*+h^max, 64-bit: 793 => 796 (+3)

So we get the desired effect of 64-bit becoming closer to 32-bit, but we
actually lose some 32-bit coverage. The h^max code is very similar to the h^add
code, so I wondered what might be going on here.

Some of the changes we made to get rid of pointers require a bit more work in
the heuristic computation than before, and I suspect this is the main cause. In
h^add and h^FF that was compensated by a smaller memory footprint during the
computation helping a lot with cache performance. With A* + h^max, large tasks
where this could matter typically cannot be solved anyway because this
configuration is almost never able to solve large problems anyway. So the
overhead is more pronounced.

In all three tasks we lose, the solution times for the base revision are in
excess of 1600 seconds, somethimes more than 1750 seconds. Perhaps we are also a
bit unlucky with grid noise.

So all in all I think this is mostly fine. We might claw back a bit of runtime
during cleanup. For the preconditions data structure, we currently use a size
field and a sentinel, and we only need one of the two. There are also some
things in the data structure that we don't need for h^max, so we could also
improve performance a bit by giving it its own data structures or separating out
the data structures only needed for preferred operators. But I think for the
overall goal of allowing a 64-bit default build, these numbers already look fine.

So the main thing that needs to be done next is to clean up again.
msg7670 (view) Author: malte Date: 2018-09-20.19:41:16
Experiments with h^max have been started.

Note to self: v11 is identical to v10 except that it includes h^max, so no need
for experiments with the other heuristics.
msg7667 (view) Author: malte Date: 2018-09-20.19:01:27
I found 5-minute data for the base revision (from the base vs. v4 experiment),
so we can actually have a look at 5-minute coverage for base vs. v10:

h^add, 32-bit: 2034 => 2045 (+6)
h^add, 64-bit: 2017 => 2045 (+28)

h^FF,  32-bit: 2001 => 2006 (+5)
h^FF,  64-bit: 1998 => 2008 (+10)
msg7663 (view) Author: malte Date: 2018-09-20.18:29:04
I combined the data from some of the old experiments and was disappointed to see
that v10 has significantly worse coverage than the base revision -- until I
realized that the early experiments used a 30-minute timeout and we later
switched to a 5-minute timeout, so I was comparing the two revisions with
different timeouts. :-)

An apples-to-apples comparison looks much nicer. This table includes a bunch of
different revisions:

base and v1 use a 30-minute timeout, while v6, v7, v9 and v10 use a 5-minute
timeout. Both v6 and v7 should be reasonably good revisions, while v9 is broken.
Going from v6 to v10 we see the following changes in coverage for the release

h^add, 32-bit: 2039 => 2045 (+6)
h^add, 64-bit: 2029 => 2045 (+16)

h^FF,  32-bit: 2005 => 2006 (+1)
h^FF,  64-bit: 2000 => 2008 (+8)

In particular, in v10 the 64-bit build has caught up with the 32-bit build, at
least for the 5-minute timeout. If we look at the scores, the time scores are
slightly better for the 64-bit build and the memory scores are slightly worse,
which is more or less what we would expect/hope.

The new thing in v10 is that it uses a custom data structure to reduce the
vector overhead: instead of 12 bytes in 32-bit and 24 bytes in 64-bit, each
vector now uses 4 bytes in the UnaryOperator/Proposition data structures
themselves, plus 4 bytes in UnaryOperator for storing the number of
preconditions per operator because we want to be able to access that quickly.

All the vector contents are stored in one big vector, so there are some
additional memory savings per vector for each UnaryOperator/Proposition due to
less memory management overhead. On the negative side, we pay 4 additional bytes
per vector for a sentinel object. This could be changed: given that we store the
number of preconditions per operator anyway, we could replace the sentinel
object by a length field for the usage in UnaryOperator, and the usage in
Proposition is less critical, so we could add a size field there, too.

I have started a 30-minute baseline vs. v10 experiment to see what the results
are for the longer timeout.

Even if the results are good, there are two major drawbacks with v10 that need
to be addressed before we can consider merging this.

- h^max is currently disabled because it has not yet been adapted to the
  changes in the base class.

- The new code is too ugly to merge because all changes after v4 were written
  as quick prototypes. So even if performance turns out to be good enough,
  there is still a lot of cleanup work to do.

I think I can address the first of these two today and start an experiment
overnight. I won't get the second of these done today, but perhaps I can make at
least a bit of progress.
msg7657 (view) Author: malte Date: 2018-09-20.17:23:44
OK, the good news is that the new revision (v10 = v9 minus the segfault plus
space-optimizing UnaryOperator) solves far more problems than v9. However,
that's largely because v9 produces all these segfaults. ;-)

I'll try to cook up a comparison between v10 and some older non-broken revisions
without rerunning the experiment. If that works out, the data will be slightly
less reliable because the data were aggregated at different times on the grid,
and we know that the other running experiments etc. affect the outcomes.
msg7623 (view) Author: jendrik Date: 2018-09-20.11:14:54
I agree that the safe-STL mode sounds interesting.
msg7619 (view) Author: malte Date: 2018-09-20.11:05:45
Anticlimax time! The experiments for v6-v9 are done, and there are no sustantial
improvements. However, I have excuses:


This is where we reorder the attributes of Proposition and UnaryOperator for
better 64-bit memory layout. We noticed before that doing this for Proposition
doesn't really help because there are typically so many more unary operators
than propositions.

On closer inspection, it turns out that the 64-bit memory layout for
UnaryOperator was already as good as it could get with the current collection of
attributes. The change merely moves a 4-byte gap from the middle of the object
to the end, which does not really help. If I looked at it correctly, we do save
8 bytes for the propositions, but they don't matter much. (Reordering unary
operators would have helped if we had done it earlier, but with the move from
pointers to ints, the other gaps we had have already been closed.)


This is where we change some attributes to bitfields to save space. However,
this is for Proposition, not for UnaryOperator, which I overlooked yesterday,
and as said above, that class doesn't matter so much. There are some minor
improvements that may or may not be noise.

v8 and v9:

This is where we replace vectors with a more compact data structure, but it
turns out I optimized the wrong data structure again, Proposition rather than
UnaryOperator. So no real impact. Of the two variants, the vector-based one (v9)
is slightly better in terms of time and the deque-based one (v8) is slightly
better in terms of memory, as expected, with the vector-based one looking
slightly better overall to me. But perhaps both variants are still worth
checking once I've actually applied this to the correct class.

There are also lots of segmentation faults here because I resized a vector to
the number of unary operators where I should have used the number of
propositions, so it crashes on (the few) tasks where there are more facts than
operator effects. In the end it wasn't too hard to diagnose this, but it would
have been even easier if vector::operator[] performed bounds checking in debug
mode. I found this link, which is something we might want to consider:

The seg fault is fixed. Next step: apply the optimization of v8/v9 to the
correct data structure.
msg7613 (view) Author: malte Date: 2018-09-19.21:49:08
Results for v5 are in:

We hoped/expected performance to remain similar from v4 to v5 in 32-bit mode,
but to improve in 64-bit mode, although we did not expect 64-bit mode to catch
up with 32-bit mode. The coverage numbers more or less bear this out:

h^add, 32-bit: 2038 => 2038 (unchanged)
h^add, 64-bit: 2018 => 2031 (+13)

h^FF,  32-bit: 2006 => 2003 (-3)
h^FF,  64-bit: 1990 => 1998 (+8)

(All results for release builds; the debug builds are only included in the
experiment to test if they trip an assertion.)

So h^FF has suffered a bit in 32-bit mode, but we got the desired improvements
in 64-bit mode. As expected, there is still a non-trivial gap between 32-bit and
64-bit, but the numbers are moving in the right direction.

In addition to v6 and v7, I have submitted two quite speculative new revisions
v8 and v9 that are supposed to help with the vector space overhead. They are
almost identical, except that v8 uses a deque and v9 uses a vector in an
internal data structure. The vector should have faster access, but the deque
might have a nicer peak memory profile because it grows in a more stable way.

All of these are 5-minute experiments, so we should have results for v6-v9 by
tomorrow morning.
msg7605 (view) Author: malte Date: 2018-09-19.19:40:32
I started experiments for v6, which reorders the attributes of Proposition and
UnaryOperator in a way that should help with 64-bit performance.

I also started experiments for v7, which should the size of Proposition by 8
bytes by using bitfields. The disadvantage of bitfields is that they require
additional computation, and they don't always pay off. I tried to combine
attributes that aren't commonly accessed in the first place and that don't
require arithmetic to mitigate these issues.

The expectation for v6 is that it makes essentially no difference for 32 bits
and that it helps a bit for 64 bits on tasks that are large but simple.

The expectation for v7 is that it may help a little bit for both versions on
tasks that are large but simple, but this is more speculative.

This exhausts the low-hanging fruit for reducing the 32-bit/64-bit memory layout
gap. More things can certainly be done, but they require a bit more code.
msg7604 (view) Author: malte Date: 2018-09-19.19:26:29
I started experiments for v5, which uses (integer) indices instead of pointers
everywhere in the core data structures of h^add, h^max and h^FF.

I hope/expect that performance in 32-bit mode will remain quite similar to the
way it currently is, and that performance in 64-bit mode will become closer to

However, I expect a substantial gap to remain between the two because the 64-bit
memory layout is still unfavourably affected by alignment issues and its vectors
are twice as large (the vector itself, not its contents). But with indices
instead of pointers used everywhere, further changes will be much simpler.
msg7600 (view) Author: malte Date: 2018-09-19.19:08:39
OK, so in all the SIGKILL cases, the translator was killed after running out of
time. It is supposed to be aborted more gracefully, and in all cases except for
these 10 this also worked, so I'm not sure what to do but am happy to leave this
alone for now. If someone wants to have a look, here is a ways to log at the
run.log files; vary as necessary for driver.log, properties etc.

Can be copy-pasted into bash:

cd /infai/helmert/downward/malte/experiments/issue814/data/issue814-v4

for x in $BADRUNS; do
    echo $x
    GROUP=$(echo $x | head -c 3)
    less $RUNDIR/run.log
msg7599 (view) Author: malte Date: 2018-09-19.18:57:42
Results for v4 are in:

Good news:
- Simplifying is still much faster in v4 than in base.
- We still have a very nice improvement in the memory score.
- We see a coverage increase in settlers, which may or may not be caused
  by the filtering of duplicate preconditions.
- We now simplify to the same number of unary operators as in base, so the
  bugs in the previous versions are fixed. (The numbers only differ in
  trucks_strips, where the translator behaves unpredictably.)
- No more failed assertions.

Bad news:
- There are some unexplained errors in the report which I have not seem
  before, namely sigkills. They seem to happen mostly, but perhaps (hard to
  say at the moment) in cases where the translator failed to complete in the
  given time/memory limits (300 seconds, however much memory).
  They occur both for base and the new code, so it does not seem to be related.
  Still, probably something that shouldn't happen. I'll take a closer look.
msg7598 (view) Author: malte Date: 2018-09-19.18:38:11
I am removing the old summary, which comes from the time before we knew what was
going on here.
msg7569 (view) Author: malte Date: 2018-09-19.14:30:49
Duplicated preconditions are now filtered in the relaxation heuristics. I have
tagged v4 with this change and started a new 5-minute experiment. I'm comparing
to base rather than v3 because v1, v2 and v3 all misbehave in various ways.

In the affected domains (= only settlers, I think), the change should improve
h^add, and it may lead to more simplification because there are now more cases
with 5 or fewer preconditions, which is the magic number for simplification.

I left a note about these duplicates and their special treatment in
RelaxationHeuristic at issue497, which is about canonicalizing operator
definitions and similar things.
msg7567 (view) Author: malte Date: 2018-09-19.14:00:59
Indeed, there are duplicated preconditions in Settlers, and it's related to the
stupid way our file format includes actual preconditions (not just effect
conditions) as part of effect specifications.

For example, in settlers-sat18-adl/p01.pddl the action "build-docks p1" contains
71 effects (most of them conditional), including for example these:

1 902 0 893 0 1
1 900 0 893 0 1

Both of these encode the precondition (not effect condition) var893 = 0.
msg7565 (view) Author: malte Date: 2018-09-19.13:37:13
Results for v1 vs. v2 here:

Results for v2 vs. v3 here:

The debug experiments uncovered a violated assumption. We assume that the set of
unary operator preconditions is sorted and unique, but it is not. Because we
sort the preconditions ourselves, this must mean (unless there is a bug) that it
is not unique, i.e., the same fact is listed multiple times as a precondition.
This may include the case where it is listed once as a precondition and once as
an effect condition. All examples of this are in the settlers domain.

If we indeed have replicated conditions, then getting rid of them could also
benefit other heuristics, so it is perhaps something to keep in mind beyond the
scope of this issue.
msg7558 (view) Author: malte Date: 2018-09-19.10:23:12
I tagged v3 with the fix for RelaxationHeuristic::simplify and started an
experiment comparing v2 and v3.

In the interest of not clogging the grid, I wanted to use a 1-minute timeout,
but then saw this is not enough to translate the first task in agricola, which
is one of the affected domains. So I'm using a 5-minute timeout instead.
msg7556 (view) Author: malte Date: 2018-09-19.09:33:33
Experiments for base vs. v1 are in:


* We see some improvements in coverage. However, all or almost all of them
  will be due to different tie-breaking in determing best achievers, which
  is just us being more lucky now. So nothing to see here.

* We reach the h^add clamp in a number of domains, which is unfortunate,
  but I cannot think of a good and easy solution for this. For now, I'm
  fine with not doing anything about it.

* Memory scores are noticeably improved. This makes sense because we are
  now more careful about memory usage in the heuristic setup. For example.
  RelaxationHeuristic::simplify no longer maintains three copies of the
  operator data.

* RelaxationHeuristic::simplify is much faster. For the additive heuristic,
  its runtime is cut in half.

* The negative news: the behaviour of simplification has changed in a few
  domains. I now see what could be causing this. For unary operators with
  more than 5 preconditions, the old code was able to filter out exact
  duplicates (in terms of preconditions and effects), while the new code
  ignores them entirely. (Number of unary operator preconditions is the
  sum of operator preconditions and effect conditions.)

  This is easy to fix, and fixing it will make simplify slower again in the
  affected domains, but there are only a few such domains, so it should still
  be much faster overall than the current code.

Experiments for v1 vs. v2 are still running, but should be done soon.
msg7552 (view) Author: malte Date: 2018-09-19.00:10:00
Experiments for v2 have been started. Results for both v1 and v2 should be
available by tomorrow morning.

The difference between v1 and v2 is mainly code cleanup and preparation for the
following (presumably) more performance-critical changes. The idea is to isolate
the larger, but less critical changes, so that we can verify the assumption that
they are indeed less critical. This will then help evaluate the impact of the
changes in v3 (which is yet to come) in isolation. Of course, we've been wrong
before about what is and isn't critical. :-)

On another note, in msg7496 I wrote that "the expectation is that h^add
behaviour should not change". This is wrong: the code should not change the
h^add heuristic values, but it will affect the h^add preferred operators, and
these are important for performance. h^add preferred operators are the same as
h^FF preferred operators, so they are subject to the same tie-breaking dependencies.

What this means is that we should expect arbitrary performance fluctuations
already in v1, which is a bit unfortunate, but of course that is also the reason
why we want to get the changes that affect tie-breaking out of the way at the
start of the branch. Unless I'm missing something at the moment, tie-breaking
should remain stable from v1 onwards.
msg7530 (view) Author: malte Date: 2018-09-18.18:02:25
The experiments for v1 (which only affects RelaxationHeuristic::simplify) have
been started.

These are the first experiments we are running. It is most definitely not the
case that someone rewrote revision control history to hide the fact that their
first implementation had a subtle bug.
msg7496 (view) Author: malte Date: 2018-09-17.19:38:34
We started with a cleanup of RelaxationHeuristic::simplify.
The code is in my bitbucket repository:

This is not the most obvious part in need of improvement, but our study of the
relaxation heuristic code shows that some things going there are silly, and now
that we actually understand the code deeply might be a good time to make it less
silly. (For what it's worth, we have had issues with this code multiple times.)

Changing simplify can change the order of unary operators and their
preconditions, which in turn can affect tie-breaking in the FF heuristic in
unpredictable ways. So this can have rather wild effects on FF, and it would be
good to run an experiment. It would also be good to run an experiment to see
that the new code is not broken because it is quite hairy (though less hairy
than the old code, in my opinion).

@Jendrik: I will need your help with this. We can either set up an experiment
together, or if you can already start it yourself, that would of course also be

The revisions to compare are issue814-base and issue814-v1. I added output to
show the time for simplify, so issue814-base is not actually the revision from
the default but "default + this output".

I would suggest running "--evaluator h=add() --search
lazy_greedy([h],preferred=[h])" plus the same configuration with ff() instead of
add(), on the full satisficing suite.

In addition, we should run debug configurations. I think it is enough to run a
debug configuration for h^add, not also h^ff.

I suggest a 30-minute timeout. It is not really needed for this experiment, but
I think it may help to have the same setup for this and our later experiments in
this issue, and later ones will want 30 minutes.

The expectation is that h^add behaviour should not change, but h^FF behaviour
might. The new code should simplify away the same number of unary operators as
the old one.

As additional attributes, we should parse the number of unary operators before
and after simplification and the time to simplify. Example output:

Simplifying 29928 unary operators... done! [17292 unary operators]
time to simplify: 0.022623s

=> Here we want to extract 29928 (simplify_before), 17292 (simplify_after) and
0.022623s (simplify_time).
msg7471 (view) Author: malte Date: 2018-09-15.22:53:22
I had a closer look at how the code works, what data is used where, and so on.

The critical data structure seems to be "UnaryOperator". The main loop of the
heuristic only needs the attributes "unsatisfied_preconditions", "cost" and
"effect" of this data structure, so it would make sense to have the main data
structure only consist of these three, so that 5 unary operators fit into a
single cache line. (A cache line is 64 bytes, and these three attributes can be
stored in 12 bytes if we change "effect" from a pointer into an index, which
would require reorganizing how the propositions are stored.)

The three attributes don't quite play the same role: "unsatisfied_preconditions"
and "cost" are touched more frequently (whenever a precondition of an operator
is reached) and are read and written, while "effect" is only accessed after all
preconditions of an operator have been reached, and is read-only. I think it is
still a good idea to not split them up further because the three are frequently
accessed together.

For propositions, the key attributes are "cost", "reached_by" (both accessed
whenever an operator with the proposition as an effect is triggered; read/write
access; the latter only needed when we want preferred operators or for the FF
heuristic), "is_goal" and "precondition_of" (accessed only once when the
proposition is processed; read access). When the latter two are accessed, "cost"
is also accessed, so it may or may not be a good idea to keep all four together.

All other attributes are not accessed in the main heuristic computation. The
ones that are involved in the per-state setup or in the preferred operator
computation are still performance-critical, but less so than the ones involved
in the main heuristic computation, and I think it would make sense to store them

The attributes of UnaryOperator are used as follows:

operator_no => needed for preferred operator computation
precondition => needed for preferred operator computation;
  *size* of the vector needed for per-state initialization;
  may be a good idea to store this size separately to speed up this initialization
effect => needed for heuristic computation
base_cost => needed for per-state initialization;
  also needed for preferred operator computation, but that could be solved
  (see point J. below)
unsatisfied_preconditions => needed for heuristic computation;
  needed for per-state initialization
cost => needed for heuristic computation;
  needed for per-state initialization;
  also needed for preferred operator computation, but that could be solved
  (see point J. below)

The attributes of Proposition are used as follows:

is_goal => needed for heuristic computation
id => not needed after one-time setup of heuristic is done
precondition_of => needed for heuristic computation
cost => needed for heuristic computation;
  needed for per-state initialization
reached_by => needed for heuristic computation (only when using pref. ops.
  or for FF heuristic);
  needed for preferred operator computation
marked => needed for preferred operator computation

Some possible changes of how things are computed that could help improve data
organization, continuing the numbering from the earlier messages:

    I. Compute h^FF as a side effect of marking preferred operators rather than
       actually computing the relaxed plan explicitly.

This is a simple optimization that could make the code both faster and leaner.

    J. Replace the current test of whether a marked operator is a preferred operator
       (= is applicable) by something more principled.

We used to test cost == 0, which was fine until we added axioms and operators
with cost 0. Now we test cost == 0 as a quick necessary criterion and then check
whether we actually have an operator (not an axiom) and test for applicability.
It should be possible not to test "cost" here by checking whether all
preconditions have reached_by == nullptr. (We have to iterate over the
preconditions anyway.) We still need to test that it is an operator, not an
axiom. A nice side-effect is that we no longer need to test for applicability.

    K. Compile the goal into an operator.

This is frequently useful, and we already do this in other heuristics. The main
advantage is that we can rid of the is_goal attribute.

    L. Have all propositions live in a single vector rather than a 2D data

This is necessary for replacing "Proposition *" by "int" (indices) in our main
data structures, which we need to do to make 64-bit builds as compact as 32-bit

    M. Introduce a num_preconditions attribute.

This could further speed up the setup step of the heuristic computation.
Depending on how we split the data, we could go to the extreme of having the
initialization be a simple memcpy, but we would perhaps have to pay for that

    N. Precompute the set of no-precondition actions.

This could further speed up the setup step of the heuristic computation.

An alternative, in some ways more elegant way of achieving the same effect would
be to introduce an artificial precondition, like we do in the "normal form" of
delete-relaxed tasks we use in our lecture (e.g. for LM-Cut). This could be
combined with compiling the current state into an action, analogously to
compiling the goal into an action (having an artificial "initial state" atom "i"
and a 0-cost action with precondition "i" and effects representing the actual
state. But I think while this would in some sense be conceptually more elegant,
I don't think the code would be cleaner. (Also, it would probably be a bit
slower, though perhaps not enough to matter.) So it's probably not worth
following up on this idea.
msg7417 (view) Author: guillem Date: 2018-09-12.21:45:07
Looks like the real optimization workshop was in Basel, not Bern :-)
msg7416 (view) Author: malte Date: 2018-09-12.21:34:20
I played around a bit with A.+D.+E.

Using vectors of ints instead of pointers had close to no effect in my
unscientific experiment, as did the changes I made to make the Proposition class
more compact. But making UnaryOperator a bit more compact using D had a very
noticeable effect, reducing 64-bit runtime from 21 seconds to 17 seconds in my
example task.

All that needs to be done for this is to make "base_cost" the second rather than
fourth attribute of Proposition. If it is the fourth attribute, as currently, we
lose 8 bytes per object to alignment, as pointers (including vectors, which are
three pointers) are aligned on 8-byte boundaries. Currently, Proposition has a
size of 56 bytes in 64-bit mode. With the change, it has 48 bytes in 64-bit
mode. In either version, it has a size of 32 bytes in 32-bit mode, suggesting
there is further scope for improvement by making the 64-bit representation more
compact than it currently is.

Beyond the points A-F, our current data organization is cache-inefficient
because it mixes data that is rarely access with data this commonly accessed,
and it mixes data that is frequently read and written with data this frequently
read but not written. This is not ideal, and it could be addressed by organizing
the data differently. We could also make more of an effort to keep data close
together that is frequently accessed together. So two more items for this:

    G. Use struct-of-vector representation instead of vector-of-struct
representation, keeping related data together.
    H. Separate rarely accessed data, frequently read data, and frequently
read+written data.
msg7414 (view) Author: malte Date: 2018-09-12.18:54:25
Also F., because this is unnecessary data in the *operators*, and we have so
many more operators than propositions.
msg7413 (view) Author: malte Date: 2018-09-12.18:53:02
Let me correct myself regarding how good we expect the small-vector optimization
(point C.) to be: I realized that the vectors we have are "which operators is a
given proposition a precondition of?", and that is often *not* a small number.

On a task with many operators, it's common that the ratio between the number of
operators and number of propositions is in the hundreds, and then (if, as is
typical, operators have few effects), these vectors will actually be large.

So I think the key improvement will really be to use ints instead of pointers
inside these vectors. Everything else will probably help, too, but I expect it
will help much less so.

Also, Guillem and I looked at the number of L1 cache misses and all-levels cache
misses for an example task with cachegrind. (We used the default parameters of
cachegrind; no idea which cache sizes this actually simulates, but let's for the
moment assume it is reasonable.)

The task is parking-sat11-strips/pfile10-037.pddl.

L1 cache misses:

32-bit: 1015778432
64-bit: 1509012539 (so ~50% more)

Last-level cache misses:

32-bit: 246781261
64-bit: 972779999 (so roughly 4 times as many)

Also, the overwhelming majority of cache misses happens at the same places in
the inner loop of the heuristic computation. In fact, the line

    cost += amount;

in additive_heuristic.h account for more than half of the cache misses of the
whole planner execution.

In summary, point A. from msg7405 should hopefully get 64-bit performance much
closer to 32-bit performance, and all the other performance enhancements will
perhaps have less of an impact.
msg7405 (view) Author: malte Date: 2018-09-12.15:49:03
Silvan and I had a closer look. We agree with what Jendrik said in issue213:
this is very likely due to cache efficiency.

The major slowdown happens in the inner loops of the computation of h^add, which
is algorithmically very simple code not using fancy data structures. (Nothing
but vectors, I think, and they are not resized during the run, only used like

Even before we had a closer look at the code, my impression was that the
affected domains are all ones with unusually large numbers of operators or
axioms. This supports the hypothesis that the working set of the heuristic
computation fits well into the cache in the 32-bit build, but not in the 64-bit
build, and that this is what is causing the slowdown.

The important classes are UnaryOperator and Proposition.
Possible ways of making them more compact:

A. Use integers with indices instead of pointers.
B. Use direct pointers to the data instead of vectors (for example with some
kind of terminator entry instead of a size field).
C. Something like the small string optimization. (Store small vectors "inline"
-- almost all our vectors will be small.)
D. Change the order of attributes in these classes to reduce losses due to
E. Use bitfields for elements with small ranges. (For example, we have some
F. Get rid of data that is only needed during initialization and does not need
to be part of the working set.

All these are conceptually straight-forward, but some of them are a bit of work
to implement. The most complicated one is C., but I also expect significant
gains from this because almost all our 
operators tend to have few preconditions. This is similar to what we saw with
successor generators, except that for these it is even more extreme because the
overwhelming majority of vectors was of size 1.

We looked at h^add, but the situation should be the same for a large set of
heuristics, including at least:

h^add, h^max, h^FF, h^CG, h^cea.

So perhaps some follow-up work needed here. The first three in the list share
significant parts of the implementation, so should probably be dealt with at the
same time.
msg7400 (view) Author: malte Date: 2018-09-12.14:37:09
Looks like time efficiency might be more critical than memory efficiency based
on the mentioned experiment.
msg7395 (view) Author: jendrik Date: 2018-09-12.13:54:57
Here are two memory profiles for tetris-sat14:p020.pddl comparing the two builds. They can be 
opened with massif-visualizer (available in Ubuntu repos).

From a quick introspection, I suspect the extra memory usage stems from the larger size of 
the Proposition pointers stored in UnaryOperator.
msg7386 (view) Author: malte Date: 2018-09-12.11:18:22
This is part of issue213.

One specific configuration to test is
--heuristic h=add() --search lazy_greedy([h], preferred=[h])

See the experiments linked from msg7383.
Date User Action Args
2018-12-10 22:24:03jendriksetstatus: reviewing -> resolved
messages: + msg8222
2018-12-10 20:02:00maltesetmessages: + msg8219
2018-12-10 19:22:20jendriksetstatus: in-progress -> reviewing
messages: + msg8218
2018-12-10 17:12:09maltesetmessages: + msg8216
2018-12-10 16:39:41jendriksetmessages: + msg8215
2018-12-10 15:41:26maltesetmessages: + msg8214
2018-12-09 08:59:42jendriksetmessages: + msg8212
2018-12-08 18:12:40jendriksetmessages: + msg8210
2018-12-08 12:59:05jendriksetmessages: + msg8202
2018-12-08 12:57:42maltesetmessages: + msg8201
2018-12-08 12:52:24jendriksetmessages: + msg8199
2018-12-07 23:10:51maltesetmessages: + msg8195
2018-12-07 23:05:52jendriksetmessages: + msg8194
2018-12-07 23:01:03maltesetmessages: + msg8193
2018-12-07 22:42:43jendriksetmessages: + msg8192
2018-12-07 17:20:49maltesetmessages: + msg8188
2018-12-07 17:16:16maltesetmessages: + msg8187
2018-12-07 16:27:21jendriksetmessages: + msg8185
2018-12-07 15:59:13jendriksetmessages: + msg8184
2018-12-07 14:15:46jendriksetmessages: + msg8181
2018-12-07 09:13:20maltesetmessages: + msg8178
2018-12-07 09:13:11jendriksetmessages: + msg8177
2018-12-07 08:51:23jendriksetmessages: + msg8175
2018-12-05 20:14:22maltesetmessages: + msg8159
2018-12-05 20:08:45jendriksetstatus: chatting -> in-progress
assignedto: jendrik
messages: + msg8157
2018-09-21 17:37:45maltesetmessages: + msg7753
2018-09-21 13:37:51maltesetmessages: + msg7738
2018-09-21 11:47:00maltesetmessages: + msg7706
2018-09-21 11:41:05maltesetmessages: + msg7705
2018-09-21 09:26:27maltesetmessages: + msg7696
2018-09-20 19:41:16maltesetmessages: + msg7670
2018-09-20 19:01:27maltesetmessages: + msg7667
2018-09-20 18:29:04maltesetmessages: + msg7663
2018-09-20 17:23:44maltesetmessages: + msg7657
2018-09-20 11:14:54jendriksetmessages: + msg7623
2018-09-20 11:05:45maltesetmessages: + msg7619
2018-09-20 10:11:53thomassetnosy: + thomas
2018-09-19 21:49:08maltesetmessages: + msg7613
2018-09-19 19:40:32maltesetmessages: + msg7605
2018-09-19 19:26:29maltesetmessages: + msg7604
2018-09-19 19:08:39maltesetmessages: + msg7600
2018-09-19 18:57:42maltesetmessages: + msg7599
2018-09-19 18:38:11maltesetmessages: + msg7598
summary: Possible steps: 1. Do a memory profile of --heuristic h=add() --search lazy_greedy([h], preferred=[h]). 2. Improve memory usage. 3. Rerun the affected experiments from msg7383. ->
2018-09-19 18:36:53maltesetmessages: - msg7597
2018-09-19 18:36:50maltesetmessages: - msg7585
2018-09-19 18:36:45maltesetmessages: + msg7597
2018-09-19 16:19:48maltesetmessages: + msg7585
2018-09-19 14:30:49maltesetmessages: + msg7569
2018-09-19 14:00:59maltesetmessages: + msg7567
2018-09-19 13:37:13maltesetmessages: + msg7565
2018-09-19 10:23:12maltesetmessages: + msg7558
2018-09-19 09:33:33maltesetmessages: + msg7556
2018-09-19 00:10:00maltesetmessages: + msg7552
2018-09-18 18:02:25maltesetmessages: + msg7530
2018-09-17 19:38:34maltesetmessages: + msg7496
2018-09-15 22:53:22maltesetmessages: + msg7471
title: make h^add/h^max/h^FF more efficient in 64-bit builds -> make h^add/h^max/h^FF more efficient, especially in 64-bit builds
2018-09-14 08:40:45cedricsetnosy: + cedric
2018-09-13 11:03:21floriansetnosy: + florian
2018-09-12 21:45:07guillemsetmessages: + msg7417
2018-09-12 21:41:12guillemsetnosy: + guillem
2018-09-12 21:34:20maltesetmessages: + msg7416
2018-09-12 18:54:25maltesetmessages: + msg7414
2018-09-12 18:53:02maltesetmessages: + msg7413
2018-09-12 15:49:03maltesetmessages: + msg7405
title: make lazy greedy search more efficient in 64-bit builds -> make h^add/h^max/h^FF more efficient in 64-bit builds
2018-09-12 14:37:09maltesetmessages: + msg7400
title: make lazy greedy search more memory-efficient in 64-bit builds -> make lazy greedy search more efficient in 64-bit builds
2018-09-12 13:54:57jendriksetfiles: + massif-profiles.tar.gz
status: unread -> chatting
messages: + msg7395
2018-09-12 12:45:14silvansetnosy: + silvan
2018-09-12 11:31:14jendriksetnosy: + jendrik
2018-09-12 11:18:22maltecreate