Issue839

Title remove FF-landmark synergy
Priority wish Status resolved
Superseder Nosy List cedric, jendrik, malte, silvan
Assigned To jendrik Keywords
Optional summary

Created on 2018-09-20.14:48:25 by jendrik, last changed by malte.

Messages
msg7799 (view) Author: malte Date: 2018-09-25.21:25:07
Many thanks, Jendrik!
msg7798 (view) Author: jendrik Date: 2018-09-25.18:25:47
I asked around and everybody was happy to see the synergy go away. So I went 
ahead and merged this. I think we can announce the change together with the 
changes we made during the sprint.

We finally did it, the synergy is gone :-)
msg7797 (view) Author: jendrik Date: 2018-09-25.15:42:19
Add TODO to summary.
msg7795 (view) Author: malte Date: 2018-09-25.13:01:09
I also left some minor comments on bitbucket.

From my side this looks good to merge, but I would suggest that you discuss it
with the others in the group to see that nobody minds seeing this go.

This is also the kind of change that would warrant an email to the fast-downward
mailing list. Actually, we have had quite a few of these in the sprint without
telling people (e.g. the evaluator/heuristic merge and what it means for the doc
pages and command line options, or the elimination of lm_cost_type). I think we
should rectify this by going over the issues merged during and after the sprint
and have a look at which things should be announced on the mailing list.
msg7794 (view) Author: malte Date: 2018-09-25.12:57:41
It is quite possible that issue846 won't have much of an impact on code size
either because we will still need the Exploration class (or something like it)
inside the lm_rhw factory and possibly other factories.

But I think code complexity (as opposed to size) is already reduced quite
significantly once there are fewer interactions between complicated pieces of
code. Hopefully, after issu846, the use of the Exploration class will be
localized to the landmark factories. This should make it much easier to
understand and improve the landmark heuristic itself.

And of course removing the synergy classes itself already simplifies the
landmark code by getting rid of ~250 lines of code, even if nothing else
currently depends on it.
msg7793 (view) Author: silvan Date: 2018-09-25.12:39:02
I had a look at the pull-request and left one minor comment. The synergy doesn't
seem to interact at all with the other classes, except for two lines; so
removing it indeed doesn't reduce the other code complexity at all, as Malte
already said. So let's hope that issue846 will simplify the landmark heuristic
significantly; otherwise, removing the synergy does not really help, does it?
msg7791 (view) Author: jendrik Date: 2018-09-25.11:41:48
I agree. Maybe we can just parse which planner component was run before the 
segfault, i.e., the line "Driver aborting after (component)".
msg7789 (view) Author: silvan Date: 2018-09-25.11:13:24
Regarding the driver: I think having the real signal values passed to the user
also has some merits, and changing this for only one or some signals that would
be converted into a (positive integer) exit code would be a bit strange. This is
particularly true since these "unnormal" memory errors of Python are not
reliably reproducible. What we really would like is to turn these kinds of
segfaults into a memory exit code and ignore output on stderr, but this is
probably impossible to achieve.
msg7786 (view) Author: jendrik Date: 2018-09-24.23:14:15
I opened issue846 for changing the way how the lmcount heuristic computes 
preferred operators.
msg7784 (view) Author: jendrik Date: 2018-09-24.22:24:55
I tried to replicate the Python segfault locally, but failed to do so. Even on 
the grid the segfault doesn't happen every time.
msg7783 (view) Author: jendrik Date: 2018-09-24.17:01:33
I think that it would be good if the driver returned different exit codes 
depending on whether the segfault happens in the translator or the search.
msg7781 (view) Author: silvan Date: 2018-09-24.09:26:28
Thanks for the great explanation of why the synergy and what it actually does in
msg7768. I think I never really understood it from that high-level perspective :-)

Regarding the segfaults: I've seen the same problem with the translator before,
even with the new emergency memory. Like Malte said, Python sometimes doesn't
seem to be able to cleanly exit when running out of memory, so it may still
throw a MemoryError or (rarely, in my experience) just segfault.
msg7780 (view) Author: jendrik Date: 2018-09-24.00:14:53
Yes, the experiments are based on the most recent version of Fast Downward. I 
just forgot to convert --heuristic to --evaluator.

I have prepared a pull request that removes the synergy:
https://bitbucket.org/jendrikseipp/downward/pull-requests/100
msg7775 (view) Author: malte Date: 2018-09-23.14:44:58
Many thanks for running the experiments, Jendrik!

Are they based on a recent enough default branch where lm_cost_type is no longer
necessary? I'm asking because the option strings used are partly new (not using
lm_cost_type) and partly old (using the deprecated --heuristic).

Here are my thoughts on the lama-first results, focusing on coverage, quality
and the scores for total time, memory and evaluations as proxies of the main
things we care about. Let's abbreviate the configurations as "synergy", "pref"
and "nopref".

1. In terms of coverage and quality, nopref >> synergy > pref. This suggests the
preferred
   operators of the landmark heuristic are not useful enough to use, in which
case we no
   longer need the synergy. But synergy > pref also shows that the synergy
actually helps
   when we want to compute preferred operators.

2. For coverage, the trends are fairly clear also on a per-domain level, with
   the nurikabe domain the major exception.

3. For quality, the nopref vs. synergy picture is very mixed with neither having
a strong
   advantage over the other if we ask the question "given a random domain, which one
   should I prefer?". The overall quality difference is almost entirely due to
one domain
   (maintenance-sat14-adl) where we would probably find out something
interesting if we
   dug deeper into the failure of synergy and pref.

4. For the evaluations and memory scores, we have exactly the opposite picture:
   pref > synergy > nopref. If we bear in mind that these scores always also
take into
   account coverage to some extent (because unsolved tasks will score 0), this means
   that using preferred operators (in either pref of synergy) is clearly good to
focus
   the search, i.e., leads to fewer state evaluations and hence also less memory
use.
   So the explanations for their worse coverage seems to be that they are better
   informed, but substantially slower. This is corroborated by the total time
scores,
   where we again have nopref >> synergy > pref, where the gap between nopref
and the
   others is even more substantial.

5. An interesting question is why we have pref > synergy in terms of evaluations and
   memory even though they both use preferred operators and synergy is the
faster one.
   I think the answer to this is that the heuristics are not actually fully
equivalent.
   For example, there are differences in how exactly the relaxed exploration is
   performed. Given that the worse implementation here is the one we want to get
rid of,
   I don't think we have to look into this too deeply.

For the full lama experiment, we should look at quality. (Coverage out to be the
same as in lama-first, and it is, more or less.) We see nopref >> synergy > pref
again, and the advantage of nopref is more pronounced, which makes sense because
being faster helps finding better plans in the anytime planner. While nopref is
still not dominant compared to synergy on a per-domain level, it is clearly the
better choice overall.

So, based on this, we can clearly get rid of the synergy and switch to
pref=false as the recommended setting for the lama configuration. Because this
is a parameter switch, I wonder if we should introduce a new alias, e.g. "lama",
and leave the existing alias alone, because the modified configuration would no
longer reflect the IPC 2011 version of the lama, making seq-sat-lama-2011 an
inappropriate name.


As a further step, as said before, I think it would be interesting to see if we
only considered the first kind of landmarks from msg7768, but not the second one.
msg7774 (view) Author: malte Date: 2018-09-23.14:11:53
Regarding the segfaults, this may be a Python issue. I think it's not the first
time we see that Python doesn't manage to exit cleanly when running out of
memory. We could try to diagnose it. This link:

   
https://stackoverflow.com/questions/16731115/how-to-debug-a-python-segmentation-fault?rq=1

suggests using the "faulthandler" module to help diagnose this.

But we could also leave it alone for now. If we can reproduce it, if it's
Python's fault and it if happens with current Python versions, we should report
it to the Python developers.
msg7773 (view) Author: malte Date: 2018-09-23.14:03:27
No objections to updating the memory limit, but as said in the lab repository
yesterday (or was it the day before?), I think we should explain where the
numbers come from. After a bit of digging, I assume that the 3872 MiB number is
based on /etc/slurm/slurm.conf:

NodeName=ase[01-24] CPUs=16 RealMemory=61964 Sockets=2 CoresPerSocket=8
ThreadsPerCore=1 TmpDisk=410000 State=UNKNOWN
NodeName=ase[31-54] CPUs=20 RealMemory=127080 Sockets=2 CoresPerSocket=10
ThreadsPerCore=1 TmpDisk=169204 State=UNKNOWN

For the first entry (the infai_1 partition), we can calculate:

    61964 / 16 = 3872.75

For the second entry (the infai_2 partition), it looks like we could afford a
lot more memory:

    127080 / 20 = 6354

Why isn't 3872 good enough then? In /etc/slurm/cgroup.conf, we have

    AllowedRAMSpace=99  # in %

which means that the jobs are actually only allowed 99% of the overall memory,
so maybe that is part of the reason why 3872 MiB is too much.

However, it is also the case the ulimit limits we set are enforced at the level
of individual tasks, whereas I assume cgroup considers something more
comprehensive based on all processes in the group, which would include at
minimum the slurm job script, the driver and the planner component added together.

And then there is added uncertainty because we don't know if the
/etc/slurm/*.conf  config files we see are actually the ones in use, as we found
out recently.

So perhaps there is no better way to figure this out than experimenting with
what works and what doesn't, but I think we should at least explain on which
basis we came up with the number, even if at some point it's a ballpark estimate.
msg7772 (view) Author: jendrik Date: 2018-09-23.10:20:41
Here are the results:

https://ai.dmi.unibas.ch/_tmp_files/seipp/issue839-lama-no-syn-pref-True-vs-lama-no-syn-pref-False.html
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue839-lama-first-no-syn-pref-false-vs-lama-first-no-syn-pref-true.html

The lama-first experiment has some unusual unexplained errors. The segfault happens in the translator and I don't 
know what could be causing it. There are also problems with exceeding the Slurm job memory limit (even with our new 
ulimit setting). I suggest to lower our ulimit value from 3872 MiB to 3800 MiB. (The experiment data is at 
/infai/seipp/projects/jendrik-downward/experiments/issue839 if you'd like to have a closer look.)
msg7771 (view) Author: jendrik Date: 2018-09-23.00:52:48
Experiments described in msg7767 are running.
msg7769 (view) Author: malte Date: 2018-09-22.10:16:43
PS: Of course all suggested changes don't have to be done in the same issue. We
can first get rid of the landmark synergy, then think about the preferred
operators of the landmark heuristic in a separate issue.

But from what I remember, ripping out the landmark synergy will only simplify
the code slightly, while ripping out the relaxed exploration parts from the
landmark heuristic will simplify it a lot, so we should definitely also use that
opportunity if we can.

Unfortunately, I don't think we will be able to physically remove the relaxed
exploration code from the landmark code because it is also needed internally
inside the lm_rhw landmark factory. But I think having a big mess only in that
factory and not also in the lmcount heuristic would already be a huge step forward.
msg7768 (view) Author: malte Date: 2018-09-22.10:09:43
Background for why the pref=true option of the landmark heuristic exists:

In our code, heuristics don't know whether preferred operators are going to be
needed or not, so heuristics like h^FF and h^add compute them whether or not
they are going to be used. This is OK because the overhead is not very large,
especially for h^FF, and it is very unusual not to use preferred operators anyway.

For the landmark heuristic, computing preferred operators can be expensive, and
hence we only want to do it if they end up being used. Hence the pref=true
option to enable the computation of preferred operators. If and for what the
preferred operators are going to be *used* is a different matter controlled by
the search algorithm, not the heuristic. It is currently very easily possible to
compute the preferred operators and not use them, or use them but not compute
them (= using the empty set of preferred operators), which is what we did in the
initial no-synergy experiment. Hence why we consider this an option caveat.

The basic idea of what preferred operators for the landmark heuristic are (not
sure if it is implemented exactly like that) is as follows:

1. If there is an applicable operator that reaches a landmark, all such
operators are preferred.

2. If no such operators exist, conduct an FF-style relaxed exploration towards
the nearest landmarks (according to the landmark orderings we have) and use the
preferred operators of that.

No #1 should be cheap, but no #2 is expensive because it needs to conduct a
relaxed exploration that would otherwise not be necessary. It also carries a
high cost in code maintenance. The landmark heuristic needs to include a way to
compute something like the FF heuristic internally *just* for #2 of the
preferred operator computation. So it would be great to be able to get rid of
that. Rather than ripping out preferred operators entirely, I think it would
make a lot more sense conceptually to only rip out #2, but keep #1 and have it
done always, like other preferred operator computations.


The synergy heuristics exist to reduce the computation cost of #2. The idea is
that if you want to compute h^FF and h^lm, you need to do a relaxed exploration
anyway for h^FF, so the information you're interested in for case #2 of the h^lm
preferred operators can reuse the internal data structures that have already
been computed for h^FF. If we don't want #2 any more, then there no longer is
any synergy to be exploited between the two heuristics.
msg7767 (view) Author: malte Date: 2018-09-22.09:52:47
Thanks, Jendrik!

That's roughly what I expected after the previous experiment. The differences in
the previous experiment seemed too large to be explained by the different
implementations, so it seemed likely that it was due to the different settings
(the synergy using preferred operators for the landmark heuristics and the
separate configuration accidentally not doing that).

The fair comparison is between the synergy-based method and the pref=true
no-synergy method, because the synergy always has the equivalent of pref=true.
(If you don't want preferred operators out of the landmarks, there is no reason
to compute them at the same time as the FF heuristic.)

So in the initial experiment we accidentally did an apples and oranges
comparison, which led us to conclude that the synergy implementation performs
worse. That's almost exactly what happened the first time we considered this
(issue123). Going over these old comments, especially the first 13 or so, there
are lots of parallels, and I found it interesting to reread these. Lots of
points raised there about "option caveats" are still equally valid. It's great
that at least we managed to clean up the lm_cost_type mess now.

Now, the very interesting thing here is that based on the data we have now,
pref=true just seems to be a poor setting, which would mean we could get rid of
the synergy and also simplify the basic lmcount heuristic a lot, which would be
great. In the old experiment (issue123) we came to a different conclusion, but
the code and benchmark set have evolved since then. (You can see that in this
experiment, almost all the differences come from "new" domains, i.e. IPC 2011 or
later.)

BTW, the experiments don't include the IPC 2018 domains. Can they be included?

Ripping out the synergies and the pref option would be a rather significant
change, so before we do something like that, I think we should have

1) the previous experiment extended to also cover the IPC 2018 domains, and
2) an equivalent experiment using the full anytime configurations rather than
just lama-first.

If we want to save time, I'm fine with doing the experiments just with the
32-bit builds, since I assume no matter what we end up with, we will be able to
make the 64-bit performance reasonable.

I'll try to explain some of the technical background (why we actually have the
synergy heuristics and the pref option) in a separate message.
msg7766 (view) Author: jendrik Date: 2018-09-22.09:07:29
Here are the results:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue213-release32-vs-release64-v7-lama-first-pref-30min.html

Interestingly, using pref=false leads to a higher total coverage than pref=true.
msg7764 (view) Author: jendrik Date: 2018-09-22.00:53:12
Experiments are running.
msg7693 (view) Author: malte Date: 2018-09-21.08:11:11
Looks like we should have set "pref=true" in the experiments for "lama-separate"
in issue213. See http://www.fast-downward.org/OptionCaveats. Perhaps the best
thing is to redo the lama-related experiments from issue213 with and without
that setting.
msg7656 (view) Author: silvan Date: 2018-09-20.16:51:51
Yay, another attempt!
msg7651 (view) Author: malte Date: 2018-09-20.15:58:28
> if we can get rid of the synergy without a loss in coverage

...or quality, in the case of the full lama configuration
msg7644 (view) Author: jendrik Date: 2018-09-20.14:48:24
In issue123 we tried to get rid of the synergy, but at the time the using the 
synergy achied better results than using the ff() and lmcount() heuristics by 
themselves. Experiments for issue213 (msg7629) suggest that this has changed now 
and we'd like to see if we can get rid of the synergy without a loss in coverage.
History
Date User Action Args
2018-09-25 21:25:07maltesetmessages: + msg7799
2018-09-25 18:25:47jendriksetstatus: reviewing -> resolved
messages: + msg7798
summary: See also issue306 if we end up getting rid of the "pref" option of lmcount. After merging this, announce the change on the mailing list. ->
2018-09-25 15:42:19jendriksetmessages: + msg7797
summary: See also issue306 if we end up getting rid of the "pref" option of lmcount. -> See also issue306 if we end up getting rid of the "pref" option of lmcount. After merging this, announce the change on the mailing list.
2018-09-25 13:01:10maltesetmessages: + msg7795
2018-09-25 12:57:41maltesetmessages: + msg7794
2018-09-25 12:39:02silvansetmessages: + msg7793
2018-09-25 11:41:48jendriksetmessages: + msg7791
2018-09-25 11:13:24silvansetmessages: + msg7789
2018-09-24 23:14:15jendriksetmessages: + msg7786
2018-09-24 22:24:55jendriksetmessages: + msg7784
2018-09-24 17:01:33jendriksetmessages: + msg7783
2018-09-24 09:26:28silvansetmessages: + msg7781
2018-09-24 00:14:53jendriksetstatus: in-progress -> reviewing
messages: + msg7780
2018-09-23 14:44:58maltesetmessages: + msg7775
2018-09-23 14:11:53maltesetmessages: + msg7774
2018-09-23 14:03:27maltesetmessages: + msg7773
2018-09-23 10:20:42jendriksetmessages: + msg7772
2018-09-23 00:52:48jendriksetstatus: chatting -> in-progress
assignedto: jendrik
messages: + msg7771
2018-09-22 10:16:43maltesetmessages: + msg7769
2018-09-22 10:09:43maltesetmessages: + msg7768
2018-09-22 09:52:47maltesetmessages: + msg7767
summary: See also issue306 if we end up getting rid of the "pref" option of lmcount.
2018-09-22 09:07:29jendriksetmessages: + msg7766
2018-09-22 00:53:12jendriksetmessages: + msg7764
2018-09-21 08:24:01cedricsetnosy: + cedric
2018-09-21 08:11:11maltesetmessages: + msg7693
2018-09-20 16:51:52silvansetnosy: + silvan
messages: + msg7656
2018-09-20 15:58:28maltesetstatus: unread -> chatting
messages: + msg7651
2018-09-20 14:48:25jendrikcreate