Title Recover losses in total time during refactoring LandmarkGraph
Priority urgent Status chatting
Superseder Nosy List clemens, jendrik, malte, salome, silvan, thomas
Assigned To Keywords
Optional summary
This is part of meta issue987.

Created on 2021-01-28.18:43:35 by silvan, last changed by clemens.

This is part of meta issue987.
msg10240 (view) Author: clemens Date: 2021-04-01.09:12:15
I have repeated the experiment for v14 several times to get more data to eventually pinpoint the reason for the strange results in the landmark experiments. They reveal interesting results. I summarize some things that I found interesting below.

Satisficing (lama-first):

 - The score_search_time values for the base revision range from 2181.62 (report A) to 2185.38 (report E). This is a difference of 3.76, which I found surprisingly high. There seems to be a lot of noise, if this is the range of difference we have to expect from the same configuration.
 - The score_search_time values for the v14 revision range from 2178.09 (report A) to 2181.59 (report C). This is a difference of 3.50, which is in the same range as for the base revision.
 - Interestingly, the *difference* in score_search_time between base and v14 ranges from -3.53 to -4.35, so there's not a large variation there.

Optimal (seq-opt-bjolp):

 - The time scores seem to be much more consistent for the optimal configurations.
 - score_search_time with base ranges from 798.50 (report E) to 799.51 (report A).
 - score_search_time with v14 ranges from 799.75 (report D) to 800.83 (report A)
 - What I found very interesting was the variation in the reasons for errors (timeout vs. out of memory): between 483 and 487 out of memory errors for base, 485-488 for v14, 343-347 timeouts for base, 341-345 timeouts for v14.

I would be happy to discuss these results further in the near future if someone's interested. Maybe it could lead to a broader understanding of the unexpected variations of code changes in the landmark part of Fast Downward that are also mentioned in issue995.
msg10117 (view) Author: clemens Date: 2021-02-22.11:33:58
As discussed, I've merged the pull request and started an experiment run for the final changes.

Once more, there's a significant difference between the optimal and the satisficing configuration. In satisficing, the search-time score decreases by 4.18 which looks really bad. I was worried that something like this was going to happen before I merged. We discussed this on Discord and Malte said I should merge anyways, and if the results are unexpected, we have one more reason to look into the strange things going on with the landmark experiments. This is what we should do now, I suppose...
msg10109 (view) Author: clemens Date: 2021-02-19.14:46:27
I have created a pull request:

Since Malte, Thomas, and I have already discussed the diff, I'd go ahead and directly merge this issue.

Experiments on a final version v14 are running right now.
msg10107 (view) Author: clemens Date: 2021-02-19.12:56:56
I discussed the most recent results with Malte and Thomas today. We decided to revert the changes from v13 because it influences the results negatively. Apart from that, the results look fine and the diff between base and v12 looks fine. It seems that the changes in issue348 (fundamental changes for States) have already recovered some part of the losses in issue988. Some more recovery can be observed in v12 here. Even though the comparison is wanky, we had a look at the score difference for issue988-base and issue1000-v12:
                  search-time | total-time
 - lama-first:           -1.6 |       -2.4
 - seq-opt-bjolp:        -1.6 |       -1.0

These values look a lot better than what we have lost in issue988 in the first place. Though, a lot of other stuff has happened in the meantime, before the branch for issue1000 started.

We decided to merge and resolve this issue. There is still a lot of potential to improve the performance of the landmark code. Since they were not introduced in issue988 we don't deem it sensible to put more effort in this issue.
msg10103 (view) Author: clemens Date: 2021-02-19.08:34:45
Yes, the drop in time score for satisficing is the most worrying thing I'd say. (Apart from that, it is still unclear how to recover all loss observed in issue988.)

The change of v13 is something that I wouldn't do intuitively. However, before issue988, this was exactly the structure of the code: there was no "dead_end_exists" function because it was integrated in "update_lm_status". Hence, I thought it could have an effect that recovers some loss since before issue988 there was one fewer loop over all landmarks.
Looking at the results here, I'd rather stick with v12.

Concerning the change from "int" to "const LandmarkNode &": So what happend before was the following:

for (LandmarkNode &node : nodes) {
  foo(node->get_id()); // parameter is "int"

The first line of "foo" was:
LandmarkNode *node = lm_graph.get_landmark(id); 

After the change, the loop looks as follows

for (LandmarkNode &node : nodes) {
  foo(node); // parameter is "const LandmarkNode &"

and the first line of "foo" is not necessary anymore. My intention was to get rid of the unnecessary detour using the ID.

Now that I've pointed this out: do you still think this could have a negative impact? I have no deep understanding of what C++ does with this, so the change was based on a gut feeling, nothing else.
msg10102 (view) Author: malte Date: 2021-02-18.21:42:05
Sorry I didn't have time for a meeting today. :-( I am only now done with my admin duties for today.

From the issue1000 experiments, it looks to me like the part that most needs explanation is the drop in time score for the satisficing configuration between v12 and v13. Is this correct?

One algorithmic difference is that in the old code, "dead_end_exists" returns as soon as a reason for a dead end has been found. In the new code, the equivalent code always iterates over all landmarks.

If this is not it, the change from "int" to "const LandmarkNode &" in the argument to landmark_needed_again could also have some negative impact if that function runs very quickly and is part of a tight inner loop. If this is the case, inlining the function could probably help. But I'd look at the dead-end thing first.
msg10064 (view) Author: clemens Date: 2021-02-13.09:31:11
So I have branched off the current main-branch and implemented the changes which seemed to have an impact based on the previous experiments and the profiling. I have again done it in stages and evaluated different versions.
 - v11: use hash-map, inline getter function, and loop over IDs to avoid getting ID from landmark node pointers
 - v12: when computing heuristic value: only trace pointer if status is unreached or needed again (should make no difference for optimal BJOLP)
 - v13: merge function for which we introduced an additional loop over all landmarks in issue988 (so one loop less in v13)

I am once more surprised by the outcome of the experiments.

I have also rerun the experiments for issue988 because back then, there was an issue with hyperthreading on one of the nodes.

The changes have a small positive impact on optimal. Though, it doesn't completely cancel out what we have lost in issue988. I'm not sure how applicable this direct comparison is anyways, a lot has changed between the merge of issue988 and the split-point for the branch I'm currently working on.

I am completely clueless about what happens in the satisficing case. There, the changes of v11 and v12 have almost no impact. However, when sparing the additional loop over all landmarks in every state, the time scores go down by 2.5, which is counterintuitive... Also in the other versions, there's no sign of compensation for what we've lost in issue988. 

In conclusion: I'm not sure we're much closer to the goal of this issue than we were one week ago. We should probably have a discussion and decide on the next steps together on Monday.
msg10052 (view) Author: clemens Date: 2021-02-12.09:29:49
I once more ran an experiment comparing several versions, including two new ones:
- issue1000-v6: similar to v4 (hence based on v3) but using a reference to the landmark node instead of ID in the function header for needed again.
- issue1000-v7: (based on v6) when computing LM-count, check the status of a landmark first to avoid tracing pointers to LandmarkNodes that are not needed (because they are reached and not needed again)

Once more I'm surprised that the changes have different impact on the optimal and the satisficing configuration...

Looking at the optimal report, from v3 to v6 it becomes a bit worse, and even worse to v7. However, the numbers are not so shocking, could partly also be due to noise I'd say.

For satisficing, however, we can observe similar changes than in the report from yesterday. From v3 to v6 (yesterday to v4), specifically moving the needed-again check to a separate function, significantly impacts the time-scores negatively (around -7.5 in total). The step to v7 improves this again a little bit, which is what I'd expect from having to trace fewer pointers (which was not reflected in the optimal case).

I still don't know what would be the best thing to do here. What changes should I implement when branching off of the current main where the state changes from issue348 are included?

I'd also like to point out the difference between the first time I ran the experiment of several versions (msg10029) and this time in the versions 1-3. Looking at the step from v2 to v3 in the optimal reports for example, the score_search_time dropped by 0.7 before and now goes up by 0.2. So in this case, noise alone lead to a change of ~0.9 in the score. 
This surprises me, because before we were saying that a score change between 0 and 1 is reason enough to have a closer look at the code. Don't get me wrong, I'm all in for using these opportunities to look at the code and find small things to improve it! Given how much time I've spent on this issue, however, it is kinda frustrating that some parts of it might as well be noise... (Clearly, the changes of -7 in the satisficing configurations are a different story.)
msg10048 (view) Author: malte Date: 2021-02-11.21:39:43
> A question we could not answer ourselves is whether the satisficing
> scores could be much more prone to noise than the optimal configuration?

Yes for certain kinds of changes, but only really when we also see substantial changes in the numbers of evaluations/expansions. If this isn't the case, they might be somewhat more prone to noise but not much more.
msg10030 (view) Author: clemens Date: 2021-02-11.11:31:54
Our conclusion was more or less the following: we can understand the results in the optimal report, but the don't correlate with those of the satisficing report. We cannot find an explanation why this is not the case, and we don't understand the satisficing results on their own either.

Salomé had ideas how to fix some problems introduced with v4 and v5. I will implement those and run another experiment. 
A question we could not answer ourselves is whether the satisficing scores could be much more prone to noise than the optimal configuration?
msg10029 (view) Author: clemens Date: 2021-02-11.10:41:39
I ran another experiment where I compare several versions at once.
 - issue988-base: before working on issue988
 - issue1000-v1:  mainly moving stuff to more appropriate places, changing how lmcount is calculated (without affecting heuristic values)
 - issue1000-v2:  move landmark status from LandmarkNode to LandmarkStatusManager
 - issue1000-v3:  move forward_orders to RHW landmark factory
 - issue1000-v4:  make structure of computing which landmarks are needed again clearer
 - issue1000-v5:  split dead-end check away from status update function

If you want a clearer view of the changes between two versions, I suggest to have a look at<tag1>...ClemensBuechner:<tag2>. The only change where I'd expecte a significant difference is the step from v4 to v5 because it introduces an additional loop over all landmarks.

You can find the experiment report here:

Only looking at the report for the optimal case, I would have an idea where to have a closer look, because the performance gets worse only for the last two changes.
In the satisficing report, however, the time scores jump all over the place between the rather small changes between each version...

I will have a look at it with Salomé right now. If someone else has insightful thoughts, they'd be much appreciated!
msg10027 (view) Author: malte Date: 2021-02-10.20:20:38
We discussed this a bit further. I hope I summarize this correctly, otherwise Clemens can me correctly.

- We want to run another experiment for the current branch that branches off from issue988-base where we compares base, v1, v2 and v3, where v1 and v2 are the things we have already looked at, and v3 also incorporates the remaining changes we haven't looked at yet.

If there is a performance loss between v2 and v3, we will also look at this diff a bit closer with an eye on performance.

- Afterwards, we will not profile further based on the current issue1000 branch because there have been several relevant changes in the main branch since then that could significantly affect where the bottlenecks are. So the next step would be to make a new branch for this issue off the current head of main that includes the performance improvements that we found (so far, only some inlining). Then we do an experiment for this and compare the performance improvement to the loss before/after issue988. (For this indirect comparison we cannot use the old issue988 experiment because since then the grid became faster due to the hyperthreading change. We saw this effect quite clearly in another indirect comparison we made. So the issue988 experiment should be rerun.) Ideally, these four configurations (before/after issue988 and before after the "new" issue1000 branch) should be in the same experiment to minimize the effect of noise.

At this point, we may or may not do some more profiling. We found a few places where it looks like the performance of the landmark heuristic can be significantly improved, so we can also defer further improvements for later, when we make more algorithmic changes to how the heuristic is computed.
msg10025 (view) Author: clemens Date: 2021-02-10.17:14:07
Malte and I had a meeting in which we had a close look at the diff of issue988-base and my issue1000 branch right before v2. We did not find find anything that looks particularly suspicious (correct me if I'm wrong, Malte). One thing that made a somewhat significant difference, was that the getter for a landmark status was not inlined before, but is in v2. I started an experiment including that change and here are the results:

The overall scores in the optimal configuration are still improved in this version. For the satisficing case, however, the scores are worse than in the last report. Should we look even closer once more?
msg10021 (view) Author: malte Date: 2021-02-10.12:39:32
I am not sure how a pull request behaves on github, but I think on bitbucket it would have shown the same diff: from the revision you branched off to the head of the branch. I expect it would do the same on github, but I'm not sure. (Whether this applies cleanly to the head of the main branch is a separate consideration, I think.) We often use pull requests even if we don't really want to merge. The advantage is that they allow commenting on the code.

> I compare it to the head of my issue1000 branch, which corresponds to
> the reports from msg10007 below. If I understand Malte and Silvan
> correctly, they look fine. As a reminder: in this version I reverted the
> biggest changes from what I now call issue988-final.

I wanted to be a bit more non-committal ("okay-ish"). ;-) Performance losses in this range can be OK, but it's often still a good idea to have a look at the code to see if there is a bottleneck that was accidentally introduced.
msg10014 (view) Author: clemens Date: 2021-02-09.18:10:41
I'm not sure whether a pull request is the right thing to do at the moment; a lot has changed in the landmark code since we merged issue988 and maybe this also affects the benchmarks. I have branched away from before all these changes to have a clearer comparison of before and after issue988. Let me know if you'd prefer a comparison with the current master and an actual pull request.

What I suggest instead, is to look at the diff of the relevant revisions:

"issue988-final" is what we have merged into main, so it (more or less) corresponds to the results reported for issue988:

I compare it to the head of my issue1000 branch, which corresponds to the reports from msg10007 below. If I understand Malte and Silvan correctly, they look fine. As a reminder: in this version I reverted the biggest changes from what I now call issue988-final.
msg10009 (view) Author: malte Date: 2021-02-09.12:20:05
I think the numbers in Silvan's message of what is okay-ish and what is a large difference are in the right ballpark. In the previous experiment, we had -4.78 for lama-first. That is a large difference, but I don't think there is a universal number for when we want to take a closer look and when we don't. It depends on which part of the code we are changing and what we expect the performance impact to be.

If something that shouldn't be a bottleneck becomes a bottleneck for some instances, we may want to take a closer look. If a certain part of the code is 10 times slower than before but it doesn't need to be, we may want to take a closer look, even if it's only a 10% increase in overall runtime. In the example experiments, there are some tasks where runtime increases by around 50%, but there are also tasks where runtime drops a lot. It's hard to tell from these numbers alone whether we want to change something. We should also look at the code, and if necessary do some profiling.

Perhaps a good next step is to create a pull request, so that others can have a closer look at the code changes so far to see if there are any warning flags.
msg10008 (view) Author: silvan Date: 2021-02-09.11:36:31
Malte can probably comment more on this, but I would say that a difference in search time and memory scores of less than 1 could still be okay-ish. If the score difference is more than 1, this means that the relative difference in search time and memory can be noticeable for several tasks.

Here, for bjolp, the scores get better, but for lama, they already decrease a bit. For me, this is still acceptable. I think that the change of scores in the original issue988 was around 5-6 maybe, wasn't it? This would certainly be a "large difference".
msg10007 (view) Author: malte Date: 2021-02-09.10:14:42
I must have accidentally clicked on something that deleted Clemens's message, apologies.
Here it is again:


I started with the investigation. My first approach was to revert specific changes and see whether this has significant impact on the runtime profile. Doing so did not reveal any helpful insights.

What I did next, was to revert all changes from issue988. Then, I added them again step by step, looking at the runtime profile after every step. I have now arrived at a point where it seems, every further change impacts the runtime significantly. Before deciding the next step, I started the experiments for which you can find the reports here:

To me, the results look fine at this point. The optimal BJOLP configuration yields results which are at least as good as before the changes of issue988. In the satisficing configuration (lama-first), there is a tendency to get worse runtimes, but the numbers are quite small.
Unfortunately, I am not yet familiar with what changes in the score are acceptable and what is not. Can someone who is give some input whether it looks okay? Or should I already worry at this point?

The major changes (e.g., storing the status of a landmark in the status manager instead of landmark nodes) are not yet added back in. Hence, the experiment really only covers trivial changes.
msg9898 (view) Author: silvan Date: 2021-01-28.18:43:34
In issue988, we refactored the LandmarkGraph graph to be "constant after creation", i.e., landmark graph objects are only modified by the landmark factories building them but not anymore by the landmark heuristic (such as, e.g., for tracking the status of landmark nodes). Malte mentioned in msg9897 that we overlooked that the drop in search/total time is quite large for BJOLP and LAMA-first. In this issue, we want to investigate the change of issue988 again to find out where we lost the performance and how to restore it.
Date User Action Args
2021-04-01 09:12:16clemenssetmessages: + msg10240
2021-02-22 11:33:59clemenssetmessages: + msg10117
2021-02-19 14:46:27clemenssetmessages: + msg10109
2021-02-19 12:56:57clemenssetmessages: + msg10107
2021-02-19 08:34:45clemenssetmessages: + msg10103
2021-02-18 21:42:05maltesetmessages: + msg10102
2021-02-13 09:31:12clemenssetmessages: + msg10064
2021-02-12 09:29:49clemenssetmessages: + msg10052
2021-02-11 21:39:43maltesetmessages: + msg10048
2021-02-11 11:31:56clemenssetmessages: + msg10030
2021-02-11 10:41:39clemenssetmessages: + msg10029
2021-02-10 20:20:39maltesetmessages: + msg10027
2021-02-10 17:14:07clemenssetmessages: + msg10025
2021-02-10 12:39:33maltesetmessages: + msg10021
2021-02-09 18:10:41clemenssetmessages: + msg10014
2021-02-09 12:20:05maltesetmessages: + msg10009
2021-02-09 11:36:31silvansetmessages: + msg10008
2021-02-09 10:14:42maltesetmessages: + msg10007
2021-02-09 10:07:45maltesetmessages: - msg10005
2021-02-09 07:48:56clemenssetmessages: + msg10005
2021-01-28 18:45:34silvansetsummary: This is part of meta issue987.
2021-01-28 18:43:35silvancreate