Issue1058

Title fix performance regression introduced when refactoring the pruning method interface
Priority urgent Status resolved
Superseder Nosy List gabi, jendrik, malte, silvan
Assigned To silvan Keywords
Optional summary

Created on 2022-07-12.12:46:03 by silvan, last changed by malte.

Messages
msg10781 (view) Author: malte Date: 2022-07-15.14:33:48
Great, thanks! :-)
msg10779 (view) Author: silvan Date: 2022-07-15.14:28:14
Ok. I opened issue1060 to address the verbosity issue and merged this issue.
msg10777 (view) Author: malte Date: 2022-07-15.14:08:58
Fine by me. "Silent" is a poor name then, but if it's like this elsewhere, it doesn't need to concern us in this issue.
msg10776 (view) Author: silvan Date: 2022-07-15.14:05:21
When we introduced these levels, I vaguely remember having a discussion whether silent should mean strictly silent or not. Currently, we have *lots* of places in the planner where we log something unconditionally, e.g., names of generators/strategies/heuristics/searches like in this case the pruning method.

I'm fine with changing this in the future, e.g., requiring every line "log << ..." to be guarded by some "if (log.is_at_least...)", but I don't think that we should start this here and now.
msg10775 (view) Author: malte Date: 2022-07-15.13:19:04
Beyond these small comments, generally looks good to merge.
msg10774 (view) Author: malte Date: 2022-07-15.13:18:29
I left a review on github. The main point is that I would expect less/no output with "silent" log level. If the current behaviour is what we prefer elsewhere in the code, feel free to leave it, although I think we should make silent more silent everywhere eventually in that case.
msg10767 (view) Author: silvan Date: 2022-07-13.11:51:20
(I fixed the verbosity issue of h^m landmarks in the main branch.)
msg10766 (view) Author: silvan Date: 2022-07-13.11:25:32
Yes for the first two. issue1042-base is the implementation before issue1042 which introduced the regression.
msg10764 (view) Author: gabi Date: 2022-07-13.10:01:25
I am not sure whether I fully understand, which configuration corresponds to which 
situation. Issue1058-v1 is the current implementation just with the call to the 
timer removed? Issue1058-base is the current implementation? What is issue1042-
base?
msg10762 (view) Author: silvan Date: 2022-07-13.09:42:33
Thanks for the investigation! I suggest to ask at today's daily meeting if someone wants to work on this. I could also work on it, but I don't know if we should do this as part of our sprint or not, and I wanted to work on other tasks, too.

Experiments with the calls to the timer removed are in:
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue1058-v1.html
There seems to be a problem with BJOLP producing too much output. Looking at some local output, I assume that the code dumps some form of debug output which should be disabled. I'll look into this.

Comparison reports:

https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue1058-v1-issue1042-base-issue1058-base-compare.html
This confirms the regression I found on my code base.

https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue1058-v1-issue1042-base-issue1058-v1-compare.html
This confirms that removing the timer solves the performance issue.

What remains to decide for this issue is if we simply want to remove the timer as it is in the current pull request (https://github.com/aibasel/downward/pull/116) or if we want to get rid of the statistics entirely, going back to having a single "prune_operators" method which does nothing for the null pruning method and can be arbitrarily instantiated by inheriting sub-classes. Since I think this issue should simply address the performance regression and having timer statistics seems to be the least important aspect, I'm fine with simply removing the timer from the prune_operators method.
msg10759 (view) Author: malte Date: 2022-07-12.17:39:03
> I'd be interested in seeing some microbenchmarks on timers to gather some
> information on this for the future. I don't think it's necessary for this
> issue, but if someone is interested in this topic, I'd like to discuss it.
> I think it could be useful for the future, and it could be useful to have
> a look while this is fresh on our minds.

I did write a little microbenchmark, and perhaps we can include it in the experiments for this issue like we did with other microbenchmarks.

Our timer currently wraps around clock_gettime, which can be used with many different clocks with different properties. We use the CLOCK_PROCESS_CPUTIME_ID clock, which conceptually goes well with our reporting of CPU time in other places. We also have overhead from the way we process the results of clock_gettime within the Timer class, so I measured

- how long it takes to repeatedly resume and stop our current Timer class many times
- how this compares to just calling clock_gettime directly and doing the minimal work necessary to keep track how much time has passed between calls
- how the time for clock_gettime is affected by the type of clock used
- how all this compares to using the CPU's time stamp counter, which is not possible in standard C++, but still easier than I thought (some context: https://en.wikipedia.org/wiki/Time_Stamp_Counter and https://stackoverflow.com/questions/13772567/how-to-get-the-cpu-cycle-count-in-x86-64-from-c/51907627)

Here is one representative run of the benchmarks. Each measurement is for 10 million resumes and stops, for a total of 20 million accesses to the clock:

Running utils::Timer: 8.92593s
Running clock_gettime(CLOCK_REALTIME, ...): 7.46282s
Running clock_gettime(CLOCK_REALTIME_COARSE, ...): 7.03064s
Running clock_gettime(CLOCK_MONOTONIC, ...): 7.37465s
Running clock_gettime(CLOCK_MONOTONIC_COARSE, ...): 7.04013s
Running clock_gettime(CLOCK_MONOTONIC_RAW, ...): 7.327s
Running clock_gettime(CLOCK_BOOTTIME, ...): 7.42522s
Running clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...): 8.58128s
Running clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...): 8.50268s
Running RDTSC: 0.126442s

Conclusions:

1. utils::Timer has some overhead over a leaner use of CLOCK_PROCESS_CPUTIME_ID, but it is modest (less than 10%). The bulk of the time is really spent in the calls to clock_gettime.

2. Theother clocks for clock_gettime are faster than CLOCK_PROCESS_CPUTIME_ID, but not massively so.

3. RDTSC is a lot faster, about 70 times faster than utils::Timer. It manages roughly 160 million clock accesses per second. If we want to do things that require checking the clock frequently in the future, such as the kind of instrumentation-based profiling we discussed on the AI Basel Discord a while ago, we probably want to base it on this.
msg10758 (view) Author: malte Date: 2022-07-12.14:37:28
I did another local test, same task as discussed for issue1042 (barman-opt11-strips/pfile01-001.pddl), but with blind search.

- 1st revision in Silvan's experiment: 14.19s, 14.13s, 14.17s
- 2nd revision in Silvan's experiment: 20.16s, 19.93s, 20.14s
- 2nd revision without timer calls:    14.30s, 14.30s, 14.41s
- 2nd revision, only prune call:       14.22s, 14.32s, 14.41s

Moreover, I used the "time" utility to check for the 1st vs. 2nd revision how much of the runtime is spent inside the kernel ("system time"), with one run only:

- 1st revision in Silvan's experiment: 0.100s
- 2nd revision in Silvan's experiment: 2.756s

So it looks like a lot of time is spent inside the system call that obtains the clock. This is a pity, because it would be nice to have a fast clock for many of the things we would like to do, but it's not clear if we can do something about this. It may be worth trying out other clocks in our calls to clock_gettime; I'm surprised about the amount of overhead. With a little bit of search I couldn't find information on ways of checking the elapsed time that are substantially faster.

There is also wasted time of our own devising. For example, we do unnecessary floating-point calculations every time we pause and resume the clock. We could instead accumulate elapsed time and convert once at the end. But this won't help with the time spent inside syscalls.

[later]

I've gone down a little rabbit hole of timer functions, and there may be some hope of something substantially faster if we're willing to go through some pain for it. We would probably like to use something like Windows's QueryPerformanceCounter for speed of access (we'd have to say goodbye to the concept of CPU time of a given process/thread, but that's probably unavoidable for a really fast timer). But there seems to be no direct (non-assembly language) way of doing this on Linux.

I'd be interested in seeing some microbenchmarks on timers to gather some information on this for the future. I don't think it's necessary for this issue, but if someone is interested in this topic, I'd like to discuss it. I think it could be useful for the future, and it could be useful to have a look while this is fresh on our minds.
msg10756 (view) Author: silvan Date: 2022-07-12.12:46:03
When refactoring the pruning method interface in issue1042, we thought that the slowdown in the search was acceptable, however we only tested LM-cut and none of the faster heuristics such as blind or abstraction-based ones. Experiments showed (msg10736, msg10751) that the slowdown is a lot worse with these configurations, so we want to revisit the change introduced by that issue. 

Malte did some local analysis (msg10754) that hints at the calls to the timer class being the culprit of the slowdown. In a first step, I'll therefore remove the timer statistics from the base class to see if that already addresses the issue.
History
Date User Action Args
2022-07-15 14:33:48maltesetmessages: + msg10781
2022-07-15 14:28:14silvansetstatus: in-progress -> resolved
messages: + msg10779
2022-07-15 14:08:58maltesetmessages: + msg10777
2022-07-15 14:05:21silvansetmessages: + msg10776
2022-07-15 13:19:04maltesetmessages: + msg10775
2022-07-15 13:18:29maltesetmessages: + msg10774
2022-07-13 11:51:20silvansetmessages: + msg10767
2022-07-13 11:25:32silvansetmessages: + msg10766
2022-07-13 10:01:25gabisetmessages: + msg10764
2022-07-13 09:52:04gabisetnosy: + gabi
2022-07-13 09:42:33silvansetmessages: + msg10762
2022-07-12 17:39:03maltesetmessages: + msg10759
2022-07-12 14:37:28maltesetmessages: + msg10758
2022-07-12 12:46:03silvancreate