Issue 1058: fix performance regression introduced when refactoring the pruning method interface

Title	fix performance regression introduced when refactoring the pruning method interface
Priority	urgent	Status	resolved
Superseder		Nosy List	gabi, jendrik, malte, silvan
Assigned To	silvan	Keywords
Optional summary

Created on 2022-07-12.12:46:03 by silvan, last changed by malte.

Messages
msg10781 (view)	Author: malte	Date: 2022-07-15.14:33:48
Great, thanks! :-)
msg10779 (view)	Author: silvan	Date: 2022-07-15.14:28:14
Ok. I opened issue1060 to address the verbosity issue and merged this issue.
msg10777 (view)	Author: malte	Date: 2022-07-15.14:08:58
Fine by me. "Silent" is a poor name then, but if it's like this elsewhere, it doesn't need to concern us in this issue.
msg10776 (view)	Author: silvan	Date: 2022-07-15.14:05:21
When we introduced these levels, I vaguely remember having a discussion whether silent should mean strictly silent or not. Currently, we have lots of places in the planner where we log something unconditionally, e.g., names of generators/strategies/heuristics/searches like in this case the pruning method. I'm fine with changing this in the future, e.g., requiring every line "log << ..." to be guarded by some "if (log.is_at_least...)", but I don't think that we should start this here and now.
msg10775 (view)	Author: malte	Date: 2022-07-15.13:19:04
Beyond these small comments, generally looks good to merge.
msg10774 (view)	Author: malte	Date: 2022-07-15.13:18:29
I left a review on github. The main point is that I would expect less/no output with "silent" log level. If the current behaviour is what we prefer elsewhere in the code, feel free to leave it, although I think we should make silent more silent everywhere eventually in that case.
msg10767 (view)	Author: silvan	Date: 2022-07-13.11:51:20
(I fixed the verbosity issue of h^m landmarks in the main branch.)
msg10766 (view)	Author: silvan	Date: 2022-07-13.11:25:32
Yes for the first two. issue1042-base is the implementation before issue1042 which introduced the regression.
msg10764 (view)	Author: gabi	Date: 2022-07-13.10:01:25
I am not sure whether I fully understand, which configuration corresponds to which situation. Issue1058-v1 is the current implementation just with the call to the timer removed? Issue1058-base is the current implementation? What is issue1042- base?
msg10762 (view)	Author: silvan	Date: 2022-07-13.09:42:33
Thanks for the investigation! I suggest to ask at today's daily meeting if someone wants to work on this. I could also work on it, but I don't know if we should do this as part of our sprint or not, and I wanted to work on other tasks, too. Experiments with the calls to the timer removed are in: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue1058-v1.html There seems to be a problem with BJOLP producing too much output. Looking at some local output, I assume that the code dumps some form of debug output which should be disabled. I'll look into this. Comparison reports: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue1058-v1-issue1042-base-issue1058-base-compare.html This confirms the regression I found on my code base. https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue1058-v1-issue1042-base-issue1058-v1-compare.html This confirms that removing the timer solves the performance issue. What remains to decide for this issue is if we simply want to remove the timer as it is in the current pull request (https://github.com/aibasel/downward/pull/116) or if we want to get rid of the statistics entirely, going back to having a single "prune_operators" method which does nothing for the null pruning method and can be arbitrarily instantiated by inheriting sub-classes. Since I think this issue should simply address the performance regression and having timer statistics seems to be the least important aspect, I'm fine with simply removing the timer from the prune_operators method.
msg10759 (view)	Author: malte	Date: 2022-07-12.17:39:03
> I'd be interested in seeing some microbenchmarks on timers to gather some > information on this for the future. I don't think it's necessary for this > issue, but if someone is interested in this topic, I'd like to discuss it. > I think it could be useful for the future, and it could be useful to have > a look while this is fresh on our minds. I did write a little microbenchmark, and perhaps we can include it in the experiments for this issue like we did with other microbenchmarks. Our timer currently wraps around clock_gettime, which can be used with many different clocks with different properties. We use the CLOCK_PROCESS_CPUTIME_ID clock, which conceptually goes well with our reporting of CPU time in other places. We also have overhead from the way we process the results of clock_gettime within the Timer class, so I measured - how long it takes to repeatedly resume and stop our current Timer class many times - how this compares to just calling clock_gettime directly and doing the minimal work necessary to keep track how much time has passed between calls - how the time for clock_gettime is affected by the type of clock used - how all this compares to using the CPU's time stamp counter, which is not possible in standard C++, but still easier than I thought (some context: https://en.wikipedia.org/wiki/Time_Stamp_Counter and https://stackoverflow.com/questions/13772567/how-to-get-the-cpu-cycle-count-in-x86-64-from-c/51907627) Here is one representative run of the benchmarks. Each measurement is for 10 million resumes and stops, for a total of 20 million accesses to the clock: Running utils::Timer: 8.92593s Running clock_gettime(CLOCK_REALTIME, ...): 7.46282s Running clock_gettime(CLOCK_REALTIME_COARSE, ...): 7.03064s Running clock_gettime(CLOCK_MONOTONIC, ...): 7.37465s Running clock_gettime(CLOCK_MONOTONIC_COARSE, ...): 7.04013s Running clock_gettime(CLOCK_MONOTONIC_RAW, ...): 7.327s Running clock_gettime(CLOCK_BOOTTIME, ...): 7.42522s Running clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...): 8.58128s Running clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...): 8.50268s Running RDTSC: 0.126442s Conclusions: 1. utils::Timer has some overhead over a leaner use of CLOCK_PROCESS_CPUTIME_ID, but it is modest (less than 10%). The bulk of the time is really spent in the calls to clock_gettime. 2. Theother clocks for clock_gettime are faster than CLOCK_PROCESS_CPUTIME_ID, but not massively so. 3. RDTSC is a lot faster, about 70 times faster than utils::Timer. It manages roughly 160 million clock accesses per second. If we want to do things that require checking the clock frequently in the future, such as the kind of instrumentation-based profiling we discussed on the AI Basel Discord a while ago, we probably want to base it on this.
msg10758 (view)	Author: malte	Date: 2022-07-12.14:37:28
I did another local test, same task as discussed for issue1042 (barman-opt11-strips/pfile01-001.pddl), but with blind search. - 1st revision in Silvan's experiment: 14.19s, 14.13s, 14.17s - 2nd revision in Silvan's experiment: 20.16s, 19.93s, 20.14s - 2nd revision without timer calls: 14.30s, 14.30s, 14.41s - 2nd revision, only prune call: 14.22s, 14.32s, 14.41s Moreover, I used the "time" utility to check for the 1st vs. 2nd revision how much of the runtime is spent inside the kernel ("system time"), with one run only: - 1st revision in Silvan's experiment: 0.100s - 2nd revision in Silvan's experiment: 2.756s So it looks like a lot of time is spent inside the system call that obtains the clock. This is a pity, because it would be nice to have a fast clock for many of the things we would like to do, but it's not clear if we can do something about this. It may be worth trying out other clocks in our calls to clock_gettime; I'm surprised about the amount of overhead. With a little bit of search I couldn't find information on ways of checking the elapsed time that are substantially faster. There is also wasted time of our own devising. For example, we do unnecessary floating-point calculations every time we pause and resume the clock. We could instead accumulate elapsed time and convert once at the end. But this won't help with the time spent inside syscalls. [later] I've gone down a little rabbit hole of timer functions, and there may be some hope of something substantially faster if we're willing to go through some pain for it. We would probably like to use something like Windows's QueryPerformanceCounter for speed of access (we'd have to say goodbye to the concept of CPU time of a given process/thread, but that's probably unavoidable for a really fast timer). But there seems to be no direct (non-assembly language) way of doing this on Linux. I'd be interested in seeing some microbenchmarks on timers to gather some information on this for the future. I don't think it's necessary for this issue, but if someone is interested in this topic, I'd like to discuss it. I think it could be useful for the future, and it could be useful to have a look while this is fresh on our minds.
msg10756 (view)	Author: silvan	Date: 2022-07-12.12:46:03
When refactoring the pruning method interface in issue1042, we thought that the slowdown in the search was acceptable, however we only tested LM-cut and none of the faster heuristics such as blind or abstraction-based ones. Experiments showed (msg10736, msg10751) that the slowdown is a lot worse with these configurations, so we want to revisit the change introduced by that issue. Malte did some local analysis (msg10754) that hints at the calls to the timer class being the culprit of the slowdown. In a first step, I'll therefore remove the timer statistics from the base class to see if that already addresses the issue.

History
Date	User	Action	Args
2022-07-15 14:33:48	malte	set	messages: + msg10781
2022-07-15 14:28:14	silvan	set	status: in-progress -> resolved messages: + msg10779
2022-07-15 14:08:58	malte	set	messages: + msg10777
2022-07-15 14:05:21	silvan	set	messages: + msg10776
2022-07-15 13:19:04	malte	set	messages: + msg10775
2022-07-15 13:18:29	malte	set	messages: + msg10774
2022-07-13 11:51:20	silvan	set	messages: + msg10767
2022-07-13 11:25:32	silvan	set	messages: + msg10766
2022-07-13 10:01:25	gabi	set	messages: + msg10764
2022-07-13 09:52:04	gabi	set	nosy: + gabi
2022-07-13 09:42:33	silvan	set	messages: + msg10762
2022-07-12 17:39:03	malte	set	messages: + msg10759
2022-07-12 14:37:28	malte	set	messages: + msg10758
2022-07-12 12:46:03	silvan	create

Issue1058