Issue 851: M&S refactoring part 2: more efficient computation of transitions of irrelevant operators

Title	M&S refactoring part 2: more efficient computation of transitions of irrelevant operators
Priority	wish	Status	resolved
Superseder		Nosy List	malte, silvan
Assigned To	silvan	Keywords
Optional summary	Possible further improvements: do not store information for all labels in each transition system, but only for the relevant ones. This requires either a concept of local label numbers and a mapping between global and local labels, or using hash maps instead of vectors whenever storing information per label, leaving out irrelevant labels.

Created on 2018-10-03.15:55:11 by silvan, last changed by silvan.

Possible further improvements: do not store information for all labels in each
transition system, but only for the relevant ones. This requires either a
concept of local label numbers and a mapping between global and local labels,
or using hash maps instead of vectors whenever storing information per label,
leaving out irrelevant labels.

Files
File name	Uploaded	Type	Edit	Remove
issue851-base-issue851-v1.tar.gz	silvan, 2018-10-19.11:27:01	application/gzip

Messages
msg8391 (view)	Author: silvan	Date: 2018-12-18.16:30:54
Yes, that corresponds to the latest version in the issue branch. I also addressed the cosmetic comment and we now use the latest utils::sort_unique for sorting. Thanks for reviewing! Marking this one as resolved.
msg8389 (view)	Author: malte	Date: 2018-12-18.15:31:25
I slightly prefer the variant that reserves to the max size because I find it a bit cleaner. If I understood and saw things correctly, this is how it is done in the current head of the branch, so nothing to do there. I left another cosmetic comment on bitbucket and otherwise have nothing else to say that would prevent this issue from being merged. :-)
msg8370 (view)	Author: silvan	Date: 2018-12-17.11:02:02
> Should we still stick with this variant? Should I decide on answering this myself, since you were fine with merging already ("From my side, I wouldn't mind merging it without me looking further into the details."), or does this need more discussion and should go to the queue?
msg8313 (view)	Author: silvan	Date: 2018-12-14.09:28:51
For completeness, also the comparison against the baseline: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v4-issue851-base-v2-issue851-v4-compare.html
msg8312 (view)	Author: silvan	Date: 2018-12-14.08:37:30
Reserving more memory leads to more out-of-memory cases and 7 fewer tasks for which the atomic FTS can be constructed. The variant is a bit faster, though (no more reallocation, I guess): https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v4-issue851-v3-issue851-v4-compare.html Should we still stick with this variant?
msg8309 (view)	Author: silvan	Date: 2018-12-13.22:40:32
I'm fine with not looking at the FTS factory code further :-) Results for v3, where I removed vector.reserve calls or reduced the space we reserve. compared to v2, there is no visible change: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v3-issue851-v2-issue851-v3-compare.html Comparison against baseline: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v3-issue851-base-v2-issue851-v3-compare.html (I already fixed the source for the two unexplained errors.) The debug experiment ran through without triggering any assertions. Next, I'll revert the change regarding on how much vector capacity we reserve.
msg8269 (view)	Author: malte	Date: 2018-12-12.18:31:48
I'm done with my comments. I did not have time to check the new FTS factory code very carefully, which in my opinion would require rereading it as a whole. If you want to see that done, you would need to find another volunteer. :-) From my side, I wouldn't mind merging it without me looking further into the details.
msg8151 (view)	Author: silvan	Date: 2018-12-05.15:12:02
Regardless of the previous message, I still changed quite a bit of things in this issue, to at least slightly reduce memory pressure and make code cleaner: atomic transition systems are now created with local equivalence relations on label computed instead of computing and regrouping labels only after construction. The results are quite good: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v2-issue851-base-v2-issue851-v2-compare.html We can finish the computation of the atomic FTS in 12 more tasks compared to the baseline (and +5 compared to v1, in which the new feature was to only compute irrelevant transitions once), and the overall abstraction in 12 more tasks compared to the baseline (+10 compared to v1). I'd now be happy to have a code review: https://bitbucket.org/SilvanS/fd-dev/pull-requests/42/issue851/diff
msg8150 (view)	Author: silvan	Date: 2018-12-05.15:09:09
In an offline discussion, we came to the conclusion that to really address the high memory consumption, we cannot afford storing an entry per operator and variable. Instead, we should (re-)introduce a special handling of labels irrelevant to a transition system. To achieve this, there are mainly two options: either each transition system uses different, continuous local label numbers and maps between global and local ones, or all data structures that need to store information for each label (transitions_by_group_id, label_to_positions) are turned into hash maps, thus potentially slowing down the runtime. I added this to the summary.
msg8133 (view)	Author: malte	Date: 2018-11-29.13:21:31
>> (BTW, this would be useful information to include in the output in the future, I >> think. It's not just a question of parsing it. The information isn't printed out >> by the search component or driver.) > > I think printing this from the search would be quite difficult, since we have a > memory handler that does not know anything about time, and printing happens > carefully, using only re-entrant functions. The driver could print the time > consumed by each component and also total time. (We then would have some of the > time information duplicated if the component also prints it itself.) Yes, this would make sense in the driver, not the search component. While we should make sure not to print too much redundant information, it's just wrong not to print any information about how long the search component ran in cases like these. Logging the amount of time taken is perhaps the most basic thing one always needs to do when running computational experiments. Regarding the suggested performance changes, I don't think either SegmentedVector or avoiding the reserve call will make a meaningful difference. If I remember the implementation correctly, the best we can get from this is a small constant factor, but the real issue seems to be that for a planning task with N state variables and M operators, we use O(NM) memory, which is simply prohibitive for very large tasks. For example, the largest Satellite task, p33, has 974711 operators and 326 variables. Even if we only store 32 bytes of information for every pair of variable and operator, that's already more than 9 GiB. We should look into something that only requires O(N + M) memory for tasks where the number of preconditions/effects per operator is bounded by a constant, as in Satellite. With the same numbers, we would then be at roughly 30 MiB instead. I don't know which of the data structures you mention is needed at which time, but hopefully we could manage this by only creating the per-operator data structures temporarily during creation, handling one variable at a time. I don't think this requires giving up on the current strategy of producing all transitions with only one pass through the operators; it should be possible to do this by computing some of the information in an initial pass that goes over all operators only once, and the rest of the information later, one state variable at a time. So the data would need to be organized a bit differently during construction. I have a vague idea how this could be done, but I don't know the current code well enough to relate this to how we currently do this. I would suggest discussing this live rather than on the tracker.
msg8132 (view)	Author: silvan	Date: 2018-11-29.12:47:04
> (BTW, this would be useful information to include in the output in the future, I > think. It's not just a question of parsing it. The information isn't printed out > by the search component or driver.) I think printing this from the search would be quite difficult, since we have a memory handler that does not know anything about time, and printing happens carefully, using only re-entrant functions. The driver could print the time consumed by each component and also total time. (We then would have some of the time information duplicated if the component also prints it itself.) > I think a logical next step would be trying to find out what is eating all the > memory in this Satellite task, but I'll stop here and pass over to you. :-) Even before computing transitions of each label (i.e., operator), we run out of memory during the initialization of data structures related to transition systems: for the vector "transitions_by_label" (more precisely, it will later store the transitions of each label group, but initially, groups are singletons), we reserve space for 2n-1 entries if there a n operators, which is the maximum number of labels that we can ever have using label reduction (since we never re-use old label numbers but add new ones). Further, the label equivalence relation has a vector of label groups (lists of label numbers) which also reserves space for 2n-1 entries, and a vector mapping each label to its group id, which has size 2n-1. Of course, we could get rid of reserving memory up front and have dynamic reallocation when reducing labels. This would allow to finish the initialization of Satellite p30, and actually the entire computation of the atomic FTS. (It seems like computing the full transitions first and only then computing the label equivalence relation is not too harmful.) Alternatively, we could think of using the SegmentedVector class to avoid costly reallocations. Would that make sense or does this class come with different overhead? Furthermore, if we were able to compute the label equivalence relations during the computation of transitions (which I'm not sure would be beneficial because I don't see how this could be easily done without giving up computing transitions of all transition systems simultaneously, going over the operators only once), we could save some memory by having fewer label groups from the start on. I think a natural thing to test is to give up reserving space for the maximum possible number of labels. What do you think of the other options?
msg8017 (view)	Author: malte	Date: 2018-10-20.02:29:35
Ah, I forgot to comment on the plots. Atomic construction time is not affected by the M&S setting, so all three should look basically the same, right? To me they look very good, i.e., if the new code isn't too complicated, I think it looks quite worthwhile. I still haven't looked at the code yet. Would prefer to do that later when it is clearer where we're heading with this.
msg8016 (view)	Author: malte	Date: 2018-10-20.02:19:23
PS: In cases this is useful, here is the script that generated the data, along with its output. I've also cleaned it up a bit and added the number of facts and the run_dir to the output. Becore running the script, I cherry-picked the run dirs where the translator succeeded, the algorithm config is the one we want, and the atomic transition system construction failed, but I don't have a script for that. Script: ============================================================================= #! /bin/bash printf '%10s %4s %6s %5s %s\n' PROD VARS OPS FACTS FILE for x in /run.log; do VARS=$(grep 'Translator variables' $x \| awk '{ print $3 }') OPS=$(grep 'Translator operators' $x \| awk '{ print $3 }') FACTS=$(grep 'Translator facts' $x \| awk '{ print $3 }') printf '%10d %4d %6d %5s %s\n' $(($VARS $OPS)) $VARS $OPS $FACTS "$x" done ============================================================================= Output: ============================================================================= PROD VARS OPS FACTS FILE 46759400 200 233797 2245 06815/run.log 47460912 1226 38712 2452 06391/run.log 49469888 1924 25712 3848 07025/run.log 53618300 4093 13100 12472 05551/run.log 59451600 1850 32136 3776 07028/run.log 64589416 1982 32588 4040 07022/run.log 77952000 232 336000 2088 06820/run.log 79193400 220 359970 2735 06816/run.log 101204335 241 419935 2512 06821/run.log 126402526 1909 66214 3818 06393/run.log 168674644 268 629383 3581 06817/run.log 215306700 2516 85575 5032 06404/run.log 299651001 2797 107133 5594 06405/run.log 317755786 326 974711 4447 06818/run.log 365607648 2973 122976 5946 06398/run.log 464544340 3406 136390 6812 06400/run.log 851134998 4207 202314 8414 06403/run.log 1022845148 4771 214388 9542 06401/run.log 1799681000 5750 312988 11500 06402/run.log =============================================================================
msg8015 (view)	Author: malte	Date: 2018-10-20.02:12:46
I had a look at the 19 failing cases where the translator succeeds but the atomic transition systems cannot be constructed (new code, first config). In all cases, the algorithm runs out of memory, not time, and it happens quickly. The output doesn't contain information about the search runtime for these failing tasks, so we don't see when it runs out of memory. (BTW, this would be useful information to include in the output in the future, I think. It's not just a question of parsing it. The information isn't printed out by the search component or driver.) But we can compute an upper bound based on the translator CPU time (which is included) and the overall planner wall-clock time (which is also included). There are only two cases where it is more than 10 seconds until the planner runs out of memory, and the maximum is around 13 seconds. So I think a natural next question would be if we are wasting memory somewhere or if the planner can legitimately run out of memory for these instances. In principle, due to the way that transitions without preconditions "multiply out", the atomic transition systems can be of size Omega(N^2) in the size of the input representation if there exist state variables with very large domains, and at least airport (one of the affect domains) has state variables with a domain of several 100 values. But I think that due to the way or invariant synthesis works, this cannot actually happen for the SAS+ representations that out translator generates. So another hypothesis could be that the atomic abstractions need some memory per operator of the task, even for the operators that are irrelevant to the given atomic abstraction. If there are K state variables and M operators, obviously that would mean Omega(K * M) memory usage, and these are tasks where K and M are large. They have between 200 and 5750 state variables and between 13100 and 974711 operators. The products K * M fall into the range 46759400 to 1799681000. The lowest of these products arises in the Satellite domain, problem 30, which has 200 variables and 233797 operators. If there is something which is O(M^2), that would of course also be a problem, or perhaps there is something which is O(L * M), where L is the number of facts (variable/value pairs). While these Satellite tasks don't have as many variables as the larger tasks in this list, they have many variables with large domains, so they have many facts. I think a logical next step would be trying to find out what is eating all the memory in this Satellite task, but I'll stop here and pass over to you. :-)
msg7998 (view)	Author: silvan	Date: 2018-10-19.11:27:01
I see your point; what I meant was "no impact on coverage" indeed, which is short-sighted of course. I think that indeed for the large tasks where we cannot compute the atomic FTS, we cannot deal with these tasks anyways (e.g. indeed the 13 large organic-synthesis tasks). The log files are in: /infai/sieverss/repos/downward/dev/experiments/issue851/data/issue851-v1 1) I attached the plots. It looks like we are indeed constantly faster with the new version, but only by some small amount in most cases. 2) I don't know how to do this easily (plotting 2 different attributes on the two axes), but I think that Manuel and Jendrik had the same problem before and I sent them an email.
msg7989 (view)	Author: malte	Date: 2018-10-19.03:29:51
> Unfortunately, the change has basically no impact. It does reduce the number of cases where we cannot compute the atomic transition systems from 39 to 32, right? Sure we'd like to see something better, but I think that is some impact. If you mean "no impact on coverage", I think that's because we're optimizing a part of the algorithm that is not a bottleneck. I scrolled through all the individual numbers for atomic transition system construction time, and we see a substantial speedup there. But there are very few examples where it takes more than a second to start with, and very very few where it takes more than three seconds. I would hypothesize that almost all of these are way, way, way beyond the scalability of optimal algorithms. (For logistics98 and satellite, which are two of the domains where we have missing entries for the old code, I know this for sure.) For example, the largest Satellite tasks are so large that we could not solve them with A* even with a perfect heuristic that takes 0 time to compute. (Just storing the states along the optimal solution would exhaust memory.) Can you tell me where I can find the run files? I know some of these instances (e.g. in Logistics, Airport and Satellite) are way, way beyond what is reasonable for an optimal planner. For some of the 2018 domains which are challenging to ground, I would like to have a look at how much time is actually spent in the search component as opposed to the translator, given that we now have an overall time limit, not a search time limit. For example, if I interpret the experiment HTML file correctly, the 13 failing cases in organic-synthesis are all ones where the translator runs out of memory. Even a zero-time merge-and-shrink initialization cannot help with that. :-) So that would be 13 of the 32 cases already covered. To have a better picture of the overall impact, I would be interested in 1) a before-after scatterplot of atomic transition system construction time, to get an idea how much we speed this up and how time this typically takes, and 2) a scatterplot of total_time vs. atomic transition system construction time, to see how much of the overall runtime on the tasks we can actually solve is spent on constructing the atomic transition systems. To make sure that constructing the atomic transition systems isn't a bottleneck, we should also check the cases where it doesn't complete, as neither of these two plots can give us information about that. For that, I think looking at the individual run logs could be useful.
msg7877 (view)	Author: silvan	Date: 2018-10-04.10:28:51
I opened a pull-request: https://bitbucket.org/SilvanS/fd-dev/pull-requests/42/issue851/diff (just a few new lines in one method) Results of the change: https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v1-issue851-base-issue851-v1-compare.html Unfortunately, the change has basically no impact. The relevant attributes to look at are ms_atomic_construction_time and ms_atomic_fts_constructed. We can now complete the atomic FTS in 7 tasks more than before, being able to do so in 1795 out of 1827 tasks. The involved domains after this issue are: airport (1), organic-synthesis (13), organic-synthesis-split (9), satellite (6), tetris (3). Regarding the usefulness of adding a time limit also for the atomic FTS factory (the discussion we had for issue802), this issue does not really help... Unless, of course, we deem these large tasks to be so large that we cannot handle them anyway.
msg7847 (view)	Author: silvan	Date: 2018-10-03.15:55:11
This is part of meta issue567. Quoting item 1 in msg4462: In the atomic transition system for variable v, it looks like we generate self-loops for each value of v and each operator o that does not mention v. So if there are N values and M irrelevant operators, this will take time and space O(NM). I assume that the self-loops for the different irrelevant operators are then combined into one in label normalization, reducing the representation to O(N). It might be worth checking out if we can get a performance boost from avoiding the O(NM) intermediate representation, instead generating a "normalized" representation for the irrelevant operators from the start.

History
Date	User	Action	Args
2018-12-18 16:30:55	silvan	set	status: reviewing -> resolved messages: + msg8391
2018-12-18 15:31:25	malte	set	messages: + msg8389
2018-12-17 11:02:02	silvan	set	messages: + msg8370
2018-12-14 09:28:51	silvan	set	messages: + msg8313
2018-12-14 08:37:30	silvan	set	messages: + msg8312
2018-12-13 22:40:32	silvan	set	messages: + msg8309
2018-12-12 18:31:48	malte	set	messages: + msg8269
2018-12-05 15:12:02	silvan	set	messages: + msg8151
2018-12-05 15:09:09	silvan	set	status: in-progress -> reviewing messages: + msg8150 summary: Possible further improvements: do not store information for all labels in each transition system, but only for the relevant ones. This requires either a concept of local label numbers and a mapping between global and local labels, or using hash maps instead of vectors whenever storing information per label, leaving out irrelevant labels.
2018-11-29 13:21:31	malte	set	messages: + msg8133
2018-11-29 12:47:04	silvan	set	status: reviewing -> in-progress messages: + msg8132
2018-10-20 02:29:35	malte	set	messages: + msg8017
2018-10-20 02:19:23	malte	set	messages: + msg8016
2018-10-20 02:12:46	malte	set	messages: + msg8015
2018-10-19 11:27:01	silvan	set	files: + issue851-base-issue851-v1.tar.gz messages: + msg7998
2018-10-19 03:29:51	malte	set	messages: + msg7989
2018-10-04 10:28:51	silvan	set	status: in-progress -> reviewing messages: + msg7877
2018-10-03 15:55:11	silvan	create

Issue851