Issue851

Title M&S refactoring part 2: more efficient computation of transitions of irrelevant operators
Priority wish Status resolved
Superseder Nosy List malte, silvan
Assigned To silvan Keywords
Optional summary
Possible further improvements: do not store information for all labels in each
transition system, but only for the relevant ones. This requires either a
concept of local label numbers and a mapping between global and local labels,
or using hash maps instead of vectors whenever storing information per label,
leaving out irrelevant labels.

Created on 2018-10-03.15:55:11 by silvan, last changed by silvan.

Summary
Possible further improvements: do not store information for all labels in each
transition system, but only for the relevant ones. This requires either a
concept of local label numbers and a mapping between global and local labels,
or using hash maps instead of vectors whenever storing information per label,
leaving out irrelevant labels.
Files
File name Uploaded Type Edit Remove
issue851-base-issue851-v1.tar.gz silvan, 2018-10-19.11:27:01 application/gzip
Messages
msg8391 (view) Author: silvan Date: 2018-12-18.16:30:54
Yes, that corresponds to the latest version in the issue branch. I also
addressed the cosmetic comment and we now use the latest utils::sort_unique for
sorting. Thanks for reviewing! Marking this one as resolved.
msg8389 (view) Author: malte Date: 2018-12-18.15:31:25
I slightly prefer the variant that reserves to the max size because I find it a
bit cleaner. If I understood and saw things correctly, this is how it is done in
the current head of the branch, so nothing to do there.

I left another cosmetic comment on bitbucket and otherwise have nothing else to
say that would prevent this issue from being merged. :-)
msg8370 (view) Author: silvan Date: 2018-12-17.11:02:02
> Should we still stick with this variant?
Should I decide on answering this myself, since you were fine with merging
already ("From my
side, I wouldn't mind merging it without me looking further into the details."),
or does this need more discussion and should go to the queue?
msg8313 (view) Author: silvan Date: 2018-12-14.09:28:51
For completeness, also the comparison against the baseline:
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v4-issue851-base-v2-issue851-v4-compare.html
msg8312 (view) Author: silvan Date: 2018-12-14.08:37:30
Reserving more memory leads to more out-of-memory cases and 7 fewer tasks for
which the atomic FTS can be constructed. The variant is a bit faster, though (no
more reallocation, I guess):
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v4-issue851-v3-issue851-v4-compare.html


Should we still stick with this variant?
msg8309 (view) Author: silvan Date: 2018-12-13.22:40:32
I'm fine with not looking at the FTS factory code further :-)

Results for v3, where I removed vector.reserve calls or reduced the space we
reserve. compared to v2, there is no visible change:
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v3-issue851-v2-issue851-v3-compare.html

Comparison against baseline:
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v3-issue851-base-v2-issue851-v3-compare.html

(I already fixed the source for the two unexplained errors.)

The debug experiment ran through without triggering any assertions.

Next, I'll revert the change regarding on how much vector capacity we reserve.
msg8269 (view) Author: malte Date: 2018-12-12.18:31:48
I'm done with my comments. I did not have time to check the new FTS factory code
very carefully, which in my opinion would require rereading it as a whole. If
you want to see that done, you would need to find another volunteer. :-) From my
side, I wouldn't mind merging it without me looking further into the details.
msg8151 (view) Author: silvan Date: 2018-12-05.15:12:02
Regardless of the previous message, I still changed quite a bit of things in
this issue, to at least slightly reduce memory pressure and make code cleaner:
atomic transition systems are now created with local equivalence relations on
label computed instead of computing and regrouping labels only after
construction. The results are quite good:
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v2-issue851-base-v2-issue851-v2-compare.html

We can finish the computation of the atomic FTS in 12 more tasks compared to the
baseline (and +5 compared to v1, in which the new feature was to only compute
irrelevant transitions once), and the overall abstraction in 12 more tasks
compared to the baseline (+10 compared to v1).

I'd now be happy to have a code review:
https://bitbucket.org/SilvanS/fd-dev/pull-requests/42/issue851/diff
msg8150 (view) Author: silvan Date: 2018-12-05.15:09:09
In an offline discussion, we came to the conclusion that to really address the
high memory consumption, we cannot afford storing an entry per operator and
variable. Instead, we should (re-)introduce a special handling of labels
irrelevant to a transition system. To achieve this, there are mainly two
options: either each transition system uses different, continuous local label
numbers and maps between global and local ones, or all data structures that
need to store information for each label (transitions_by_group_id,
label_to_positions) are turned into hash maps, thus potentially slowing down
the runtime. I added this to the summary.
msg8133 (view) Author: malte Date: 2018-11-29.13:21:31
>> (BTW, this would be useful information to include in the output in the future, I
>> think. It's not just a question of parsing it. The information isn't printed out
>> by the search component or driver.)
>
> I think printing this from the search would be quite difficult, since we have a
> memory handler that does not know anything about time, and printing happens
> carefully, using only re-entrant functions. The driver could print the time
> consumed by each component and also total time. (We then would have some of the
> time information duplicated if the component also prints it itself.)

Yes, this would make sense in the driver, not the search component. While we
should make sure not to print too much redundant information, it's just wrong
not to print *any* information about how long the search component ran in cases
like these. Logging the amount of time taken is perhaps the most basic thing one
always needs to do when running computational experiments.

Regarding the suggested performance changes, I don't think either
SegmentedVector or avoiding the reserve call will make a meaningful difference.
If I remember the implementation correctly, the best we can get from this is a
small constant factor, but the real issue seems to be that for a planning task
with N state variables and M operators, we use O(NM) memory, which is simply
prohibitive for very large tasks. For example, the largest Satellite task, p33,
has 974711 operators and 326 variables. Even if we only store 32 bytes of
information for every pair of variable and operator, that's already more than 9 GiB.

We should look into something that only requires O(N + M) memory for tasks where
the number of preconditions/effects per operator is bounded by a constant, as in
Satellite. With the same numbers, we would then be at roughly 30 MiB instead.

I don't know which of the data structures you mention is needed at which time,
but hopefully we could manage this by only creating the per-operator data
structures temporarily during creation, handling one variable at a time. I don't
think this requires giving up on the current strategy of producing all
transitions with only one pass through the operators; it should be possible to
do this by computing some of the information in an initial pass that goes over
all operators only once, and the rest of the information later, one state
variable at a time. So the data would need to be organized a bit differently
during construction.

I have a vague idea how this could be done, but I don't know the current code
well enough to relate this to how we currently do this. I would suggest
discussing this live rather than on the tracker.
msg8132 (view) Author: silvan Date: 2018-11-29.12:47:04
> (BTW, this would be useful information to include in the output in the future, I
> think. It's not just a question of parsing it. The information isn't printed out
> by the search component or driver.)
I think printing this from the search would be quite difficult, since we have a
memory handler that does not know anything about time, and printing happens
carefully, using only re-entrant functions. The driver could print the time
consumed by each component and also total time. (We then would have some of the
time information duplicated if the component also prints it itself.)

> I think a logical next step would be trying to find out what is eating all the
> memory in this Satellite task, but I'll stop here and pass over to you. :-)
Even before computing transitions of each label (i.e., operator), we run out of
memory during the initialization of data structures related to transition
systems: for the vector "transitions_by_label" (more precisely, it will later
store the transitions of each label *group*, but initially, groups are
singletons), we reserve space for 2n-1 entries if there a n operators, which is
the maximum number of labels that we can ever have using label reduction (since
we never re-use old label numbers but add new ones). Further, the label
equivalence relation has a vector of label groups (lists of label numbers) which
also reserves space for 2n-1 entries, and a vector mapping each label to its
group id, which has size 2n-1.

Of course, we could get rid of reserving memory up front and have dynamic
reallocation when reducing labels. This would allow to finish the initialization
of Satellite p30, and actually the entire computation of the atomic FTS. (It
seems like computing the full transitions first and only then computing the
label equivalence relation is not too harmful.) Alternatively, we could think of
using the SegmentedVector class to avoid costly reallocations. Would that make
sense or does this class come with different overhead?

Furthermore, if we were able to compute the label equivalence relations during
the computation of transitions (which I'm not sure would be beneficial because I
don't see how this could be easily done without giving up computing transitions
of all transition systems simultaneously, going over the operators only once),
we could save some memory by having fewer label groups from the start on.

I think a natural thing to test is to give up reserving space for the maximum
possible number of labels. What do you think of the other options?
msg8017 (view) Author: malte Date: 2018-10-20.02:29:35
Ah, I forgot to comment on the plots. Atomic construction time is not affected
by the M&S setting, so all three should look basically the same, right? To me
they look very good, i.e., if the new code isn't too complicated, I think it
looks quite worthwhile. 

I still haven't looked at the code yet. Would prefer to do that later when it is
clearer where we're heading with this.
msg8016 (view) Author: malte Date: 2018-10-20.02:19:23
PS: In cases this is useful, here is the script that generated the data, along
with its output. I've also cleaned it up a bit and added the number of facts and
the run_dir to the output. Becore running the script, I cherry-picked the run
dirs where the translator succeeded, the algorithm config is the one we want,
and the atomic transition system construction failed, but I don't have a script
for that.

Script:

=============================================================================
#! /bin/bash
printf '%10s %4s %6s %5s %s\n' PROD VARS OPS FACTS FILE
for x in */run.log; do
    VARS=$(grep 'Translator variables' $x | awk '{ print $3 }')
    OPS=$(grep 'Translator operators' $x | awk '{ print $3 }')
    FACTS=$(grep 'Translator facts' $x | awk '{ print $3 }')
    printf '%10d %4d %6d %5s %s\n' $(($VARS * $OPS)) $VARS $OPS $FACTS "$x"
done
=============================================================================

Output:

=============================================================================
      PROD VARS    OPS FACTS FILE
  46759400  200 233797  2245 06815/run.log
  47460912 1226  38712  2452 06391/run.log
  49469888 1924  25712  3848 07025/run.log
  53618300 4093  13100 12472 05551/run.log
  59451600 1850  32136  3776 07028/run.log
  64589416 1982  32588  4040 07022/run.log
  77952000  232 336000  2088 06820/run.log
  79193400  220 359970  2735 06816/run.log
 101204335  241 419935  2512 06821/run.log
 126402526 1909  66214  3818 06393/run.log
 168674644  268 629383  3581 06817/run.log
 215306700 2516  85575  5032 06404/run.log
 299651001 2797 107133  5594 06405/run.log
 317755786  326 974711  4447 06818/run.log
 365607648 2973 122976  5946 06398/run.log
 464544340 3406 136390  6812 06400/run.log
 851134998 4207 202314  8414 06403/run.log
1022845148 4771 214388  9542 06401/run.log
1799681000 5750 312988 11500 06402/run.log
=============================================================================
msg8015 (view) Author: malte Date: 2018-10-20.02:12:46
I had a look at the 19 failing cases where the translator succeeds but the
atomic transition systems cannot be constructed (new code, first config).

In all cases, the algorithm runs out of memory, not time, and it happens
quickly. The output doesn't contain information about the search runtime for
these failing tasks, so we don't see *when* it runs out of memory.

(BTW, this would be useful information to include in the output in the future, I
think. It's not just a question of parsing it. The information isn't printed out
by the search component or driver.)

But we can compute an upper bound based on the translator CPU time (which is
included) and the overall planner wall-clock time (which is also included).
There are only two cases where it is more than 10 seconds until the planner runs
out of memory, and the maximum is around 13 seconds.

So I think a natural next question would be if we are wasting memory somewhere
or if the planner can legitimately run out of memory for these instances. In
principle, due to the way that transitions without preconditions "multiply out",
the atomic transition systems can be of size Omega(N^2) in the size of the input
representation if there exist state variables with very large domains, and at
least airport (one of the affect domains) has state variables with a domain of
several 100 values. But I think that due to the way or invariant synthesis
works, this cannot actually happen for the SAS+ representations that out
translator generates.

So another hypothesis could be that the atomic abstractions need some memory
*per operator of the task*, even for the operators that are irrelevant to the
given atomic abstraction. If there are K state variables and M operators,
obviously that would mean Omega(K * M) memory usage, and these are tasks where K
and M are large. They have between 200 and 5750 state variables and between
13100 and 974711 operators. The products K * M fall into the range 46759400 to
1799681000.

The lowest of these products arises in the Satellite domain, problem 30, which
has 200 variables and 233797 operators. If there is something which is O(M^2),
that would of course also be a problem, or perhaps there is something which is
O(L * M), where L is the number of facts (variable/value pairs). While these
Satellite tasks don't have as many variables as the larger tasks in this list,
they have many variables with large domains, so they have many facts.

I think a logical next step would be trying to find out what is eating all the
memory in this Satellite task, but I'll stop here and pass over to you. :-)
msg7998 (view) Author: silvan Date: 2018-10-19.11:27:01
I see your point; what I meant was "no impact on coverage" indeed, which is
short-sighted of course. I think that indeed for the large tasks where we cannot
compute the atomic FTS, we cannot deal with these tasks anyways (e.g. indeed the
13 large organic-synthesis tasks).

The log files are in:
/infai/sieverss/repos/downward/dev/experiments/issue851/data/issue851-v1

1) I attached the plots. It looks like we are indeed constantly faster with the
new version, but only by some small amount in most cases.

2) I don't know how to do this easily (plotting 2 different attributes on the
two axes), but I think that Manuel and Jendrik had the same problem before and I
sent them an email.
msg7989 (view) Author: malte Date: 2018-10-19.03:29:51
> Unfortunately, the change has basically no impact.

It does reduce the number of cases where we cannot compute the atomic transition
systems from 39 to 32, right? Sure we'd like to see something better, but I
think that is some impact.

If you mean "no impact on coverage", I think that's because we're optimizing a
part of the algorithm that is not a bottleneck. I scrolled through all the
individual numbers for atomic transition system construction time, and we see a
substantial speedup there. But there are very few examples where it takes more
than a second to start with, and very very few where it takes more than three
seconds. I would hypothesize that almost all of these are way, way, way beyond
the scalability of optimal algorithms. (For logistics98 and satellite, which are
two of the domains where we have missing entries for the old code, I know this
for sure.) For example, the largest Satellite tasks are so large that we could
not solve them with A* even with a perfect heuristic that takes 0 time to
compute. (Just storing the states along the optimal solution would exhaust memory.)


Can you tell me where I can find the run files? I know some of these instances
(e.g. in Logistics, Airport and Satellite) are way, way beyond what is
reasonable for an optimal planner. For some of the 2018 domains which are
challenging to ground, I would like to have a look at how much time is actually
spent in the search component as opposed to the translator, given that we now
have an overall time limit, not a search time limit. For example, if I interpret
the experiment HTML file correctly, the 13 failing cases in organic-synthesis
are all ones where the translator runs out of memory. Even a zero-time
merge-and-shrink initialization cannot help with that. :-) So that would be 13
of the 32 cases already covered.


To have a better picture of the overall impact, I would be interested in

1) a before-after scatterplot of atomic transition system construction time, to
get an idea how much we speed this up and how time this typically takes, and

2) a scatterplot of total_time vs. atomic transition system construction time,
to see how much of the overall runtime on the tasks we can actually solve is
spent on constructing the atomic transition systems.

To make sure that constructing the atomic transition systems isn't a bottleneck,
we should also check the cases where it doesn't complete, as neither of these
two plots can give us information about that. For that, I think looking at the
individual run logs could be useful.
msg7877 (view) Author: silvan Date: 2018-10-04.10:28:51
I opened a pull-request:
https://bitbucket.org/SilvanS/fd-dev/pull-requests/42/issue851/diff
(just a few new lines in one method)

Results of the change:
https://ai.dmi.unibas.ch/_tmp_files/sieverss/issue851-v1-issue851-base-issue851-v1-compare.html

Unfortunately, the change has basically no impact. The relevant attributes to
look at are ms_atomic_construction_time and ms_atomic_fts_constructed. We can
now complete the atomic FTS in 7 tasks more than before, being able to do so in
1795 out of 1827 tasks. The involved domains after this issue are: airport (1),
organic-synthesis (13), organic-synthesis-split (9), satellite (6), tetris (3).

Regarding the usefulness of adding a time limit also for the atomic FTS factory
(the discussion we had for issue802), this issue does not really help... Unless,
of course, we deem these large tasks to be so large that we cannot handle them
anyway.
msg7847 (view) Author: silvan Date: 2018-10-03.15:55:11
This is part of meta issue567. Quoting item 1 in msg4462: 

   In the atomic transition system for variable v, it looks like we
   generate self-loops for each value of v and each operator o that does not
   mention v. So if there are N values and M irrelevant operators, this will
   take time and space O(NM). I assume that the self-loops for the different
   irrelevant operators are then combined into one in label normalization,
   reducing the representation to O(N). It might be worth checking out if we
   can get a performance boost from avoiding the O(NM) intermediate
   representation, instead generating a "normalized" representation for the
   irrelevant operators from the start.
History
Date User Action Args
2018-12-18 16:30:55silvansetstatus: reviewing -> resolved
messages: + msg8391
2018-12-18 15:31:25maltesetmessages: + msg8389
2018-12-17 11:02:02silvansetmessages: + msg8370
2018-12-14 09:28:51silvansetmessages: + msg8313
2018-12-14 08:37:30silvansetmessages: + msg8312
2018-12-13 22:40:32silvansetmessages: + msg8309
2018-12-12 18:31:48maltesetmessages: + msg8269
2018-12-05 15:12:02silvansetmessages: + msg8151
2018-12-05 15:09:09silvansetstatus: in-progress -> reviewing
messages: + msg8150
summary: Possible further improvements: do not store information for all labels in each transition system, but only for the relevant ones. This requires either a concept of local label numbers and a mapping between global and local labels, or using hash maps instead of vectors whenever storing information per label, leaving out irrelevant labels.
2018-11-29 13:21:31maltesetmessages: + msg8133
2018-11-29 12:47:04silvansetstatus: reviewing -> in-progress
messages: + msg8132
2018-10-20 02:29:35maltesetmessages: + msg8017
2018-10-20 02:19:23maltesetmessages: + msg8016
2018-10-20 02:12:46maltesetmessages: + msg8015
2018-10-19 11:27:01silvansetfiles: + issue851-base-issue851-v1.tar.gz
messages: + msg7998
2018-10-19 03:29:51maltesetmessages: + msg7989
2018-10-04 10:28:51silvansetstatus: in-progress -> reviewing
messages: + msg7877
2018-10-03 15:55:11silvancreate