Issue213

Title inefficient memory use with -m64 setting
Priority meta Status chatting
Superseder Nosy List andrew.coles, erez, florian, jendrik, malte, silvia
Assigned To jendrik Keywords
Optional summary
Subissues:

- Use operator IDs instead of generating operator pointers (issue692)
- Implement new hash function (issue693)
- Implement new hash table (issue694) (blocked by issue693)
- Store landmarks more efficiently (issue695)

After this is done, we should consider changing the build system (e.g., make
64-bit the default build, or remove the option altogether). This is related to
issue687.

Created on 2011-01-23.11:11:44 by erez, last changed by jendrik.

Summary
Subissues:

- Use operator IDs instead of generating operator pointers (issue692)
- Implement new hash function (issue693)
- Implement new hash table (issue694) (blocked by issue693)
- Store landmarks more efficiently (issue695)

After this is done, we should consider changing the build system (e.g., make
64-bit the default build, or remove the option altogether). This is related to
issue687.
Files
File name Uploaded Type Edit Remove
chunk_allocator.h andrew.coles, 2014-10-27.00:45:25 text/x-chdr
hash_sizes.py malte, 2016-12-12.23:21:13 text/x-python
state_registry_low_memory.tar.gz andrew.coles, 2014-10-27.14:01:11 application/binary
Messages
msg6258 (view) Author: jendrik Date: 2017-04-27.23:33:47
Converted to meta-issue and updated the summary.
msg6253 (view) Author: malte Date: 2017-04-27.19:13:15
This issue is next in the review queue, but it really looks like a meta-issue
rather than a regular one. Can you

1) convert it into a meta-issue,
2) update the summary to include the current status and open TODOs, and
3) let me know which of the subissue is the one that should be reviewed next and
update the entry in the review queue accordingly?
msg5905 (view) Author: florian Date: 2016-12-19.14:08:13
Added a reference to issue687.
msg5890 (view) Author: jendrik Date: 2016-12-14.21:25:33
Of course we can test the new hash function on multiple optimal configurations. I 
don't expect any surprises if we're happy with the performance of blind search, 
but you never know :-)
msg5887 (view) Author: malte Date: 2016-12-14.18:28:50
> Regarding the first item: bjolp uses uniform cost partitioning, so CPLEX is not
> involved. Probably you meant seq instead of bjolp. I suggest to rerun the
> experiment for seq once we've merged the new hash table implementation.

Yes, I meant seq. Regarding rerunning the experiments, does this mean that you
advocate against testing the new hash table implementation on our standard
optimal planning configurations? Since it's a rather performance-critical
change, I'd be happier to have it evaluated on more than just blind search
before merging. (But perhaps this discussion should be in issue694.)

> Regarding the second item: I think it would be good to defer experiments until
> we're happy with optimal configurations.

Fine with me. I don't think we need to test the satisficing configurations for
issue692, but I think we should for issue693 (some domains have massively larger
states, so hash function performance could be interesting) and for issue694 (for
similar reasons). But of course we can focus on optimal configurations for these
issues first.

> We could convert the lazy open list 
> operator pointers in issue688. Or do you prefer a separate issue for that?

Separate issues are usually easier to review and hence faster to be merged. This
is especially true when we expect a performance impact from some of the changes
and want to know where it comes from. So I think in this case a separate issue
would be better.
msg5886 (view) Author: jendrik Date: 2016-12-14.18:20:58
Regarding the first item: bjolp uses uniform cost partitioning, so CPLEX is not 
involved. Probably you meant seq instead of bjolp. I suggest to rerun the 
experiment for seq once we've merged the new hash table implementation. 

Regarding the second item: I think it would be good to defer experiments until 
we're happy with optimal configurations. We could convert the lazy open list 
operator pointers in issue688. Or do you prefer a separate issue for that?
msg5885 (view) Author: malte Date: 2016-12-14.18:07:36
Great! For this issue, I can think of two more TODO items:

- Rerun the BJOLP comparison with identical OSI/CPLEX versions. (Ideally we
should use the most recent versions for both, but IIRC, this doesn't work
because the newest CPLEX version has 32-bit issues. But perhaps we can still use
our recommended OSI version for both and couple it with an older CPLEX version
that still supports both architectures. Florian can hopefully advise on this.)

- Run experiments for satisficing configurations. If we do this later when we're
happy with the changes affecting optimal planners, we'll have to run fewer
experiments overall. If we do it earlier, we might be quicker to notice other
parts of the code we need to change before moving to 64-bit compiles. Both
options are fine for me.

Right now I can only think of one obvious change we will want to make for the
satisficing search configurations: replace the generating operator pointer in
lazy open lists with an operator ID, similar to the change for SearchNodeInfo.
We might then also want to change the interface so that it works with these IDs
and we don't have to convert back and forth between operator pointers and IDs
when we actually use IDs on both ends.
msg5882 (view) Author: jendrik Date: 2016-12-14.15:47:41
Yes, the experiment only compares the old hash function and hash table to the new 
hash function and hash table. On the instance I tested locally, the old hash 
function basically produced ascending hash values, which is very bad for 
hopscotch hashing. 

I have split off the issues as you suggested:

A) issue692: operator IDs instead of generating operator pointers
B) issue693: new hash function
C) issue694: new hash table implementation
D) issue695: space improvement for LM-Count
msg5873 (view) Author: malte Date: 2016-12-14.03:55:26
Those numbers certainly look better. :-) There still seems to be a bit of a
slowdown in most tasks, but I think that would be an acceptable penalty given
the memory improvements.

I still think we should split this into separate issues, though. For example, if
we change the hash function and hash table at the same time (which I think the
experiment does?), the changes become harder to evaluate.

> In light of these results, I suggest we polish and merge this hash table
> implementation, before we try to come up with fancier versions :-) What do
> you think?

Sounds fine to me. (Regarding alternative hash table implementations, I was
really hoping that we could use something simpler rather than something more
complex. But either way, this can wait.)
msg5872 (view) Author: jendrik Date: 2016-12-14.01:33:41
I analyzed and tweaked the code a bit and was able to speed it up while preserving the same memory 
usage. The most important change was to not sort the old buckets during a resize. Previously, I 
was trying to be smart about this to achieve a better layout, but the extra effort didn't pay off. 
In fact, experiments showed that the order in which buckets are reinserted, doesn't affect memory 
usage or runtime.

The resulting code version is tagged revision v5. Here are the results comparing v5 to v1, the 
version using std::unordered_map:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-issue213-v1-vs-issue213-v5-release32.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-issue213-v1-vs-issue213-v5-release64.html

-> Memory usage and runtime decrease for both 32- and 64-bit mode. The effect is more pronounced 
for memory. Coverage goes up by 7 and 18 tasks.

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-release32-vs-release64-issue213-v1.html

-> With std::unordered_set, m64 solves 11 fewer tasks than m32. As we know, this is due to m64 
using much more memory than m32 in v1.

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-release32-vs-release64-issue213-v5.html

-> m64 still uses more memory than m32, but this can't be due to the hash set since it should uses 
the same amount of memory in both settings. The runtimes are about the same on average. Both build 
types solve the same number of tasks.

In light of these results, I suggest we polish and merge this hash table implementation, before we 
try to come up with fancier versions :-) What do you think?
msg5871 (view) Author: malte Date: 2016-12-13.17:17:10
Works for me, but the grid will be crowded then, and in an ideal world it would
be great to have some of the aspects of this already merged before then.
msg5870 (view) Author: jendrik Date: 2016-12-13.16:53:43
I think it's best to discuss in person during the sprint how we want to move 
forward with this.
msg5868 (view) Author: malte Date: 2016-12-12.23:28:01
Yet another point: it looks like so far we have only experimented with optimal
planning configurations? Before merging, we also need data for satisficing planning.
msg5867 (view) Author: malte Date: 2016-12-12.23:27:37
Taking the wider view: I am somewhat reluctant to rush merging the new hash
implementation because of its time overhead. But I don't think this should keep
us from merging other, less critical aspects of this that affect other parts of
the code.

Suggestion: we split off the different code changes (e.g. replacing generating
operator pointers by operator IDs, new hash function, new hash table
implementation, space improvement for lmcount heuristic) into their own issues,
and we use this issue for keeping track of the relative performance of m32 vs.
m64 and the eventual switch (hopefully) to m64 in the build configs etc.

I see the following largely independent changes:

A) operator IDs instead of generating operator pointers
B) new hash function
C) new hash table implementation
D) space improvement for LM-Count

If we split these off, hopefully we can at least merge A, B and D soon.
(Of course, D still needs to be done.)

What do you think?
msg5866 (view) Author: malte Date: 2016-12-12.23:21:13
Thanks!

I don't think tweaking the parameter is the best solution at the moment. Unless
there is a performance issue in the implementation that we still have to
identify, perhaps this hash table implementation just is a bit slower than the
standard library one. It stands to reason that the standard library
implementation is optimized for something. Perhaps it is optimized for speed. ;-)

There are a few alternatives we can try. We wanted to avoid separate chaining
because of its space requirements, but we can also try bringing the space
requirements of separate chaining down for our particular use case by
compressing the information. Currently, the values -2 to -2^31 are unused, and
we could exploit this to use a single 4-byte cell to store either a state ID or
(the equivalent of) a pointer to a chain of collided values.

Assuming a load factor of 1.0, which is not too unreasonable when using separate
chaining, a "basic" but space-efficient implementation of separate chaining
takes 12 bytes per entry if there is no dynamic allocation overhead (which I
think is reasonable for a hash table that does not support deletion): 4 bytes
per bucket plus 8 bytes per linked list entry that holds the data.

I think that with some compression tricks, we can bring this down to less than 7
bytes per entry, i.e., the same amount of memory as with closed hashing at a
load factor of 0.57. (Closed hashing of course needs load factors to be lower
than what we can use with open hashing.)

This assumes that we don't need to store the hash keys -- if we do need to store
them, the comparison becomes more favourable for closed hashing because the
relative overhead of the "pointers" becomes smaller. I'm attaching a Python
script that allows estimating the size requirements of different hash table
implementations.
msg5865 (view) Author: jendrik Date: 2016-12-12.16:29:02
I profiled blind search on pegsol-08-strips:p24.pddl. In 32-bit mode revision v1 
uses 7.5s and v3 uses 9.6s.

v1:
3.13% of the time is spent in StateRegistry::get_successor_state().
  StateRegistry::insert_id_or_pop_state() seems to be inlined.
    3.13% of the time is spent inserting states into the closed list.

v3:
18.18% of the time is spent in StateRegistry::get_successor_state().
  15.16% of the time is spent in StateRegistry::insert_id_or_pop_state().
    10.30% of the time is spent in IntHashSet::reserve().
      3.68% of the time is spent sorting the temporary vector.

One parameter, we could try changing is "max_distance". Currently each
key can be at most 32 buckets from its ideal bucket. Raising this
number would allow for higher load factors (and to a lesser extent
fewer resizes) at the cost of longer lookups.

For v3 in 64-bit mode changing this parameter for the pegsol task produced:

max_distance:     32      64     128
load factor:    0.43    0.86    0.86
peak memory:  100 MB   70 MB   70 MB  
resizes:          22      21      21
msg5864 (view) Author: florian Date: 2016-12-12.16:19:14
We have seen huge performance differences between OSI 0.103.0 and OSI 0.107.8.
If you want to compare 32 and 64 bit results, both revisions should use the same
version.
msg5863 (view) Author: jendrik Date: 2016-12-12.14:36:05
Regarding "v3: m32-vs-m64": Yes, exactly.

The experiments use the following library versions:

32-bit: OSI 0.103.0, CPLEX 12.5.1
64-bit: OSI 0.107.8, CPLEX 12.5.1

http://www.fast-downward.org/LPBuildInstructions mentions OSI 0.107.8 and CPLEX 12.6.3.
msg5862 (view) Author: malte Date: 2016-12-12.13:14:08
So some good news and some bad news, then!

I assume the second "v1: m32-vs-m64" line should be "v3: m32-vs-m64"?

I understand the desire to make the new m64 version look good, but a base-m32
vs. v3-m64 comparison is not meaningful for the decision we have to make in this
issue. We want to make the m64 compile competitive with the m32 compile. It
doesn't really matter if the m64 compile of revision X is competitive with the
m32 compile of revision Y.

Regarding the coverage drops for bjolp and seq:

The drop for seq may be CPLEX's fault, so not sure if we can do much about it.
Still, this probably warrants a closer look. I'd be a bit surprised to see the
32-bit version of CPLEX do massively better than the 64-bit version. Are these
otherwise identical versions of CPLEX and OSI? Are these the versions we recommend?

The drop for bjolp surprised me a bit, so I had a look how it stores a landmark
data, and it uses a PerStateInformation<vector<bool>>, which is very wasteful
because we pay the vector overhead for every single state. For our state
representation, we moved from SegmentedVector to SegmentedArrayVector for this
reason, and it made a big difference. We should probably do a similar thing
here, i.e., have something like PerStateInformation optimized for same-size
arrays. (And because this is a vector<bool>, we additionally need to pack
individual bits.)
msg5861 (view) Author: jendrik Date: 2016-12-12.11:56:48
Sure, here we go:

m32:

base-vs-v1: pointers and ints have the same size, so memory and runtime
don't change much. Coverage drops by 1 in total.

v1-vs-v3: memory usage is significantly reduced. Runtime results are
mixed. There's a slight negative trend for bjolp, cegar and ipdb and a
stronger negative trend for blind. Overall, all configurations benefit
from the change coverage stays the same or increases in all domains for
all configurations. Together, the configurations solve 61 more tasks.

m64:

base-vs-v1: both memory and total_time scores go up leading to 15
additional solved tasks.

v1-vs-v3: The picture is very similar to the 32-bit v1-vs-v3
comparison. Memory usage goes down, runtime goes up. The change in
memory is much more significant than the one for runtime though. In
total, coverage increases by 122 tasks.


base: m32-vs-m64 - Coverage drops for all configurations. Total coverage diff = -142
v1:   m32-vs-m64 - Coverage drops for all configurations. Total coverage diff = -136
v1:   m32-vs-m64 - Coverage drops by 20 and 32 for bjolp and seq, respectively. It
                   dropy by at most five for divpot, ipdb, lmcut and mas. It remains
                   the same for blind and cegar. Total coverage diff = -65


base-m32 vs. v3-m64: Memory usage increases significantly for most
configurations. Results for runtime are mixed. Coverage decreases by 12
and 33 for bjolp and seq, respectively. It increases for all other
configurations (+40 in total). Total coverage diff = -5.
msg5860 (view) Author: malte Date: 2016-12-11.23:42:54
That's a lot of data to go through. Can you summarize the results?
msg5859 (view) Author: jendrik Date: 2016-12-11.23:15:29
Thanks! Since I didn't expect you to look into this so soon, I went ahead and ran 
an experiment for more configurations:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-v3-opt.tar.gz

The reports make the following comparisons: 

- comare revisions base, v1 and v3 for a fixed build type.
- compare the two build types (m32 and m64) for a fixed revision.
- compare base-m32 to v3-m64
msg5858 (view) Author: malte Date: 2016-12-10.12:03:43
I added some comments on bitbucket.
msg5856 (view) Author: malte Date: 2016-12-09.11:17:37
Sounds good, but it may take a while. Lots on my plate at the moment.
msg5855 (view) Author: jendrik Date: 2016-12-09.09:51:58
I went over the IntHashMap code again and made some adjustments that speed 
things up while using the same amount of memory. For example, each key is now 
only hashed once. I'm sure there are further speed optimizations that I haven't 
discovered.

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-v2-vs-v3-blind.html

Before I test other configurations, I think it would be good if you could have 
a look at the code, Malte.
msg5854 (view) Author: jendrik Date: 2016-12-08.17:56:16
Regarding 1) in msg5851: I completely agree, this experiment was just meant to be a quick way of assessing where 
we stand. I'll run more extensive experiments before merging.

Regarding 2) in msg5851: I have repeated the experiment for 32-bit builds:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-v2-blind-m32-issue213-v1-issue213-v2-compare.html

Here is the complete coverage table for blind search:

version   coverage
------------------
base-m32:      617
  v1-m32:      617
  v2-m32:      624
base-m64:      602
  v1-m64:      606
  v2-m64:      624
msg5853 (view) Author: malte Date: 2016-12-08.17:14:10
Regarding Andrew's suggestion, this is indeed another optimization we discussed
and should test at some point. It's already part of the longer-term plan (see
the SearchNodeInfo row and "ideal" column in msg5843).

I generally like doing this one optimization at a time because I think that
makes it easier to evaluate the impact of the various changes on memory,
coverage and runtime, but we should certainly not forget about this optimization
opportunity.

Let me point out that for certain search algorithms (greedy BFS without
reopening), we can also get rid of the g values altogether.
msg5852 (view) Author: malte Date: 2016-12-08.17:10:25
Hi Andrew, you're not the first (or second, or third ;-)) one to be tripped up
by the new "optional summary" field. Pasting your comment as a change note:

=========

Jendrik - re the changes for v1 - is there a need to store the parent operator
at all?  A while back I stripped this out of a fork of FD, and when a goal state
was reached, derived what the operators must have been by looking at the
differences between states on the path to the goal.  This has a small
postprocessing cost, but saves memory.
msg5851 (view) Author: malte Date: 2016-12-08.15:02:34
So far, so good! It would be nice to reduce the runtime penalty, but if it turns
out that we cannot, I don't think it would be a deal-breaker. Hopefully the
overhead will be much less noticeable with non-blind search settings.

I have two issues with the experiment:

1) If I understand correctly, the "base" numbers come from a separate
experiment. For the same reason that we randomize run IDs in our experiments, we
can get systematic errors if we combine data from two separate experiments. This
is less of an issue here than usually because these erros only affect runtime,
not memory usage. But I think we should still run a "clean" (all-in-one)
experiment before merging.

2) For a complete picture, we should have 32-bit and 64-bit data for all
configurations. That is, we should have data for v1-m32 and v2-m64, too.
msg5850 (view) Author: jendrik Date: 2016-12-08.14:53:00
Two changes for reducing memory consumption have been made in the pull request so far: 

issue213-v1 uses ints instead of GlobalOperator pointers for storing the parent operator.
issue213-v2 (in addition to changes by v1) uses IntHashSet instead of std::unordered_set for storing the closed-list.

I have run an experiment comparing v1 and v2 for blind search in a 64-bit build:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-v2-blind-m64-issue213-v1-issue213-v2-compare.html

Memory consumption is significantly reduced, but runtimes go up.

v2 recovers the loss in total coverage for blind search incurred by switching to 64-bit builds:  

version   coverage
------------------
base-m32:      617
base-m64:      602
  v1-m64:      606 
  v2-m64:      624
msg5849 (view) Author: jendrik Date: 2016-12-05.11:13:42
Concerning the CPLEX error: we decided to postpone fixing this until we find a 
reliable way to test the fix and opened issue691 for this.
msg5847 (view) Author: florian Date: 2016-12-05.10:51:20
> We could exit with EXIT_OUT_OF_MEMORY if the error message matches "CPXlpopt
> returned error 1001" (1001 stands for out of memory in the CPLEX lib). Not
> sure if that's the best solution though.

Sounds good. This is also the way we handle other error messages that come
through unexpected channels (e.g., OSI reports some CPLEX errors as warnings).
msg5846 (view) Author: jendrik Date: 2016-12-05.01:01:48
Minor remark: the task tested in msg5843 is actually elevators-opt11-strips:p04.pddl. In 
revision issue213-base it uses 102 MB in a 32-bit build and 152 MB in a 64-bit build.

I have implemented a new hash table class using open addressing instead of chaining. It 
decreases peak memory usage to 80 MB in 32-bit builds and to 82 MB in 64-bit builds. 

A critical ingredient was a new hash function. The old hash function 
(utils/hash.h:hash_sequence()) produced too many collisions, resulting in a final load 
factor of 0.05 and using ~400 MB (peak memory). Switching to Jenkins' one-at-a-time hash 
function (https://en.wikipedia.org/wiki/Jenkins_hash_function) increased the final load 
factor to 0.79.

Since coming up with his one-at-a-time hash function, Jenkins has released multiple 
improved hash functions under the public domain. I will try hashword() from
http://www.burtleburtle.net/bob/c/lookup3.c tomorrow. This seems to be one of the best 
non-cryptographic hash functions for variable-length integer sequences. He has also 
released SpookyHash, but it only works for 64-bit builds.

It would be great if someone could have a look at the implementation of IntHashMap. I 
opened a pull request at https://bitbucket.org/jendrikseipp/downward/pull-requests/63 .
msg5844 (view) Author: malte Date: 2016-11-30.19:16:01
One small addition: the total numbers in the last row of Jendrik's table assume
that the number of open list entries and the number of generated states are of
the same order of magnitude. For tasks with many duplicates, it is possible that
there are arbitrarily more open list entries, so that the open lists can become
the dominating factor for memory usage.

If we pursue the goal of minimizing memory usage, it would make sense to
introduce measurements to the planner that compare the maximum number of open
list entries with the number of generated states. If it turns out that open
lists are sometimes the bottleneck, we can rethink our current (quite lazy)
duplicate elimination scheme.
msg5843 (view) Author: jendrik Date: 2016-11-30.19:10:59
Malte and I analyzed the memory profiles of A* + blind search for pegsol-opt11-strips:p03.pddl today and noted 
where we use how much memory in a 32-bit, 64-bit and ideal version. Here are the numbers:


A* + blind search                                  32-bit          64-bit           ideal

---------------------------------------------------------------------------------------------------------------

PerStateInformation<SearchNodeInfo>              16 Bytes        24 Bytes         8 Bytes (parent state + g)
(parent state, creating operator, g, real g)

PerStateInformation<HEntry>                       4 Bytes         4 Bytes         0 Bytes (for blind search)
(h-cache)

State registry - state data                     4*B Bytes       4*B Bytes       4*B Bytes (B = #buckets)
(h-cache)

State registry - hash table                     ~17 Bytes       ~34 Bytes        ~5 Bytes (4 Bytes + overhead)
(h-cache)

Open list                                         4 Bytes         4 Bytes         4 Bytes (per open-list entry)

---------------------------------------------------------------------------------------------------------------

Total                                           ~45 Bytes       ~70 Bytes       ~21 Bytes (in example with B=1)


We have already started to reduce the size of SearchNodeInfo by storing operator IDs (ints) instead of 
GlobalOperator pointers. Next, we will be looking into more memory-efficient hash table implementations, probably 
using open addressing instead of separate chaining like std::unordered_set.
msg5842 (view) Author: jendrik Date: 2016-11-30.00:28:34
I agree. After looking at the code again, my fix make no sense. The error 
message apparently comes from COIN, not CPLEX. Our code detects this, calls 
handle_coin_error() which prints "Coin threw exception: ...". We could exit 
with EXIT_OUT_OF_MEMORY if the error message matches "CPXlpopt returned error 
1001" (1001 stands for out of memory in the CPLEX lib). Not sure if that's the 
best solution though.
msg5841 (view) Author: florian Date: 2016-11-29.21:34:00
Hard to say. The other error messages look more like the warning you saw on
stdout. So, maybe this is communicated over a different channel (wouldn't
surprise me). Is the "Coin threw exception:" stuff from our code? It would be
better if we were sure that the error actually passes through the error handler
that you modified. If you cannot reproduce the error, I would rather leave the
handler as it is than change it and hope that the change is correct.
msg5840 (view) Author: jendrik Date: 2016-11-29.21:18:41
I failed to reproduce the error on both my machine and maia. So I'm not sure how to test my fix. I pushed it to my 
bitbucket repo:

https://bitbucket.org/jendrikseipp/downward/commits/d19cd04fa4bb1547524573669ac83170c050cd5b?at=default

Florian, do you think this fix is correct?
msg5838 (view) Author: florian Date: 2016-11-29.14:45:21
The message "Compressing row and column files." is unrelated (that is just
CPLEX's last effort to conserve memory). We would have to catch error 1001 in
our custom error handler before it throws the exception.
msg5837 (view) Author: jendrik Date: 2016-11-29.14:38:37
Concerning the critical error: 

The output on stdout is

CPX0000  Compressing row and column files.

The output on stderr is

Coin threw exception: CPXlpopt returned error 1001
 from method resolve
 from class OsiCpxSolverInterface
Unexplained error occurred.

(The last line is from Fast Downward.)

Interestingly, the file "OsiDefaultName.xxx" is written to disk, containing a 
textual represenation of an LP.
msg5835 (view) Author: malte Date: 2016-11-29.14:14:50
It's a pity that the data isn't better. It's hard to justify moving to 64 bits
with such a large increase in memory usage (and loss in coverage). I think with
these numbers, we should do something about memory usage.

I suggest we focus on blind search first. There may be other memory-wasters
hidden inside certain heuristics etc., but whatever is making blind search use
much more memory will affect all configurations that expand/evaluate/generate
many states. 

Of course we should measure things before proceeding, but my best guess at the
moment is that the hash table (unordered_set) inside StateRegistry is to blame
since I cannot think of any other major data structures for which it is
plausible that they require much more memory in a 64-bit compile.

I would suggest that we
1) do a memory profile on some representative blind search test cases, in 32-bit
mode and 64-bit mode
2) look more closely at the main memory users and their underlying
implementation to see how they differ in 32-bit and 64-bit mode
3) think of more memory-efficient implementations

Perhaps the answer is simply to use another ready-made hash table
implementation, but I think it may be worthwhile to understand this more deeply
before we design a solution.
msg5834 (view) Author: florian Date: 2016-11-29.14:03:22
Looks like a huge satellite task running out of memory in an LP configuration. I
suspect CPLEX throws an error that we don't recognize as an out-of-memory
memory. This happened before, and we already have special casing for some of
these errors. Jendrik, if can figure out the exact error message from the logs,
we can add a special case for it.
msg5831 (view) Author: malte Date: 2016-11-29.13:57:52
Should we look into the critical error in that experiment?
msg5828 (view) Author: jendrik Date: 2016-11-29.11:54:22
Due to newer CPLEX versions removing support for 32-bit builds, the discussion 
about 64-bit builds came up again. I repeated Florian's experiment on the bigger 
benchmark set and with some new optimal configurations. The general picture 
remains the same though. Coverage decreases in many domains for all configs.

http://ai.cs.unibas.ch/_tmp_files/seipp/issue213-opt-comparison.html

The question is whether and how we want to act on these results. This may be 
something we should discuss in a Fast Downward meeting.
msg4128 (view) Author: jendrik Date: 2015-03-31.13:38:22
issue436 (removing GNU hash sets and maps) has been addressed now, so work on 
this issue could continue.
msg3882 (view) Author: andrew.coles Date: 2014-10-27.14:01:11
(Background:

The standard hash_set implementation is a hashtable.  The hashtable is stored as
a std::vector, where each index contains a singly-linked list: an item of data,
and a 'next' pointer to the next item in the list. )

It's also possible to represent the lists in each bucket of a hash set by moving
the next pointer into the state.  Normally this wouldn't be done -- we can only
have one hash set implemented in this way -- but the state registry is special.
 In implementation terms, this can be done by making the PackedStateBin one int
larger, and instead of storing the 'next' entry as a pointer, store it as the
StateID of the next state.  The hash bucket themselves are then StateIDs: -1 if
empty, or the StateID of the first state on the list of states in that bucket.

Storing 'next' as part of the PackedStateBin gives a 25% reduction in memory
usage on 32-bit builds - see attached state_registry_low_memory.tar.gz.  The
downside is a drop in performance: blind search is a factor of 2 to 3 slower. 
(This surprised me, and might be improvable with some profiling.)  The % time
overhead would of course be lower with something other than blind search.
msg3881 (view) Author: florian Date: 2014-10-27.11:20:12
Jendrik already added them (hash sets etc. are handled in issue436).
msg3880 (view) Author: malte Date: 2014-10-27.11:18:00
Worth adding crossreferences between the issues?
msg3879 (view) Author: florian Date: 2014-10-27.11:17:39
Ahh, interesting. This looks very similar to the way we store state data in the
state registry. Maybe we can reuse some code here. But I think we should wait
until we changed the hash sets to the c++11 variants.
msg3878 (view) Author: andrew.coles Date: 2014-10-27.00:45:25
I've been looking into memory management recently*, and various hash set/hash
table options, and recalled there was a thread here on that topic.

You might want to try the attached replacement for the standard C++ allocator,
for use where objects will be persistent - e.g. nodes inside data structures. 
In state_registry.h switch the definition of StateIDSet to:

    typedef __gnu_cxx::hash_set<StateID,
                                StateIDSemanticHash,
                                StateIDSemanticEqual,
                                ChunkAllocator<StateID>  > StateIDSet;

A quick unscientific experiment doing this on a 32-bit build shows a 17% or so
reduction in memory usage.


* In IPC2011, POPF used 6GB of memory in 25s, so this is long overdue
msg3730 (view) Author: jendrik Date: 2014-10-08.21:30:42
I added a note in issue436.
msg3727 (view) Author: florian Date: 2014-10-08.20:37:02
I ran some memory profiles a while back and if I remember correctly, the hash
set entries contributed a large part to the total memory usage, so this seems
plausible. I vote to leave this issue open, add a reference to it on the issue
for hash sets (does it exist yet?) and do a memory profile once we finished
working on the hash sets.
msg3714 (view) Author: malte Date: 2014-10-08.12:10:23
Interesting and mildly surprising. The only reasonable explanation I can up with
for something like astar(blind()) is that this is due to the central hash_set in
the state registry. We might want to leave this one open and look at it again
when we revisit the hash tables, which is something we had on the screen anyway.
Or we might just close it as something that is the way it is. Preferences?
msg3702 (view) Author: florian Date: 2014-10-08.09:04:47
Seems like this is still an issue with the new state representation:

http://ai.cs.unibas.ch/_tmp_files/pommeren/issue213-base-issue213-base64-compare.html
msg3669 (view) Author: malte Date: 2014-10-06.12:20:19
I suggest blind(), BJOLP, lmcut(), ipdb() and our two common merge-and-shrink
configurations, on the optimal (incl. ipc11) benchmark suite.
msg3668 (view) Author: florian Date: 2014-10-06.12:16:55
> Florian, how much work would it be to investigate the time and memory of -m32
> vs. -m64 for some standard configurations?

Not too much. I could create two experimental branches, modify the Makefile in
them and set up an experiment for this. Which configs/tasks do you want to test?
msg3636 (view) Author: malte Date: 2014-10-04.19:42:03
I'm adding Florian to this issue as our "master state memory manager". I wonder
if this is still an issue with our new way of storing state information. Perhaps
we can get rid of "-m32"?

Florian, how much work would it be to investigate the time and memory of -m32
vs. -m64 for some standard configurations?
msg1608 (view) Author: malte Date: 2011-08-13.20:17:45
I made some more tests, which show that the behaviour already shows up with
blind search. Here's the relevant data; everything else (no. expanded states,
no. hash buckets in the closed list etc.) was reported identically with -m32 and
-m64.

For reference, this is from my new desktop machine (Xeon E31270 running 64-bit
Ubuntu 10.10 with gcc 4.4.4-14ubuntu5).

./downward-1 --search 'astar(blind())' < output:

woodworking-opt #12:

-m32:
Total time: 20.28s
Peak memory: 345372 KB

-m64:
Total time: 18.25s
Peak memory: 501048 KB

woodworking-opt #23:

-m32:
Total time: 11.67s
Peak memory: 214460 KB

-m64:
Total time: 10.69s
Peak memory: 309812 KB

blocks #9-0:

-m32:
Total time: 24.07s
Peak memory: 497516 KB

-m64:
Total time: 22.61s
Peak memory: 866388 KB
msg1607 (view) Author: malte Date: 2011-08-13.19:12:29
For future reference, here's the relevant excerpt of the original email that
discussed -m32 vs. -m64 memory usage:

===========================================================================
as a sanity test for our packaging, I did some experiments with the
final seq-opt-bjolp package this night, and noticed that it lost 17
(IIRC) problems compared to my previous tests, which is rather a lot.

Looking into this a bit more deeply, this seems to be due to -m64 using
a lot more memory than -m32. My tests were run with 2 GB (so this would
not be such an issue with the 6 GB in the competition), but still the
memory usage difference is rather large, and maybe we should do
something about this in the long run. I don't know if this is typical
behaviour or specific to LM configurations.

Example for the four toughest solved tasks in woodworking:

 #04: uses  737 MB with -m32, 1192 MB with -m64
 #14: uses 1277 MB with -m32, runs out of memory with -m64
 #15: uses 1583 MB with -m32, runs out of memory with -m64
 #24: uses  322 MB with -m32,  514 MB with -m64

I think there's nothing we could or should do about this for the
competition, but maybe it's worth opening an issue on this for later.
===========================================================================
msg1220 (view) Author: malte Date: 2011-01-23.11:29:53
It's true that the LM graph storage is not particularly efficient, but that
cannot explain memory usage in the gigabytes. In fact, I'd be surprised if the
landmark graph contributed even a single megabyte here. We should really
optimize where it matters first, and that means reducing *per-state* memory
cost, not per-problem memory cost. That means we can safely ignore 1) and 2),
although 3) is indeed an issue.

Also, going from vector<int> to vector<state_t> etc. on vectors that only
usually have one or two elements as in point 1) saves next to nothing (or even
exactly nothing due to memory alignment issues); the vector and dynamic-memory
management overhead dwarfs everything else here. Feel free to try the
optimizations in 1) and 2) out, since seeing is believing :-).

Note that so far we have no evidence that this memory explosion only happens for
the landmark configurations; they might happen everywhere. Landmark
configurations were the only ones I had the chance to look at so far.
msg1219 (view) Author: erez Date: 2011-01-23.11:11:44
The landmarks code is not very efficient with memory, and this is compounded in 
64-bit mode. Examples with BJOLP configuration that Malte ran:

Woodworking #04: uses  737 MB with -m32, 1192 MB with -m64
Woodworking #14: uses 1277 MB with -m32, runs out of memory with -m64
Woodworking #15: uses 1583 MB with -m32, runs out of memory with -m64
Woodworking #24: uses  322 MB with -m32,  514 MB with -m64

Here are some places where the landmarks code uses lots of memory:

1. LandmarkNode has:

   vector<int> vars
   vector<int> vals
   set<int> first_achievers
   set<int> possible_achievers
   hash_set<pair<int, int>, hash_int_pair> forward_orders

Should be - 

   vector<var_num_t>  vars
   vector<state_var_t> vals
   set<operator_num_t> first_achievers
   set<operator_num_t> possible_achievers
   hash_set<pair<var_num_t, state_var_t>, hash_int_pair> forward_orders

If var_num_t, operator_num_t and state_var_t are all 32 bits, that would reduce 
the memory footprint of the landmarks by almost a half (there are some other 
members).

2. LandmarksGraph has:
    vector<int> empty_pre_operators
    vector<vector<vector<int> > > operators_eff_lookup
    vector<vector<vector<int> > > operators_pre_lookup
    vector<vector<set<pair<int, int> > > > inconsistent_facts
    hash_map<pair<int, int>, LandmarkNode *, hash_int_pair> simple_lms_to_nodes
    hash_map<pair<int, int>, LandmarkNode *, hash_int_pair> disj_lms_to_nodes
    hash_map<pair<int, int>, Pddl_proposition, hash_int_pair> pddl_propositions

should be - 

    vector<operator_num_t> empty_pre_operators
    vector<vector<vector<operator_num_t> > > operators_eff_lookup
    vector<vector<vector<operator_num_t> > > operators_pre_lookup
    vector<vector<set<pair<num_vars_t, state_var_t> > > > inconsistent_facts
    hash_map<pair<num_vars_t, state_var_t>, LandmarkNode *, hash_int_pair> 
simple_lms_to_nodes
    hash_map<pair<num_vars_t, state_var_t>, LandmarkNode *, hash_int_pair> 
disj_lms_to_nodes
    hash_map<pair<num_vars_t, state_var_t>, Pddl_proposition, hash_int_pair> 
pddl_propositions

3. LandmarkStatusManager uses StateProxy. If we add an ID for each state, we can 
replace this with a number, which might use less memory.
History
Date User Action Args
2017-04-27 23:33:47jendriksetpriority: feature -> meta
messages: + msg6258
summary: After this is done, we should consider changing the build system (e.g., make 64-bit the default build, or remove the option altogether). This is related to issue687. -> Subissues: - Use operator IDs instead of generating operator pointers (issue692) - Implement new hash function (issue693) - Implement new hash table (issue694) (blocked by issue693) - Store landmarks more efficiently (issue695) After this is done, we should consider changing the build system (e.g., make 64-bit the default build, or remove the option altogether). This is related to issue687.
2017-04-27 19:13:15maltesetmessages: + msg6253
2016-12-19 14:08:13floriansetmessages: + msg5905
summary: After this is done, we should consider changing the build system (e.g., make 64-bit the default build, or remove the option altogether). This is related to issue687.
2016-12-14 21:25:33jendriksetmessages: + msg5890
2016-12-14 18:28:50maltesetmessages: + msg5887
2016-12-14 18:20:58jendriksetstatus: reviewing -> chatting
messages: + msg5886
2016-12-14 18:07:36maltesetmessages: + msg5885
2016-12-14 15:47:41jendriksetmessages: + msg5882
2016-12-14 03:55:26maltesetmessages: + msg5873
2016-12-14 01:33:41jendriksetmessages: + msg5872
2016-12-13 17:17:10maltesetmessages: + msg5871
2016-12-13 16:53:43jendriksetmessages: + msg5870
2016-12-12 23:28:01maltesetmessages: + msg5868
2016-12-12 23:27:37maltesetmessages: + msg5867
2016-12-12 23:21:13maltesetfiles: + hash_sizes.py
messages: + msg5866
2016-12-12 16:29:02jendriksetmessages: + msg5865
2016-12-12 16:19:14floriansetmessages: + msg5864
2016-12-12 14:36:05jendriksetmessages: + msg5863
2016-12-12 13:14:08maltesetmessages: + msg5862
2016-12-12 11:56:48jendriksetmessages: + msg5861
2016-12-11 23:42:54maltesetmessages: + msg5860
2016-12-11 23:15:29jendriksetmessages: + msg5859
2016-12-10 12:03:43maltesetmessages: + msg5858
2016-12-09 11:17:37maltesetmessages: + msg5856
2016-12-09 09:51:58jendriksetmessages: + msg5855
2016-12-08 17:56:16jendriksetmessages: + msg5854
2016-12-08 17:14:10maltesetmessages: + msg5853
2016-12-08 17:10:25maltesetmessages: + msg5852
summary: Jendrik - re the changes for v1 - is there a need to store the parent operator at all? A while back I stripped this out of a fork of FD, and when a goal state was reached, derived what the operators must have been by looking at the differences between states on the path to the goal. This has a small postprocessing cost, but saves memory. -> (no value)
2016-12-08 17:06:09andrew.colessetsummary: Jendrik - re the changes for v1 - is there a need to store the parent operator at all? A while back I stripped this out of a fork of FD, and when a goal state was reached, derived what the operators must have been by looking at the differences between states on the path to the goal. This has a small postprocessing cost, but saves memory.
2016-12-08 15:02:34maltesetmessages: + msg5851
2016-12-08 14:53:00jendriksetmessages: + msg5850
2016-12-05 11:13:42jendriksetmessages: + msg5849
2016-12-05 10:51:20floriansetmessages: + msg5847
2016-12-05 01:01:48jendriksetstatus: chatting -> reviewing
messages: + msg5846
2016-11-30 19:16:01maltesetmessages: + msg5844
2016-11-30 19:10:59jendriksetassignedto: jendrik
messages: + msg5843
2016-11-30 00:28:34jendriksetmessages: + msg5842
2016-11-29 21:34:00floriansetmessages: + msg5841
2016-11-29 21:18:41jendriksetmessages: + msg5840
2016-11-29 14:45:21floriansetmessages: + msg5838
2016-11-29 14:38:37jendriksetmessages: + msg5837
2016-11-29 14:14:50maltesetmessages: + msg5835
2016-11-29 14:03:22floriansetmessages: + msg5834
2016-11-29 13:57:52maltesetmessages: + msg5831
2016-11-29 11:54:22jendriksetmessages: + msg5828
2015-03-31 13:38:22jendriksetmessages: + msg4128
2014-10-27 14:01:11andrew.colessetfiles: + state_registry_low_memory.tar.gz
messages: + msg3882
2014-10-27 11:20:12floriansetmessages: + msg3881
2014-10-27 11:18:00maltesetmessages: + msg3880
2014-10-27 11:17:39floriansetmessages: + msg3879
2014-10-27 00:45:25andrew.colessetfiles: + chunk_allocator.h
nosy: + andrew.coles
messages: + msg3878
2014-10-08 21:30:42jendriksetnosy: + jendrik
messages: + msg3730
2014-10-08 20:37:02floriansetmessages: + msg3727
2014-10-08 12:10:23maltesetmessages: + msg3714
2014-10-08 09:04:47floriansetmessages: + msg3702
2014-10-06 12:20:19maltesetmessages: + msg3669
2014-10-06 12:16:55floriansetmessages: + msg3668
2014-10-04 19:42:03maltesetnosy: + florian
messages: + msg3636
2011-08-13 20:17:45maltesetmessages: + msg1608
2011-08-13 19:12:29maltesetmessages: + msg1607
2011-01-23 11:30:08maltesettitle: Inefficient Memory Use of Landmarks -> inefficient memory use with -m64 setting
2011-01-23 11:29:53maltesetmessages: + msg1220
2011-01-23 11:11:44erezcreate