Issue731

Title remove legacy hash functions
Priority feature Status resolved
Superseder Nosy List jendrik, malte
Assigned To jendrik Keywords
Optional summary

Created on 2017-07-06.00:35:45 by jendrik, last changed by malte.

Files
File name Uploaded Type Edit Remove
test_hash.cc malte, 2017-09-01.14:52:31 text/x-c++src
Messages
msg7824 (view) Author: malte Date: 2018-10-01.13:08:47
Looks good, thanks! :-)
msg7823 (view) Author: jendrik Date: 2018-10-01.12:57:10
Done. Thanks for catching this!
msg7822 (view) Author: malte Date: 2018-10-01.12:47:45
Hi Jendrik, that change is not the right thing on 64-bit Linux and breaks the
hashing of very large vectors. ("Break" is a strong word, but it no longer
satisfies what we describe regarding prefix codes etc.)

It would be more future-proof to cast to uint64_t instead of unsigned int. Can
we do that? I would also add a TODO observing that using uint64_t is wasteful on
32-bit platforms and point to msg7812 for an explanation why we do it like this.
msg7818 (view) Author: jendrik Date: 2018-09-30.22:37:58
I had to make one more change (http://hg.fast-downward.org/rev/e7a0a2510be7), but 
now the MacOS buildbot is happy again. We should look into issue847 soon to 
remove the need for leaking the private values of TaskID and OperatorID.
msg7817 (view) Author: jendrik Date: 2018-09-30.20:57:17
I like your "feed_hash" idea. I think this is similar to what the abseil guys 
recommend for their hashing mechanism (https://abseil.io/tips/152). I opened 
issue847 for this.

For a short-term solution, I have changed OperatorID::hash() to return an int 
and TaskID::hash() to return uint64_t.
msg7815 (view) Author: malte Date: 2018-09-30.17:41:39
One question worth asking is why OperatorID::hash returns a size_t when
internally it uses an int, and our hashers can actually deal with ints directly.
This may be worth changing. With our own hashing scheme, "hash" is not a method
that needs to have the same interface for all classes because it is *not* part
of the protocol. The function that is part of the protocol is "feed".

In fact, the amount of boilerplate we get because we currently have to implement
both a feed function and a hash method (because computing the hash method
requires access to the internals) is not good. We decided to go with a global
function because this gives more generality, but this function often refers to
an actual method to have access to private state. In the std library, this is
somewhat analogous to the role of the function std::begin vs. the begin methods
of many classes.

One idea, following the stdlib, is to have a default template implementation of
feed for typename T that delegates to a method T::feed_hash. Then implementors
only need to provide that method, and in the case of OperatorID, we could
directly feed the int.

We would then no longer implement any "hash()" methods for our classes or global
"feed()" functions, only the "feed_hash" methods, reducing boilerplate and
having more choice in which types to use.

If we still need to feed a size_t somewhere, I suppose we have to come up with a
solution, but I would wait if we need that first. If we do, we probably need an
#ifdef or similar kind of conditional compilation.
msg7812 (view) Author: jendrik Date: 2018-09-29.20:07:22
I had a brief look and think the problem is exactly the same as the one described 
here: https://stackoverflow.com/questions/11603818/ , i.e., none of uint32_t and 
uint64_t are defined using the same type as size_t. Unfortunately, the suggested 
solution (adding an overload for "unsigned long") doesn't work on Linux since on 
64-bit Linux "uint64_t" is defined as "unsigned long", meaning we would define 
the same function twice there (resulting in a compilation error). Also, we want 
to know the size of the input type, which can differ between platforms for 
"unsigned long", if I remember correctly. I am not sure how to best handle this 
problem. Do you have an idea? Here are the types that I looked up using the 
method from https://stackoverflow.com/a/12490756/1277298 .

Linux 32bit:
typedef unsigned int size_t;
typedef unsigned int uint32_t;
typedef unsigned long long int uint64_t;

Linux 64bit:
typedef long unsigned int size_t;
typedef unsigned int uint32_t;
typedef unsigned long int uint64_t;

Darwin from stackoverflow (probably 64bit):
typedef unsigned long __darwin_size_t;
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
msg7811 (view) Author: malte Date: 2018-09-29.16:03:21
Sounds good, but for what it's worth, I don't think access to the Mac is needed
to understand the error messages, at least the first ones. It seems to be
something reasonably straightforward about ambiguous overloads. (If I recall
this correctly.) Let me know if you need any support there.
msg7810 (view) Author: jendrik Date: 2018-09-28.22:29:15
I'll look into the buildbot issue on Monday in the office when I can directly 
test things on the Mac.
msg7809 (view) Author: malte Date: 2018-09-28.20:20:35
Looks like there are some issues with the buildbot that need to be addressed.

Once this is all resolved, I think it would be good to let rest of the group
know about the new hash functionality and how/when to use them. (I'd recommend
sending an internal email, not something to the fast-downward list.)
msg7808 (view) Author: jendrik Date: 2018-09-28.19:59:00
I made one more change to hash the new TaskID objects with the new hash function 
and merged the issue branch.
msg7751 (view) Author: malte Date: 2018-09-21.16:30:08
Done, looks good to merge. I'm slightly in two minds about some of the
score_total_time penalties in satisficing configurations, but that's not a
reason to stand in the way of progress. ;-)
msg7750 (view) Author: jendrik Date: 2018-09-21.16:27:59
Sounds good.
msg7749 (view) Author: malte Date: 2018-09-21.16:20:59
I am looking at this now because it is at the front of the review queue.

Due to the many changes on default, this has some merge conflicts now in the
pull request, but nothing that looks too complicated to resolve, so I'll just
have a look at the code again, and then another look at the experiments.

The way I understand the history, v4 and v5 are identical except that v5
includes a merge from default that affects iPDB, so it's OK that we don't have
satisficing experiments for v5.
msg6729 (view) Author: jendrik Date: 2017-12-06.08:43:26
Now that issue735 is merged, I ran experiments for the PDB configurations again:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue731-v5-opt-issue731-v5-base-issue731-v5-compare.html

Do you think we can merge this?
msg6496 (view) Author: malte Date: 2017-09-01.16:55:10
I've looked at the dominance pruning code, and I think there are a few things
that could be tried. I'll comment some more over at issue735.
msg6494 (view) Author: jendrik Date: 2017-09-01.16:49:46
Interesting! I pushed your code to the repo. Where do we go from here?
msg6493 (view) Author: malte Date: 2017-09-01.15:52:24
> The new test code should already be pushed.

Thanks! With larger hash sets (10M entries), I get a much larger effect, and it
is again larger if after insertions we do 10 rounds of lookups for every key.
For the read-then-lookup test, the old hash table is more than 10 times faster
on my machine:

Running nothing 1 times: 1e-06s

Running insert sequential int with BoostHash 1 times: 0.885316s
Running insert sequential int with BurtleFeed 1 times: 5.48213s

Running insert scrambled int with BoostHash 1 times: 6.22216s
Running insert scrambled int with BurtleFeed 1 times: 6.28391s

Running insert, then read sequential int with BoostHash 1 times: 3.14857s
Running insert, then read sequential int with BurtleFeed 1 times: 34.718s

Running insert, then read scrambled int with BoostHash 1 times: 20.511s
Running insert, then read scrambled int with BurtleFeed 1 times: 25.3854s

Running nothing 1 times: 0s

Running insert sequential int with BoostHash 1 times: 1.85816s
Running insert sequential int with BurtleFeed 1 times: 5.49561s

Running insert scrambled int with BoostHash 1 times: 6.29328s
Running insert scrambled int with BurtleFeed 1 times: 6.40746s

Running insert, then read sequential int with BoostHash 1 times: 3.15231s
Running insert, then read sequential int with BurtleFeed 1 times: 35.194s

Running insert, then read scrambled int with BoostHash 1 times: 20.461s
Running insert, then read scrambled int with BurtleFeed 1 times: 25.3585s


For reference, here is the experiment code I used (no changes outside main):

int main(int, char **) {
    const int REPETITIONS = 2;
    const int NUM_CALLS = 1;
    const int NUM_INSERTIONS = 10000000;
    const int NUM_READ_PASSES = 10;

    for (int i = 0; i < REPETITIONS; ++i) {
        benchmark("nothing", NUM_CALLS, [] () {});
        cout << endl;

        benchmark("insert sequential int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
        });
        benchmark("insert sequential int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
        });
        cout << endl;

        benchmark("insert scrambled int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
        });
        benchmark("insert scrambled int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
        });
        cout << endl;

            benchmark("insert, then read sequential int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        benchmark("insert, then read sequential int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        cout << endl;

        benchmark("insert, then read scrambled int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        benchmark("insert, then read scrambled int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        cout << endl;
    }

    return 0;
}
msg6492 (view) Author: jendrik Date: 2017-09-01.15:32:39
The new test code should already be pushed.
msg6491 (view) Author: malte Date: 2017-09-01.15:31:07
That's not nearly a large enough effect, though. I expect that the effect is
much larger for much larger hash sets. Can you push the test code to the repository?
msg6490 (view) Author: jendrik Date: 2017-09-01.15:30:37
Regarding msg6487, I opened issue735 for analyzing dominance pruning and added 
your second point to the meeting agenda.
msg6488 (view) Author: jendrik Date: 2017-09-01.15:25:54
Your prediction holds. The Boost hash profits from sequential inserts:

Running insert sequential int with BoostHash 100000 times: 0.797894s
Running insert sequential int with BurtleFeed 100000 times: 0.924217s

Running insert scrambled int with BoostHash 100000 times: 0.858256s
Running insert scrambled int with BurtleFeed 100000 times: 0.922044s
msg6487 (view) Author: malte Date: 2017-09-01.15:07:11
PS: the fact that we have long running configurations that spend more than 90%
of their time on dominance pruning suggests that we should look a bit more
closely whether dominance pruning is always a good idea, whether we should set a
time limit for it, etc. Can you open a separate issue for this? We might be
crippling our PDB-based planners in some domains. Woodworking with
astar(cpdbs(systematic(2))) would make a good test domain for this.

More generally, it is not ideal that we find out such bottlenecks more or less
by accident. Perhaps we can discuss how we can do things like these better in
the future, e.g. by having the results run include a categorical parameter that
explains what the bottleneck was and a numerical parameter that explains how
much time was spent on this bottleneck. To be really useful, this should also
adequately cover the case where we don't solve a task, i.e., where the
bottleneck leads to running out of time and memory. I think this shouldn't be
too difficult as long as the log output is clear enough to tell us at regular
checkpoints what the code is currently doing, at which point it started doing
it, and perhaps also what the peak memory usage at this point was.
msg6486 (view) Author: malte Date: 2017-09-01.14:52:31
The benchmark code looks fine.

One possible explanation is that the old hash function leads to better cache
locality for certain access patterns.


For hashing pair<X *, Y *> , the following functions in the old hash code are
relevant (removing comments and some namespace stuff, otherwise taken from the
pull request):

template<typename T>
inline void hash_combine(size_t &hash, const T &value) {
    std::hash<T> hasher;
    hash ^= hasher(value) + 0x9e3779b9 + (hash << 6) + (hash >> 2);
}

template<typename TA, typename TB>
struct hash<std::pair<TA, TB>> {
    size_t operator()(const std::pair<TA, TB> &pair) const {
        size_t hash = 0;
        utils::hash_combine(hash, pair.first);
        utils::hash_combine(hash, pair.second);
        return hash;
    }
};

This is more-or-less equivalent to the following computation for `pair<X *, Y *>
p`, assuming "using namespace std":

        size_t h = 0;
        h ^= hash<X *>()(pair.first) + 0x9e3779b9 + (h << 6) + (h >> 2);
        h ^= hash<Y *>()(pair.second) + 0x9e3779b9 + (h << 6) + (h >> 2);
        return h;

This can be computed quickly, but the micro-benchmark shows that this shouldn't
matter very much, so this looks like a red herring. I think the performance
difference is not because of *how fast* the hash values are computed but because
of *what* the hash values are.

There are two main ways in which the distribution of hash values can affect
performance:
1) If there are many collisions, performance degrades.
2) If many similar hash values are considered shortly after one another, we can
get better cache locality when accessing the hash buckets.

Our new code tries to scramble the hash values much more, which means that 1)
isn't a likely explanation, but 2) is.

I checked that with gcc (at least on my machine, code attached), hash<void
*>()(x) just returns x itself. Ditto for ints. So the code above simplifies to:

        size_t h = 0;
        h ^= reinterpret_cast<size_t>(pair.first) + 0x9e3779b9 + (h << 6) + (h
>> 2);
        h ^= reinterpret_cast<size_t>(pair.second) + 0x9e3779b9 + (h << 6) + (h
>> 2);
        return h;

For many kinds of code, it's quite to have consecutive accesses to "consecutive
pairs", i.e., something like the pair <p, q> followed by the pair <p, q + 1>
(where "+ 1" is pointer arithmetic, so the numerical value actually increases by
sizeof Y). The code above doesn't really scramble pair.second very much, so
these will often end up with very similar hash values. For example, if sizeof Y
= 4 and (0x9e3779b9 + (h << 6) + (h >> 2) doesn't have the 4-bit set, the two
final hash values will only differ by 4, and hence we'll sequentially access
hash buckets in close proximity, giving us good memory locality.


To test this, I think it would be interesting to test hash tables with
sequential access patterns. For example, a hash set where we insert consecutive
integers from 0 to N-1, unscrambled, and then a separate experiment where we
insert consecutive integers from 0 to N-1, scrambled.

I haven't looked at the PDB code yet to see how the hashes are actually used
there, so this could be a next step. Reasons why Woodworking might be more
heavily affected than other domains could be either that it has more work than
usually to do in dominance pruning, or that locality of access is more important
there than in other domains, for example because in other domains the working
set of this code is often small enough to more or less completely fit whatever
cache (L1, L2, L3) is causing this, so that locality of access is less
important. If that is the case, it might be difficult to reproduce the behaviour
on machines with different cache sizes.
msg6484 (view) Author: jendrik Date: 2017-09-01.13:59:46
Let's try number 2.

I compared the "before" and "after" logs for systematic(2) on a woodworking task and it seems that all of the additional runtime 
comes from dominance pruning. Dominance pruning hashes pointer pairs. To analyze our old and new hashing algorithm for hashing 
pointer pairs, I created a new microbenchmark (with the new hash.h file). Here are the results (of the second benchmark run):


[...]/experiments/issue731/hash-microbenchmark$ export DOWNWARD_BITWIDTH=32 && make && ./benchmark${DOWNWARD_BITWIDTH}

Running nothing 100000 times: 0.000215s

Running insert pair<void *, void *> with BoostHash 100000 times: 0.9221s
Running insert pair<void *, void *> with BurtleFeed 100000 times: 0.996635s

Running count pair<void *, void *> with BoostHash 100000 times: 0.232802s
Running count pair<void *, void *> with BurtleFeed 100000 times: 0.341353s


These numbers don't explain the change in runtime we're seeing. I'm not sure if my approach to generating random pointer pairs 
is valid. You can inspect the code in the pull request (https://bitbucket.org/jendrikseipp/downward/pull-requests/73).
msg6474 (view) Author: malte Date: 2017-08-31.23:47:21
If it weren't for the systematic PDB results, I think this would be fine to merge.

I'd like to look a bit more carefully at what goes on in this configuration. The
results for woodworking-opt08-strips in this configuration, where runtime
doubles for all solved tasks, give me pause. Plus 700 seconds just because of
the new hash function? That's a lot. How does this happen when in our
microbenchmarks the new hash function was at worst a few percentage points
slower than the old one? (And of course the planner does other things than
computing hash values, too.)

So two paths to proceed:
1) Remove the changes that make the difference in systematic PDBs, verify that
this configuration is back to old performance, merge, then look at the
systematic PDB stuff in a separate issue.
2) Look at the systematic PDB stuff right now, merge everything in one go.
msg6473 (view) Author: malte Date: 2017-08-31.15:58:32
> (The unexplained errors for the satisficing experiment are due to VAL
> running out of memory. We should probably discuss what to do about that
> in our next Fast Downward meeting.)

We could try the INVAL validator instead
(https://github.com/patrikhaslum/INVAL). I'm not sure if anyone ever compared
its validation time and memory requirements to VAL, but unless it's very
difficult to get to run, perhaps we can at least manually run it on 2-3 cases
where VAL is having difficulties and see how it behaves. It's still worth
discussing this in the Fast Downward meeting, but having some data on INVAL
beforehand could be useful for the discussion.
msg6472 (view) Author: jendrik Date: 2017-08-31.11:00:58
The results are in and I think they look better than expected:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue731-v4-opt-issue731-base-issue731-v4-compare.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue731-v4-sat-issue731-base-issue731-v4-compare.html

(The unexplained errors for the satisficing experiment are due to VAL running out of memory. We should 
probably discuss what to do about that in our next Fast Downward meeting.)

As usual we're seeing quite some noise: blind search is among the configurations with the highest 
overall difference for score_search_time (-1.37), but I don't see any place where new hash functions 
are used in this configuration. 

The most critical configuration is probably the one using systematic PDBs. It's total score_total_time 
decreases by 3.41.

What are your thoughts on this?
msg6471 (view) Author: malte Date: 2017-08-28.15:28:34
Code looks good. Based on our earlier experiments, I'm skeptical about
performance, so we might end up leading results for the intermediate commits as
well. Let's wait and see.
msg6470 (view) Author: jendrik Date: 2017-08-28.11:55:52
I have removed the remaining legacy hash functions and created a pull request 
at https://bitbucket.org/jendrikseipp/downward/pull-requests/73 . Experiments 
are running.

I have divided the PR into 4 commits. The current experiment compares the 
fourth commit (v4) to the base revision. If we can't interpret the combined 
revisions, we can make more granular experiments later.
msg6416 (view) Author: jendrik Date: 2017-07-07.18:29:40
You're right, thanks for correcting this! I updated the comment in the code for 
issue693 accordingly.
msg6415 (view) Author: malte Date: 2017-07-07.18:24:37
Do we really have evidence that they cause more collisions? I thought the reason
you looked into other hash functions was that the existing ones often hashed to
adjacent keys, which is bad for closed hashing (but good for open hashing
because of better memory locality).

For me, a main reason to get rid of them is because it's simpler to only use and
maintain one set of user-defined hash functions rather than two.
msg6411 (view) Author: jendrik Date: 2017-07-06.00:35:45
We plan to remove the legacy hash functions in utils/hash.h since they produce 
too many collisions and implementing std::hash<T> for non-user-defined types T 
causes
undefined behaviour (http://en.cppreference.com/w/cpp/language/extending_std).

We want to use the new utils::HashMap and utils::HashSet for hashing types that 
have no specialization for std::hash.
History
Date User Action Args
2018-10-01 13:08:47maltesetmessages: + msg7824
2018-10-01 12:57:10jendriksetmessages: + msg7823
2018-10-01 12:47:45maltesetmessages: + msg7822
2018-09-30 22:37:58jendriksetmessages: + msg7818
2018-09-30 20:57:17jendriksetmessages: + msg7817
2018-09-30 17:41:39maltesetmessages: + msg7815
2018-09-29 20:07:22jendriksetmessages: + msg7812
2018-09-29 16:03:21maltesetmessages: + msg7811
2018-09-28 22:29:15jendriksetmessages: + msg7810
2018-09-28 20:20:36maltesetmessages: + msg7809
2018-09-28 19:59:00jendriksetstatus: chatting -> resolved
messages: + msg7808
2018-09-21 16:30:08maltesetmessages: + msg7751
2018-09-21 16:27:59jendriksetmessages: + msg7750
2018-09-21 16:20:59maltesetmessages: + msg7749
2017-12-06 08:43:26jendriksetstatus: deferred -> chatting
messages: + msg6729
summary: Blocked by issue735. ->
2017-09-01 22:18:45jendriksetstatus: reviewing -> deferred
summary: Blocked by issue735.
2017-09-01 16:55:10maltesetmessages: + msg6496
2017-09-01 16:49:46jendriksetmessages: + msg6494
2017-09-01 15:52:24maltesetmessages: + msg6493
2017-09-01 15:32:39jendriksetmessages: + msg6492
2017-09-01 15:31:07maltesetmessages: + msg6491
2017-09-01 15:30:37jendriksetmessages: + msg6490
2017-09-01 15:25:54jendriksetmessages: + msg6488
2017-09-01 15:07:11maltesetmessages: + msg6487
2017-09-01 14:52:31maltesetfiles: + test_hash.cc
messages: + msg6486
2017-09-01 14:52:08maltesetmessages: - msg6485
2017-09-01 14:52:03maltesetmessages: + msg6485
2017-09-01 13:59:46jendriksetmessages: + msg6484
2017-08-31 23:47:21maltesetmessages: + msg6474
2017-08-31 15:58:32maltesetmessages: + msg6473
2017-08-31 11:00:58jendriksetmessages: + msg6472
2017-08-28 15:28:34maltesetmessages: + msg6471
2017-08-28 11:55:52jendriksetstatus: chatting -> reviewing
assignedto: jendrik
messages: + msg6470
2017-07-07 18:29:40jendriksetmessages: + msg6416
2017-07-07 18:24:37maltesetstatus: unread -> chatting
messages: + msg6415
2017-07-06 00:35:45jendrikcreate