Issue731

Title remove legacy hash functions
Priority feature Status deferred
Superseder Nosy List jendrik, malte
Assigned To jendrik Keywords
Optional summary
Blocked by issue735.

Created on 2017-07-06.00:35:45 by jendrik, last changed by jendrik.

Summary
Blocked by issue735.
Files
File name Uploaded Type Edit Remove
test_hash.cc malte, 2017-09-01.14:52:31 text/x-c++src
Messages
msg6496 (view) Author: malte Date: 2017-09-01.16:55:10
I've looked at the dominance pruning code, and I think there are a few things
that could be tried. I'll comment some more over at issue735.
msg6494 (view) Author: jendrik Date: 2017-09-01.16:49:46
Interesting! I pushed your code to the repo. Where do we go from here?
msg6493 (view) Author: malte Date: 2017-09-01.15:52:24
> The new test code should already be pushed.

Thanks! With larger hash sets (10M entries), I get a much larger effect, and it
is again larger if after insertions we do 10 rounds of lookups for every key.
For the read-then-lookup test, the old hash table is more than 10 times faster
on my machine:

Running nothing 1 times: 1e-06s

Running insert sequential int with BoostHash 1 times: 0.885316s
Running insert sequential int with BurtleFeed 1 times: 5.48213s

Running insert scrambled int with BoostHash 1 times: 6.22216s
Running insert scrambled int with BurtleFeed 1 times: 6.28391s

Running insert, then read sequential int with BoostHash 1 times: 3.14857s
Running insert, then read sequential int with BurtleFeed 1 times: 34.718s

Running insert, then read scrambled int with BoostHash 1 times: 20.511s
Running insert, then read scrambled int with BurtleFeed 1 times: 25.3854s

Running nothing 1 times: 0s

Running insert sequential int with BoostHash 1 times: 1.85816s
Running insert sequential int with BurtleFeed 1 times: 5.49561s

Running insert scrambled int with BoostHash 1 times: 6.29328s
Running insert scrambled int with BurtleFeed 1 times: 6.40746s

Running insert, then read sequential int with BoostHash 1 times: 3.15231s
Running insert, then read sequential int with BurtleFeed 1 times: 35.194s

Running insert, then read scrambled int with BoostHash 1 times: 20.461s
Running insert, then read scrambled int with BurtleFeed 1 times: 25.3585s


For reference, here is the experiment code I used (no changes outside main):

int main(int, char **) {
    const int REPETITIONS = 2;
    const int NUM_CALLS = 1;
    const int NUM_INSERTIONS = 10000000;
    const int NUM_READ_PASSES = 10;

    for (int i = 0; i < REPETITIONS; ++i) {
        benchmark("nothing", NUM_CALLS, [] () {});
        cout << endl;

        benchmark("insert sequential int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
        });
        benchmark("insert sequential int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
        });
        cout << endl;

        benchmark("insert scrambled int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
        });
        benchmark("insert scrambled int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
        });
        cout << endl;

            benchmark("insert, then read sequential int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        benchmark("insert, then read sequential int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(i);
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        cout << endl;

        benchmark("insert, then read scrambled int with BoostHash", NUM_CALLS,
                  [&]() {
            unordered_set<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        benchmark("insert, then read scrambled int with BurtleFeed", NUM_CALLS,
                  [&]() {
            utils::HashSet<int> s;
            for (int i = 0; i < NUM_INSERTIONS; ++i) {
                s.insert(scramble(i));
            }
            for (int j = 0; j < NUM_READ_PASSES; ++j) {
                for (int i = 0; i < NUM_INSERTIONS; ++i) {
                    s.count(i);
                }
            }
        });
        cout << endl;
    }

    return 0;
}
msg6492 (view) Author: jendrik Date: 2017-09-01.15:32:39
The new test code should already be pushed.
msg6491 (view) Author: malte Date: 2017-09-01.15:31:07
That's not nearly a large enough effect, though. I expect that the effect is
much larger for much larger hash sets. Can you push the test code to the repository?
msg6490 (view) Author: jendrik Date: 2017-09-01.15:30:37
Regarding msg6487, I opened issue735 for analyzing dominance pruning and added 
your second point to the meeting agenda.
msg6488 (view) Author: jendrik Date: 2017-09-01.15:25:54
Your prediction holds. The Boost hash profits from sequential inserts:

Running insert sequential int with BoostHash 100000 times: 0.797894s
Running insert sequential int with BurtleFeed 100000 times: 0.924217s

Running insert scrambled int with BoostHash 100000 times: 0.858256s
Running insert scrambled int with BurtleFeed 100000 times: 0.922044s
msg6487 (view) Author: malte Date: 2017-09-01.15:07:11
PS: the fact that we have long running configurations that spend more than 90%
of their time on dominance pruning suggests that we should look a bit more
closely whether dominance pruning is always a good idea, whether we should set a
time limit for it, etc. Can you open a separate issue for this? We might be
crippling our PDB-based planners in some domains. Woodworking with
astar(cpdbs(systematic(2))) would make a good test domain for this.

More generally, it is not ideal that we find out such bottlenecks more or less
by accident. Perhaps we can discuss how we can do things like these better in
the future, e.g. by having the results run include a categorical parameter that
explains what the bottleneck was and a numerical parameter that explains how
much time was spent on this bottleneck. To be really useful, this should also
adequately cover the case where we don't solve a task, i.e., where the
bottleneck leads to running out of time and memory. I think this shouldn't be
too difficult as long as the log output is clear enough to tell us at regular
checkpoints what the code is currently doing, at which point it started doing
it, and perhaps also what the peak memory usage at this point was.
msg6486 (view) Author: malte Date: 2017-09-01.14:52:31
The benchmark code looks fine.

One possible explanation is that the old hash function leads to better cache
locality for certain access patterns.


For hashing pair<X *, Y *> , the following functions in the old hash code are
relevant (removing comments and some namespace stuff, otherwise taken from the
pull request):

template<typename T>
inline void hash_combine(size_t &hash, const T &value) {
    std::hash<T> hasher;
    hash ^= hasher(value) + 0x9e3779b9 + (hash << 6) + (hash >> 2);
}

template<typename TA, typename TB>
struct hash<std::pair<TA, TB>> {
    size_t operator()(const std::pair<TA, TB> &pair) const {
        size_t hash = 0;
        utils::hash_combine(hash, pair.first);
        utils::hash_combine(hash, pair.second);
        return hash;
    }
};

This is more-or-less equivalent to the following computation for `pair<X *, Y *>
p`, assuming "using namespace std":

        size_t h = 0;
        h ^= hash<X *>()(pair.first) + 0x9e3779b9 + (h << 6) + (h >> 2);
        h ^= hash<Y *>()(pair.second) + 0x9e3779b9 + (h << 6) + (h >> 2);
        return h;

This can be computed quickly, but the micro-benchmark shows that this shouldn't
matter very much, so this looks like a red herring. I think the performance
difference is not because of *how fast* the hash values are computed but because
of *what* the hash values are.

There are two main ways in which the distribution of hash values can affect
performance:
1) If there are many collisions, performance degrades.
2) If many similar hash values are considered shortly after one another, we can
get better cache locality when accessing the hash buckets.

Our new code tries to scramble the hash values much more, which means that 1)
isn't a likely explanation, but 2) is.

I checked that with gcc (at least on my machine, code attached), hash<void
*>()(x) just returns x itself. Ditto for ints. So the code above simplifies to:

        size_t h = 0;
        h ^= reinterpret_cast<size_t>(pair.first) + 0x9e3779b9 + (h << 6) + (h
>> 2);
        h ^= reinterpret_cast<size_t>(pair.second) + 0x9e3779b9 + (h << 6) + (h
>> 2);
        return h;

For many kinds of code, it's quite to have consecutive accesses to "consecutive
pairs", i.e., something like the pair <p, q> followed by the pair <p, q + 1>
(where "+ 1" is pointer arithmetic, so the numerical value actually increases by
sizeof Y). The code above doesn't really scramble pair.second very much, so
these will often end up with very similar hash values. For example, if sizeof Y
= 4 and (0x9e3779b9 + (h << 6) + (h >> 2) doesn't have the 4-bit set, the two
final hash values will only differ by 4, and hence we'll sequentially access
hash buckets in close proximity, giving us good memory locality.


To test this, I think it would be interesting to test hash tables with
sequential access patterns. For example, a hash set where we insert consecutive
integers from 0 to N-1, unscrambled, and then a separate experiment where we
insert consecutive integers from 0 to N-1, scrambled.

I haven't looked at the PDB code yet to see how the hashes are actually used
there, so this could be a next step. Reasons why Woodworking might be more
heavily affected than other domains could be either that it has more work than
usually to do in dominance pruning, or that locality of access is more important
there than in other domains, for example because in other domains the working
set of this code is often small enough to more or less completely fit whatever
cache (L1, L2, L3) is causing this, so that locality of access is less
important. If that is the case, it might be difficult to reproduce the behaviour
on machines with different cache sizes.
msg6484 (view) Author: jendrik Date: 2017-09-01.13:59:46
Let's try number 2.

I compared the "before" and "after" logs for systematic(2) on a woodworking task and it seems that all of the additional runtime 
comes from dominance pruning. Dominance pruning hashes pointer pairs. To analyze our old and new hashing algorithm for hashing 
pointer pairs, I created a new microbenchmark (with the new hash.h file). Here are the results (of the second benchmark run):


[...]/experiments/issue731/hash-microbenchmark$ export DOWNWARD_BITWIDTH=32 && make && ./benchmark${DOWNWARD_BITWIDTH}

Running nothing 100000 times: 0.000215s

Running insert pair<void *, void *> with BoostHash 100000 times: 0.9221s
Running insert pair<void *, void *> with BurtleFeed 100000 times: 0.996635s

Running count pair<void *, void *> with BoostHash 100000 times: 0.232802s
Running count pair<void *, void *> with BurtleFeed 100000 times: 0.341353s


These numbers don't explain the change in runtime we're seeing. I'm not sure if my approach to generating random pointer pairs 
is valid. You can inspect the code in the pull request (https://bitbucket.org/jendrikseipp/downward/pull-requests/73).
msg6474 (view) Author: malte Date: 2017-08-31.23:47:21
If it weren't for the systematic PDB results, I think this would be fine to merge.

I'd like to look a bit more carefully at what goes on in this configuration. The
results for woodworking-opt08-strips in this configuration, where runtime
doubles for all solved tasks, give me pause. Plus 700 seconds just because of
the new hash function? That's a lot. How does this happen when in our
microbenchmarks the new hash function was at worst a few percentage points
slower than the old one? (And of course the planner does other things than
computing hash values, too.)

So two paths to proceed:
1) Remove the changes that make the difference in systematic PDBs, verify that
this configuration is back to old performance, merge, then look at the
systematic PDB stuff in a separate issue.
2) Look at the systematic PDB stuff right now, merge everything in one go.
msg6473 (view) Author: malte Date: 2017-08-31.15:58:32
> (The unexplained errors for the satisficing experiment are due to VAL
> running out of memory. We should probably discuss what to do about that
> in our next Fast Downward meeting.)

We could try the INVAL validator instead
(https://github.com/patrikhaslum/INVAL). I'm not sure if anyone ever compared
its validation time and memory requirements to VAL, but unless it's very
difficult to get to run, perhaps we can at least manually run it on 2-3 cases
where VAL is having difficulties and see how it behaves. It's still worth
discussing this in the Fast Downward meeting, but having some data on INVAL
beforehand could be useful for the discussion.
msg6472 (view) Author: jendrik Date: 2017-08-31.11:00:58
The results are in and I think they look better than expected:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue731-v4-opt-issue731-base-issue731-v4-compare.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue731-v4-sat-issue731-base-issue731-v4-compare.html

(The unexplained errors for the satisficing experiment are due to VAL running out of memory. We should 
probably discuss what to do about that in our next Fast Downward meeting.)

As usual we're seeing quite some noise: blind search is among the configurations with the highest 
overall difference for score_search_time (-1.37), but I don't see any place where new hash functions 
are used in this configuration. 

The most critical configuration is probably the one using systematic PDBs. It's total score_total_time 
decreases by 3.41.

What are your thoughts on this?
msg6471 (view) Author: malte Date: 2017-08-28.15:28:34
Code looks good. Based on our earlier experiments, I'm skeptical about
performance, so we might end up leading results for the intermediate commits as
well. Let's wait and see.
msg6470 (view) Author: jendrik Date: 2017-08-28.11:55:52
I have removed the remaining legacy hash functions and created a pull request 
at https://bitbucket.org/jendrikseipp/downward/pull-requests/73 . Experiments 
are running.

I have divided the PR into 4 commits. The current experiment compares the 
fourth commit (v4) to the base revision. If we can't interpret the combined 
revisions, we can make more granular experiments later.
msg6416 (view) Author: jendrik Date: 2017-07-07.18:29:40
You're right, thanks for correcting this! I updated the comment in the code for 
issue693 accordingly.
msg6415 (view) Author: malte Date: 2017-07-07.18:24:37
Do we really have evidence that they cause more collisions? I thought the reason
you looked into other hash functions was that the existing ones often hashed to
adjacent keys, which is bad for closed hashing (but good for open hashing
because of better memory locality).

For me, a main reason to get rid of them is because it's simpler to only use and
maintain one set of user-defined hash functions rather than two.
msg6411 (view) Author: jendrik Date: 2017-07-06.00:35:45
We plan to remove the legacy hash functions in utils/hash.h since they produce 
too many collisions and implementing std::hash<T> for non-user-defined types T 
causes
undefined behaviour (http://en.cppreference.com/w/cpp/language/extending_std).

We want to use the new utils::HashMap and utils::HashSet for hashing types that 
have no specialization for std::hash.
History
Date User Action Args
2017-09-01 22:18:45jendriksetstatus: reviewing -> deferred
summary: Blocked by issue735.
2017-09-01 16:55:10maltesetmessages: + msg6496
2017-09-01 16:49:46jendriksetmessages: + msg6494
2017-09-01 15:52:24maltesetmessages: + msg6493
2017-09-01 15:32:39jendriksetmessages: + msg6492
2017-09-01 15:31:07maltesetmessages: + msg6491
2017-09-01 15:30:37jendriksetmessages: + msg6490
2017-09-01 15:25:54jendriksetmessages: + msg6488
2017-09-01 15:07:11maltesetmessages: + msg6487
2017-09-01 14:52:31maltesetfiles: + test_hash.cc
messages: + msg6486
2017-09-01 14:52:08maltesetmessages: - msg6485
2017-09-01 14:52:03maltesetmessages: + msg6485
2017-09-01 13:59:46jendriksetmessages: + msg6484
2017-08-31 23:47:21maltesetmessages: + msg6474
2017-08-31 15:58:32maltesetmessages: + msg6473
2017-08-31 11:00:58jendriksetmessages: + msg6472
2017-08-28 15:28:34maltesetmessages: + msg6471
2017-08-28 11:55:52jendriksetstatus: chatting -> reviewing
assignedto: jendrik
messages: + msg6470
2017-07-07 18:29:40jendriksetmessages: + msg6416
2017-07-07 18:24:37maltesetstatus: unread -> chatting
messages: + msg6415
2017-07-06 00:35:45jendrikcreate