Author malte
Recipients jendrik, malte
Date 2017-09-01.14:52:31
The benchmark code looks fine.

One possible explanation is that the old hash function leads to better cache
locality for certain access patterns.

For hashing pair<X *, Y *> , the following functions in the old hash code are
relevant (removing comments and some namespace stuff, otherwise taken from the
pull request):

template<typename T>
inline void hash_combine(size_t &hash, const T &value) {
    std::hash<T> hasher;
    hash ^= hasher(value) + 0x9e3779b9 + (hash << 6) + (hash >> 2);

template<typename TA, typename TB>
struct hash<std::pair<TA, TB>> {
    size_t operator()(const std::pair<TA, TB> &pair) const {
        size_t hash = 0;
        utils::hash_combine(hash, pair.first);
        utils::hash_combine(hash, pair.second);
        return hash;

This is more-or-less equivalent to the following computation for `pair<X *, Y *>
p`, assuming "using namespace std":

        size_t h = 0;
        h ^= hash<X *>()(pair.first) + 0x9e3779b9 + (h << 6) + (h >> 2);
        h ^= hash<Y *>()(pair.second) + 0x9e3779b9 + (h << 6) + (h >> 2);
        return h;

This can be computed quickly, but the micro-benchmark shows that this shouldn't
matter very much, so this looks like a red herring. I think the performance
difference is not because of *how fast* the hash values are computed but because
of *what* the hash values are.

There are two main ways in which the distribution of hash values can affect
1) If there are many collisions, performance degrades.
2) If many similar hash values are considered shortly after one another, we can
get better cache locality when accessing the hash buckets.

Our new code tries to scramble the hash values much more, which means that 1)
isn't a likely explanation, but 2) is.

I checked that with gcc (at least on my machine, code attached), hash<void
*>()(x) just returns x itself. Ditto for ints. So the code above simplifies to:

        size_t h = 0;
        h ^= reinterpret_cast<size_t>(pair.first) + 0x9e3779b9 + (h << 6) + (h
>> 2);
        h ^= reinterpret_cast<size_t>(pair.second) + 0x9e3779b9 + (h << 6) + (h
>> 2);
        return h;

For many kinds of code, it's quite to have consecutive accesses to "consecutive
pairs", i.e., something like the pair <p, q> followed by the pair <p, q + 1>
(where "+ 1" is pointer arithmetic, so the numerical value actually increases by
sizeof Y). The code above doesn't really scramble pair.second very much, so
these will often end up with very similar hash values. For example, if sizeof Y
= 4 and (0x9e3779b9 + (h << 6) + (h >> 2) doesn't have the 4-bit set, the two
final hash values will only differ by 4, and hence we'll sequentially access
hash buckets in close proximity, giving us good memory locality.

To test this, I think it would be interesting to test hash tables with
sequential access patterns. For example, a hash set where we insert consecutive
integers from 0 to N-1, unscrambled, and then a separate experiment where we
insert consecutive integers from 0 to N-1, scrambled.

I haven't looked at the PDB code yet to see how the hashes are actually used
there, so this could be a next step. Reasons why Woodworking might be more
heavily affected than other domains could be either that it has more work than
usually to do in dominance pruning, or that locality of access is more important
there than in other domains, for example because in other domains the working
set of this code is often small enough to more or less completely fit whatever
cache (L1, L2, L3) is causing this, so that locality of access is less
important. If that is the case, it might be difficult to reproduce the behaviour
on machines with different cache sizes.
File name Uploaded malte, 2017-09-01.14:52:31
Date User Action Args
2017-09-01 14:52:31maltesetmessageid: <>
2017-09-01 14:52:31maltesetrecipients: + malte, jendrik
2017-09-01 14:52:31maltelinkissue731 messages
2017-09-01 14:52:31maltecreate