msg6425 (view) |
Author: malte |
Date: 2017-07-09.12:11:50 |
|
Thanks!
|
msg6424 (view) |
Author: jendrik |
Date: 2017-07-09.12:05:05 |
|
Here are the results, I don't see a big effect:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v8-blind-issue693-v7-issue693-v8-compare.html
For completeness, here is the latest branch version (v8) vs. the default branch (v7-base):
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v8-blind-issue693-v7-base-issue693-v8-compare.html
Merged and pushed.
|
msg6423 (view) |
Author: malte |
Date: 2017-07-08.15:38:25 |
|
Looks good. I'd be interested in seeing another set of experiments (blind search
only is enough) to see if the change without the wrapper class led to a speedup.
But you can merge either way.
|
msg6419 (view) |
Author: jendrik |
Date: 2017-07-07.19:45:28 |
|
I'm done handling your comments. Can I merge this?
|
msg6418 (view) |
Author: malte |
Date: 2017-07-07.18:44:59 |
|
Experimental results look good. I'm done with my bitbucket comments.
|
msg6412 (view) |
Author: jendrik |
Date: 2017-07-06.09:18:50 |
|
The results are in:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v7-opt-issue693-v7-base-issue693-v7-compare.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v7-sat-issue693-v7-base-issue693-v7-compare.html
|
msg6410 (view) |
Author: jendrik |
Date: 2017-07-06.00:08:56 |
|
I have made the changes we discussed offline and will run experiments now:
https://bitbucket.org/jendrikseipp/downward/pull-requests/66
|
msg6349 (view) |
Author: malte |
Date: 2017-05-07.17:54:51 |
|
I'd like to have a closer look at the code and the experimental results also see
how they differ from machine to machine. There's a few other things in front of
this in the review queue, though.
|
msg6347 (view) |
Author: jendrik |
Date: 2017-05-07.17:48:21 |
|
How do you propose to continue here?
|
msg6295 (view) |
Author: jendrik |
Date: 2017-04-29.09:23:41 |
|
You can find the code under
experiments/issue693/hash-microbenchmark
in the repo linked below.
|
msg6286 (view) |
Author: malte |
Date: 2017-04-28.21:35:00 |
|
Some very good-looking numbers mixed with some I find hard to understand.
Can I get the code for the microbenchmarks somewhere? I wonder how much it
changes from computer to computer.
|
msg6284 (view) |
Author: jendrik |
Date: 2017-04-28.21:09:42 |
|
I implemented a version of HashState using the Boost hash function. Here are the
results (for each bitwidth I report the second iteration to account for
cache effects):
std::unordered_set: hash_combine + Boost hash
FastHashSet: HashState + Boost hash
BurtleBurtleHashSet: hashword() from lookup3
utils::HashSet: HashState + feed()
HashWordHash: HashState + feed_ints()
SpookyV2HashSet: SpookyHashV2
32-bit:
Running nothing 100000 times: 0.000201s
Running insert int into std::unordered_set 100000 times: 0.794704s
Running insert int into FastHashSet 100000 times: 0.788237s
Running insert int into utils::HashSet 100000 times: 0.886164s
Running insert int into SpookyHash 100000 times: 1.15538s
Running insert pair<int, int> into std::unordered_set 100000 times: 0.806538s
Running insert pair<int, int> into FastHashSet 100000 times: 0.807989s
Running insert pair<int, int> into utils::HashSet 100000 times: 0.849372s
Running insert vector<int> of size 1 into std::unordered_set 100000 times: 0.416037s
Running insert vector<int> of size 1 into FastHashSet 100000 times: 0.421279s
Running insert vector<int> of size 1 into BurtleBurtleHashSet 100000 times: 0.445918s
Running insert vector<int> of size 1 into utils::HashSet 100000 times: 0.450099s
Running insert vector<int> of size 1 into HashWordHash 100000 times: 0.445355s
Running insert vector<int> of size 1 into SpookyV2HashSet 100000 times: 0.735016s
Running insert vector<int> of size 10 into std::unordered_set 100000 times: 0.712605s
Running insert vector<int> of size 10 into FastHashSet 100000 times: 0.672364s
Running insert vector<int> of size 10 into BurtleBurtleHashSet 100000 times: 0.695451s
Running insert vector<int> of size 10 into utils::HashSet 100000 times: 0.735414s
Running insert vector<int> of size 10 into HashWordHash 100000 times: 0.695138s
Running insert vector<int> of size 10 into SpookyV2HashSet 100000 times: 1.23303s
Running insert vector<int> of size 100 into std::unordered_set 100000 times: 3.91018s
Running insert vector<int> of size 100 into FastHashSet 100000 times: 3.89545s
Running insert vector<int> of size 100 into BurtleBurtleHashSet 100000 times: 3.72564s
Running insert vector<int> of size 100 into utils::HashSet 100000 times: 3.92713s
Running insert vector<int> of size 100 into HashWordHash 100000 times: 3.75994s
Running insert vector<int> of size 100 into SpookyV2HashSet 100000 times: 4.67119s
64-bit:
Running nothing 100000 times: 0.000199s
Running insert int into std::unordered_set 100000 times: 0.767982s
Running insert int into FastHashSet 100000 times: 0.75542s
Running insert int into utils::HashSet 100000 times: 0.796044s
Running insert int into SpookyHash 100000 times: 0.837655s
Running insert pair<int, int> into std::unordered_set 100000 times: 0.774631s
Running insert pair<int, int> into FastHashSet 100000 times: 0.798033s
Running insert pair<int, int> into utils::HashSet 100000 times: 0.808774s
Running insert vector<int> of size 1 into std::unordered_set 100000 times: 0.440595s
Running insert vector<int> of size 1 into FastHashSet 100000 times: 0.418915s
Running insert vector<int> of size 1 into BurtleBurtleHashSet 100000 times: 0.445275s
Running insert vector<int> of size 1 into utils::HashSet 100000 times: 0.462981s
Running insert vector<int> of size 1 into HashWordHash 100000 times: 0.472938s
Running insert vector<int> of size 1 into SpookyV2HashSet 100000 times: 0.510124s
Running insert vector<int> of size 10 into std::unordered_set 100000 times: 0.662004s
Running insert vector<int> of size 10 into FastHashSet 100000 times: 0.728206s
Running insert vector<int> of size 10 into BurtleBurtleHashSet 100000 times: 0.686017s
Running insert vector<int> of size 10 into utils::HashSet 100000 times: 0.827219s
Running insert vector<int> of size 10 into HashWordHash 100000 times: 0.718663s
Running insert vector<int> of size 10 into SpookyV2HashSet 100000 times: 0.739045s
Running insert vector<int> of size 100 into std::unordered_set 100000 times: 3.78722s
Running insert vector<int> of size 100 into FastHashSet 100000 times: 4.2276s
Running insert vector<int> of size 100 into BurtleBurtleHashSet 100000 times: 3.89874s
Running insert vector<int> of size 100 into utils::HashSet 100000 times: 4.2328s
Running insert vector<int> of size 100 into HashWordHash 100000 times: 4.02751s
Running insert vector<int> of size 100 into SpookyV2HashSet 100000 times: 2.97786s
|
msg6131 (view) |
Author: malte |
Date: 2017-01-17.17:00:21 |
|
Thanks! For the next such experiment, I'd recommend running separate rounds of
the whole thing, not separate rounds of the whole test, so rather than:
for each test:
run each test 10 times
but rather
10 times:
run each test
Otherwise we still run the risk of CPU frequency scaling kicking in too late for
the first tests.
But this is not necessary here, I think, as the data looks fine to me.
Agreed on SpookyHash. I think the next thing we could try is to speed up our
32-bit Jenkins' hash by reducing the overhead and evaluating it within the
context of this mini-benchmark, and if we're successful, put it back into the
planner.
Discussing detailed suggestions on how to try to reduce the overhead would take
a lot of time, I think, so you can either go ahead with some ideas you have, or
we can discuss it next week when I'm back from Dagstuhl.
|
msg6130 (view) |
Author: jendrik |
Date: 2017-01-17.15:50:05 |
|
I addressed your comments and repeated the experiment:
$ export DOWNWARD_BITWIDTH=32 && make && ./benchmark32
Running nothing 10000 times: 2.7e-05s 1.6e-05s 1.6e-05s 1.7e-05s 1.6e-05s 1.8e-05s 1.8e-05s 1.7e-05s 1.7e-05s 1.7e-05s
Running insert int into std::unordered_set 10000 times: 0.081036s 0.079106s 0.079247s 0.084152s 0.08487s 0.08151s 0.083751s 0.080348s 0.079144s 0.07914s
Running insert int into utils::HashSet 10000 times: 0.084762s 0.085265s 0.084493s 0.0838s 0.08397s 0.085129s 0.085078s 0.086782s 0.104818s 0.087344s
Running insert pair<int, int> into std::unordered_set 10000 times: 0.08351s 0.082174s 0.081757s 0.083469s 0.085102s 0.084095s 0.08828s 0.086784s 0.087166s 0.08912s
Running insert pair<int, int> into utils::HashSet 10000 times: 0.094726s 0.097081s 0.093484s 0.09684s 0.105848s 0.088336s 0.085695s 0.087439s 0.092227s 0.098264s
Running insert vector<int> of size 1 into std::unordered_set 10000 times: 0.04596s 0.056363s 0.048646s 0.047186s 0.04645s 0.0438s 0.042544s 0.042643s 0.042583s 0.042839s
Running insert vector<int> of size 1 into utils::HashSet 10000 times: 0.048968s 0.04695s 0.054449s 0.048542s 0.047915s 0.047623s 0.056844s 0.063033s 0.057115s 0.057683s
Running insert vector<int> of size 10 into std::unordered_set 10000 times: 0.078425s 0.094143s 0.091946s 0.101197s 0.068526s 0.069803s 0.069481s 0.070252s 0.07039s 0.069613s
Running insert vector<int> of size 10 into utils::HashSet 10000 times: 0.079197s 0.078231s 0.078407s 0.081361s 0.077901s 0.075849s 0.075296s 0.074961s 0.075143s 0.076078s
Running insert vector<int> of size 100 into std::unordered_set 10000 times: 0.389071s 0.389709s 0.39034s 0.392658s 0.388745s 0.411826s 0.389423s 0.406438s 0.395687s 0.389923s
Running insert vector<int> of size 100 into utils::HashSet 10000 times: 0.41582s 0.417067s 0.414103s 0.413918s 0.411115s 0.415822s 0.413745s 0.411823s 0.41271s 0.438373s
Running insert vector<int> of size 101 into std::unordered_set 10000 times: 0.400385s 0.392617s 0.394046s 0.39738s 0.39478s 0.391783s 0.393181s 0.399569s 0.393019s 0.392476s
Running insert vector<int> of size 101 into utils::HashSet 10000 times: 0.419522s 0.418543s 0.423371s 0.423999s 0.427279s 0.433636s 0.423573s 0.423132s 0.445995s 0.431476s
$ export DOWNWARD_BITWIDTH=64 && make && ./benchmark64
Running nothing 10000 times: 3.2e-05s 3.5e-05s 2.5e-05s 2.5e-05s 2e-05s 2.5e-05s 2.2e-05s 2.4e-05s 2.5e-05s 2.5e-05s
Running insert int into std::unordered_set 10000 times: 0.077637s 0.075893s 0.075891s 0.076898s 0.075936s 0.076199s 0.075817s 0.075498s 0.076068s 0.082677s
Running insert int into utils::HashSet 10000 times: 0.079573s 0.079173s 0.078947s 0.078562s 0.078708s 0.079013s 0.079192s 0.078822s 0.079377s 0.078879s
Running insert pair<int, int> into std::unordered_set 10000 times: 0.077939s 0.077992s 0.077781s 0.079007s 0.077732s 0.077526s 0.077522s 0.077508s 0.078121s 0.078726s
Running insert pair<int, int> into utils::HashSet 10000 times: 0.083359s 0.081937s 0.081923s 0.081819s 0.090081s 0.08283s 0.081909s 0.082625s 0.081634s 0.082201s
Running insert vector<int> of size 1 into std::unordered_set 10000 times: 0.043533s 0.042524s 0.042537s 0.042486s 0.041758s 0.041832s 0.042071s 0.042023s 0.041996s 0.042037s
Running insert vector<int> of size 1 into utils::HashSet 10000 times: 0.052521s 0.049945s 0.050969s 0.05069s 0.05067s 0.050303s 0.050282s 0.050177s 0.050549s 0.05069s
Running insert vector<int> of size 10 into std::unordered_set 10000 times: 0.06806s 0.067649s 0.067984s 0.068276s 0.070085s 0.067969s 0.067595s 0.06761s 0.068054s 0.067954s
Running insert vector<int> of size 10 into utils::HashSet 10000 times: 0.085893s 0.084267s 0.084279s 0.084266s 0.084104s 0.084591s 0.084925s 0.084917s 0.084119s 0.084324s
Running insert vector<int> of size 100 into std::unordered_set 10000 times: 0.390046s 0.389598s 0.389963s 0.390719s 0.391212s 0.391607s 0.389507s 0.389408s 0.388558s 0.393362s
Running insert vector<int> of size 100 into utils::HashSet 10000 times: 0.429928s 0.429662s 0.431475s 0.428931s 0.430573s 0.431571s 0.430258s 0.432028s 0.429097s 0.431196s
Running insert vector<int> of size 101 into std::unordered_set 10000 times: 0.39213s 0.395047s 0.392967s 0.392531s 0.393602s 0.413045s 0.407416s 0.395272s 0.397204s 0.396622s
Running insert vector<int> of size 101 into utils::HashSet 10000 times: 0.435027s 0.435036s 0.434441s 0.438296s 0.435236s 0.436028s 0.442919s 0.434608s 0.435859s 0.43567s
I had a brief look at SpookyHash and think that due to its complexity it should only be a last resort. How do you propose to continue here?
|
msg6129 (view) |
Author: malte |
Date: 2017-01-17.14:42:29 |
|
Oops, I seem to have accidentally deleted my message. Trying again:
Very useful, thanks!
1) I'm surprised that HashSet is faster for pair<int,int> than for int (in
32-bit mode). I think this is because the processor isn't warmed up at the start
of execution (CPU frequency scaling, cold caches). I suggest running 5-10 rounds
of everything (in one process) and checking that the numbers have stabilized
towards the end. It's also important to keep the background load the same
throughout execution.
2) I think the "bad" hash function benefits from the fact that the inserted
numbers are sequential in terms of cache locality. Instead of using "i", I
suggest using something scrambled like (0xdeadbeef * i) ^ 0xfeedcafe.
3) It's alright that the new hash function is slower than the old one; it's not
unexpected, as it scrambles the input better. We should of course still check if
there's overhead we can reduce.
4) Regarding 32-bit vs. 64-bit: if we want to faster on 64-bit compiles, an
obvious avenue is to use Jenkins's 64-bit hash function (SpookyHash) which he
write is ~3 times faster than lookup3. It's much more complex, though, and I
expect it will only be useful for longer hash keys, not for int, pair<int,int>
and similar types. (How useful it will be for vector<int> will depend on the
size of the vector -- the longer, the better.) I'd expect that we'd be able to
get much better performance for these by using special-purpose functions for
small keys that avoid much of the constant-time overhead of our generic functions.
> Interestingly, the time for hashing int vectors with utils::HashSet
> decreases from 3.3 seconds to 2.9 seconds if we cast
> the vector size to int before feeding it to the HashState.
This might simply be because it throws 4 bytes of information away, but for
something to work with arbitrarily large vectors, this isn't the right thing to
do. What are the runtimes if the vector is always one element longer?
For our use cases, I'd also be interested in seeing how hashing vectors of fixed
length behaves for some sample lengths, say 10, 100, 1000.
|
msg6128 (view) |
Author: jendrik |
Date: 2017-01-17.12:18:23 |
|
I had the suspicion that it could be the case that our Jenkins hashing code is optimized for 32-bit compiles. To test
this, I wrote a micro benchmark (added to the repo):
$ export DOWNWARD_BITWIDTH=32 && make && ./benchmark32
Running nothing 100000 times: 0.000481 seconds
Running insert int into std::unordered_set 100000 times: 0.716415 seconds
Running insert int into utils::HashSet 100000 times: 0.836888 seconds
Running insert pair<int, int> into std::unordered_set 100000 times: 0.818349 seconds
Running insert pair<int, int> into utils::HashSet 100000 times: 0.831098 seconds
Running insert vector<int> into std::unordered_set 100000 times: 3.2735 seconds
Running insert vector<int> into utils::HashSet 100000 times: 3.46468 seconds
$ export DOWNWARD_BITWIDTH=64 && make && ./benchmark64
Running nothing 100000 times: 0.000596 seconds
Running insert int into std::unordered_set 100000 times: 0.746506 seconds
Running insert int into utils::HashSet 100000 times: 0.791947 seconds
Running insert pair<int, int> into std::unordered_set 100000 times: 0.778041 seconds
Running insert pair<int, int> into utils::HashSet 100000 times: 0.80977 seconds
Running insert vector<int> into std::unordered_set 100000 times: 2.70406 seconds
Running insert vector<int> into utils::HashSet 100000 times: 3.31512 seconds
Our new hash functions are slower than the old hash functions in all test cases on both 32-bit and 64-bit. Focusing on
the last test case (inserting vector<int>, which is probably comparable to inserting PackedStateWrapper) we see that the
runtimes of both hash functions are almost equal in 32-bit mode, but the old hash functions benefit much more from
switching to 64-bit than the new ones. Do you think we can do something about this?
Interestingly, the time for hashing int vectors with utils::HashSet decreases from 3.3 seconds to 2.9 seconds if we cast
the vector size to int before feeding it to the HashState.
|
msg6127 (view) |
Author: jendrik |
Date: 2017-01-16.11:43:56 |
|
All data comes from a single experiment.
|
msg6126 (view) |
Author: malte |
Date: 2017-01-16.11:34:50 |
|
Did you generate all data that you mention in your previous message in the same
experiment, or are they combined from different experiments? I'm asking because,
looking at several experiments that test the same revision, it looks like we're
seeing huge grid noise effects in some domains.
|
msg6125 (view) |
Author: jendrik |
Date: 2017-01-16.11:12:00 |
|
I think the tables comparing v5 and v6 show
* peak memory is almost unaffected by the change
* runtime is more or less unaffected in 32-bit mode: 28 domains have higher total_time score and 17 have lower total_time score.
* runtime increases in 64-bit mode: 11 domains have higher total_time score and 34 have lower total_time score.
Here are some relative scatter plots comparing "total_time" for v5 and v6:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-relative-scatter-blind-release32-issue693-v5-vs-issue693-v6-total_time.png
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-relative-scatter-blind-release64-issue693-v5-vs-issue693-v6-total_time.png
I'm surprised that only 64-bit is adversely affected, since we use utils::get_hash64() in both versions. The only difference I can see
is that we need to copy a bigger pointer when constructing PackedStateWrapper in 64-bit mode. However, I doubt that this can explain
the timing effects.
|
msg6111 (view) |
Author: jendrik |
Date: 2017-01-13.11:10:49 |
|
Here are the results for blind search using a 1 minute timeout. The only difference to "default" is
that the state registry uses the new hash function:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v5-vs-issue693-v6-release32.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v5-vs-issue693-v6-release64.html
|
msg6109 (view) |
Author: malte |
Date: 2017-01-13.00:09:48 |
|
In fact, even a 1-minute timeout would be enough.
|
msg6108 (view) |
Author: malte |
Date: 2017-01-13.00:09:25 |
|
A short timeout is enough. 32-bit and 64-bit sounds good, though.
|
msg6107 (view) |
Author: jendrik |
Date: 2017-01-13.00:06:47 |
|
For the following experiments, should we test both 32-bit and 64-bit or only one of
the two? Should I use a 5 minute or 30 minute timeout? I'm asking, because it
sounds like a lot of experiments.
|
msg6101 (view) |
Author: malte |
Date: 2017-01-12.22:31:13 |
|
Sounds good! I had a look at the experiment and at the pull request for v5; both
look good.
|
msg6099 (view) |
Author: jendrik |
Date: 2017-01-12.22:18:46 |
|
I chose a slightly different path, but ended up with the desired result. I have added the old hash
function to v4 and kept the new hash functions. The new hash functions are unused for now. This code
version is tagged "v5".
I have run an experiment with a search timeout of 5 minutes checking that nothing changes and I think
the results confirm this:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v5-base-vs-issue693-v5-release32.html
I'll change the hash function for the state registry next, since it is one of the most important
places where we hash.
|
msg6087 (view) |
Author: jendrik |
Date: 2017-01-12.00:06:58 |
|
Your plan sounds good, I'll try it tomorrow.
|
msg6086 (view) |
Author: malte |
Date: 2017-01-11.20:58:40 |
|
OK, I've now stared at the profiles for an hour or so; time to come up with a
plan. (I'm not sure if the kcachegrind numbers will be much more insightful; I
think any profile will be hard to analyze because we've changed so many places.)
To the extent that we can trust instruction counts to accurately reflect
runtime, we slow down from 18.8 billion units to 20.4 billion units.
Of the 1.6 billion difference, about 0.47 billion (so around 30% of the
additional time) seem to be attributable to hashing states in the state
registry, and 0.3 billion (around 20%) to hashing ScalarEvaluator pointers in
the evaluation context. But I wouldn't really trust these numbers because they
ignore cache locality, and also they account for only half of the penalty.
I'd really like to study this a bit further, and for this I think it would be
good not to change so many things at the same time. Looking at the diff between
v4 and v4-base, there are three main changes:
1. getting rid of the old hash functions
2. introducing the new hash functions and HashSet/HashMap aliases
3. using the new hash tables everywhere
I would like to split this up a bit so that we can test the performance impact
in smaller increments. My suggestion is to have both hash table implementations
available in this branch simultaneously, so that we can switch them out
individually more easily.
One possibility is to split off a new experimental branch off issue693-v4-base
where we proceed as follows:
1. Introduce the new hash functions, but not by overwriting the existing hash.h,
but by adding to it. Let's call the new aliases utils::JenkinsHashSet and
utils::JenkinsHashMap, and also introduce aliases in hash.h for our old hash
maps, say utils::OldHashSet and utils::OldHashMap or whatever.
2. Use OldHashSet and OldHashMap everywhere where we currently use hash_set and
hash_map.
3. Do a short experiment/profile to verify that this has no performance impact.
It shouldn't, but better safe than sorry because the above is quite a large code
change.
We're then in a position to test the impact of the new hash tables with
individual changes (one line in most cases) for each major point where we use
hash tables.
What do you think? Going from what we currently have to the step after 2.
shouldn't be too difficult, I hope. We'd basically need to globally search and
replace utils::HashMap and utils::HashSet by utils::OldHashMap and
utils::OldHashSet everywhere outside hash.h, and in hash.h we'd need to re-add
the old code and add the OldHashMap and OldHashSet aliases.
|
msg6083 (view) |
Author: malte |
Date: 2017-01-11.19:49:38 |
|
I had a look. We're optimizing rather low-level code here, so I'm worried that
the simulation options used for these profiles were too coarse. I assume you
used callgrind? (qcachegrind showed me that it estimated the number of clock
cycles as the number of instructions read. I think it would only do that if
callgrind was used, not if cachegrind was used.) Can you create updated profiles
with cachegrind?
|
msg6082 (view) |
Author: malte |
Date: 2017-01-11.19:35:00 |
|
I'm looking at the profiles at the moment -- don't know yet.
|
msg6081 (view) |
Author: jendrik |
Date: 2017-01-11.19:34:27 |
|
What's your suggestion for how we continue here?
|
msg6080 (view) |
Author: malte |
Date: 2017-01-11.19:16:37 |
|
One factor that influences whether a function is inlined is how large it is. I
assume the old hash functions were much smaller than the new ones.
|
msg6069 (view) |
Author: jendrik |
Date: 2017-01-11.11:49:04 |
|
I have profiled miconic:s8-4.pddl for both v4 and v4-base for blind
search. I'll attach the resulting profiles. Looking at the children of
EagerSearch::step(), which uses almost all of the time, we find that
only the following functions use more relative time in v4 compared to
v4-base:
1) get_successor_state: 24.15% -> 24.55%
hash packed states
2) OpenList::insert: 4.27% -> 5.02%
hash ScalarEvaluator pointers for HeuristicCache
3) utils::HashSet::count(): ?% -> 3.39%
hash GlobalOperator pointers for testing preferredness
It seems for 3) the function is inlined in the old code version, but
not in the new one. I have no idea why that happens though. Since the
hash set of preferred operators is always empty, it shouldn't make a
(big) difference anyway.
|
msg6062 (view) |
Author: malte |
Date: 2017-01-10.18:34:29 |
|
Thanks for the update! I think this confirms the suspicion that for very simple
types and hash functions that are used in inner loops with little going on apart
from adding stuff to the hash tables, the overhead of the new hash functions can
hurt us, which makes sense. We hash everything as if it were a generic sequence,
and for very small keys, the relative overhead for this is high.
At the end, we may have to make a decision between hash quality and hashing
speed. But let's gather more data first.
|
msg6060 (view) |
Author: jendrik |
Date: 2017-01-10.18:29:12 |
|
Before reading your last comment, I tried the following:
I had a closer look at the results. The slowdown is very pronounced for the
systematic patterns. E.g., the total_time for parcprinter-opt11-strips tasks get
doubled in 32-bit mode. I profiled the systematic pattern generation on p05.pddl
from that domain. It turns out that almost all of the time before the search starts
is spent doing dominance pruning, i.e., in the function collection_dominates() in
dominance_pruning.cc. The only hashing that is done there is to hash pairs of
pointers. If it had been a primitive type with an obvious hash value (identity)
that was hashed, I would have brought up adding template specializations for
primitive types for utils::get_hash() (but I wouldn't have liked that solution).
Since it's hashing pointer pairs that's causing a slowdown, I'm not sure how to
proceed.
I'll profile a miconic task for blind search now.
|
msg6059 (view) |
Author: malte |
Date: 2017-01-10.18:15:21 |
|
The performance is a bit worse than I'd hope for. I'd rather investigate this a
bit more closely before merging. I suggest focusing on blind search first, since
it should be the configuration where it should be easiest to pinpoint where the
performance change comes from (as it only uses few planner components).
For blind search, a good domain to focus on is miconic because it is the domain
with the strongest negative effect on score_total_time by far (although this
also has to do with the fact that is has so many tasks), and because the domain
is quite simple and the tasks scale in a straight-forward way.
On the larger tasks, blind search runtime increases by around 10-15% in Miconic.
My first instinct is that with blind search, the only important place where the
new hash code comes into play is hashing for the state registry, and the most
obvious explanation would be that the new hash function is slower than the old
one. I'd like to confirm whether or not this is true, and if this is the case, I
would like to see if this is because the Jenkins hash is intrinsically slower,
or because there is an overhead due to our design with HashState, feed and
friends. (I'd hope that almost all of this, but not quite all of this, would be
optimized away.)
Can you run a profile on a suitable Miconic tasks and see if that generates any
insights?
|
msg6034 (view) |
Author: jendrik |
Date: 2017-01-08.10:37:03 |
|
Here are the results of the experiment comparing the new utils::HashMap and utils::HashSet (v4) to its
ancestor on default (v4-base):
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v4-base-vs-issue693-v4-release32.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v4-base-vs-issue693-v4-release64.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-sat-issue693-v4-base-vs-issue693-v4-release32.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-sat-issue693-v4-base-vs-issue693-v4-release64.html
I'd say memory looks ok for 32-bit and 64-bit. However, the code seems to get a little slower on average.
|
msg6008 (view) |
Author: malte |
Date: 2017-01-03.17:09:18 |
|
Sounds good! I've added a few more comments.
|
msg6007 (view) |
Author: jendrik |
Date: 2017-01-03.16:36:11 |
|
OK, I'll wait with the experiments until we're happy with the code.
|
msg6006 (view) |
Author: malte |
Date: 2017-01-03.15:52:35 |
|
I think a new experiment comparing v3 to its ancestor on default would be good;
I don't think we can meaningfully interpret these numbers.
BTW, we should really use different configurations for satisficing search, as
none of these configurations is a good one. lmcount is a poor heuristic on its
own, and no configuration without preferred operators is competitive with a
configuration with preferred operators. I would include --alias lama-first and
then 2-3 variations of eager_greedy or lazy_greedy with preferred operators. For
example, eager_greedy with FF and preferred operators, lazy_greedy with FF and
preferred operators, and perhaps one more configuration using cg, cea or add.
|
msg6005 (view) |
Author: jendrik |
Date: 2017-01-03.15:43:46 |
|
I have run experiments comparing the new hashing functions before (v2) and after the addition of
the HashState class (v3). Unfortunately, I missed that in between the two versions the default
branch was merged into the issue branch. If you want, I can run experiments again comparing the
last "default" revision that was merged into the issue branch against the last revision on the
issue branch.
In any case, here are the results:
opt:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v2-vs-issue693-v3-release32.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-opt-issue693-v2-vs-issue693-v3-release64.html
sat:
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-sat-issue693-v2-vs-issue693-v3-release32.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-sat-issue693-v2-vs-issue693-v3-release64.html
Here is the diff between the two revisions:
https://bitbucket.org/jendrikseipp/downward/branches/compare/issue693-v3..issue693-v2#diff
|
msg6004 (view) |
Author: malte |
Date: 2017-01-03.15:42:15 |
|
Sounds good!
In the message, perhaps you can show what the alternatives are. For example, we
cannot write "unordered_set<vector<int>>" because std::hash doesn't exist for
std::vector<int>.
One alternative is to define our own template and write
"utils::hash_set<vector<int>>". The other alternative is to manually specify a
hash function for unordered_set. It would be good to include what this would
look like in the message to downward-dev.
|
msg6003 (view) |
Author: jendrik |
Date: 2017-01-03.15:28:25 |
|
OK, then the only question is whether we want to add utils::hash_set and
utils::hash_map templates (see msg5998), right? I'll write to downward-dev about
this.
|
msg6002 (view) |
Author: malte |
Date: 2017-01-03.15:23:23 |
|
Yes, this rules out defining std::hash for types like vector<int> or
pair<int,int>. I've also looked at other sources to confirm this.
|
msg6001 (view) |
Author: jendrik |
Date: 2017-01-03.15:11:43 |
|
http://en.cppreference.com/w/cpp/language/extending_std: "It is allowed to add
template specializations for any standard library template to the namespace std
only if the declaration depends on a user-defined type ..."
If I understand this correctly we're not allowed to redefine the std::hash()
struct. Do you agree?
|
msg6000 (view) |
Author: malte |
Date: 2017-01-02.19:38:53 |
|
> Feel free to add the high-level documentation.
Done. If we stop providing a catch-all implementation for std::hash (see
msg5998), then the last comment in hash.h needs to be updated.
|
msg5999 (view) |
Author: jendrik |
Date: 2017-01-02.19:12:14 |
|
All I could find is http://en.cppreference.com/w/cpp/language/extending_std . Not
sure whether it explains the error we're seeing though.
|
msg5998 (view) |
Author: malte |
Date: 2017-01-02.19:01:36 |
|
I was wondering before if it was preferable not to inject our hash functions
into the std namespace. Then the error should go away, and we can be sure that
we don't accidentally use a potentially poor stdlib hash in places where we want
to use our own hash.
Not using std::hash would mean that we have to specify the extra template
arguments when using std::unordered_set or std::unordered_map, but we could
encapsulate this by providing our own template that picks the right arguments.
For example, we could call our template utils::hash_set and utils::hash_map.
If we want to go this route, I'd suggest discussing it briefly on downward-dev.
|
msg5997 (view) |
Author: jendrik |
Date: 2017-01-02.18:54:44 |
|
It compiles on gcc 4.8 (bitbucket pipelines) and gcc 5.4 (my laptop). Judging by
the error message it seems that we're not allowed to overwrite the std::hash struct
in newer gcc versions. I'll do a google search for this.
|
msg5996 (view) |
Author: malte |
Date: 2017-01-02.18:45:10 |
|
The current code (700f56c30ef6) doesn't compile on my machine at home (gcc
6.2.0). Does it compile for you? I'm attaching what the build process writes to
stderr.
|
msg5968 (view) |
Author: jendrik |
Date: 2016-12-21.21:00:41 |
|
Experiments are running. Feel free to add the high-level documentation.
|
msg5964 (view) |
Author: jendrik |
Date: 2016-12-21.16:59:10 |
|
I updated the pull request with the implementation we discussed. Once you're happy
with the implementation I can run experiments.
|
msg5899 (view) |
Author: malte |
Date: 2016-12-17.03:50:21 |
|
Are you sure it only works well on 64-bit architectures? From the blog
post/description, I only see that it requires 64-bit math, but that's generally
supported in C++11. 64-bit math can be quite inefficient on a 32-bit
architecture, and we would probably see this in the runtime with a 32-bit
compile, but we're planning to discontinue 32-bit compiles, and x64 has been
around for 13 years now.
But actually, reading more about this, I wonder whether all this (either lookup3
or SpookyHash) isn't complete overkill anyway, and we might be better served
with something simple such as the one-at-a-time hash, which would have a much
lower implementation complexity and could fit well into our current framework of
combining functions for hashing functions for different types.
> I'd propose to focus on the hash function for the state registry in this issue
> and consider 64-bit alternatives once we've made the jump to 64-bit compiles.
> What do you think?
It's not particularly clean to use two completely different approaches to
hashing in two different parts of the code. If we think our current default hash
distributes keys badly, I'm not sure it's a good idea to use it everywhere in
the code. So this would make me a bit sad, but I would probably survive it. ;-)
|
msg5897 (view) |
Author: jendrik |
Date: 2016-12-16.19:18:46 |
|
Here is the link for comparing v1 and v2:
https://bitbucket.org/jendrikseipp/downward/branches/compare/issue693-v2..issue693-v1#diff
I already saw SpookyHash when looking for suitable hash functions. Unfortunately, it only works well on 64-bit
architectures (see SpookyHash homepage and http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html).
I'd propose to focus on the hash function for the state registry in this issue and consider 64-bit alternatives once
we've made the jump to 64-bit compiles. What do you think?
|
msg5896 (view) |
Author: malte |
Date: 2016-12-16.16:32:00 |
|
Regarding the unsigned_int vs. size_t question, after some more thought I think
it would actually be much more future-proof if we computed a 64-bit (or larger)
hash rather than a 32-bit hash. A 64-bit hash value can be used for a 32-bit
hash table, but not vice versa.
I'm sorry to suggest extra work, but I think it would be good if our hash
functions could at least in principle support large hash tables, even if our
state registries are currently limited to 2^31 - 1 states. It is not that
unlikely that a few years down the road we will look into external search
algorithms or will occasionally want to run experiments with significantly more
RAM. And a 64-bit hash function would also enable us to continue using the
standard library's hash tables in places where we don't want to customize.
The author of the hash function we're using writes "More recently I published
SpookyHash, which is many times faster than lookup3 on 64-bit platforms, even
for short keys." SpookyHash provides a 128-bit hash that can also be used as a
32-bit hash or 64-bit hash. See
http://www.burtleburtle.net/bob/hash/index.html#lookup.
As an added bonus, it looks like SpookyHash can support potentially
non-contiguous data nicely via the Init/Update/Final methods. This would allow
us to implement hash functions for something like vector<FactPair> without
non-portable reinterpret_casts, and I think we could not do that with lookup3.
|
msg5895 (view) |
Author: malte |
Date: 2016-12-16.16:11:37 |
|
Is there a way to look at a diff of v1 vs. v2 in bitbucket?
The numbers in the experiment look OK to me. I'm not sure the experimental
trends are more than noise. For example, if you compare score_total_time of base
and v2 domain by domain, then base wins 26 time and v2 wins 18 times. That's not
a big deal.
|
msg5889 (view) |
Author: jendrik |
Date: 2016-12-14.21:21:21 |
|
The old hash_combine() function is currently used to hash
- State objects (used by diverse_potentials())
- patterns
- vector<FactPair> in DTGFactory
Our old hash functions in hash.h operate on size_t, but the new hash function
uses unsigned int. How should we go about this?
|
msg5888 (view) |
Author: jendrik |
Date: 2016-12-14.18:59:33 |
|
I'll look into where our hash function is currently used.
Here are the experiment results:
v1: hash function as in issue213-v5.
v2: hash function as in issue213-v5, but adapted to our code style.
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v2-blind-issue693-base-issue693-v1-compare.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v2-blind-issue693-base-issue693-v2-compare.html
http://ai.cs.unibas.ch/_tmp_files/seipp/issue693-v2-blind-issue693-v1-issue693-v2-compare.html
|
msg5884 (view) |
Author: malte |
Date: 2016-12-14.17:50:43 |
|
I'm done commenting on bitbucket for now, but one more major comment: rather
than adding a new hash function, I'd prefer getting rid of our old hash function
entirely, replacing it with the new one. Is that possible? (I guess it depends
on which types of objects we currently hash.)
|
msg5879 (view) |
Author: jendrik |
Date: 2016-12-14.10:12:40 |
|
I'll run experiments now.
|
msg5878 (view) |
Author: jendrik |
Date: 2016-12-14.10:12:19 |
|
I made a pull request at https://bitbucket.org/jendrikseipp/downward/pull-requests/66
|
msg5877 (view) |
Author: jendrik |
Date: 2016-12-14.09:37:54 |
|
Split off from issue213.
|