msg9964 (view) |
Author: silvan |
Date: 2021-01-30.18:32:59 |
|
Regarding the Github actions: I don't understand the source of the error on the last commit of the branch, but the actions went through on the merge commit, if I see this right.
Congrats for merging this old issue!
|
msg9962 (view) |
Author: jendrik |
Date: 2021-01-30.17:57:52 |
|
More than eight years after its creation, the issue is finally done. Congrats, that took quite some stamina!
|
msg9961 (view) |
Author: malte |
Date: 2021-01-30.17:45:25 |
|
Hooray! :-)
|
msg9958 (view) |
Author: florian |
Date: 2021-01-30.17:38:21 |
|
It took a few more iterations but I got rid of the warnings and finally merged the issue. Thanks everyone.
|
msg9957 (view) |
Author: malte |
Date: 2021-01-30.15:57:58 |
|
Feel free to merge, but we should also seek to make the Github actions green. One of them looks related to CPLEX setup and probably needs input from the Github actions experts, but at least one of them seems to be just a basic code change required (missing "override").
|
msg9956 (view) |
Author: jendrik |
Date: 2021-01-30.08:56:42 |
|
I approved the pull request now.
|
msg9955 (view) |
Author: florian |
Date: 2021-01-30.00:18:52 |
|
I added a change log entry, merged the last changes from default into the issue branch, fixed some comments that still mentioned GlobalState, and did some quick local tests on {lazy_greedy, EHC, A*} x {release, debug}.
The new Github actions still complain and the review status of Jendrik is still "requested changes" but I still think this is ready to merge now.
|
msg9950 (view) |
Author: malte |
Date: 2021-01-29.14:12:47 |
|
Florian pushed some more changes after the code review. They look good to me, but probably require at least some local tests with lazy and eager search with and without debug mode to see that the EvaluationContext changes don't crash.
|
msg9948 (view) |
Author: malte |
Date: 2021-01-29.13:51:39 |
|
Experimental results look good, no reason not to merge.
Unexplained errors:
1) We have the usual segmentation faults that we attribute to memory-related crashes in the translator. (Would be good to deal with this in some way at some point, but not here and now.)
2) We have too much output in parcprinter and in a few other cases for iPDB, but not exceeding the hard limit.
Interpretation of results:
Overall, we're faster than base and Dominik's previous version. Compared to base, our score_total_time differences are:
+1.65 for lama-first
+2.13 for EHC-FF
-0.51 for iPDB
+0.39 for LM-Cut
+0.90 for blind
+0.47 for lazy-greedy with h^FF and h^cea and preferred ops
The improvement of blind search is unexpected, but we'll take it. We can swallow the slight penalty for iPDB.
No substantial changes in memory usage, as expected.
Overall, we lose a bit of coverage:
+1 for lama-first
-2 for EHC-FF
But looking at the details, this may very well be noise. We lose the following tasks with EHC-FF:
- depot-p09, solved by the baseline in 1768.27 seconds
- mprime-p13, solved by the baseline in 1390.25 seconds
- pipesworld-tankage-p40, solved by the baseline in 1782.17 seconds
We gain the following task with EHC-FF:
- mprime-p14, solved by v24 in 1554.73 seconds
We observed recently that in some cases we see more dramatic noise than we're used to, and we don't know why, but it may very well be a grid issue. The mprime tasks are hard to explain otherwise, we don't see that much variation in this domain for the other tasks. The other two tasks would be explainable by "normal" noise.
|
msg9945 (view) |
Author: malte |
Date: 2021-01-29.13:20:10 |
|
Florian, Patrick, Silvan and I also reviewed the pull request, leading to various smaller local change suggestions, some for now and some as TODOs for later. Apart from the ones that should be done now (like making fewer attributes of State mutable), this looks fine.
|
msg9944 (view) |
Author: florian |
Date: 2021-01-29.13:18:44 |
|
Results for the experiment of v24 (the code we reviewed) are in. I also re-ran the code Dominik ran on the Freiburg Grid on our cluster so it can be compared to our results (version2-v3 in the data).
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v24-eval.zip
|
msg9938 (view) |
Author: jendrik |
Date: 2021-01-29.10:54:07 |
|
I looked at the code and made some suggestions.
|
msg9886 (view) |
Author: florian |
Date: 2021-01-28.15:52:53 |
|
Point 6 in the summary is done and we decided that the remaining TODO in the summary will not be handled here. The performance of the goalcount heuristic should improve again when we complete our plan for the component interaction problem. I moved that point to issue559.
|
msg9884 (view) |
Author: florian |
Date: 2021-01-28.15:47:52 |
|
We discussed the results and found them acceptable. I cleaned up the commit history and the code a bit and added documentation. The changes I made are small compared to the revision that we ran the tests on. I would suggest that we review the code under the assumption that the results of v23 also apply to this version and I run another experiment to confirm this in parallel.
|
msg9876 (view) |
Author: florian |
Date: 2021-01-28.12:05:35 |
|
I ran experiments for the version of the code that uses a vector<int> in pdb::get_value and in the successor generator. For the former, we call State::unpack outside of loops when evaluating multiple PDBs, for the latter, we unpack within the successor generator (the public interface still takes a State).
There are only small (positive) differences in coverage and memory is unaffected as expected. The search becomes faster for LAMA (the total time score summed over all instances increases by +1.99), EHC with h^FF (+2.62), and A* with LM-cut (+0.5). It becomes slower for blind search (-0.51), iPDB (-1.76), and lazy greedy search with h^FF and h^CEA (-0.4).
Full Report:
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-compare.html
Relative scatter plots for total time:
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-total_time-blind.png
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-total_time-ehc-ff.png
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-total_time-ipdb.png
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-total_time-lama.png
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-total_time-lazy.png
https://ai.dmi.unibas.ch/_tmp_files/pommeren/issue348-v23-issue348-base-issue348-v23-total_time-lmcut.png
|
msg9872 (view) |
Author: florian |
Date: 2021-01-27.23:29:39 |
|
I think changing the interface in a way that the subscript operator of State returns an int is a good idea regardless of performance. I tried this out and found that most users immediately call get_value() and there is only one slightly dodgy use of the actual FactProxy. I created issue997 for this in case it doesn't fit here.
On the performance side of things, there is no big difference. I implemented this on top of version2-v3 and the profile of our airport instance improved a little bit but not by much (18330 vs. 18734 in v3 and 11979 in base). You can see the patch here (the iterator is a bit hacky but not used in this configuration):
https://github.com/FlorianPommerening/downward/compare/issue348-version2...issue348-debug-no-fact-proxy
|
msg9870 (view) |
Author: florian |
Date: 2021-01-27.21:21:12 |
|
The use-vector-and-unpack version of the code required 12003 * 10^6 simulation steps in valgrind, very close to the base version. This was expected for iPDB. The real test is to see how it performs in other configs.
|
msg9869 (view) |
Author: florian |
Date: 2021-01-27.21:00:25 |
|
We discussed the performance loss in iPDB and saw the fundamental problem that we have classes (SuccessorGenerator and PDB in this case) where we ideally want better performance than what we would get from an general interface covering both packed and unpacked states.
We discussed some alternatives to a polymorphic class:
A) Using templates for the classes to support anything that has a state-like interface. This could cover vector<int> and State for example. Because State::operator[] currently returns a FactProxy, we would have to change this to int and (if needed) offer a different way of getting FactProxies. The templating would then introduce the usual problems that templates come with, e.g., that we have to implement the classes in the header, etc.
B) Have a simple class UnpackedState and have the classes that require unpacked values for efficiency be explicit about this. This is similar to the situation in the main branch right now but we would try to make the choice of registered/unregistered independent of the choice of packed/unpacked. The downside of this would be that it may require duplicate implementations again of the kind we got rid of in this issue.
C) In the longer run, a simple class containing unpacked data directly might be all we need. To make this efficient, we would have to take an even closer look at where states are copied and avoid copies more or less completely. This would require some work on the EvaluationContexts.
D) Some classes that currently work on states, could also work on a lower-level representation (like vector<int>). Switching the SuccessorGenerator to this would mean that state always have to be unpacked before successors are generated.
We want to evaluate if option D) is viable and also test what Jendrik suggested in msg9866.
|
msg9866 (view) |
Author: jendrik |
Date: 2021-01-27.19:55:59 |
|
Apart from the if-statement, the version pairs you mention also differ in whether they return a FactProxy or an int. Maybe it would be good to test a variant of v3 that returns an int?
|
msg9856 (view) |
Author: florian |
Date: 2021-01-27.13:15:52 |
|
I ran more profiles on different ways of accessing the state. The numbers I report here are from the profile of Airport/p08 with iPDB and measure the simulation steps in PatterCollectionGenerator::hillclimbing where the bulk of the time is spent and where the difference between the versions shows. All numbers are divided by 10^6 to make things more readable. The difference in the code can be seen in the commits on the debug branch (https://github.com/FlorianPommerening/downward/commits/issue348-debug) but I also have some mock code at the end of this message.
base: 11979
int *: 11779 (base - 200)
vector: 12095 (base + 116)
shared_ptr: 12405 (base + 426)
int* in State: 18344 (base + 6365)
v3: 18734 (base + 6755)
The main difference between versions "int *" and "int* in State" is that the if statement checks within operator[] whether unpacked data is available (and that the access happens inside the State class). The same is true for the difference between versions "shared_ptr" and "v3". So I assume that the if statement blocks some compiler optimization.
### base
class State {
vector<int> values;
FactProxy operator[](size_t var_id) {
return FactProxy(..., values[var_id]);
}
...
}
pdb::get_values(const State &) {...}
### int *
class State {
shared_ptr<vector<int>> values;
const int *get_data() { return values->data(); }
...
}
pdb::get_values(const int *) {...}
### vector
class State {
shared_ptr<vector<int>> values;
const vector<int> &get_values() { return *values; }
...
}
pdb::get_values(const vector<int> &) {...}
### shared_ptr
class State {
shared_ptr<vector<int>> values;
const shared_ptr<vector<int>> &get_values() { return values; }
...
}
pdb::get_values(const shared_ptr<vector<int>> &) {...}
### int* in State
class State {
shared_ptr<vector<int>> values;
const int *data; // initialized to values->data() once
FactProxy operator[](size_t var_id) {
if (data) {
return FactProxy(..., data[var_id]);
} else {...}
}
...
}
pdb::get_values(const State &) {...}
### v3
class State {
shared_ptr<vector<int>> values;
FactProxy operator[](size_t var_id) {
if (values) {
return FactProxy(..., (*values)[var_id]);
} else {...}
}
...
}
pdb::get_values(const State &) {...}
|
msg9814 (view) |
Author: malte |
Date: 2021-01-26.18:42:02 |
|
> Do you have some time to look at this together?
Not today, possibly tomorrow after the exam.
> I don't really understand this: the old code had one indirection, this code has
> none and the version of the new code that had one indirection (msg9807) still
> had the overhead. I would have expected the code from msg9807 to have the same
> performance as the base and this code to be faster.
I think it also matters *where* the indirection happens, but I cannot articulate this very clearly. In some cases the compiler will be better able to determine that the pointer cannot be changed from the outside. This may also be related to inlining.
> I also don't know where to go from here: the efficient code requires all the code
> duplication again which we probably do not want to reintroduce.
Understanding what is going on is a good starting point. We don't necessary need to get maximum performance everywhere immediately, but I think now is a good time to look at things carefully.
|
msg9813 (view) |
Author: florian |
Date: 2021-01-26.16:27:36 |
|
I implemented the version that Malte suggested in 993b9ed6
(https://github.com/FlorianPommerening/downward/commit/993b9ed657e5f2e4944078dbb0a823e1b27ebe20).
This re-introduces the code duplication that we got rid off during the issue but I think this is OK since we only plan to use the cod for debugging. The profile for this was clear again and showed the same performance as the base code.
I don't really understand this: the old code had one indirection, this code has none and the version of the new code that had one indirection (msg9807) still had the overhead. I would have expected the code from msg9807 to have the same performance as the base and this code to be faster.
I also don't know where to go from here: the efficient code requires all the code duplication again which we probably do not want to reintroduce.
|
msg9807 (view) |
Author: florian |
Date: 2021-01-26.14:47:50 |
|
> If I understand this correctly, it only saves one indirection and makes the
> State class more expensive.
This is true but it also makes the code more comparable to the base implementation because that also has one indirection for accessing the vector<int>.
I would have expected the difference to go away but today I cannot even replicate the 1.9 seconds, I measured yesterday. I undid the change yesterday and redid it today and got different results. I looked closely at the diff and am convinced it is the same (but I cannot check). Profiling the patched code had results that look almost identical to the unpatched version, still showing a similar difference in the number of profiling samples overall and in the individual methods. I double-checked that I profiled the new code and could see the patched source code in the profiler. If the results are true, the indirection should not be the problem but then I do not see what else could be the cause. Do you have some time to look at this together?
|
msg9805 (view) |
Author: malte |
Date: 2021-01-25.14:37:14 |
|
If I understand this correctly, it only saves one indirection and makes the State class more expensive. I would be interested in trying out the alternative solution to see what its impact is. A quick-and-dirty test that disables everything that isn't necessary for iPDB would be good enough for this. I'm happy to do this myself it it helps, but not today.
|
msg9804 (view) |
Author: florian |
Date: 2021-01-25.14:23:35 |
|
> For example, you could add a temporary get_data() method that returns an int *.
> It would make sure that the vector of state values exists and then return
> values.data() (assuming the vector is called "values"). Then use this int *
> instead of the state reference to access the state contents.
I did something similar locally. Instead of a method get_data, I added a
mutable const int *unpacked_data;
to the State class and initialized it to values->data() at the time values is initialized. In operator[] I then swapped out (*values) for unpacked_data. The issue with your suggestion is that it would require a lot of changes in all PDB classes and sampling methods if we want to save the overhead across more than one PDB.
My change reduced the overhead in airport/p08 somewhat but not completely (eyeball average over 5 runs each):
base ~1.75s
version2-v3 ~2.00s
unpacked_data ~1.90s
I don't have time to profile this and look into more details right now.
|
msg9803 (view) |
Author: malte |
Date: 2021-01-25.13:33:22 |
|
I doubt this is related to the cache, but it could certainly be related to the additional indirection, especially if the compiler cannot prove that the pointers involved don't remain constant throughout the execution of the critical loop. If we are accessing the state by reference, I count three levels of indirection to get at the data: from the state reference to the state, which is a shared_ptr; from the shared_ptr to its contents, which include a vector; from the data pointer of this vector to the actual data.
I can think of a number of ways to make "bulk access" cheaper by reducing this to one level of indirection. The question is what the beauty/speed trade-off is here. One things we could do is experiment with something potentially ugly to see if it has a performance impact, so that we can make a more informed decision.
For example, you could add a temporary get_data() method that returns an int *. It would make sure that the vector of state values exists and then return values.data() (assuming the vector is called "values"). Then use this int * instead of the state reference to access the state contents.
|
msg9802 (view) |
Author: florian |
Date: 2021-01-25.12:41:03 |
|
Patrick and I profiled iPDB on Airport/p08 (I'll attach the files, don't confuse them with Airport/p09 which is already there from a previous code version).
The results clearly showed that the only significant difference was in the pattern collection generator. Within that code, the difference was distributed over two methods: SuccessorGenerator::generate_applicable_ops and PatternDatabase::get_value. Both of them take a const reference to a state and access its values. In the contexts in which they are called, the state is always unpacked and unregistered, i.e., it consists of a shared pointer to a vector<int> and a bunch of nullptrs for registry, buffer, ID, state_packer. No part related to creating or copying states showed a difference in the profile, only accessing the unpacked data showed up.
In the main branch, the State class has a vector<int> and the operator[] is implemented like this:
FactProxy operator[](std::size_t var_id) const {
assert(var_id < size());
return FactProxy(*task, var_id, values[var_id]);
}
In the new code (version 2), the state has a shared_ptr<vector<int>> and is implemented like this:
inline FactProxy State::operator[](std::size_t var_id) const {
assert(var_id < size());
if (values) {
return FactProxy(*task, var_id, (*values)[var_id]);
} else {
// access to packed state. Not used in this profile.
}
}
Both methods are inlined (one in-place, the other with the inline keyword further down in the header). We don't suspect the additional if statement to be the problem, especially since it will have the same result for a lot of calls in a row, which should be good for branch prediction.
That leaves the additional indirection when accessing values. Could it be that this uses up more space in the cache because the shared pointer has to be cached in addition to the actual data? If this is the issue, we don't know an easy way around this. One complicated way would be to write a class analogous to the StateRegistry to store values of unpacked states. This way, we could store a pointer to the data directly in the state, rather than behind a shared_ptr.
|
msg9801 (view) |
Author: florian |
Date: 2021-01-24.13:24:21 |
|
The results for version 2 look good in blind search. There is still a negative effect of the patch but it is now very small and I think we could live with this. The results for LAMA and EHC still look OK (the effect size is even smaller than for blind search). But the iPDB configuration still shows a stronger negative effect. Airport/p08 or p09 look like good tasks for profiling this.
|
msg9800 (view) |
Author: DominikDrexler |
Date: 2021-01-23.10:34:43 |
|
Here are the experiments where version1-v3 and version2-v3 only unpack a state in the successor generation if axioms are used.
https://mega.nz/file/0eZlgCaQ#0QmOc74ifkk4cX-MgE6AIhsYXQ6LbPGM96OQXX5zixg
In comparison to the previous experiment (v2) where we always unpacked in the successor generation, the search time now gets much closer to the baseline.
|
msg9799 (view) |
Author: malte |
Date: 2021-01-22.15:33:56 |
|
> The next step we discussed is to change the code of get_successor_state
> in a way that it will operate only on the packed data in tasks with no
> axioms, while unpacking the data in tasks with axioms.
Let me add that this different behavior with and without axioms is of course ugly, and perhaps it can eventually go away. As discussed elsewhere, it is wasteful in terms of memory that we store the values of derived variables in our permanent state storage.
Florian's numbers shed light on another aspect of this: we currently need to evaluate axioms for all generated states because we need all variable values to look the state up in the hash table. But more than 90% of states are duplicates at least in the task we looked at here (duplicates: (#generated - #evaluated) / #generated). For the duplicate test, evaluating the axioms is not logically necessary. If two state agree on the non-derived variables, they must be the same. If the 90% duplicate ratio carries over to other tasks (a big if: I think this has a lot to do with the use of blind search and with the domain), this suggests that more than 90% of the calls to the axiom evaluator could be removed.
This is somewhat orthogonal to the question whether or not to store derived variables in the state storage. We currently store and hash all variables. One option we previously discussed is to store and hash only the non-derived variables. But there is also the option of storing all variables but only using the non-derived ones for the hash.
The same analysis that Florian did in msg9794 regarding the number of memory allocations could also be done regarding the number of calls to the axiom evaluator. If we did this currently, I assume we would see that we have #generated calls (at least). I think this could be reduced to #evaluated if we continue to store the derived variables but don't use them for lookup, and to #evaluated + #expanded if we don't store them. (#evaluated because the heuristics may need to look at the derived variables and #expanded because the successor generator might need to look, too -- plus the heuristic again in some configurations, for example with eager search using preferred operators). In tasks like this one, both of these alternatives look much better than the current behaviour.
(This discussion should probably be eventually moved to an issue that deals with the axioms, but the analysis fits well with what we looked at for msg9794.)
|
msg9796 (view) |
Author: florian |
Date: 2021-01-22.13:40:11 |
|
I updated the TODO list for the summary. What my previous message didn't mention was that we expect to see a benefit mostly in version 2 (the one uses shared_ptr<vector<int>> instead of shared_ptr<StateData>) because in this case, slim states that do not have unpacked data do not require dynamic allocation. We still plan to implement the change in both versions to confirm this.
|
msg9795 (view) |
Author: florian |
Date: 2021-01-22.13:35:53 |
|
I'm attaching the profiles of the miconic task discussed in the previous message.
|
msg9794 (view) |
Author: florian |
Date: 2021-01-22.13:32:57 |
|
We looked closely at the results and saw that memory is unaffected as expected. For the running time, results varied with the configuration and domain. For configurations other than blind search, we saw no large difference. In those cases, the heuristic evaluation may outweigh and hide the overhead of states. In blind search, we saw a positive impact for domains with axioms and a ~20% decrease in other domains.
We profiled miconic/s8-4 in the base version and in version 1 (using shared_ptr<StateData>). The profile showed that roughly half of the performance difference was caused by dynamic memory allocations (malloc/free), so we investigated where the calls for new come from. We could clearly see that some calls happen once per evaluated state, some once per generated state and so on.
In most cases the number calls matched (exactly) between the base and the issue version:
~6 * #evaluated: reallocate size for evaluation cache
~5 * #expanded: reallocate vector for applicable operators
#evaluated + #expanded: CombiningEvaluator::compute_result (allocating vector for results?)
#evaluated + #expanded: Hashtable::rehash (adding elements to the cache of an evaluation context)
#evaluated: vector<>::reserve (vector<int> key from TieBreakingOpenList)
#evaluated: SearchProgress::process_evaluator_value (adding entries to a hash map)
#evaluated: Heuristic::convert_global_state (copying the state for the task transformation)
All of the above have the same number of calls in both version. While they cannot cause the performance difference we saw, they can be targets for other issues to save some of the allocations. For example, reusing the same vector for applicable operators sounds like low-hanging fruit.
The only differences between the versions was in allocations caused by the state unpacking and generating successor states. The base version unpacks global states before converting it along a task transformation. As we only used one heuristic, this resulted in #evaluated dynamic allocations. On the other hand, the new code, unpacks in the state registry and creates a new vector<int> for the unpacked copy of the data. This leads to #generated dynamic allocations. (Actually the number is higher. We assume this is caused by delayed duplicate detection where we generate more states in the registry than the ones that are considered "generated states".) The new code also has to construct states which needs one dynamic allocation for the shared pointer. This happens once per generated state, and some additional times (for example when looking up states during deferred duplicate detection).
So in the end, the new code has at least (2*#generated - #evaluated) dynamic allocations more than the old one. This is significant since #generated is much higher than #evaluated (factor of 15 in the task we looked at).
The next step we discussed is to change the code of get_successor_state in a way that it will operate only on the packed data in tasks with no axioms, while unpacking the data in tasks with axioms.
|
msg9788 (view) |
Author: DominikDrexler |
Date: 2021-01-22.09:39:43 |
|
experimental results for a comparison of version1, version2, and base:
https://mega.nz/file/kCgUAL6A#WGkdBm_MHLDlaAqNnIIsOzDLz5qDWn2xmePHwJw6yJM
In summary: base performs better than version1 and version2 on most benchmarks.
The difference between version1 and version2 is rather small.
|
msg9786 (view) |
Author: malte |
Date: 2021-01-21.16:03:05 |
|
Great link, thanks! The first thing the article discusses (constant-sized vectors) is something that has bothered me many times, and there are a few places in the code where we avoid vector because of it. (I think one of them is the abstraction representation for Cartesian abstractions? Another is inside the delete relaxation. And of course the way we store things in state registries and PerStateInformations.)
Also interesting that they point to Compiler Explorer. Before I did my benchmark, I started to look there at the assembly of the two variants too, but quickly decided that it was too much. :-)
|
msg9785 (view) |
Author: jendrik |
Date: 2021-01-21.15:40:33 |
|
Your microbenchmark results are in line with a recent blog article: https://wolchok.org/posts/cxx-trap-1-constant-size-vector/
|
msg9783 (view) |
Author: malte |
Date: 2021-01-21.13:29:04 |
|
In my microbenchmark (attached, easy to run for yourself; check the first lines), resize-then-assign is much faster than reserve-then-push on my two machines, at least for the scenario I tested. It seems that the compiler cannot fully optimize away all the "test against capacity, increment the size" shenanigans that push_back involves, which is fair enough.
For completeness, I also tested whether something like "vector<int> v; v.resize(100);" behaves significantly differently from "vector<int> v(100)" (it doesn't) and whether push_back behaves significantly differently from emplace_back (it doesn't).
So the conclusion seems to be to favor the resize-then-assign version, and I would suggest using that in both cases. I would say it is slightly less in the spirit of what counts as clean C++ these days (because we first assign a wrong value and then the correct one), so I would add a comment stating that this was faster than reserve-then-push_back experimentally.
Of course, even better would be to actually measure this, rather than rely on a microbenchmark. But for the time being I think it's enough to use a consistent method in both cases.
(In case it's not obvious, we are bothering with these low-level things here because this can be one of the critical paths based on some things we've seen in the past. Perhaps it isn't, but these state-copying or state-converting loops were among the perfomance-critical bits of code in the past. We clearly shouldn't worry about this kind of low-level performance in 99% of the code.)
|
msg9780 (view) |
Author: malte |
Date: 2021-01-21.13:08:13 |
|
I'm running a benchmark, and perhaps we should change it the other way around, so perhaps wait a bit with experiments.
|
msg9779 (view) |
Author: malte |
Date: 2021-01-21.12:51:23 |
|
> In one case the vector already exists (but is empty) in the other case, we
> create it there.
I understand, but I think the reserve then push version might be preferable in both cases because it only has to write to every vector element once. If you create a vector of size N (rather than reserving capacity), the default constructor has to be invoked and assigned to each element, only to be immediately overwritten. (A good compiler may or may not be able to compile this away.)
> Isn't there a default capacity for a vector if it is created or are they
> created without reserving space as long as they are unused?
I'd be surprised if any specific behavior was mandated by the standard. But I'd also be surprised for a common implementation to reserve nonzero capacity because vectors can frequently remain empty, and then you never need to do a dynamic memory allocation if you begin with capacity 0.
> Should we switch to creating an empty vector and using reserve/push_back
> in the second case?
I think this would be better in terms of reducing spurious performance differences between the two versions. I don't want to claim that either version will be better than the other without benchmarking. :-) But I'm leaning towards reserve/push_back being no worse at least, and then I think it's best to make both the same.
Here is a small test program to show the capacity of an empty vector. It is 0 for me, but like I said I doubt this is a language guarantee:
===========================================================================
#include <iostream>
#include <vector>
int main(int, char **) {
std::cout << std::vector<int>().capacity() << std::endl;
}
|
msg9778 (view) |
Author: florian |
Date: 2021-01-21.12:36:00 |
|
> Isn't there a default capacity for a vector if it is created
Answering my own question: the default seems to be 0 for all common compilers. Makes sense. So we'll keep the reserve/push_back strategy in both cases. I will update the code (the links will stay the same).
|
msg9777 (view) |
Author: florian |
Date: 2021-01-21.12:21:17 |
|
In one case the vector already exists (but is empty) in the other case, we create it there. Isn't there a default capacity for a vector if it is created or are they created without reserving space as long as they are unused? Should we switch to creating an empty vector and using reserve/push_back in the second case?
|
msg9775 (view) |
Author: malte |
Date: 2021-01-21.12:00:00 |
|
I didn't find a way on github to comment directly on the diff, so one comment here:
Why the two different strategies (presize vector and assign vs. reserve and push_back) in State/StateData::unpack()? These can have different performance characteristics. For example, the resize-then-assign version has to touch every element twice (once to assign 0, then again to assign the final value).
|
msg9774 (view) |
Author: florian |
Date: 2021-01-21.11:58:45 |
|
I updated the summary:
- mentioned the meta issue (issue401)
- point 3 is done in the current prototypes
- added points 4 and 5 as next steps
|
msg9773 (view) |
Author: florian |
Date: 2021-01-21.11:53:19 |
|
Dominik and I implemented the two variants mentioned in msg9771. You can see them on Github
Variant 1 (State has shared_ptr<StateData>):
https://github.com/FlorianPommerening/downward/compare/main...issue348-v21#files_bucket
Variant 2 (State has shared_ptr<vector<int>>):
https://github.com/FlorianPommerening/downward/compare/main...issue348-v22#files_bucket
Difference between the versions:
https://github.com/FlorianPommerening/downward/compare/issue348-v21...issue348-v22
Dominik is now setting up experiments for both variants.
|
msg9772 (view) |
Author: DominikDrexler |
Date: 2021-01-20.19:09:16 |
|
Aspects to discuss about the current prototype for version 1:
- How to implement StateData in the Header file?
- How to access StatePacker inside State class?
Will be good if someone can review the code.
|
msg9771 (view) |
Author: florian |
Date: 2021-01-20.14:25:04 |
|
We discussed the patch and some of the more ugly parts of it this morning. The result was that we are going to try an additional iteration of the prototype where we try a different approach:
This iteration will use a state class that can have packed and unpacked, registered and unregistered data inside. It will offer methods to access the data, and create missing representations (e.g. unpack packed data). We will use indirection with shared pointers to keep the actual state class light-weight and easy to copy. The state will remain immutable in the sense that the data inside the state can only change by adding and equivalent representation (unpacking, registering) or by the assignment operator.
One advantage is that in the future, we can try things like not storing derived variables in tha packed data and treat evaluated axioms as another (optional) piece of the state that is created on demand. However, this is not part of this issue.
We also discussed the way states are currently copied when transforming them from the task used in the search to the task used in the heuristic (both in the current main branch and the issue branch). This will hopefully become obsolete with our planned solution to the component interaction problem, so we decided to leave it like it is in this issue.
As a next step, we want to try two variants of the idea described above. Here is some mock code
### Variant 1:
class StateData {
mutable const PackedStateBin *buffer;
mutable StateID id;
mutable vector<int> unpacked_data;
mutable const StateRegistry *registry;
public:
void unpack() const;
operator [];
}
class State {
shared_ptr<StateData> data;
public:
void unpack() const;
operator [];
}
### Variant 2:
class State {
mutable const PackedStateBin *buffer;
mutable StateID id;
mutable shared_ptr<vector<int>> unpacked_data;
mutable const StateRegistry *registry;
public:
void unpack() const;
operator [];
};
|
msg9764 (view) |
Author: DominikDrexler |
Date: 2021-01-19.11:07:47 |
|
Here is a comparison of the branch tip denoted by issue348-v20 with issue348-
base using blind search:
ComparativeReport:
https://mega.nz/file/xDwEFJqb#QTGQyjN4wiz69Rk4fi5jpQikn31-ABCH5FsqcMXV5ZM
Scatterplot adl:
https://mega.nz/file/lXo1zSxR#AbALakrN3hB1Uy8arM0jsQvVUaqpASJuJ0o0BVQJE3I
Scatterplot strips:
https://mega.nz/file/hLhFEa6C#d_KX1vpU2NORLPhcEJleCfpezU1CLGW9mdAe5vy0qCs
The branch tip yields better performance on adl and worse on strips.
|
msg9762 (view) |
Author: florian |
Date: 2021-01-18.16:33:37 |
|
To bring this issue back up to date, I re-recreated the pull request on GitHub and rebased it to the current main branch.
https://github.com/aibasel/downward/pull/17
Since the patch is fairly large, I created a summary of the most important changes:
* States (objects of the class State) are always unpacked and can be registered or unregistered. The only place packed state values occur is inside the StateRegistry.
* States internally use a shared_ptr<vector<int>> for the data, so they remain cheap to move around and copy. This could also be used to unpack the values on demand (there is a second branch that tries that here: https://github.com/FlorianPommerening/downward/compare/issue348...FlorianPommerening:issue348-experimental).
* Axioms are always evaluated on unpacked data.
* State creation is still in the StateRegistry (we once planned to separate it but couldn't find a good solution here).
* Successor state creation copies the packed parent state, then unpacks it and evaluates effects and axioms on the unpacked data, then repacks everything again for storage. The unpacked data is returned (without unpacking it again).
* Unregistered successors of unpacked states can also be created through the task interface. This copies the unpacked data and then evaluates effects and axioms on it.
* SearchNode now stores a mutable unique_ptr<State> instead of a GlobalState.
* (Unpacked) state is now stored in EvaluationContext rather than EvaluationCache (was only accessed through EvaluationContext).
* We reintroduced the StateHandle class to represent a state (StateHandle = state ID + reference to registry).
* PerStateInformation is now indexed with StateHandles.
* Duplicated code in the successor generator was removes (overloads for States and GlobalStates not necessary anymore).
* Heuristics no longer have two overloads for compute_heuristic.
* The AxiomEvaluator no longer needs to overloads to evlauate axioms on packed and unpacked data.
|
msg8920 (view) |
Author: jendrik |
Date: 2019-06-19.09:18:50 |
|
Here are the results.
base (05d98c9): start of the branch
v13 (2f7831a): "rename interface for unpacked states"
v14 (cba70832): "keep unpacked and packed data synchronized instead of unpacking
at the end of successor generation"
v19 (540e85c): branch tip ("Allow to generate unregistered successor states on
tasks with axioms")
v13 vs. 14:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v14-blind-issue348-v13-issue348-v14-compare.html
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v14-blind-issue348-v13-issue348-v14-total_time-blind-strips.png
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v14-blind-issue348-v13-issue348-v14-total_time-blind-adl.png
base vs. v19:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v19-blind-issue348-base-issue348-v19-compare.html
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v19-blind-issue348-base-issue348-v19-total_time-blind-strips.png
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v19-blind-issue348-base-issue348-v19-total_time-blind-adl.png
|
msg8919 (view) |
Author: jendrik |
Date: 2019-06-18.15:10:28 |
|
Experiments are running (see msg8911).
|
msg8915 (view) |
Author: florian |
Date: 2019-06-17.11:14:54 |
|
The pull request at
https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/59
is the one that we made for review. It contains the changes from the prototype
branches that we want to keep. The other pull requests were of the prototype
branches and we only used them while working on them to keep track of the
changes. There is also pull request 60 that was just to see if we missed
anything important in the pull request above (see msg8904).
|
msg8914 (view) |
Author: malte |
Date: 2019-06-17.10:59:20 |
|
Which pull request have you been reviewing? There seem to be multiple open ones.
I'm not entirely sure which one the comments refer to.
> * Check how "lightweight" classes such as StateHandle are passed around as
> function parameters. Didn't see any reference to that in the coding guidelines,
> but StateHandle is ATM being passed in most (all?) of the new code by value, not
> by const ref.
I don't think we have a written-down policy for this. In principle, very small
objects can be more efficiently passed by value than by const reference, both
because there is less to pass, but more importantly because indirection can be
reduced and more optimization assumptions are possible. For example, we wouldn't
want to pass an int by const reference. But it is not easy to come up with
general rules that are easy to understand and avoid code churn.
> * The SearchNode class now holds a unique_ptr to the unpacked State (which acts
> as a cache to the calls to SearchNode::get_state). We should perhaps evaluate
> the memory impact of this (I would expect 8Bytes times the number of nodes, which
> could be maneageable, but still)
I didn't see such a change (which is one reason why I'm asking which pull
request you are referring to), but for what it's worth: SearchNode is a
short-lived temporary class. I don't think we have more than five or so
SearchNode instances alive at the same time. An overhead of 8 bytes per state is
not something we would introduce without a very strong need.
|
msg8913 (view) |
Author: guillem |
Date: 2019-06-17.10:37:44 |
|
Cedric and me went through the proposed changes in the prototype and wrote down a few notes that might (or might not :-)) be
of some use:
* The State class now contains some redundancy: StateHandle contains a pointer to the registry, and the AbstractTask pointer
contains that as well.
* Florian was not too happy about the introduction of the StateHandle class - perhaps this could be avoided.
* The State class seems to have too many constructors, some could probably be refactored in terms of others.
* The comments about design decisions related to the State class in task_proxy.h should be revised / updated.
* The modifications in StateRegistry::get_successor_state might be introducing some performance overhead, this should be
checked.
In particular, both the parallel update of the packed and unpacked state, and the way axioms are evaluated.
* Check that the class EvaluatorCache is still needed - i.e., seems like a very thin wrapper around EvaluationResults that
provides very little extra functionality.
* Check how "lightweight" classes such as StateHandle are passed around as function parameters. Didn't see any reference
to that in the coding guidelines, but StateHandle is ATM being passed in most (all?) of the new code by value, not
by const ref.
* The SearchNode class now holds a unique_ptr to the unpacked State (which acts as a cache to the calls to
SearchNode::get_state). We should perhaps evaluate the memory impact of this (I would expect 8Bytes times the number
of nodes, which could be maneageable, but still)
|
msg8911 (view) |
Author: florian |
Date: 2019-06-14.21:55:35 |
|
The first two points are about the performance and surprised me a lot because
they seem to directly contradict our previous experiments. I ran blind search on
gripper/prob07 locally and it took 27 seconds, roughly 10% slower than on the
default branch.
Jendrik, could you run an experiment comparing revision cba70832 to its parent
(2f7831a) and create relative scatter plots for tasks with axioms and tasks
without axioms separately? This way we could have a look at the change for
axioms in isolation.
Could you also run an experiment for the tip of the issue (540e85c) vs. the
start of the branch (05d98c9) to see the overall impact?
I'll start another experimental branch off of the issue branch to try out the
unpacking on demand.
|
msg8910 (view) |
Author: florian |
Date: 2019-06-14.21:50:27 |
|
I went through the changes again and tried to collect the rough edges and open
questions.
I came up with the following list:
1) Tasks with axioms now do an additional copy of the unpacked state data.
msg7500 concluded that unpacking the state for the axiom evaluation was not
worth it.
The new code doesn't unpack but it copies the full state data, evaluates the
axioms on
it, then packs it again. From my understanding of packing and unpacking, this
should
take longer than unpacking. I don't see how we save time with this change.
2) msg8880 showed a performance drop when unpacking every generated state but we
currently do that. Why is this not a problem anymore?
3) Successor state creation is implemented twice, once for packed and once for
unpacked
data. Can we avoid this duplication?
4) State::operator[] returns a FactProxy where GlobalState::operator[] returned
an int,
do we want to change this?
5) There are now two ways to get to the ID of a state: s.get_id() and
s.get_handle().get_id(). This smells of bad design but using the second one
everywhere also seems bad.
6) We use unique_ptr<T> for data that is created on demand (SearchNode::state and
State::values). Should we use optional<T> instead?
|
msg8906 (view) |
Author: jendrik |
Date: 2019-06-14.15:07:55 |
|
I reviewed the individual commits and only had a few minor comments. I think
unpacking the state data on demand could be deferred to a follow-up issue.
|
msg8904 (view) |
Author: florian |
Date: 2019-06-14.13:58:46 |
|
I turned the prototype branch into a pull request for this issue by splitting it
into 20 commits that are hopefully easier to review. I left out some stuff that
I think we don't need any more.
Here is the pull request from the issue branch to default. I recommend looking
at the individual commits instead of the full diff.
https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/59
Here is a pull request from the prototype to the issue branch. It shows the
left-overs, i.e., the parts I didn't include in the new issue branch. Guillem
and Cedric are currently looking over it to see if I missed anything important.
https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/60
I think there is still quite a number of open points to do before we can
consider merging them. the main things for me are:
* We currently have two functions that compute successor states, with very
similar code. We should try to get rid of one of them.
* The StateHandle class is not that nice, especially if we keep getters for both
handle and ID in the state.
* I'm not convinced that copying the full state, then evaluating the axioms on
the unpacked data, and packing everything again is worth it (over evaluating
axioms on packed data and then unpacking). We should test that.
We might be able to remove the StateHandle again if we switch the state class to
contain only the handle as required data and treat the buffer lookup and the
state unpacking as optional cached data.
|
msg8891 (view) |
Author: jendrik |
Date: 2019-06-12.18:18:38 |
|
For completeness, here are some additional results:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-all-unpacked-exp1.html
v1: store state data packed in registry and unpacked in State
v2: remove copy and assignment constructors for State
v3: store state in SearchNode to avoid unpacking it again
v4: compute successors in registry
v5: set effects in packed and unpacked state data simultaneously
total search_time scores:
v1: 568.65
v2: 568.38
v3: 570.45
v4: 571.56
v5: 577.54
|
msg8888 (view) |
Author: jendrik |
Date: 2019-06-12.17:25:30 |
|
I'm done making the changes we discussed offline. The new pull request is at
https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/57 .
|
msg8880 (view) |
Author: jendrik |
Date: 2019-06-12.13:53:23 |
|
We suspect that the problem slowing down blind search is that all generated
states have to be unpacked. The following experiment confirms this hypothesis:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-base-exp1.html
v1: default
v2: make dummy copy of each evaluated State
v3: add dummy unpacking for each expanded GlobalState
v4: add dummy unpacking for each generated GlobalState
total search_time scores:
v1: 578.82
v2: 578.01
v3: 576.58
v4: 568.57
|
msg8822 (view) |
Author: jendrik |
Date: 2019-06-05.21:26:08 |
|
I ran an experiment evaluating the changes we made today, but the changes have
no noticeable effect:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v3-opt.html
|
msg8809 (view) |
Author: jendrik |
Date: 2019-06-05.12:37:42 |
|
Update summary.
|
msg8806 (view) |
Author: florian |
Date: 2019-06-05.10:22:07 |
|
I'm attaching two profiles of gripper/prob06 in the default branch and in our
prototype. It looks like the main place where we lose time is in
get_successor_state. In the default branch, this was handled by the state
registry and it created the successor state in the packed buffer. The packed
buffer of the predecessor was copied and then modified. In the new code, we move
the successor generation out of the registry and now copy the unpacked state,
modify the copy, then pack it (copies it again) and register it (copies the
packed buffer again). So instead of one copy we now have three.
|
msg8805 (view) |
Author: jendrik |
Date: 2019-06-04.23:40:37 |
|
We made some progress on the prototype and ran some experiments comparing the
prototype to the default branch. Here are the results:
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v1-opt-issue348-v1-base-issue348-v1-compare.html
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue348-v1-sat-issue348-v1-base-issue348-v1-compare.html
The search time score decreases for most configurations, so we should profile
the runtime to see where the slowdown comes from.
|
msg8791 (view) |
Author: cedric |
Date: 2019-06-03.17:46:14 |
|
The current plan is to store the state in the state registry in its packed form
and unpack it everywhere else. This means that we will merge the functionality
of the classes State and GlobalState into one class (called State). State
objects can be registered or unregistered but they are always unpacked. This has
some wide-ranging implications because copying a state is now an expensive
operation. We thus have to think carefully about how to pass states to where
they are used. The current plan is to store them inside an EvaluationContext and
move them where possible.
We started working on a separate branch to test the effects of this change. Once
we are happy with it, we can decide which parts of it we want to use.
https://bitbucket.org/FlorianPommerening/downward-issues/pull-requests/51
|
msg7500 (view) |
Author: florian |
Date: 2018-09-17.22:00:21 |
|
In issue794 we tested if its worth unpacking the state just for the axiom
evaluation. The answer apparently is "no" (msg7498, msg7499). While the axiom
evaluator is much faster on unpacked data, the unpacking and packing offsets
this benefit. Since we already unpack the state in most heuristics, it would
probably be worth it if we can unpack every state when it is taken out of
storage and pack it when it is stored.
|
msg4337 (view) |
Author: florian |
Date: 2015-07-07.22:20:58 |
|
@Gabi: I'm adding you to the nosy list because you are working on issue545 and
this might be relevant. Feel free to undo this.
This issue has been lying around for a while again. In the meantime the
situation has changed once again. We currently have two state classes:
There is GlobalState which has an id, is used in the search, by StateRegistry,
and PerStateInformation but does not know anything about the new task interface.
GlobalState objects are immutable and hold a pointer to their data (stored in a
StateRegistry). Each access to variable values is packed/unpacked on-the-fly
individually. In the long run, we want the global task to disappear, which means
we also hav to get rid of GlobalStates.
Then there is the class State which contains unpacked state data. State objects
know about the task interface and can generate FactProxy objects for it. Access
to variable values is a direct array access, but copying States is expensive.
States can currently be created from GlobalStates by the intermediary helper
method convert_global_state. States are currently used within heuristics (those
that already are updated to the new task interface). Operator application is
implemented for them, but does not support axioms.
The problem of packing/unpacking currently exists for both state classes:
GlobalState is always packed, which makes evaluating axioms slow when generating
successors in the search algorithms. State is always unpacked which is a problem
for heuristics like the blind heuristic or PDBs that only access very few facts.
For example, if we currently generate a successor and then evaluate a pdb
heuristic for it, the following happens:
1. StateRegistry::get_successor_state reserves space for packed state data and
copies the the current packed state data there.
2. It then loops through the effects of the operator and checks for each one, if
it fires. To do that, it unpacks each value individually. The effects are then
set in the new packed state individually as well.
3. The AxiomEvaluator is called on the new packed state data and again accesses
each value individually (read and write access).
4. A GlobalState object is created that points to the new packed state data.
5. A State object is created by accessing/unpacking each value individually and
storing the result in an int array.
6. The PDB accesses the values for variables in its pattern.
This is inefficient mostly in step 3. (evaluating axioms would be faster on
unpacked states) and step 5./6. (The PDB only needs a few values and should not
unpack the state).
Malte mentioned in msg4326:
> One possibility is to replace the current concrete State class with
> an abstract interface with unpacking and non-unpacking implementations.
Gabi currently also works on issue545 which will change the state class(es) once
again.
|
msg3741 (view) |
Author: malte |
Date: 2014-10-09.10:48:52 |
|
Sure, sounds good.
|
msg3728 (view) |
Author: florian |
Date: 2014-10-08.20:54:46 |
|
What do you suggest as a next step? You pointed out a problem in the code review
which accounts for some of the lost runtime, but it cannot account for the whole
difference. Should we have a more detailed look at the profile together?
|
msg3721 (view) |
Author: malte |
Date: 2014-10-08.17:40:59 |
|
It makes a certain amount of sense that this causes overhead, but it's
surprising that it's so large and that it varies so much from task to task
(little difference for some tasks, more than five times the runtime for others).
Perhaps there's something special about these airport tasks, such as a high rate
of duplicates.
|
msg3720 (view) |
Author: florian |
Date: 2014-10-08.17:34:34 |
|
I ran callgrind on airport/p09 but did not have a close look at the results. On
first glance it looks like pack/unpack just takes too much time. With the blind
heuristic, we unpack every value even though we do not look it during heuristic
calculation.
|
msg3717 (view) |
Author: florian |
Date: 2014-10-08.13:04:42 |
|
It might take me a while to get to this but I'll look at the profile for the
Airport tasks.
The code is in my bitbucket repository. Merging in the new default should have
updated the existing pull request, so it now shows the diff of the versions we
compared:
https://bitbucket.org/flogo/downward-issues-issue348/pull-request/1/always-unpack-state/diff
|
msg3713 (view) |
Author: malte |
Date: 2014-10-08.12:02:03 |
|
The slowdown in, for example, some Airport tasks with blind search is quite
large. Perhaps you can profile one of these? I'd be interested in having a look
at the code, ideally a comparison of the two versions compared in the
experiment. Can you set up a code review for this delta?
|
msg3700 (view) |
Author: florian |
Date: 2014-10-07.21:25:08 |
|
Results for the optimal suite are in (40MB):
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-v3-base-issue348-v3-compare.html
Coverage and memory look good, but everything seems to run a little slower, and
in some cases a lot slower. In the last reports we also compared to the code
before issue214 (packed states), this is not done here.
What do think would be a good next step (more reports, experiments for sat,
profiling, ...)?
|
msg3670 (view) |
Author: florian |
Date: 2014-10-06.12:30:49 |
|
I will merge in the new code and re-run the experiments.
|
msg3638 (view) |
Author: malte |
Date: 2014-10-04.19:49:35 |
|
It would be good to make progress on this issue -- I think it's an important one.
Can I do anything to help get this one moving again? Also, perhaps we should run
some new experiments that give before/after numbers against the newest code
version? The original experiments are now 4-5 months old, we've had many changes
since then.
|
msg3407 (view) |
Author: johannes |
Date: 2014-09-03.13:23:02 |
|
This issue has implications for my numeric fast downward extension too. I
currently tag variables in preprocess with "constant" (contains the value of a
primitive numeric expression such as road length) "derived" (result of a numeric
axiom) "instrumentation" (variable never occurs in preconditions -> only
relevant to compute metrics) and "state-relevant".
The state-relevant variables have to be packed, while states that only differ in
the instrumentation variables can be considered the same state. Constants can be
stored globally once (they never change) and the handling of derived variables
has multiple possible solutions. Following florian the current solution would be
to evaluate all axioms when unpacking the state and not store them in the packed
buffer (and also do not evaluate them on the fly).
|
msg3188 (view) |
Author: florian |
Date: 2014-05-23.16:55:50 |
|
We still have an open issue in the code, maybe you could have a look. Its the
TODO in state_registry.cc: the question is whether we can avoid to copy the
state data by reserving uninitialized memory in the pool and then pack the state
into this memory.
(https://bitbucket.org/flogo/downward-issues-issue348/pull-request/1)
Apart from that, we wanted to have a closer look (i.e. profile) at iPDB. It
seems to be the config where the state packing has the largest negative effect.
|
msg3187 (view) |
Author: malte |
Date: 2014-05-23.15:55:06 |
|
Thanks! What's the next step?
|
msg3185 (view) |
Author: florian |
Date: 2014-05-23.15:22:07 |
|
I ran full experiments after Jendriks review for optimal domains:
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-v2-opt.html
The full results for satisficing domains are for the code version before the
review, but this should be fine, because we mostly changed minor things in the
review
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-v1full-sat.html
Both reports show a lot of failed merge and shrink configs because the
configuration file in lab was out of date.
Finally, I tried to create a report that only shows runs with a significant
change in total time. It compares the code before issue214 to the code after
this issue. It excludes a run if it wasn't solved in both configs, or it was
solved in both configs in under 0.5 seconds. The remaining runs are only
included if their runtime difference is at least 20%.
http://ai.cs.unibas.ch/_tmp_files/pommeren/notablechangesbyrevisionreport.html#total_time
|
msg3180 (view) |
Author: florian |
Date: 2014-05-21.17:53:47 |
|
Argh, yes! The reports had the same name and the second one replaced the first.
Looked ok in my browser because I checked the first before uploading the second.
Let me try again:
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-v1-opt.html
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-v1-sat.html
|
msg3179 (view) |
Author: malte |
Date: 2014-05-21.17:49:49 |
|
Numbers look good! Did you want to give two different experiment links?
|
msg3178 (view) |
Author: florian |
Date: 2014-05-21.17:42:44 |
|
I did a first round of experiments for unpacking in every state and invited
Jendrik to the following pull request:
https://bitbucket.org/flogo/downward-issues-issue348/pull-request/1
If anyone else wants to have a look, let me know.
The experiments only test the domains and configs Malte suggested in msg3049. We
should run a larger set of benchmarks before the final merge.
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-base-issue348-v1-compare.html
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue348-base-issue348-v1-compare.html
|
msg3092 (view) |
Author: florian |
Date: 2014-03-28.12:00:38 |
|
Some notes from our last meeting:
As a first step, we want to completely unpack the state everywhere outside the
state registry, i.e., only pack it for storage.
As I said offline, I'll finish some other things first and will then work on
this issue if it is still open. I'll leave the issue unassigned until I actually
have time to work on it in case anyone else wants to work on it in the meantime.
|
msg3082 (view) |
Author: malte |
Date: 2014-03-24.13:32:26 |
|
Any volunteers to lead this? I'm happy to give advice regarding the design etc.
|
msg3065 (view) |
Author: malte |
Date: 2014-03-15.16:31:52 |
|
Andrew reports (in msg3061 and msg3063 of issue214) a nice speed improvement for
blind search in the tidybot domain by inlining the get/set methods for the
packed states. See msg3061 for details.
We should consider the question of inlining or not inlining in the context of
this issue, since
1) inlining speeds up access to packed states, making unpacking less necessary
2) inlining might be less beneficial if we unpack states in one go, since there
will be fewer calls to get/set.
My suspicion regarding 1) is that we will want to unpack even if we inline calls
to get/set. Regarding 2), we should measure and look at particularly bad
domains/configurations.
|
msg3049 (view) |
Author: malte |
Date: 2014-03-14.01:33:48 |
|
For these and the related issues on the state space, I suggest we identify a set
of domains and configurations that are particularly worth watching, and limit
our initial experiments to this smaller subset. (Or, if you prefer to run all
experiments on all benchmarks and configs, that's also fine, but I'd like to
additionally have result tables that only talk about the "interesting subset".)
I find it very hard to find problem cases in the 100 MB tables, since the
summaries don't nearly tell the whole story. For example, we know that for this
issue we mainly need to watch total time, and the total time summaries are
rather useless with a large set of configs because the set of commonly solved
instances becomes tiny. I would never have found h^CG + PSR-Large as a problem
case if I hadn't been looking for it specifically in the detailed tables.
Here are my candidates for configs to watch:
- A* + blind
- A* + h^2
- A* + ipdb
- eager_greedy + h^CG (presumably with preferred ops? whatever the config in our
satisficing experiment for issue214 was)
- LM-Cut
- BJOLP
(The last two are not really problem cases, but it's good to have something not
expected to be problematic to compare against.)
Here are my candidates for domains:
- tidybot (e.g. the opt variant)
- pegsol (e.g. opt-11, but I expect pegsol-2008 should work equally well)
- psr-large
There may be other critical cases I am missing. Ideally, I'd be interested in a
list of cases (config, instance) where total time is at least 0.5 seconds for
the old code and increases by at least 20% in the new code in the experiments of
issue214. Can someone prepare this, or if not, can someone make the properties
files for the two main experiments (opt and sat) from this issue available?
(Actually, it'd be nice to have the properties files either way.)
|
msg3043 (view) |
Author: jendrik |
Date: 2014-03-13.17:02:57 |
|
Rephrased title to make this issue more general. Originally this was only about the axiom evaluator.
Quoting Malte from issue214:
"""
There is one outstanding point ...: the
question if, when and where to unpack the state information. We already have
issue348 related to the axiom evaluator that discusses this point, but this
should not be limited to the axiom evaluator, as our results for heuristics that
access state information many times show.
I suggest we [...] deal with it soon, either straight
away or after issue420, which should be quick. We can take the information from
msg2994 and msg3004 to identify the most interesting configurations and
benchmarks for this, and msg3001 has some additional considerations regarding
IDA* etc. that might help guide the code design for the new issue.
"""
|
msg3041 (view) |
Author: jendrik |
Date: 2014-03-13.01:10:08 |
|
Other related things we should take care of:
1) The initial state representation contains unevaluated axiom values.
2) Update the comment above the declaration of g_initial_state_data in globals.h
once 1) is done.
|
msg3033 (view) |
Author: florian |
Date: 2014-03-06.16:10:08 |
|
After we finished issu214 and issue420, we should try to refactor the axiom
evaluator. States should be unpacked before axioms are evaluated. This would
mean that the axiom evaluator works on unpacked state data and we could move the
typedef for PackedStateEntry into the state class and make it private. We should
then also make sure that axioms are evaluated on the global variable holding the
initial state data. Currently this is done only when the initial state object is
created by the state registry, which is confusing because the global variable
name is not what it seems.
|
msg2327 (view) |
Author: florian |
Date: 2012-08-30.13:28:12 |
|
From Malte's comment in Rietveld:
> The way that the axiom evaluator accesses the state at the moment will
probably need some additional thought at some point.
At the moment the evaluator accesses the State class directly and changes the
state variables of it. In the patch for issue344 this is changed to a solution
where the evaluation works directly on the state_var_t array and is called in
the constructor for successor states and in the named constructor for the
initial state.
|