Issue 22: make translator deterministic - Fast Downward issue tracker

Title	make translator deterministic
Priority	wish	Status	resolved
Superseder		Nosy List	gabi, jendrik, malte
Assigned To	gabi	Keywords
Optional summary

Created on 2009-10-08.12:04:34 by silvia, last changed by malte.

Files
File name	Uploaded	Type	Edit	Remove
js-issue22-sat-preprocess-eval.tar.gz	jendrik, 2012-05-15.00:08:07	application/x-gzip
output.sas.1	gabi, 2013-11-17.14:06:06	application/octet-stream
output.sas.2	gabi, 2013-11-17.14:06:31	application/octet-stream

Messages
msg2718 (view)	Author: malte	Date: 2013-11-22.21:24:58
I've moved this over to the separate issue398. For old enough issues, I'd prefer to let them rest, even if we observe similar symptoms again later, mainly for two reasons: 1) no need to read all the old messages again to see what the issue is about 2) easier to identify what relates to what in the repository (based on the branch names)
msg2714 (view)	Author: malte	Date: 2013-11-21.15:06:05
An invariant candidate with a "heavy action" is discarded immediately, so I think you're misinterpreting the causality here. It's good to find the heavy actions that refute an invariant candidate as soon as possible. Let's discuss the possible ways to address this offline.
msg2713 (view)	Author: jendrik	Date: 2013-11-21.14:59:22
We think that the invariant synthesis can be made deterministic by sorting the threatening actions of an invariant deterministically. If we don't do that, the synthesis might find different groups in different runs. I tried to sort the actions by the key (name, parameters, precondition, effects, cost), this introduced a performance degradation in openstacks-strips however. After some experiments we think that the performance degradation comes from the changed action order rather than from the cost of changing the actions. For different random action orderings the number of heavy_actions that have to be generated and also the time to check them varies greatly: openstack-strips:p06.pddl #heavy_actions ranges from ~688K to ~931K time ranges from ~40s to ~54s time = time for invariant synthesis (other times are almost 0) time correlates with #heavy_actions (numbers obtained from 9 runs) Do you have an idea what might be a deterministic action ordering that yields a low number of heavy actions? If not, hwo should we go about this?
msg2707 (view)	Author: gabi	Date: 2013-11-20.09:25:42
There is some significant performance degeneration in openstacks-strips. I assume that the problem is that the operator description in this domain are already grounded, so there are lots of actions to be processed by the invariance analysis. Jendrik, I think we should try to catch this non-determinism earlier so that we do not need to sort the threatening actions so often. Can you check whether the difference occurs first with the actions_to_check set or already earlier in the sets with the threatening actions for a predicate? If someone is interested, hre are the problematic translator times: <task> <old translator time> -> <new translator time> openstacks-strips/p06.pddl 40.6s -> 60.8s openstacks-strips/p07.pddl 44.4s -> 60.8s openstacks-strips/p08.pddl 56.8s -> 300.6s openstacks-strips/p09.pddl 53.6s -> 300.6s openstacks-strips/p10.pddl 41.0s -> 300.8s openstacks-strips/p11.pddl 40.0s -> 300.7s openstacks-strips/p12.pddl 45.3s -> 300.8s openstacks-strips/p13.pddl 36.9s -> 300.8s openstacks-strips/p14.pddl 41.0s -> 300.8s openstacks-strips/p15.pddl 44.9s -> 300.8s openstacks-strips/p16.pddl 40.3s -> 300.8s openstacks-strips/p17.pddl 45.3s -> 301.0s openstacks-strips/p18.pddl 45.6s -> 302.2s openstacks-strips/p19.pddl 38.1s -> 301.2s openstacks-strips/p20.pddl 38.4s -> 301.7s openstacks-strips/p21.pddl 40.1s -> 301.7s openstacks-strips/p22.pddl 38.7s -> 301.8s openstacks-strips/p23.pddl 38.4s -> 301.8s openstacks-strips/p24.pddl 51.6s -> 309.5s openstacks-strips/p25.pddl 54.3s -> 308.9s openstacks-strips/p26.pddl 42.9s -> 305.2s openstacks-strips/p27.pddl 47.9s -> 312.2s openstacks-strips/p28.pddl 116.7s -> 357.8s openstacks-strips/p29.pddl 107.8s -> 352.1s openstacks-strips/p30.pddl 62.5s -> 321.3s
msg2693 (view)	Author: gabi	Date: 2013-11-19.14:58:40
Jendrik fixed the issue. I am currently running experiments to make sure that there is no negative performance impact.
msg2685 (view)	Author: gabi	Date: 2013-11-17.14:06:06
I encountered a few trucks-strips tasks where the translator output is not deterministic. I attached two output files for p17.pddl. Of 10 translator runs, 8 lead to output.sas.1 and 2 to output.sas.2.
msg2224 (view)	Author: jendrik	Date: 2012-05-22.00:28:43
There seem to be no objections, so I will go ahead and merge tomorrow.
msg2222 (view)	Author: malte	Date: 2012-05-18.12:48:17
Can you write a mail to downward-dev to discuss the rationale of the change and the performance impact? If there are no objections, we can merge.
msg2214 (view)	Author: jendrik	Date: 2012-05-15.00:08:07
Here are the results for the translator experiment. How do we proceed?
msg2212 (view)	Author: jendrik	Date: 2012-05-14.12:38:52
The experiment showed that it takes too much time to sort axioms in axiom_rules.py. It seems that those sortings aren't necessary though (all first problems yield the same SAS+ files). I just started a new experiment.
msg2200 (view)	Author: jendrik	Date: 2012-05-10.16:05:03
Indeed, this fixes it. I'll run another experiment comparing the preprocess performance.
msg2199 (view)	Author: malte	Date: 2012-05-10.15:39:42
Does this fix it? A cleaner change would be to change the computation of connected components instead. I haven't tested it, but adding result[-1].sort() in the while loop after the call to dfs(node) in graph.py:24 should do the trick. --- a/src/translate/split_rules.py Thu May 10 14:58:06 2012 +0200 +++ b/src/translate/split_rules.py Thu May 10 15:35:35 2012 +0200 @@ -22,7 +22,7 @@ for var, conds in var_to_conditions.items(): for cond in conds[1:]: agraph.connect(conds[0], cond) - return agraph.connected_components() + return sorted(map(sorted, agraph.connected_components())) def project_rule(rule, conditions, name_generator): predicate = next(name_generator)
msg2198 (view)	Author: jendrik	Date: 2012-05-10.14:50:02
If you are in the issue22 branch, in misc/test-translator.py, comment out the lines containing r'\d+ auxiliary atoms', r'\d+ final queue length', r'\d+ total queue pushes' to include those values in the diff and set DEBUG to True at the top of the file. Then run misc/test-translator. For a nicer view of the diff: cd /tmp/sas-files/airport-adl-p01-airport1-p1.pddl meld *
msg2197 (view)	Author: malte	Date: 2012-05-10.14:37:38
> But if the outcome of this part is always deterministic I think we can > live with this non-determinism. The result will be the same, but there may be huge performance differences. The reason we are reporting these numbers is to have a chance to spot performance issues. Can you give me an example where we get different results? Then I'll see what can be done without digging too deeply.
msg2196 (view)	Author: jendrik	Date: 2012-05-10.14:25:03
As you said, it would make spotting non-deterministic code easier later. But if the outcome of this part is always deterministic I think we can live with this non-determinism.
msg2195 (view)	Author: malte	Date: 2012-05-10.14:14:17
> The log output is deterministic except for the auxiliary atoms and the > queue pushes. My attempts at making them deterministic were futile. Do you think it would be good to have those deterministic, too? Then I would have a look.
msg2194 (view)	Author: jendrik	Date: 2012-05-10.12:15:03
I uploaded another patch set on rietveld. The log output is deterministic except for the auxiliary atoms and the queue pushes. My attempts at making them deterministic were futile. The code now produces the same SAS files for all three python versions even if we turn on random seed hashing.
msg2192 (view)	Author: malte	Date: 2012-05-09.17:27:32
I would say that output-deterministic (in the sense of output.sas) is good enough, although identical log output would have the advantage that it might be easier to spot when we make changes that cause things to go wrong performance-wise. So this might be even better. What do you think? (The code in question that produces this log output has always been deterministic in terms of the final result it produces -- modulo bugs -- if this result is treated as a set, i.e., we don't care about order. The differences observed are due to differences in intermediate results.)
msg2190 (view)	Author: jendrik	Date: 2012-05-09.16:30:03
I am currently writing an extension to the translator test that checks if at least all three python versions produce the same results for the test instances. Some non-deterministic cases have already been fixed, but there's at least one peculiarity: For the first airport problem the output.sas files are identical but the log output is different: 2.6: 141 auxiliary atoms 311 final queue length 364 total queue pushes 2.7: 156 auxiliary atoms 326 final queue length 379 total queue pushes 3.2: 155 auxiliary atoms 325 final queue length 378 total queue pushes The number of relevant atoms is 170 for all three python versions. Do we need to make that deterministic as well?
msg2183 (view)	Author: malte	Date: 2012-05-08.14:06:22
I think a decent way to test is to create a dictionary with entries of the form sha_hash(domain_file_text), sha_hash(problem_file_text) -> sha_hash(output_sas_text). Maybe also add the file names of the domain and problem file as additional data somewhere for convenience. In addition to the entries, we should also store on which planner revision the data was generated. We could then have a script that allows us to populate such a dictionary from an experiment (every now and then, we will need to repopulate due to intentional changes to the translator) and a diff script (or better: have the data in a form where regular diff gives us useful output). Then we can, at regular intervals or on demand, compare if things are still as they used to be, and if not, find out easily exactly on which tasks we have changes. That would also be generally useful to find out the impact of intended changes such as bug fixes. Regarding merging: I think the test code could be a separate issue.
msg2182 (view)	Author: jendrik	Date: 2012-05-08.13:55:03
I was striving for complete determinism regardless of the hardware and environment, but don't know a good way to test this. Shall I merge this already or should we try to come up with a test first?
msg2181 (view)	Author: malte	Date: 2012-05-08.13:49:21
OK, I think this is ready to merge. One question is how we can detect in the future when the planner becomes nondeterministic again. In the long run, it might be worth adding test cases that verify that output hashes don't change (without good reason). Which kind of determinism guarantees do we give, by the way? Do we want the code to always give the same result independent of hardware and Python version, or do we just go for reproducibility in the same environment?
msg2180 (view)	Author: jendrik	Date: 2012-05-08.13:25:02
I have added the score_* attributes to the tables.
msg2176 (view)	Author: malte	Date: 2012-05-07.16:00:16
The score_... attributes would be important (attributes like "total_time", "expansions" etc. cannot be meaningfully interpreted as summary scores since they ignore problems not solved by everyone).
msg2175 (view)	Author: jendrik	Date: 2012-05-07.15:09:46
The results for the optimal experiment are in: http://www.informatik.uni-freiburg.de/~downward/js-issue22-opt-abs-d.html http://www.informatik.uni-freiburg.de/~downward/js-issue22-opt-abs-p.html
msg2172 (view)	Author: malte	Date: 2012-05-03.01:23:30
> Automatically detecting what needs to be rerun seems tricky to me. I guess there's no need to go there any time soon, but I don't think it would be particularly difficult: 1) store hashes that identify the code revision 2) store hashes that identify the inputs 3) store the command-line options 4) store timeouts, memory limits and execution environment (e.g. queue used) If all four match an old result, fetch the old result rather than recomputing. Allow invalidating caches to account for things like hardware errors that were fixed, new compiler versions that warrant rerunning identical code, and similar cases. (Not a priority of course.)
msg2170 (view)	Author: jendrik	Date: 2012-05-02.21:35:02
The optimal experiment has been started. No. I have made this limit part of the GkiGrid class, so we are warned if we use that class for too big experiments. > Long-term, we should think of a way to avoid rerunning unnecessary experiments. > For example, if later we want to evaluate issue214, it would be a pity if we had > to reproduce baseline experiments if we already have all the data we need from > this experiment (i.e. the issue22 one). I think the easiest way here would be to run the baseline experiment separately and store it somewhere. With lab it's very easy to combine those results with a new experiment for analysis. Of course it's also easy to extract the baseline results from the issue22 experiment. Automatically detecting what needs to be rerun seems tricky to me.
msg2168 (view)	Author: malte	Date: 2012-05-02.18:51:12
If we run this in Basel, the limit may be 75000, lower, or higher. I think this is an arbitrary setting that is configured by the queue admin. Hence, I wouldn't be too worried about it, and start the process by running the optimal experiment. Do you know of an easy way to find out this limit? Long-term, we should think of a way to avoid rerunning unnecessary experiments. For example, if later we want to evaluate issue214, it would be a pity if we had to reproduce baseline experiments if we already have all the data we need from this experiment (i.e. the issue22 one).
msg2164 (view)	Author: jendrik	Date: 2012-05-02.15:00:02
The task numbers already account for the fact that we have to run each config twice on each task (default and issue22 branch).
msg2163 (view)	Author: jendrik	Date: 2012-05-02.14:55:02
I have separated this into two experiments. opt: 7 configs, 2792 tasks -> 19544 -> maybe ok. sat: 25 configs, 4504 tasks -> 112600 -> too much What do you suggest? Sure.
msg2158 (view)	Author: malte	Date: 2012-05-01.20:41:57
We need separate experiments for the satisficing and optimal configurations. The optimal configurations (apart from blind search) don't work for domains that have nontrivial conditional effects or axioms. Moreover, there is no need to test the satisficing configs on the "opt" domains from 2008 and 2011. I think the correct suites are "OPTIMAL_WITH_IPC11" for the optimal configurations and a to-be-defined "SATISFICING_WITH_IPC11" suite that should look like the current "ALL" but changing the line domains += suite_ipc08_all() + suite_ipc11_all() to domains += suite_ipc08_sat() + suite_ipc11_sat() It would be interesting to test blind search in both the satisficing and the optimal experiment. What are the numbers for these two experiments (no. configs, no. tasks)? Shorter time limits are OK for some experiments of this kind, but not for others. I suggest we make it possible to use different timeouts (e.g. 60 seconds, 300 seconds, 1800 seconds). For some experiments, such as the one for issue214, we need the full 1800 seconds to see the desired effect. It would be good to run such very large experiments on the grid in Basel rather than the one in Freiburg (but of course it's fine to use the grid in Freiburg to set things up and test things out, especially with a short timeout). Can we set a low grid priority in the experiment scripts? Then these larges-scale experiments should probably default to a low priority (e.g. -3).
msg2157 (view)	Author: jendrik	Date: 2012-05-01.20:01:45
If we take all tuned sat configs plus nine IPC configs that makes 30 configs. Combined with the complete suite that results in more than the maximum of 75000 tasks per job. I will therefore select some sat configs only and run the experiment on the IPC_ALL suite (sat and opt configs) with a time limit of 900 seconds, ok?
msg2137 (view)	Author: malte	Date: 2012-04-27.18:49:41
loose definition, not lose definition
msg2136 (view)	Author: malte	Date: 2012-04-27.18:49:19
The main thing that is needed to merge the change would be a before/after comparison that shows 1) that translator performance is not severely affected 2) that the main search configurations perform equally well after the change. A lose definition of 2) would be "the configurations uses at IPC 2012", including those used as part of a portfolio. (Maybe instead of the sequential satisficing portfolio configs we can use the configs you tuned in your Karl Steinbuch project.) More generally speaking, I think we need a test of this kind for all sorts of global changes, so it would be great to develop one.
msg2135 (view)	Author: jendrik	Date: 2012-04-27.17:44:27
I have uploaded another patch set to http://codereview.appspot.com/3438045/. It compares my issue22 branch with master. It is however difficult to tell if those changes are needed and sufficient.
msg2104 (view)	Author: jendrik	Date: 2012-04-13.17:50:03
You're right, using addresses for hashing is still not deterministic.
msg2099 (view)	Author: malte	Date: 2012-04-13.16:21:13
Randomization in the hash function is only one aspect. Another one is using addresses for hashing. Have you tested that we get deterministic results with this?
msg2098 (view)	Author: jendrik	Date: 2012-04-13.14:40:51
It seems the recent addition of the -R commandline flag and the PYTHONHASHSEED environment variable for python facilitate testing for determinism: http://mail.python.org/pipermail/python-announce-list/2012-April/009420.html See this example program: d = {} for i in range(10): d[str(i)] = 1 for key, val in d.items(): print key, $ python python-hash-random.py 8 9 0 1 2 3 4 5 6 7 jendrik@SAM:~/projects/Tests$ python python-hash-random.py 8 9 0 1 2 3 4 5 6 7 jendrik@SAM:~/projects/Tests$ export PYTHONHASHSEED=random jendrik@SAM:~/projects/Tests$ python python-hash-random.py 9 8 1 0 3 2 5 4 7 6 jendrik@SAM:~/projects/Tests$ python python-hash-random.py 9 8 3 2 1 0 7 6 5 4 jendrik@SAM:~/projects/Tests$ export PYTHONHASHSEED=54321 jendrik@SAM:~/projects/Tests$ python python-hash-random.py 9 8 7 6 5 4 3 2 1 0 jendrik@SAM:~/projects/Tests$ python python-hash-random.py 9 8 7 6 5 4 3 2 1 0 This poses the question if it wouldn't be easier and faster (both for implementing and during execution) to just set the PYTHONHASHSEED variable to achieve determinism.
msg1926 (view)	Author: jendrik	Date: 2011-11-11.14:30:06
I am currently focusing on the scripts, but will see if I can give you something to review concerning the determinism of the translator soonish.
msg1893 (view)	Author: malte	Date: 2011-11-08.12:48:17
This has been stalled since January. Can I do anything to make it move?
msg1232 (view)	Author: jendrik	Date: 2011-01-27.03:38:54
Just some notes: Sort the results of the following parts: instantiate.explore fact_groups.compute_groups translate_task
msg785 (view)	Author: malte	Date: 2010-12-07.21:48:27
> It would be good if you could tell me if sorting makes sense there before I > look into how to sort these instances. Can you use upload.py to get this into Rietveld? > I looked for occurences of (iter)items(), (iter)values() and (iter)keys() > instead of dict/{} because this is the place where nondeterminism occurs, > right? Not the only place -- all iterations over dicts are problematic, including "for x in mydict" and things like "mylist = list(mydict)".
msg779 (view)	Author: jendrik	Date: 2010-12-07.17:44:45
This is the current state: I looked at the fd-temporal log and added some changes that were made there were reasonable. I looked for occurences of set/frozenset and marked the places. It would be good if you could tell me if sorting makes sense there before I look into how to sort these instances. I looked for occurences of (iter)items(), (iter)values() and (iter)keys() instead of dict/{} because this is the place where nondeterminism occurs, right? Then I added some sorting code for those occurences. I haven't looked at "sorting without defining a suitable __cmp__ method" yet.
msg758 (view)	Author: malte	Date: 2010-11-30.03:15:59
Sounds good. The three most common sources of nondeterminism are: * sets (search for set/frozenset) * dicts (search for dict/"{") * sorting without defining a suitable __cmp__ method (search for sort/sorted) (Actually, sets and dicts should also produce reproducible, though arbitrary and Python-version-dependent results unless their hash method is itself nondeterministic, e.g. id-dependent. So I would expect that the root cause for more or less all existing nondeterminism would be related to hashing or comparing by id.)
msg757 (view)	Author: jendrik	Date: 2010-11-30.02:01:44
I have looked at the temporal fd log and found two changesets that speak about making the translator deterministic. The first one has been incorporated into branch issue22. Additionally I have scanned the translator dir for occurences of set() and marked the ones where the ordering of a set might be crucial for making the translator deterministic. I thought this way you could provide me with some feedback first, before I actually sort the sets and look for dicts too. You find the branch on bitbucket/digitaldump/downward. A good testcase might be trucks-strips, especially p18.pddl since the translator frequently produces different results there.
msg698 (view)	Author: malte	Date: 2010-11-08.16:03:20
When this is done, we should contact Jörg (who asked about this).

History
Date	User	Action	Args
2013-11-22 21:24:58	malte	set	status: testing -> resolved messages: + msg2718
2013-11-21 15:06:05	malte	set	messages: + msg2714
2013-11-21 14:59:22	jendrik	set	messages: + msg2713
2013-11-20 09:25:42	gabi	set	messages: + msg2707
2013-11-19 14:58:40	gabi	set	status: chatting -> testing assignedto: jendrik -> gabi messages: + msg2693
2013-11-17 14:06:31	gabi	set	files: + output.sas.2
2013-11-17 14:06:06	gabi	set	files: + output.sas.1 status: resolved -> chatting messages: + msg2685 nosy: + gabi
2012-05-22 12:06:35	jendrik	set	status: in-progress -> resolved
2012-05-22 00:28:43	jendrik	set	messages: + msg2224
2012-05-18 12:48:18	malte	set	messages: + msg2222
2012-05-15 00:08:08	jendrik	set	files: + js-issue22-sat-preprocess-eval.tar.gz messages: + msg2214
2012-05-14 12:38:52	jendrik	set	messages: + msg2212
2012-05-10 16:05:03	jendrik	set	messages: + msg2200
2012-05-10 15:39:42	malte	set	messages: + msg2199
2012-05-10 14:50:02	jendrik	set	messages: + msg2198
2012-05-10 14:37:38	malte	set	messages: + msg2197
2012-05-10 14:25:03	jendrik	set	messages: + msg2196
2012-05-10 14:14:17	malte	set	messages: + msg2195
2012-05-10 12:15:03	jendrik	set	messages: + msg2194
2012-05-09 17:27:33	malte	set	messages: + msg2192
2012-05-09 16:30:03	jendrik	set	messages: + msg2190
2012-05-08 14:06:22	malte	set	messages: + msg2183
2012-05-08 13:55:03	jendrik	set	messages: + msg2182
2012-05-08 13:49:21	malte	set	messages: + msg2181
2012-05-08 13:25:02	jendrik	set	messages: + msg2180
2012-05-07 16:00:16	malte	set	messages: + msg2176
2012-05-07 15:09:47	jendrik	set	messages: + msg2175
2012-05-03 01:23:30	malte	set	messages: + msg2172
2012-05-02 21:35:03	jendrik	set	messages: + msg2170
2012-05-02 18:51:12	malte	set	messages: + msg2168
2012-05-02 15:00:03	jendrik	set	messages: + msg2164
2012-05-02 14:55:02	jendrik	set	messages: + msg2163
2012-05-01 20:41:57	malte	set	messages: + msg2158
2012-05-01 20:01:45	jendrik	set	messages: + msg2157
2012-04-27 18:49:41	malte	set	messages: + msg2137
2012-04-27 18:49:19	malte	set	messages: + msg2136
2012-04-27 17:44:27	jendrik	set	messages: + msg2135
2012-04-13 17:50:03	jendrik	set	messages: + msg2104
2012-04-13 16:21:13	malte	set	messages: + msg2099
2012-04-13 14:40:51	jendrik	set	messages: + msg2098
2011-11-11 14:30:06	jendrik	set	messages: + msg1926
2011-11-08 12:48:18	malte	set	messages: + msg1893
2011-01-27 03:38:54	jendrik	set	messages: + msg1232
2010-12-07 21:48:27	malte	set	messages: + msg785
2010-12-07 17:44:46	jendrik	set	status: chatting -> in-progress messages: + msg779
2010-11-30 03:15:59	malte	set	messages: + msg758
2010-11-30 02:01:44	jendrik	set	messages: + msg757
2010-11-08 16:03:20	malte	set	assignedto: jendrik messages: + msg698 nosy: + jendrik
2009-10-11 15:47:33	malte	set	nosy: + malte
2009-10-08 12:04:34	silvia	create

Issue22