Issue22

Title make translator deterministic
Priority wish Status resolved
Superseder Nosy List gabi, jendrik, malte
Assigned To gabi Keywords
Optional summary

Created on 2009-10-08.12:04:34 by silvia, last changed by malte.

Files
File name Uploaded Type Edit Remove
js-issue22-sat-preprocess-eval.tar.gz jendrik, 2012-05-15.00:08:07 application/x-gzip
output.sas.1 gabi, 2013-11-17.14:06:06 application/octet-stream
output.sas.2 gabi, 2013-11-17.14:06:31 application/octet-stream
Messages
msg2718 (view) Author: malte Date: 2013-11-22.21:24:58
I've moved this over to the separate issue398. For old enough issues, I'd prefer
to let them rest, even if we observe similar symptoms again later, mainly for
two reasons:

1) no need to read all the old messages again to see what the issue is about
2) easier to identify what relates to what in the repository (based on the
branch names)
msg2714 (view) Author: malte Date: 2013-11-21.15:06:05
An invariant candidate with a "heavy action" is discarded immediately, so I
think you're misinterpreting the causality here. It's *good* to find the heavy
actions that refute an invariant candidate as soon as possible.

Let's discuss the possible ways to address this offline.
msg2713 (view) Author: jendrik Date: 2013-11-21.14:59:22
We think that the invariant synthesis can be made deterministic by sorting the threatening actions of an invariant deterministically. If we don't do that, the synthesis 
might find different groups in different runs. I tried to sort the actions by the key (name, parameters, precondition, effects, cost), this introduced a performance 
degradation in openstacks-strips however. After some experiments we think that the performance degradation comes from the changed action order rather than from the cost 
of changing the actions. For different random action orderings the number of heavy_actions that have to be generated and also the time to check them varies greatly:

openstack-strips:p06.pddl
#heavy_actions ranges from ~688K to ~931K
time ranges from ~40s to ~54s
time = time for invariant synthesis (other times are almost 0)
time correlates with #heavy_actions
(numbers obtained from 9 runs)

Do you have an idea what might be a deterministic action ordering that yields a low number of heavy actions? If not, hwo should we go about this?
msg2707 (view) Author: gabi Date: 2013-11-20.09:25:42
There is some significant performance degeneration in openstacks-strips.
I assume that the problem is that the operator description in this domain are 
already grounded, so there are lots of actions to be processed by the
invariance analysis.

Jendrik, I think we should try to catch this non-determinism earlier so that we 
do not need to sort the threatening actions so often. Can you check whether 
the difference occurs first with the actions_to_check set or already earlier
in the sets with the threatening actions for a predicate?

If someone is interested, hre are the problematic translator times:

<task> <old translator time> -> <new translator time>

openstacks-strips/p06.pddl 40.6s -> 60.8s
openstacks-strips/p07.pddl 44.4s -> 60.8s
openstacks-strips/p08.pddl 56.8s -> 300.6s
openstacks-strips/p09.pddl 53.6s -> 300.6s
openstacks-strips/p10.pddl 41.0s -> 300.8s
openstacks-strips/p11.pddl 40.0s -> 300.7s
openstacks-strips/p12.pddl 45.3s -> 300.8s
openstacks-strips/p13.pddl 36.9s -> 300.8s
openstacks-strips/p14.pddl 41.0s -> 300.8s
openstacks-strips/p15.pddl 44.9s -> 300.8s
openstacks-strips/p16.pddl 40.3s -> 300.8s
openstacks-strips/p17.pddl 45.3s -> 301.0s
openstacks-strips/p18.pddl 45.6s -> 302.2s
openstacks-strips/p19.pddl 38.1s -> 301.2s
openstacks-strips/p20.pddl 38.4s -> 301.7s
openstacks-strips/p21.pddl 40.1s -> 301.7s
openstacks-strips/p22.pddl 38.7s -> 301.8s
openstacks-strips/p23.pddl 38.4s -> 301.8s
openstacks-strips/p24.pddl 51.6s -> 309.5s
openstacks-strips/p25.pddl 54.3s -> 308.9s
openstacks-strips/p26.pddl 42.9s -> 305.2s
openstacks-strips/p27.pddl 47.9s -> 312.2s
openstacks-strips/p28.pddl 116.7s -> 357.8s
openstacks-strips/p29.pddl 107.8s -> 352.1s
openstacks-strips/p30.pddl 62.5s -> 321.3s
msg2693 (view) Author: gabi Date: 2013-11-19.14:58:40
Jendrik fixed the issue. I am currently running experiments to make sure that
there is no negative performance impact.
msg2685 (view) Author: gabi Date: 2013-11-17.14:06:06
I encountered a few trucks-strips tasks where the translator output is not
deterministic. I attached two output files for p17.pddl. Of 10 translator runs,
8 lead to output.sas.1 and 2 to output.sas.2.
msg2224 (view) Author: jendrik Date: 2012-05-22.00:28:43
There seem to be no objections, so I will go ahead and merge tomorrow.
msg2222 (view) Author: malte Date: 2012-05-18.12:48:17
Can you write a mail to downward-dev to discuss the rationale of the change and
the performance impact? If there are no objections, we can merge.
msg2214 (view) Author: jendrik Date: 2012-05-15.00:08:07
Here are the results for the translator experiment. How do we proceed?
msg2212 (view) Author: jendrik Date: 2012-05-14.12:38:52
The experiment showed that it takes too much time to sort axioms in 
axiom_rules.py. It seems that those sortings aren't necessary though (all first 
problems yield the same SAS+ files). I just started a new experiment.
msg2200 (view) Author: jendrik Date: 2012-05-10.16:05:03
Indeed, this fixes it. I'll run another experiment comparing the 
preprocess performance.
msg2199 (view) Author: malte Date: 2012-05-10.15:39:42
Does this fix it? A cleaner change would be to change the computation of
connected components instead. I haven't tested it, but adding

result[-1].sort()

in the while loop after the call to dfs(node) in graph.py:24 should do the trick.

--- a/src/translate/split_rules.py	Thu May 10 14:58:06 2012 +0200
+++ b/src/translate/split_rules.py	Thu May 10 15:35:35 2012 +0200
@@ -22,7 +22,7 @@
     for var, conds in var_to_conditions.items():
         for cond in conds[1:]:
             agraph.connect(conds[0], cond)
-    return agraph.connected_components()
+    return sorted(map(sorted, agraph.connected_components()))
 
 def project_rule(rule, conditions, name_generator):
     predicate = next(name_generator)
msg2198 (view) Author: jendrik Date: 2012-05-10.14:50:02
If you are in the issue22 branch, in misc/test-translator.py, comment 
out the lines containing

r'\d+ auxiliary atoms', r'\d+ final queue length',
                 r'\d+ total queue pushes'

to include those values in the diff and set DEBUG to True at the top of 
the file.

Then run misc/test-translator.

For a nicer view of the diff:

cd /tmp/sas-files/airport-adl-p01-airport1-p1.pddl
meld *
msg2197 (view) Author: malte Date: 2012-05-10.14:37:38
> But if the outcome of this part is always deterministic I think we can
> live with this non-determinism.

The result will be the same, but there may be huge performance differences. The
reason we are reporting these numbers is to have a chance to spot performance
issues. Can you give me an example where we get different results? Then I'll see
what can be done without digging too deeply.
msg2196 (view) Author: jendrik Date: 2012-05-10.14:25:03
As you said, it would make spotting non-deterministic code easier later. 
But if the outcome of this part is always deterministic I think we can 
live with this non-determinism.
msg2195 (view) Author: malte Date: 2012-05-10.14:14:17
> The log output is deterministic except for the auxiliary atoms and the
> queue pushes. My attempts at making them deterministic were futile.

Do you think it would be good to have those deterministic, too?
Then I would have a look.
msg2194 (view) Author: jendrik Date: 2012-05-10.12:15:03
I uploaded another patch set on rietveld.

The log output is deterministic except for the auxiliary atoms and the 
queue pushes. My attempts at making them deterministic were futile.

The code now produces the same SAS files for all three python versions 
even if we turn on random seed hashing.
msg2192 (view) Author: malte Date: 2012-05-09.17:27:32
I would say that output-deterministic (in the sense of output.sas) is good
enough, although identical log output would have the advantage that it might be
easier to spot when we make changes that cause things to go wrong
performance-wise. So this might be even better.

What do you think?

(The code in question that produces this log output has always been
deterministic in terms of the final result it produces -- modulo bugs -- if this
result is treated as a set, i.e., we don't care about order. The differences
observed are due to differences in intermediate results.)
msg2190 (view) Author: jendrik Date: 2012-05-09.16:30:03
I am currently writing an extension to the translator test that checks 
if at least all three python versions produce the same results for the 
test instances.

Some non-deterministic cases have already been fixed, but there's at 
least one peculiarity: For the first airport problem the output.sas 
files are identical but the log output is different:

2.6:
141 auxiliary atoms
311 final queue length
364 total queue pushes

2.7:
156 auxiliary atoms
326 final queue length
379 total queue pushes

3.2:
155 auxiliary atoms
325 final queue length
378 total queue pushes

The number of relevant atoms is 170 for all three python versions.

Do we need to make that deterministic as well?
msg2183 (view) Author: malte Date: 2012-05-08.14:06:22
I think a decent way to test is to create a dictionary with entries of the form
sha_hash(domain_file_text), sha_hash(problem_file_text) ->
sha_hash(output_sas_text). Maybe also add the file names of the domain and
problem file as additional data somewhere for convenience. In addition to the
entries, we should also store on which planner revision the data was generated.

We could then have a script that allows us to populate such a dictionary from an
experiment (every now and then, we will need to repopulate due to intentional
changes to the translator) and a diff script (or better: have the data in a form
where regular diff gives us useful output).

Then we can, at regular intervals or on demand, compare if things are still as
they used to be, and if not, find out easily exactly on which tasks we have
changes. That would also be generally useful to find out the impact of intended
changes such as bug fixes.

Regarding merging: I think the test code could be a separate issue.
msg2182 (view) Author: jendrik Date: 2012-05-08.13:55:03
I was striving for complete determinism regardless of the hardware and 
environment, but don't know a good way to test this. Shall I merge this 
already or should we try to come up with a test first?
msg2181 (view) Author: malte Date: 2012-05-08.13:49:21
OK, I think this is ready to merge. One question is how we can detect in the
future when the planner becomes nondeterministic again. In the long run, it
might be worth adding test cases that verify that output hashes don't change
(without good reason).

Which kind of determinism guarantees do we give, by the way? Do we want the code
to always give the same result independent of hardware and Python version, or do
we just go for reproducibility in the same environment?
msg2180 (view) Author: jendrik Date: 2012-05-08.13:25:02
I have added the score_* attributes to the tables.
msg2176 (view) Author: malte Date: 2012-05-07.16:00:16
The score_... attributes would be important (attributes like "total_time",
"expansions" etc. cannot be meaningfully interpreted as summary scores since
they ignore problems not solved by everyone).
msg2175 (view) Author: jendrik Date: 2012-05-07.15:09:46
The results for the optimal experiment are in:

http://www.informatik.uni-freiburg.de/~downward/js-issue22-opt-abs-d.html
http://www.informatik.uni-freiburg.de/~downward/js-issue22-opt-abs-p.html
msg2172 (view) Author: malte Date: 2012-05-03.01:23:30
> Automatically detecting what needs to be rerun seems tricky to me.

I guess there's no need to go there any time soon, but I don't think it would be
particularly difficult:

1) store hashes that identify the code revision
2) store hashes that identify the inputs
3) store the command-line options
4) store timeouts, memory limits and execution environment (e.g. queue used)

If all four match an old result, fetch the old result rather than recomputing.

Allow invalidating caches to account for things like hardware errors that were
fixed, new compiler versions that warrant rerunning identical code, and similar
cases.

(Not a priority of course.)
msg2170 (view) Author: jendrik Date: 2012-05-02.21:35:02
The optimal experiment has been started.

No. I have made this limit part of the GkiGrid class, so we are warned 
if we use that class for too big experiments.
> Long-term, we should think of a way to avoid rerunning unnecessary experiments.
> For example, if later we want to evaluate issue214, it would be a pity if we had
> to reproduce baseline experiments if we already have all the data we need from
> this experiment (i.e. the issue22 one).
I think the easiest way here would be to run the baseline experiment 
separately and store it somewhere. With lab it's very easy to combine 
those results with a new experiment for analysis. Of course it's also 
easy to extract the baseline results from the issue22 experiment. 
Automatically detecting what needs to be rerun seems tricky to me.
msg2168 (view) Author: malte Date: 2012-05-02.18:51:12
If we run this in Basel, the limit may be 75000, lower, or higher. I think this
is an arbitrary setting that is configured by the queue admin. Hence, I wouldn't
be too worried about it, and start the process by running the optimal experiment.

Do you know of an easy way to find out this limit?

Long-term, we should think of a way to avoid rerunning unnecessary experiments.
For example, if later we want to evaluate issue214, it would be a pity if we had
to reproduce baseline experiments if we already have all the data we need from
this experiment (i.e. the issue22 one).
msg2164 (view) Author: jendrik Date: 2012-05-02.15:00:02
The task numbers already account for the fact that we have to run each 
config twice on each task (default and issue22 branch).
msg2163 (view) Author: jendrik Date: 2012-05-02.14:55:02
I have separated this into two experiments.

opt: 7 configs, 2792 tasks -> 19544 -> maybe ok.
sat: 25 configs, 4504 tasks -> 112600 -> too much

What do you suggest?

Sure.
msg2158 (view) Author: malte Date: 2012-05-01.20:41:57
We need separate experiments for the satisficing and optimal configurations. The
optimal configurations (apart from blind search) don't work for domains that
have nontrivial conditional effects or axioms. Moreover, there is no need to
test the satisficing configs on the "opt" domains from 2008 and 2011.

I think the correct suites are "OPTIMAL_WITH_IPC11" for the optimal
configurations and a to-be-defined "SATISFICING_WITH_IPC11" suite that should
look like the current "ALL" but changing the line
    domains += suite_ipc08_all() + suite_ipc11_all()
to
    domains += suite_ipc08_sat() + suite_ipc11_sat()

It would be interesting to test blind search in both the satisficing and the
optimal experiment.

What are the numbers for these two experiments (no. configs, no. tasks)?

Shorter time limits are OK for some experiments of this kind, but not for
others. I suggest we make it possible to use different timeouts (e.g. 60
seconds, 300 seconds, 1800 seconds). For some experiments, such as the one for
issue214, we need the full 1800 seconds to see the desired effect.

It would be good to run such very large experiments on the grid in Basel rather
than the one in Freiburg (but of course it's fine to use the grid in Freiburg to
set things up and test things out, especially with a short timeout).

Can we set a low grid priority in the experiment scripts? Then these
larges-scale experiments should probably default to a low priority (e.g. -3).
msg2157 (view) Author: jendrik Date: 2012-05-01.20:01:45
If we take all tuned sat configs plus nine IPC configs that makes 30 configs. 
Combined with the complete suite that results in more than the maximum of 75000 
tasks per job. I will therefore select some sat configs only and run the 
experiment on the IPC_ALL suite (sat and opt configs) with a time limit of 900 
seconds, ok?
msg2137 (view) Author: malte Date: 2012-04-27.18:49:41
*loose* definition, not *lose* definition
msg2136 (view) Author: malte Date: 2012-04-27.18:49:19
The main thing that is needed to merge the change would be a before/after
comparison that shows

1) that translator performance is not severely affected

2) that the main search configurations perform equally well after the change.

A lose definition of 2) would be "the configurations uses at IPC 2012",
including those used as part of a portfolio. (Maybe instead of the sequential
satisficing portfolio configs we can use the configs you tuned in your Karl
Steinbuch project.)

More generally speaking, I think we need a test of this kind for all sorts of
global changes, so it would be great to develop one.
msg2135 (view) Author: jendrik Date: 2012-04-27.17:44:27
I have uploaded another patch set to http://codereview.appspot.com/3438045/. It 
compares my issue22 branch with master. It is however difficult to tell if those 
changes are needed and sufficient.
msg2104 (view) Author: jendrik Date: 2012-04-13.17:50:03
You're right, using addresses for hashing is still not deterministic.
msg2099 (view) Author: malte Date: 2012-04-13.16:21:13
Randomization in the hash function is only one aspect.
Another one is using addresses for hashing.
Have you tested that we get deterministic results with this?
msg2098 (view) Author: jendrik Date: 2012-04-13.14:40:51
It seems the recent addition of the -R commandline flag and the PYTHONHASHSEED environment 
variable for python facilitate testing for determinism: 
http://mail.python.org/pipermail/python-announce-list/2012-April/009420.html

See this example program:

d = {}

for i in range(10):
    d[str(i)] = 1

for key, val in d.items():
    print key,


$ python python-hash-random.py
8 9 0 1 2 3 4 5 6 7
jendrik@SAM:~/projects/Tests$ python python-hash-random.py
8 9 0 1 2 3 4 5 6 7
jendrik@SAM:~/projects/Tests$ export PYTHONHASHSEED=random
jendrik@SAM:~/projects/Tests$ python python-hash-random.py
9 8 1 0 3 2 5 4 7 6
jendrik@SAM:~/projects/Tests$ python python-hash-random.py
9 8 3 2 1 0 7 6 5 4
jendrik@SAM:~/projects/Tests$ export PYTHONHASHSEED=54321
jendrik@SAM:~/projects/Tests$ python python-hash-random.py
9 8 7 6 5 4 3 2 1 0
jendrik@SAM:~/projects/Tests$ python python-hash-random.py
9 8 7 6 5 4 3 2 1 0

This poses the question if it wouldn't be easier and faster (both for implementing and during 
execution) to just set the PYTHONHASHSEED variable to achieve determinism.
msg1926 (view) Author: jendrik Date: 2011-11-11.14:30:06
I am currently focusing on the scripts, but will see if I can give you
something to review concerning the determinism of the translator soonish.
msg1893 (view) Author: malte Date: 2011-11-08.12:48:17
This has been stalled since January. Can I do anything to make it move?
msg1232 (view) Author: jendrik Date: 2011-01-27.03:38:54
Just some notes: 

Sort the results of the following parts:
instantiate.explore
fact_groups.compute_groups
translate_task
msg785 (view) Author: malte Date: 2010-12-07.21:48:27
> It would be good if you could tell me if sorting makes sense there before I
> look into how to sort these instances.

Can you use upload.py to get this into Rietveld?

> I looked for occurences of (iter)items(), (iter)values() and (iter)keys()
> instead of dict/{} because this is the place where nondeterminism occurs,
> right? 

Not the only place -- all iterations over dicts are problematic, including "for
x in mydict" and things like "mylist = list(mydict)".
msg779 (view) Author: jendrik Date: 2010-12-07.17:44:45
This is the current state:
I looked at the fd-temporal log and added some changes that were made there were 
reasonable. 
I looked for occurences of set/frozenset and marked the places. It would be good 
if you could tell me if sorting makes sense there before I look into how to sort 
these instances.
I looked for occurences of (iter)items(), (iter)values() and (iter)keys() 
instead of dict/{} because this is the place where nondeterminism occurs, right? 
Then I added some sorting code for those occurences.

I haven't looked at "sorting without defining a suitable __cmp__ method" yet.
msg758 (view) Author: malte Date: 2010-11-30.03:15:59
Sounds good. The three most common sources of nondeterminism are:

 * sets (search for set/frozenset)
 * dicts (search for dict/"{")
 * sorting without defining a suitable __cmp__ method (search for sort/sorted)

(Actually, sets and dicts should also produce reproducible, though arbitrary and
Python-version-dependent results unless their hash method is itself
nondeterministic, e.g. id-dependent. So I would expect that the root cause for
more or less all existing nondeterminism would be related to hashing or
comparing by id.)
msg757 (view) Author: jendrik Date: 2010-11-30.02:01:44
I have looked at the temporal fd log and found two changesets that speak about 
making the translator deterministic. The first one has been incorporated into 
branch issue22. Additionally I have scanned the translator dir for occurences of 
set() and marked the ones where the ordering of a set might be crucial for 
making the translator deterministic. I thought this way you could provide me 
with some feedback first, before I actually sort the sets and look for dicts 
too. You find the branch on bitbucket/digitaldump/downward.

A good testcase might be trucks-strips, especially p18.pddl since the translator 
frequently produces different results there.
msg698 (view) Author: malte Date: 2010-11-08.16:03:20
When this is done, we should contact Jörg (who asked about this).
History
Date User Action Args
2013-11-22 21:24:58maltesetstatus: testing -> resolved
messages: + msg2718
2013-11-21 15:06:05maltesetmessages: + msg2714
2013-11-21 14:59:22jendriksetmessages: + msg2713
2013-11-20 09:25:42gabisetmessages: + msg2707
2013-11-19 14:58:40gabisetstatus: chatting -> testing
assignedto: jendrik -> gabi
messages: + msg2693
2013-11-17 14:06:31gabisetfiles: + output.sas.2
2013-11-17 14:06:06gabisetfiles: + output.sas.1
status: resolved -> chatting
messages: + msg2685
nosy: + gabi
2012-05-22 12:06:35jendriksetstatus: in-progress -> resolved
2012-05-22 00:28:43jendriksetmessages: + msg2224
2012-05-18 12:48:18maltesetmessages: + msg2222
2012-05-15 00:08:08jendriksetfiles: + js-issue22-sat-preprocess-eval.tar.gz
messages: + msg2214
2012-05-14 12:38:52jendriksetmessages: + msg2212
2012-05-10 16:05:03jendriksetmessages: + msg2200
2012-05-10 15:39:42maltesetmessages: + msg2199
2012-05-10 14:50:02jendriksetmessages: + msg2198
2012-05-10 14:37:38maltesetmessages: + msg2197
2012-05-10 14:25:03jendriksetmessages: + msg2196
2012-05-10 14:14:17maltesetmessages: + msg2195
2012-05-10 12:15:03jendriksetmessages: + msg2194
2012-05-09 17:27:33maltesetmessages: + msg2192
2012-05-09 16:30:03jendriksetmessages: + msg2190
2012-05-08 14:06:22maltesetmessages: + msg2183
2012-05-08 13:55:03jendriksetmessages: + msg2182
2012-05-08 13:49:21maltesetmessages: + msg2181
2012-05-08 13:25:02jendriksetmessages: + msg2180
2012-05-07 16:00:16maltesetmessages: + msg2176
2012-05-07 15:09:47jendriksetmessages: + msg2175
2012-05-03 01:23:30maltesetmessages: + msg2172
2012-05-02 21:35:03jendriksetmessages: + msg2170
2012-05-02 18:51:12maltesetmessages: + msg2168
2012-05-02 15:00:03jendriksetmessages: + msg2164
2012-05-02 14:55:02jendriksetmessages: + msg2163
2012-05-01 20:41:57maltesetmessages: + msg2158
2012-05-01 20:01:45jendriksetmessages: + msg2157
2012-04-27 18:49:41maltesetmessages: + msg2137
2012-04-27 18:49:19maltesetmessages: + msg2136
2012-04-27 17:44:27jendriksetmessages: + msg2135
2012-04-13 17:50:03jendriksetmessages: + msg2104
2012-04-13 16:21:13maltesetmessages: + msg2099
2012-04-13 14:40:51jendriksetmessages: + msg2098
2011-11-11 14:30:06jendriksetmessages: + msg1926
2011-11-08 12:48:18maltesetmessages: + msg1893
2011-01-27 03:38:54jendriksetmessages: + msg1232
2010-12-07 21:48:27maltesetmessages: + msg785
2010-12-07 17:44:46jendriksetstatus: chatting -> in-progress
messages: + msg779
2010-11-30 03:15:59maltesetmessages: + msg758
2010-11-30 02:01:44jendriksetmessages: + msg757
2010-11-08 16:03:20maltesetassignedto: jendrik
messages: + msg698
nosy: + jendrik
2009-10-11 15:47:33maltesetnosy: + malte
2009-10-08 12:04:34silviacreate