Issue69

Title A* with LM-CUT needs performance comparison between r3612/r3613 and HEAD
Priority bug Status resolved
Superseder Nosy List erez, jendrik, malte
Assigned To malte Keywords 1.0
Optional summary

Created on 2010-01-13.14:07:19 by erez, last changed by malte.

Files
File name Uploaded Type Edit Remove
exp-js-issue69-eval-abs-d.html jendrik, 2011-08-12.17:39:04 text/html
exp-js-issue69-eval-abs-p.html jendrik, 2011-08-12.17:43:23 text/html
lmcut.tar.gz erez, 2010-01-21.19:17:31 application/x-gzip
plots-v2.tar.gz jendrik, 2011-08-12.21:21:07 application/x-gzip
Messages
msg1610 (view) Author: malte Date: 2011-08-13.22:10:52
Thanks a lot, everyone! I think we're done here; the scatter plots tell the
story quite well.

r3613 is better than r3612, so it's sufficient to compare r3613 to the newer
revision. This comparison shows a bit of a loss of quality (measured in
expansions) but a bit of gain in speed in most domains when using the newer code.

Overall the new code is a bit worse due to the loss of coverage in Airport. At
this point, I wouldn't roll back the change because it's only a single domain
and it's not clear at all whether the better performance of the old code there
isn't only an accident.

But I've opened a new issue, issue263, to make a note of the fact that it'd be
good to study different tie-breaking methods for LM-Cut at some point.
msg1598 (view) Author: jendrik Date: 2011-08-12.22:00:03
Done.
msg1597 (view) Author: malte Date: 2011-08-12.21:28:12
Wonderful! How do I generate my own scatter plots? I'd like to create some "all
problems in one plot" plots that exclude the problematic domains for the
respective comparison.

Can you add the instructions to the wiki?
msg1595 (view) Author: jendrik Date: 2011-08-12.21:21:07
Here are the new plots. Missing values are now rendered on their own line and the 
bug has been fixed.
msg1593 (view) Author: malte Date: 2011-08-12.19:52:46
Thanks!

Some preliminary notes on the results after looking at the logs:

- The 3612-ou planner ignored action costs, so its results in the 2008
  domains should be disregarded. In domains with zero-cost actions
  (openstacks, sokoban) they are also occasionally suboptimal. The
  results already show this directly in Sokoban, where other planners
  find better plans; one example in Openstacks 2008 is Opt #06, where
  a cost 3 plan is found but a cost 2 plan exists.

- The 3613-ou planner segfaults on Openstacks 2008 and PegSol, and
  occasionally on Sokoban. So it should not be considered on those
  domains. It doesn't attempt to solve ParcPrinter because of the high
  action costs.

- The 5272-ou planner doesn't attempt to solve ParcPrinter because of
  the high action costs.

- 3612-ou doesn't appear to have a meaningful advantage over 3613-ou,
  but I'd still like to have a look at the expansions and time plots
  for these.

- The main comparison should be 3613-ou vs. 5272-ou, though.
msg1589 (view) Author: jendrik Date: 2011-08-12.17:57:43
Sure, the experiment is located on habakuk: /home/downward/jendrik/downward/new-
scripts/exp-js-issue69*
msg1588 (view) Author: malte Date: 2011-08-12.17:55:34
Can you give me access to the full logs? Some of the results look suspicious;
the 0s probably make sense because the relevant configs refuse to work on some
of these domains, but I'd like to investigate a few things (e.g. the 30 problems
solved by r3612 in openstacks-opt08-strips -- probably using an inadmissible
heuristic?).
msg1587 (view) Author: jendrik Date: 2011-08-12.17:49:00
Yes, done.
msg1586 (view) Author: malte Date: 2011-08-12.17:48:15
Should the attached old issue69-all-plots.tar.gz (version from
2011-08-11.21:23:52) be deleted? I assume they're affected by the bug?
msg1585 (view) Author: jendrik Date: 2011-08-12.17:43:23
Here are the html reports.
msg1583 (view) Author: jendrik Date: 2011-08-12.17:36:32
I found a bug in the scatter plot code. Will fix it and upload a new set of plots 
soon.
msg1574 (view) Author: malte Date: 2011-08-12.11:19:03
> coverage:
> - same set of tasks solved

Actually, that is probably not true. That was based on the information in file
issue69ouSTRIPSeval-p-abs.html, but I guess that was a different experiment? It
doesn't include data for the 2008 domains. Do we have the complete coverage
information for the recent experiment somewhere? I'd be interested in both the
high-res (-p-abs.html) and low-res (-d-abs.html) version.
msg1573 (view) Author: malte Date: 2011-08-12.11:10:42
comparison r3612 vs. r3613
==========================

coverage:
- same set of tasks solved

expansions:
- no tasks with differences or tiny, ignorable differences:
  blocks, depots, driverlog, freecell, grid, gripper, logistics00,
  logistics98, miconic, openstacks-strips, pathways-noneg,
  pipesworld-notankage, rovers, satellite, tpp, trucks, zenotravel

- tiny, ignorable differences:
  freecell

- few tasks with difference, hence no clear trend:
  airport, elevators-sat08, mprime, mystery, pipesworld-tankage,
  psr-small

- differences, no clear winner:
  sokoban-sat08

- r3612 appears to be somewhat better:
  sokoban-opt08

- r3613 appears to be somewhat better:
  scanalyzer-08

- r3613 clearly better:
  elevators-opt08, transport-opt08, transport-sat08,
  woodworking-opt08, woodworking-sat08

time:

picture very similar to expansions, but advantage of r3613 over r3612
is greater in domains where it is already clearly better, and it seems
to have a slight speed advantage across the board (i.e. in all
domains)
msg1572 (view) Author: malte Date: 2011-08-12.11:06:46
Wonderful! That I can work with. :-) One more feature request though: the plots
would be more useful if they also included data points for the tasks that are
solved by one, but not both of the configurations. Here's one way of doing this:
Assume that the range of values is 0...10^6. Then pick something suitably larger
(say, 10^7) and treat "unsolved" as that value. On the axis, don't write 10^7
but "fail" or some such. If the "fail" point can be set such that it is exactly
the rightmost/topmost line of the plot, which would be great, then I think no
axis label is necessary. Tasks where neither configuration finds a solution
should not be included.

To warm up, I compared the expansions for r3612 to r3613 in all-plots, although
that is not terribly interesting. The results are as expected: r3613 more or
less dominates r3612 on the tasks that have action costs > 1, and otherwise they
behave essentially the same. One thing is strange though (or I don't understand
it). If I extract issue69-all-plots.tar.gz and do
$ display *expansions*3612*3613*.png
to go through the plots, you see that there are very few points above the
diagonal in any domain. However, the file
"exp-js-issue69-eval-scatter-expansions-3612-ou+3613-ou-p.png" in the middle
gives a qualitatively very different picture.

Is that the plot for *all* tasks combined? If yes, then I think the data it uses
is wrong -- there's no way this is the union of the per-domain plots. But or
misunderstanding?
msg1567 (view) Author: jendrik Date: 2011-08-11.21:23:52
Here you go.
msg1559 (view) Author: jendrik Date: 2011-08-11.13:55:53
> I haven't checked everything, but I would go by unit. It it measures time, nodes
> or states, it should be logarithmic. If it measures length or cost, it should be
> linear. Do we have any other units of measurement?
For now I set
linear_attributes = ['cost', 'coverage', 'plan_length']
>
>> With the two attributes search_time and expansions and the three code
>> versions ('3612-ou', '3613-ou', '5272-ou') generating all possible plots
>> would yield 35 * 3 * 2 = 210 images. Do you want them all (7.5 MB)?
> Yes, please! Plus a plot with all data points, with one data point per task (not
> per domain). Or do we have that already? I'm a bit confused at the moment
> regarding what is what. (Actually, it'd be great if the plots could be more
> self-explanatory, including info like "Each data point represents XYZ",
> basically what you wrote in the issue comment.)
That's true. I already added that.
>
> Related to this, there was a video in the IJCAI video competition that
> highlighted a result exploration tool with some very neat capabilities. Check
> video 15 ("Using Multiple Models to Understand Data") on
> http://ijcai-11.iiia.csic.es/program/video_track. The "Interactive
> Visualizations" section (about halfway in) is impressive and looks *really* useful.
You're right, that looks very useful indeed. The author doesn't provide any links to binaries or sourcecode on his 
homepage though.

I'll prepare the new plots ASAP.
msg1557 (view) Author: malte Date: 2011-08-11.12:08:20
> I made logarithmic scaling the default for now. Can you list the attributes
> that should not have logarithmic scaling?

I haven't checked everything, but I would go by unit. It it measures time, nodes
or states, it should be logarithmic. If it measures length or cost, it should be
linear. Do we have any other units of measurement?

> With the two attributes search_time and expansions and the three code
> versions ('3612-ou', '3613-ou', '5272-ou') generating all possible plots
> would yield 35 * 3 * 2 = 210 images. Do you want them all (7.5 MB)?

Yes, please! Plus a plot with all data points, with one data point per task (not
per domain). Or do we have that already? I'm a bit confused at the moment
regarding what is what. (Actually, it'd be great if the plots could be more
self-explanatory, including info like "Each data point represents XYZ",
basically what you wrote in the issue comment.)

Related to this, there was a video in the IJCAI video competition that
highlighted a result exploration tool with some very neat capabilities. Check
video 15 ("Using Multiple Models to Understand Data") on
http://ijcai-11.iiia.csic.es/program/video_track. The "Interactive
Visualizations" section (about halfway in) is impressive and looks *really* useful.

Maybe we should set aside some time to discuss the
analysis/summarization/visualization aspect of our scripts, and where we want to
go with it? It's a lot of work to reinvent the wheel here, and I think there are
a few tools around that we should learn about. (Even if we don't end up using
one of them, it should be inspirational.)
msg1551 (view) Author: jendrik Date: 2011-08-11.02:25:02
For which domains do you want plots? The experiment has values for 35 domains:

domains = ['airport', 'blocks', 'depot', 'driverlog', 'elevators-opt08-strips',
           'elevators-sat08-strips', 'freecell', 'grid', 'gripper',
           'logistics00', 'logistics98', 'miconic', 'mprime', 'mystery',
           'openstacks-opt08-strips', 'openstacks-sat08-strips',
           'openstacks-strips', 'parcprinter-08-strips', 'pathways-noneg',
           'pegsol-08-strips', 'pipesworld-notankage', 'pipesworld-tankage',
           'psr-small', 'rovers', 'satellite', 'scanalyzer-08-strips',
           'sokoban-opt08-strips', 'sokoban-sat08-strips', 'tpp',
           'transport-opt08-strips', 'transport-sat08-strips', 'trucks-strips',
           'woodworking-opt08-strips', 'woodworking-sat08-strips',
           'zenotravel']

With the two attributes search_time and expansions and the three code versions ('3612-ou', '3613-ou', '5272-ou') generating all possible plots would yield 35 * 
3 * 2 = 210 images. Do you want them all (7.5 MB)?
msg1550 (view) Author: jendrik Date: 2011-08-11.01:44:45
> One minor niggle:
> 
> - The 0s on the axes appear to be set in a different font and at a different
> height than the other numbers. It should be the same font and location as the 10
> in 10^1, 10^2 etc. (Not important for us, but would need to be changed if we
> want to eventually use these plots in papers.)

I found a rather elegant hack to do this.

> One more important niggle:
> 
> - I assume that these are for a single domain now? If so, can the domain be
> mentioned somewhere, e.g. "evaluations (parcprinter)" or some such instead of
> "evaluations" in the caption at the top?

Ah, I added a different functionality: Each scatter plot point can now be the sum (or average for scores or geometric mean for times) of a domain's 
values. This behaviour can be changed by setting --res problem or --res domain.
I guess --res problem should be the default?

> 
> Runtimes should be displayed on a logarithmic
> scale even if we run an experiment where all runtimes are less than 10 seconds.

I made logarithmic scaling the default for now. Can you list the attributes that should not have logarithmic scaling?
msg1539 (view) Author: malte Date: 2011-08-10.13:21:23
Very pretty!

One minor niggle:

- The 0s on the axes appear to be set in a different font and at a different
height than the other numbers. It should be the same font and location as the 10
in 10^1, 10^2 etc. (Not important for us, but would need to be changed if we
want to eventually use these plots in papers.)

One more important niggle:

- I assume that these are for a single domain now? If so, can the domain be
mentioned somewhere, e.g. "evaluations (parcprinter)" or some such instead of
"evaluations" in the caption at the top?

> I added the requested features and made the scales logarithmic if the max
> value is bigger than 10^5.

Magnitude is not a good criterion. It depends on what is measured, not what the
values are. Plan costs in ParcPrinter should not be displayed on a logarithmic
scale despite their being >= 10^6. Runtimes should be displayed on a logarithmic
scale even if we run an experiment where all runtimes are less than 10 seconds.

If you check out the runtime scatter plots, you see why you don't want a linear
scale here: almost always all the data points are crowded in the far bottom left
corner, so the space is not used well.
msg1537 (view) Author: jendrik Date: 2011-08-10.13:12:23
I added the requested features and made the scales logarithmic if the max value 
is bigger than 10^5.
msg1526 (view) Author: malte Date: 2011-08-09.19:56:42
PS: http://www.scipy.org/Cookbook/Matplotlib has an example ("Custom log plot
labels") that has logarithmic axes for a scatter plot with nice axis labels
starting at 0 (so no need for transforming values to 1? Although I'd test that
value pairs like (0, 8) or (0, 0) actually show up in the plot.)

Regarding the diagonal line, the next example does that, but I'm not sure if and
how they could be overlaid. If there's a way to access the max x (and y) value
that is displayed inside the plot, drawing a line from (0,0) to (max, max) would
do the trick, of course.
msg1525 (view) Author: malte Date: 2011-08-09.19:51:39
> 3) I have searched for but couldn't find an easy way to add an infinite
> diagonal line.

In other plotting packages, you'd do that by plotting the function f(x) = x. Is
that difficult to do in matplotlib? Or is it difficult to combine a scatter plot
with a function plot?
msg1524 (view) Author: jendrik Date: 2011-08-09.19:48:50
> What plotting package/library do you use to generate these?
I used matplotlib.

3) I have searched for but couldn't find an easy way to add an infinite diagonal 
line.
msg1523 (view) Author: malte Date: 2011-08-09.18:35:25
Very nice! What plotting package/library do you use to generate these?

Some wishes:

1) labels for expansions should not start at -100000 or some such -- they won't
go below 0. ;-)

2) for things growing exponentially like expansions, it's more common (and much
more useful) to make both scales logarithmic. If that causes a problem because
of zero entries, one simple fix is to change all 0 entries to 1.

3) Scatter plots that compain two comparable things along the same dimension
often contain the y=x line to easily distinguish the "good" from the "bad"
region, like here:
http://www.cawcr.gov.au/projects/verification/scatterplot.gif. That would be
useful here, too, although the gray grid lines already help.

4) most importantly: per-domain results would be very very useful!
msg1522 (view) Author: jendrik Date: 2011-08-09.17:53:00
I got the hint and added a ScatterPlotReport class ;) Currently only the scatter 
plots can only report by suite and not by domain. Results are attached.
msg1420 (view) Author: malte Date: 2011-07-18.11:29:32
I think we should still do a definitive comparison of coverage, expansion count
and solution speed. I don't have time to prepare it myself, though.

The old experiment is probably already fine in terms of what it measures, but we
need to prepare the data in a way that makes it more accessible (e.g.
scatter-plots per domain).
msg1419 (view) Author: erez Date: 2011-07-18.08:01:46
I doubt we're going to change the LM-CUT implementation back to the way it was, 
but it's Malte's call.
msg1418 (view) Author: jendrik Date: 2011-07-17.21:10:07
I would like to remove issue69.py from the scripts directory. Do you still need a 
comparison experiment for this issue or is it fixed?
msg611 (view) Author: malte Date: 2010-10-28.00:00:36
I still haven't completely analyzed the code, but it looks like this might
indeed be a tie-breaking issue. I can try switching the code back to the old way
of breaking ties, but it might easily break again when implementing proper
action cost support for LM-cut, so my preference is to address this at the same
time as implementing proper action cost support for LM-cut.

For the action cost support, it would be good to run some before/after
experiments on the IPC-2008 instances, which I've now added to the repository
(apart from cybersec, since this domain alone would more than double the repo
size, I think).
msg608 (view) Author: jendrik Date: 2010-10-26.20:00:03
The easiest way would be to pass the desired config on the commandline

cd /home/downward/{trunk/master}/new-scripts
./downward-reports.py issue69-ou-STRIPS-{trans3613-}eval/ -c 
5038-5038-3613-ou --res problem
msg607 (view) Author: malte Date: 2010-10-26.19:35:16
That was indeed the problem. Is there a way to find out the search time (or
total time, or number of expansions) for particular instances of particular
configurations where the instance was not solved by all?
msg606 (view) Author: jendrik Date: 2010-10-26.19:30:49
The experiment yielded the same results as earlier. Maybe you have looked at the 
wrong subtable? For HEAD-HEAD-3612-ou and HEAD-HEAD-3613-ou Airport 25 is solved 
and appears in the solved table correctly. It doesn't appear in the tables where 
only commonly solved problems are shown.
msg605 (view) Author: jendrik Date: 2010-10-26.17:45:03
I just ran an experiment with HEAD-HEAD-3612-ou and HEAD-HEAD-3613-ou 
and airport 25 and now it seems to work. I'll submit the whole domain 
for another experiment and get back to you.

Maybe there were some errors in the older version of the new-scripts.

Am 26.10.2010 13:22, schrieb Malte Helmert:
> Malte Helmert<helmert@informatik.uni-freiburg.de>  added the comment:
>
> Small correction: it looks like the old code solved tasks #1-#38 without any
> gaps (which is a bit strange, since the paper reports 37 solved tasks?)
>
> I made a few more tests with #25 with translator and preprocessor from HEAD and
> various versions of the search component. Up to r3636, it it solved fine. In
> r3637, I get a bug leading to an invalid plan, but this is easy to fix and after
> fixing it it again works fine. In r3638, the "getting stuck" behaviour starts,
> which makes some sense since this is the revision that actually changed the
> behaviour of the heuristic.
>
> I'll look into this more thoroughly and see if there is a way to avoid these
> problems without losing the performance advantages of the r3638 (and subsequent)
> algorithm changes.
>
> One thing is strange: as far as I can tell, versions HEAD-HEAD-3612-ou and
> HEAD-HEAD-3613-ou should easily solve Airport #25, but they don't in Jendrik's
> data. (These names do mean HEAD for translate and preprocess and 3612/3613 for
> release-search, right?) Jendrik, can you check that these results were indeed
> generated with the proper code versions etc.?
>
> _______________________________________________________
> Fast Downward issue tracker<downward.issues@gmail.com>
> <http://issues.fast-downward.org/issue69>
> _______________________________________________________
msg604 (view) Author: malte Date: 2010-10-26.13:22:05
Small correction: it looks like the old code solved tasks #1-#38 without any
gaps (which is a bit strange, since the paper reports 37 solved tasks?)

I made a few more tests with #25 with translator and preprocessor from HEAD and
various versions of the search component. Up to r3636, it it solved fine. In
r3637, I get a bug leading to an invalid plan, but this is easy to fix and after
fixing it it again works fine. In r3638, the "getting stuck" behaviour starts,
which makes some sense since this is the revision that actually changed the
behaviour of the heuristic.

I'll look into this more thoroughly and see if there is a way to avoid these
problems without losing the performance advantages of the r3638 (and subsequent)
algorithm changes.

One thing is strange: as far as I can tell, versions HEAD-HEAD-3612-ou and
HEAD-HEAD-3613-ou should easily solve Airport #25, but they don't in Jendrik's
data. (These names do mean HEAD for translate and preprocess and 3612/3613 for
release-search, right?) Jendrik, can you check that these results were indeed
generated with the proper code versions etc.?
msg603 (view) Author: malte Date: 2010-10-26.12:11:09
OK, I looked at the airport results, old (= the ICAPS paper) and new (=
Jendrik's results) a bit more closely. The old results solved all tasks from
#1-#38 except #18 and #20. The new ones solve all tasks from #1-#24 except #18
and they solve #36. So there's a big gap in the range #25-#35.

I tried the smallest of these, #25, with everything@2449 and the current hg tip.
In both cases, we get h = 212 for the initial state, which is also the optimal
plan length. With the old code, we immediately zip down to a solution with no
real search, i.e., there are exactly 213 expansions which is optimal since it's
the length of the solution. With the new code, we zip down immediately to
h=93/g=118 in the first 118 expansions and then get stuck there. There are
various possible explanations for this; it could be a bug, or it could be worse
luck with tie-breaking decisions.

The new code shows the same behaviour no matter whether we use the "old"
translator and preprocess or the "new" translator or preprocessor, so it looks
like the search code is the culprit.

That narrows it down sufficiently for now, so I don't think we'll need these
experiments at the moment, Erez.
msg602 (view) Author: malte Date: 2010-10-26.11:50:01
I think we used svn+ssh://downward-svn/branches/everything@2449.
msg601 (view) Author: erez Date: 2010-10-26.11:16:45
I'll see what I can do when the current running experiment finishes.
msg600 (view) Author: malte Date: 2010-10-26.10:57:18
Would there be enough time available to run *only* Airport maybe?

Unfortunately I'm not sure what exactly was run for the ICAPS paper, since
Carmel ran those experiments and the logs are not available any more. (I asked
him about them last week.) I can check my notes for some best guesses.
msg599 (view) Author: erez Date: 2010-10-26.10:29:38
I'm afraid the Technion machines are going to be busy until the ICAPS deadline.

Summarizing the difference in solved tasks between the latest version and what 
was reported in the paper:
Airport: -10
Driverlog: -1
Gripper: +1
Miconic: +1
Mprime: -2
pw-tankage: -1

So except for airport, the differences are fairly small.
Malte - do you know which translator version was used for the experiments 
reported in the paper?
msg598 (view) Author: malte Date: 2010-10-26.03:00:27
Hmmm, very close in terms of solved problems to the current translator version.
I guess this is good because it means that our good results were not solely due
to a buggy translator illicitly dropping actions, but it means we still don't
know what is going on here.

This is much worse than either Erez's results or what we report in the ICAPS
2009 paper. Both Erez's results and the ICAPS paper results were run on Technion
machines which I know are faster than what we are using here, so this *might*
explain it. Still, we should double-check this -- it'd be a pity for the trunk
to have so much worse coverage than what we report in the ICAPS paper.

Erez, would it be possible for you to repeat Jendrik's experiments on the
Technion machines? I think it'd be enough just to drop the 3613 version from the
experiment, but it'd be good to have both translator versions. So we want:

 * 3612-HEAD-3612-ou
 * 3612-HEAD-HEAD-ou
 * HEAD-HEAD-3612-ou
 * HEAD-HEAD-HEAD-ou
msg597 (view) Author: jendrik Date: 2010-10-26.02:42:09
To distract you from the fishy code, I present to you the latest grid results ;)

We should definitely have a look at the run code at the next meeting.
msg588 (view) Author: malte Date: 2010-10-25.02:52:09
Hmmm, this looks a bit fishy.

I don't think that time.clock() takes into account the time used in the called
processes, so my guess would be that the "passed_time" computation does not make
a lot of sense. We should probably log that. About setrlimit(), I don't really
know -- it might depend on what the RUN_COMMAND is -- but I would assume that
it's a per-process limit. We could test that at our next meeting.

Also, subprocess.call with shell=True should generally be avoided in clean code,
since it opens all kinds of cans of works w.r.t. shell syntax and quoting.
msg587 (view) Author: jendrik Date: 2010-10-25.02:15:02
The gkigrid jobs get a default timeout of 1830 seconds. The default 
timeout for the run of 1800 seconds is set in the file 
data/run-template.py. I am guessing that the timeout is set for the 
whole run (including preprocess command), but I'm unsure about the 
workings of resource.setrlimit(). It would be good if you could have a 
look at the file and see what can be improved there. That file is 
probably the most important bit of the scripts, but hasn't gotten much 
attention...

The CHECK_INTERVAL is as low as 0.5 sec, because that makes local 
testing a little faster, but I should probably increase that.
msg586 (view) Author: malte Date: 2010-10-24.17:14:14
Thanks!

By the way, how do the experiments manage the timeouts? Since the vast majority
of our experiments care only about the search component, we usually *only* apply
the 30 minute timeout to search. Is that what is happening here, or is there an
overall 30 minute timeout? Shouldn't make a huge difference since
translation/preprocessing tends to be fast for the tasks that we can solve with
an optimal planner, but still this should not be neglected.
msg585 (view) Author: jendrik Date: 2010-10-24.16:11:25
I have just submitted a job to the grid with the following revs:

combinations = [
    (TranslatorSvnCheckout(rev=3613), PreprocessorSvnCheckout(), 
PlannerSvnCheckout(rev=3612)),
    (TranslatorSvnCheckout(rev=3613), PreprocessorSvnCheckout(), 
PlannerSvnCheckout(rev=3613)),
    (TranslatorSvnCheckout(rev=3613), PreprocessorSvnCheckout(), 
PlannerSvnCheckout(rev='HEAD')),
               ]
msg584 (view) Author: malte Date: 2010-10-24.12:35:33
Yes, a run with an equally old translator version would be useful.

The preprocessor version can/should be the most recent one; all that happened
there since r3612 are some bug fixes that shouldn't affect LM-cut.
msg583 (view) Author: erez Date: 2010-10-24.10:33:45
The logs from my runs are in the lmcut.tar.gz attached here.
I'm afraid I ran these on my old computer, so I have no more data, and no way to 
recreate these.
It's very possible that there is also some connection to the translator here, so 
we might want to try to use the same version of the translator as in r3612, and 
run this comparison again.
msg582 (view) Author: jendrik Date: 2010-10-24.04:27:01
I'm attaching the detailed results for a better analysis.
msg581 (view) Author: malte Date: 2010-10-24.03:57:08
Hmm... the results for the trunk version are best and the difference between
r3612 and r3613 are negligible, but the results are consistently worse than what
Erez reports in msg226.

Erez, do you still have the logs for these runs available or can you reproduce
them in some way?
msg580 (view) Author: jendrik Date: 2010-10-24.03:39:06
I have taken this issue as an example for the new comparisons module and ran 
some tests: ou on all STRIPS domains for r3612/r3613 and HEAD (HEAD being ~5038, 
one of the latest SVN revisions).

The results are attached.

BTW, the scripts now convert HEAD etc. to the actual revision number.
msg228 (view) Author: erez Date: 2010-01-21.19:17:31
The results are attached
msg227 (view) Author: malte Date: 2010-01-21.15:13:37
If you still have the raw data, I'd need to have a look at the runtime and
expansion numbers for the individual instances.
msg226 (view) Author: erez Date: 2010-01-21.08:58:33
I ran a comparison on: airport, blocks, depot, freecell, logistics, mprime,
pathways, psr-small, pipesworld-tankage, pipesworld-notankage and zenotravel.

In terms of solved problems, these are the differences (for the new LM-CUT):
airport: -11  (38 old, 27 new)
mprime: -1 (25 old, 24 new)
pipesworld-notankage: +1 (17 old, 18 new)
pipesworld-tankage: +1 (11 old, 12 new)
zenotravel: +1 (12 old, 13 new)

So overall there was an improvement in 3 domains, and a decrease in 2.
However, the decrease in airport is huge.
I ran it using the scripts, so if there are some additional reports you want me
to run, just let me know.
msg225 (view) Author: malte Date: 2010-01-14.01:06:04
I had a little bit of time left so checked the behaviour. It's not 100% clear to
me there is a fixable problem. Revision r3638 made some algorithmic changes to
LM-cut that probably affect which h_max supporter is chosen in case of ties.
Since LM-cut is only well-defined up to tie breaking, this can adversely affect
h values.

It could just be the case that we were lucky with tie-breaking before and are
not so lucky anymore, and unfortunately in domains like Airport this can mean
all or nothing. But it's equally possible that we're much better in other
domains, or even other Airport instances.

So I think the right approach here is a thorough performance comparison between
r3612 (for unit-cost problems) or r3613 (for general cost problems) and HEAD
across all domains. If there's a systematic decrease of heuristic quality, we
can look into possible causes for this -- maybe there is a bug, but then the
performance study will hopefully tell us where to look.
msg224 (view) Author: malte Date: 2010-01-13.15:10:33
PS: If anyone wants to do any performance experiments with h^LM-cut for a paper,
*don't* use the current trunk. I made some major unpublished experimental
algorithmic changes there recently, and the reference version for use in papers
should be the version described in the LM-cut paper.

I think r3304 would be a good version to use at least w.r.t. the heuristic
implementation. (I don't know about the search algorithms without checking).
msg223 (view) Author: erez Date: 2010-01-13.15:09:40
I found this by running selective-max again (after the changes I made to
selective-max). Airport was the most obvious example, but other domains also
suffered in performance (both pipesworld domains for example). I'm not sure if
this is all due to lm-cut, but it's worth looking into those as well.
msg222 (view) Author: malte Date: 2010-01-13.15:07:25
That's what I feared. The problem is that the code now computes the cuts
slightly differently, which can affect the h values (since h^LM-cut, like h^FF,
has some arbitary choice points). So it's possible that we're now simply unlucky
and get slightly worse cuts for obscure reasons, or it may be a genuine bug.
Will be a bit of work to find out which one it is. Oh well. :-(

Anyway, thanks! I'll look into it.
msg221 (view) Author: erez Date: 2010-01-13.15:04:15
Here it is:

r3636 - no bug
r3637 - different bug (wrong solution of length 1)
r3638 - bug
msg220 (view) Author: malte Date: 2010-01-13.14:47:07
Thanks for reporting! We should try to isolate the revision in which the
performance first degraded (e.g. using some kind of binary search).

Looks like I won't have time for this in the near future, so if you could do it,
great! Otherwise we can just let this one sit for a while, of course.
msg219 (view) Author: erez Date: 2010-01-13.14:07:19
There is a major difference in performance when running A* + LM-CUT to the first
results we have with LM-CUT.
Example: Airport 25 - before was solved with in 96 seconds with 213 expanded
states. Now it times out after 30 minutes.

I'm not sure if this is because of changes in LM-CUT, search engine, or maybe
something else. I don't think it's the recent translator changes, since I did
not re-run the translator after these changes.
History
Date User Action Args
2011-08-13 22:10:52maltesetstatus: chatting -> resolved
messages: + msg1610
2011-08-12 22:00:03jendriksetmessages: + msg1598
2011-08-12 21:28:12maltesetmessages: + msg1597
2011-08-12 21:21:07jendriksetfiles: + plots-v2.tar.gz
messages: + msg1595
2011-08-12 19:52:46maltesetmessages: + msg1593
2011-08-12 17:57:43jendriksetmessages: + msg1589
2011-08-12 17:55:34maltesetmessages: + msg1588
2011-08-12 17:49:00jendriksetmessages: + msg1587
2011-08-12 17:48:46jendriksetfiles: - issue69-all-plots.tar.gz
2011-08-12 17:48:15maltesetmessages: + msg1586
2011-08-12 17:43:24jendriksetfiles: + exp-js-issue69-eval-abs-p.html
messages: + msg1585
2011-08-12 17:39:56jendriksetfiles: - issue69-scatter-domain.tar.gz
2011-08-12 17:39:51jendriksetfiles: - issue69-scatter.tar.gz
2011-08-12 17:39:39jendriksetfiles: - issue69ouSTRIPSeval-d-abs.html
2011-08-12 17:39:36jendriksetfiles: - issue69ouSTRIPSeval-p-abs.html
2011-08-12 17:39:23jendriksetfiles: - issue69ouSTRIPStrans3613eval-d-abs.html
2011-08-12 17:39:04jendriksetfiles: + exp-js-issue69-eval-abs-d.html
2011-08-12 17:36:32jendriksetmessages: + msg1583
2011-08-12 11:19:03maltesetmessages: + msg1574
2011-08-12 11:10:42maltesetmessages: + msg1573
2011-08-12 11:06:47maltesetmessages: + msg1572
2011-08-11 21:23:52jendriksetfiles: + issue69-all-plots.tar.gz
messages: + msg1567
2011-08-11 13:55:53jendriksetmessages: + msg1559
2011-08-11 12:08:20maltesetmessages: + msg1557
2011-08-11 02:25:02jendriksetmessages: + msg1551
2011-08-11 01:44:45jendriksetmessages: + msg1550
2011-08-10 13:21:23maltesetmessages: + msg1539
2011-08-10 13:12:23jendriksetfiles: + issue69-scatter-domain.tar.gz
messages: + msg1537
2011-08-09 19:56:42maltesetmessages: + msg1526
2011-08-09 19:51:39maltesetmessages: + msg1525
2011-08-09 19:48:50jendriksetmessages: + msg1524
2011-08-09 18:35:25maltesetmessages: + msg1523
2011-08-09 17:53:00jendriksetfiles: + issue69-scatter.tar.gz
messages: + msg1522
2011-07-18 11:29:32maltesetmessages: + msg1420
2011-07-18 08:01:46erezsetmessages: + msg1419
2011-07-17 21:10:07jendriksetmessages: + msg1418
2010-10-28 00:00:36maltesetmessages: + msg611
2010-10-26 20:00:03jendriksetmessages: + msg608
2010-10-26 19:35:16maltesetmessages: + msg607
2010-10-26 19:30:49jendriksetmessages: + msg606
2010-10-26 17:45:04jendriksetmessages: + msg605
2010-10-26 13:22:05maltesetmessages: + msg604
2010-10-26 12:11:09maltesetmessages: + msg603
2010-10-26 11:50:01maltesetmessages: + msg602
2010-10-26 11:16:45erezsetmessages: + msg601
2010-10-26 10:57:18maltesetmessages: + msg600
2010-10-26 10:29:38erezsetmessages: + msg599
2010-10-26 03:00:27maltesetmessages: + msg598
2010-10-26 02:42:09jendriksetfiles: + issue69ouSTRIPStrans3613eval-d-abs.html
messages: + msg597
2010-10-25 02:52:09maltesetmessages: + msg588
2010-10-25 02:15:03jendriksetmessages: + msg587
2010-10-24 17:14:15maltesetmessages: + msg586
2010-10-24 16:11:25jendriksetmessages: + msg585
2010-10-24 12:35:33maltesetmessages: + msg584
2010-10-24 10:33:45erezsetmessages: + msg583
2010-10-24 04:27:03jendriksetfiles: + issue69ouSTRIPSeval-p-abs.html
messages: + msg582
2010-10-24 03:57:08maltesetmessages: + msg581
2010-10-24 03:39:06jendriksetfiles: + issue69ouSTRIPSeval-d-abs.html
nosy: + jendrik
messages: + msg580
2010-03-22 14:34:28maltesetkeyword: + 1.0
2010-01-21 19:17:31erezsetfiles: + lmcut.tar.gz
messages: + msg228
2010-01-21 15:13:38maltesetmessages: + msg227
2010-01-21 08:58:34erezsetmessages: + msg226
2010-01-14 01:06:04maltesetmessages: + msg225
title: A* with LM-CUT degradation -> A* with LM-CUT needs performance comparison between r3612/r3613 and HEAD
2010-01-13 15:10:33maltesetmessages: + msg224
2010-01-13 15:09:40erezsetmessages: + msg223
2010-01-13 15:07:25maltesetassignedto: malte
messages: + msg222
2010-01-13 15:04:15erezsetmessages: + msg221
2010-01-13 14:47:07maltesetstatus: unread -> chatting
messages: + msg220
2010-01-13 14:07:19erezcreate