Issue869

Title skip hard-to-translate tasks in daily/weekly tests
Priority feature Status resolved
Superseder Nosy List florian, jendrik, malte
Assigned To jendrik Keywords
Optional summary

Created on 2018-11-26.13:52:37 by jendrik, last changed by malte.

Messages
msg8122 (view) Author: malte Date: 2018-11-28.12:51:31
Agreed, such a comparison would be nice. Ideally with different levels of
granularity: something fast, something moderately comprehensive, and everything.

I suggest we open an issue for more tests iff someone wants to work on it in the
near or medium future.
msg8121 (view) Author: florian Date: 2018-11-28.12:49:27
I see, I thought we would compare that the translator output doesn't change from
one revision to the next. I think this would be a useful test to add to the
nightly/weekly build and could also take a bit longer. The local tests should
remain fast.
msg8119 (view) Author: malte Date: 2018-11-28.12:39:13
The original intention was that these tests can be run in 10 seconds, so that
everyone can quickly run this test before they push.

We should also be clear what we are testing here: unless I'm mistaken, we only
test that Python2.7 and Python 3 result in the same translator output, not that
this translator output matches any reference translator output or similar. So
we're not comparing to something objective, but only looking out for Python
version dependencies. For these, I think one task in (almost) every domain
should give us sufficient coverage.

So I guess what I'm saying is: before we extend this test to more tasks, I would
prefer us to think about:

1) What do we want to test?
2) What is the purpose of these tests?
3) Who should run these tests and in which situations?
msg8117 (view) Author: florian Date: 2018-11-28.12:29:51
I'm a bit late to the party but I would suggest that if we use a specific
selection of tasks for the translator tests anyway, we test a few more tasks
than just the first in each domain. How about testing all tasks that take less
than 10 seconds and 2GB or some similar selection?
msg8102 (view) Author: jendrik Date: 2018-11-27.08:21:30
I had pushed only to the master repo, not my own repo. The new revisions are now 
in both repos.

Regarding your earlier question about memory: I have just overseen that tasks 
with runtime=None may have used too much memory or too much time.
msg8099 (view) Author: malte Date: 2018-11-27.00:53:51
Awesome, thanks! I see no new commits on the pull request, though?
msg8097 (view) Author: jendrik Date: 2018-11-26.19:17:43
Thanks for the comments on Bitbucket! I took care of them and merged this. The 
nightly test is green now and the weekly test should be in a few minutes.
msg8096 (view) Author: malte Date: 2018-11-26.17:55:09
Thanks, Jendrik!

I've gone through the experimental results, and this list of domains looks good
to me.

Pull requests looks good, I left two small comments.

> Memory usage is not a problem as all first tasks from IPC 2018
> use less than 2 GB.

As I wrote, I think the first task of organic-synthesis-sat18-strips needs more
than 10 GiB; I aborted it at that point. I've now run it a bit longer, and it
used roughly 15 GiB within 7 minutes. At that point I had to abort because the
machine started swapping too much. In your experiments it looks like this one
failed. Or perhaps you meant all first tasks apart from the ones in the domains
you mentioned?
msg8095 (view) Author: jendrik Date: 2018-11-26.15:18:34
I made a pull request at https://bitbucket.org/jendrikseipp/downward/pull-requests/108 .
msg8093 (view) Author: jendrik Date: 2018-11-26.14:45:19
Here are the times and peak memory usages for translating all IPC benchmarks up 
to 2018:

https://ai.dmi.unibas.ch/_tmp_files/seipp/issue869-base-translate-all.html

Based on the runtime results, I propose to ignore the first tasks from the 
following directories:

agricola-sat18-strips
organic-synthesis-sat18-strips
organic-synthesis-split-opt18-strips
organic-synthesis-split-sat18-strips

Memory usage is not a problem as all first tasks from IPC 2018 use less than 2 
GB.
msg8089 (view) Author: jendrik Date: 2018-11-26.13:52:37
Quoting Malte:

"The problem seems to be that the translator test is supposed to only use
tasks that are quick to complete. We have implemented this requirement
as "use the first task in each domain", but since we've added the IPC
2018 benchmarks, this no longer works. For example, the task
"organic-synthesis-sat18-strips-p01.pddl" uses more than 18 GiB in the
translator on my machine, and I aborted it after 10 minutes.

I think the solution for this would be to define a different suite for
these tests, for example something as simple as "The first task in all
domains except X, Y and Z.""
History
Date User Action Args
2018-11-28 12:51:31maltesetmessages: + msg8122
2018-11-28 12:49:27floriansetmessages: + msg8121
2018-11-28 12:39:13maltesetmessages: + msg8119
2018-11-28 12:29:51floriansetnosy: + florian
messages: + msg8117
2018-11-27 08:21:30jendriksetmessages: + msg8102
2018-11-27 00:53:51maltesetmessages: + msg8099
2018-11-26 19:17:43jendriksetstatus: reviewing -> resolved
messages: + msg8097
2018-11-26 17:55:10maltesetmessages: + msg8096
2018-11-26 15:18:34jendriksetstatus: in-progress -> reviewing
messages: + msg8095
2018-11-26 14:45:19jendriksetstatus: unread -> in-progress
messages: + msg8093
2018-11-26 13:52:37jendrikcreate