Created on 2017-09-01.13:01:23 by jendrik, last changed by jendrik.
msg10184 (view) |
Author: jendrik |
Date: 2021-03-18.14:14:08 |
|
I'm closing this issue since it's now tracked in the VAL repo.
|
msg6538 (view) |
Author: jendrik |
Date: 2017-09-21.15:30:25 |
|
This is probably related to https://github.com/KCL-Planning/VAL/issues/10 .
|
msg6528 (view) |
Author: jendrik |
Date: 2017-09-04.19:16:19 |
|
I think fixing the bug would be best, yes.
|
msg6527 (view) |
Author: malte |
Date: 2017-09-04.19:03:34 |
|
(Assuming that the behaviour that Jendrik observed is a bug and not a feature. :-))
|
msg6526 (view) |
Author: malte |
Date: 2017-09-04.19:02:35 |
|
All sounds good, thanks!
How do we proceed with this issue? Try to fix the VAL bug?
|
msg6524 (view) |
Author: jendrik |
Date: 2017-09-04.15:44:24 |
|
Malte, I think your explanation regarding the small number of failed tasks makes sense.
I removed the if-block for now.
Lab also now requires that the "error" attribute is set for each run. It can either be
"none", "out-of-memory", etc. or "unexplained:output-to-stderr" etc. This catches the case
that the parser (which normally sets the "error" attribute) is not run at all (e.g., in
case the whole run isn't started by Slurm).
I agree that it would be useful to store all observed errors. I opened
https://bitbucket.org/jendrikseipp/lab/issues/41/store-all-observed-errors-instead-of-a
for this.
However, we should still inspect the Slurm logs. This requires some preparation however.
Currently, all tasks write into the same driver.err and driver.log files, which causes
these files to contain mostly garbage, since the access is not synchronized. We will have
to separate the task-wise logs from the global logs before we can parse them. I opened
https://bitbucket.org/jendrikseipp/lab/issues/40/move-task-wise-slurm-logs-into-run and
https://bitbucket.org/jendrikseipp/lab/issues/39/check-slurm-logs to keep track of this.
|
msg6519 (view) |
Author: malte |
Date: 2017-09-03.19:24:47 |
|
Makes sense, but what I don't understand is why there are so *few* failing tasks
then. If I understand it correctly, as soon as one task fails in this way, all
tasks started afterwards will fail. But the experiment only shows very few
failures. If I hadn't scrolled through all tasks manually to find outliers, we
would have missed this.
Maybe it's because it's a small experiment and the tasks causing failures are
quite long-running ones, so by the time the failures occur, almost everything
has been started.
> I suggest we remove the if-block as a short-term solution but think about adding
> it back in once we have task-specific logs (I think whether we want to have the
> check there is a separate question).
I think slurm writing something to stderr is not something we should ignore, so
we should definitely have some kind of check of slurm's output. I think we
should treat every slurm error as either a lab bug (in this case, failure to set
a memory limit for VAL) or a sign that something serious has gone wrong that
needs investigating (such as grid nodes not working properly). In either case, I
guess the user deserves a big warning in the reports (if we even want to produce
a report).
If I understand correctly, we currently don't check the slurm output except in
this place of the code, so the current codes will only catch errors that happen
before a job runs, but we won't get notified of errors that happen after the
last task has started. Is this correct? Would it be possible for the fetcher
code to have a look at the driver.err file?
Switching to individual driver.log/driver.err files will mitigate this, but I
think we won't be able to avoid having a global file because we can only
redirect inside the scripts, and there is always stuff that can happen outside
the scripts. I expect in the future the global logs will almost always be empty,
but as far as I understand, they will still exist, so we should check them.
|
msg6518 (view) |
Author: florian |
Date: 2017-09-03.19:07:09 |
|
I don't think this is something going wrong on a slurm level. The bash code
Jendrik posted is part of labs main experiment. It is included there to stop a
"run"-step of an experiment if the "build"-step failed but this is probably to
extreme. Jendrik's explanation makes sense to me: one task runs out memory, this
triggers slurm to kill the task and write a warning to stderr. The warning ends
up in driver.err and the next task that starts will trigger the if-block and
exit without generating run.{log,err}.
To fix this we should either make sure that task-specific warnings end up in
task-specific logs or if they can end up in job-wide-logs we should not treat
this as a critical failure of the whole job.
I suggest we remove the if-block as a short-term solution but think about adding
it back in once we have task-specific logs (I think whether we want to have the
check there is a separate question).
In any case, I'd add something to the parser that treats a non-exsting run.log
as a critical error.
|
msg6513 (view) |
Author: malte |
Date: 2017-09-03.16:26:01 |
|
OK, runs that fail silently are of course a serious problem if they mean we
can't trust our experiments any more. Could this also happen with a more
"regular" lab experiment, or will lab complain about missing attributes or
something like that as an unexpected error in other circumstances where we have
a more normal attribute list? Either way, I find this disconcerting -- one of
our first priorities should be to make sure that we notice if things go wrong.
Are the runs that don't get started something we should report to the Slurm
admins as well?
|
msg6512 (view) |
Author: jendrik |
Date: 2017-09-03.16:20:29 |
|
1) I copied our code from Fast Downward for measuring peak memory usage and print
it just before the program exits.
2) It appears these tasks have never been executed by Slurm. The run_dirs are
pristine, i.e., they contain the run script etc. but no logs or output files.
Florian, didn't you have a similar issue recently? The problem is probably that
the validator uses too much memory, which makes slurm log "slurmstepd: error:
Exceeded step memory limit at some point." to "grid-steps/driver.err", which
makes all runs that are started started afterwards exit early. Here is the
relevant part from the Slurm sbatch script:
if [ -s "driver.err" ]; then
echo "There was output on stderr. Please check 'driver.err'. Aborting."
exit 1
fi
cd ../issue734-base
RUN="$SLURM_ARRAY_TASK_ID"
./run-dispatcher.py "$RUN"
As long as we don't have Slurm logs per run, maybe we should remove the if-test.
|
msg6510 (view) |
Author: malte |
Date: 2017-09-03.15:49:27 |
|
Thanks! It seems the memory usage is almost always under 20 MiB, so one would
hope that the very large values in PSR-Large could be reduced. (Reasoning with
axioms does make validation harder, and in PSR-Large, a substantial number of
derived atoms need to be inferred. But of course this should not reach multiple
gigabytes in a reasonable implementation.)
Two questions:
1) How did you measure validator peak memory?
2) What do the "None" entries for validation time and validation memory mean?
For example, they occur in philosophers, schedule, transport-opt08,
visitall-opt11, visitall-opt14 and visitall-sat11.
|
msg6509 (view) |
Author: jendrik |
Date: 2017-09-03.15:11:28 |
|
Here are the results: http://ai.cs.unibas.ch/_tmp_files/seipp/issue734-base.html
The unexplained errors are due to the translator and the validator running out of
memory.
|
msg6499 (view) |
Author: jendrik |
Date: 2017-09-01.17:57:16 |
|
The all-domains experiment is queued.
|
msg6498 (view) |
Author: jendrik |
Date: 2017-09-01.17:56:28 |
|
I'm attaching the PSR-Large plans.
|
msg6483 (view) |
Author: malte |
Date: 2017-09-01.13:24:46 |
|
PS: I think it would be interesting to also do at least one all-domains
experiment that measures validation time and peak memory for future reference.
The same cea_lazy config might be a good one to use; it's not our greatest, but
it does seem to do well in some problematic domains.
|
msg6482 (view) |
Author: malte |
Date: 2017-09-01.13:23:22 |
|
Thanks, Jendrik! Sounds like inval is too slow to use for us in PSR-Large, but
perhaps the "Enviroment" thing can be changed.
To make working on this issue easier, can you attach some large plans that
people can use to test this? Perhaps the set of all PSR-Large plans for config
issue731-v4-cea_lazy-release32 from issue731 would be a good test set.
|
msg6481 (view) |
Author: jendrik |
Date: 2017-09-01.13:01:23 |
|
In experiments for issue731, we noticed that VAL uses more than 4000 MiB (and thus
leads to non-validated tasks on the grid) for the following tasks:
psr-large:p37-s193-n50-l2-f30.pddl
psr-large:p38-s195-n50-l3-f30.pddl
psr-large:p41-s201-n60-l2-f30.pddl
psr-large:p43-s205-n70-l2-f30.pddl
psr-large:p45-s209-n80-l2-f30.pddl
psr-large:p47-s213-n90-l2-f30.pddl
The algorithm from the experiment that solves these tasks the fastest is
--heuristic "hlm=lmcount(lm_zg())" --search "lazy_greedy([hlm], preferred=[hlm])".
One option for validating these tasks using less memory is using INVAL
(https://github.com/patrikhaslum/INVAL). However, INVAL is much slower:
psr-large:p01
VAL: finished VmPeak: 5 MiB Time: 0s
INVAL: finished VmPeak: 48 MiB Time: 72s
psr-large:p30
VAL: finished VmPeak: 1.5 GiB Time: 2m
INVAL: out-of-time VmPeak: 316 MiB Time: > 53m
psr-large:p37
VAL: out-of-memory VmPeak: > 4 GiB Time: > 11m
INVAL: out-of-time VmPeak: 641 MiB Time: > 80m
Instructions for obtaining and compiling INVAL:
sudo apt-get install ecl libffi-dev
git clone git@github.com:patrikhaslum/INVAL.git
cd INVAL
ecl -shell compile-with-ecl
I profiled the memory usage of VAL for a psr-large task (p27) and it seems that the
main reason for the high memory usage is that VAL constantly copies Environment
objects, puts the copies on the heap and stores pointers to them in a std::map
(^/include/Environment.h). The copies are only deleted when the program exits. The
memory profile is attached.
Discussing this issue is on the agenda for the next Fast Downward meeting.
|
|
Date |
User |
Action |
Args |
2021-03-18 14:14:08 | jendrik | set | status: chatting -> resolved messages:
+ msg10184 |
2017-09-21 15:30:25 | jendrik | set | messages:
+ msg6538 |
2017-09-04 19:16:19 | jendrik | set | messages:
+ msg6528 |
2017-09-04 19:03:34 | malte | set | messages:
+ msg6527 |
2017-09-04 19:02:35 | malte | set | messages:
+ msg6526 |
2017-09-04 15:44:24 | jendrik | set | messages:
+ msg6524 |
2017-09-03 19:24:47 | malte | set | messages:
+ msg6519 |
2017-09-03 19:07:10 | florian | set | messages:
+ msg6518 |
2017-09-03 16:26:02 | malte | set | messages:
+ msg6513 |
2017-09-03 16:20:29 | jendrik | set | messages:
+ msg6512 |
2017-09-03 15:49:27 | malte | set | messages:
+ msg6510 |
2017-09-03 15:11:28 | jendrik | set | messages:
+ msg6509 |
2017-09-01 23:41:22 | florian | set | nosy:
+ florian |
2017-09-01 17:57:16 | jendrik | set | messages:
+ msg6499 |
2017-09-01 17:56:28 | jendrik | set | files:
+ psr-large-plans.tar.gz messages:
+ msg6498 |
2017-09-01 13:24:46 | malte | set | messages:
+ msg6483 |
2017-09-01 13:23:22 | malte | set | status: unread -> chatting messages:
+ msg6482 |
2017-09-01 13:01:23 | jendrik | create | |
|