Issue734

Title validator uses too much memory
Priority wish Status resolved
Superseder Nosy List florian, jendrik, malte
Assigned To Keywords
Optional summary

Created on 2017-09-01.13:01:23 by jendrik, last changed by jendrik.

Files
File name Uploaded Type Edit Remove
massif.out.val-psr-large-p27.gz jendrik, 2017-09-01.13:01:23 application/gzip
psr-large-plans.tar.gz jendrik, 2017-09-01.17:56:28 application/gzip
Messages
msg10184 (view) Author: jendrik Date: 2021-03-18.14:14:08
I'm closing this issue since it's now tracked in the VAL repo.
msg6538 (view) Author: jendrik Date: 2017-09-21.15:30:25
This is probably related to https://github.com/KCL-Planning/VAL/issues/10 .
msg6528 (view) Author: jendrik Date: 2017-09-04.19:16:19
I think fixing the bug would be best, yes.
msg6527 (view) Author: malte Date: 2017-09-04.19:03:34
(Assuming that the behaviour that Jendrik observed is a bug and not a feature. :-))
msg6526 (view) Author: malte Date: 2017-09-04.19:02:35
All sounds good, thanks!

How do we proceed with this issue? Try to fix the VAL bug?
msg6524 (view) Author: jendrik Date: 2017-09-04.15:44:24
Malte, I think your explanation regarding the small number of failed tasks makes sense.

I removed the if-block for now.

Lab also now requires that the "error" attribute is set for each run. It can either be 
"none", "out-of-memory", etc. or "unexplained:output-to-stderr" etc. This catches the case 
that the parser (which normally sets the "error" attribute) is not run at all (e.g., in 
case the whole run isn't started by Slurm).

I agree that it would be useful to store all observed errors. I opened 
https://bitbucket.org/jendrikseipp/lab/issues/41/store-all-observed-errors-instead-of-a
for this.

However, we should still inspect the Slurm logs. This requires some preparation however. 
Currently, all tasks write into the same driver.err and driver.log files, which causes 
these files to contain mostly garbage, since the access is not synchronized. We will have 
to separate the task-wise logs from the global logs before we can parse them. I opened 
https://bitbucket.org/jendrikseipp/lab/issues/40/move-task-wise-slurm-logs-into-run and 
https://bitbucket.org/jendrikseipp/lab/issues/39/check-slurm-logs to keep track of this.
msg6519 (view) Author: malte Date: 2017-09-03.19:24:47
Makes sense, but what I don't understand is why there are so *few* failing tasks
then. If I understand it correctly, as soon as one task fails in this way, all
tasks started afterwards will fail. But the experiment only shows very few
failures. If I hadn't scrolled through all tasks manually to find outliers, we
would have missed this.

Maybe it's because it's a small experiment and the tasks causing failures are
quite long-running ones, so by the time the failures occur, almost everything
has been started.

> I suggest we remove the if-block as a short-term solution but think about adding
> it back in once we have task-specific logs (I think whether we want to have the
> check there is a separate question).

I think slurm writing something to stderr is not something we should ignore, so
we should definitely have some kind of check of slurm's output. I think we
should treat every slurm error as either a lab bug (in this case, failure to set
a memory limit for VAL) or a sign that something serious has gone wrong that
needs investigating (such as grid nodes not working properly). In either case, I
guess the user deserves a big warning in the reports (if we even want to produce
a report).

If I understand correctly, we currently don't check the slurm output except in
this place of the code, so the current codes will only catch errors that happen
before a job runs, but we won't get notified of errors that happen after the
last task has started. Is this correct? Would it be possible for the fetcher
code to have a look at the driver.err file?

Switching to individual driver.log/driver.err files will mitigate this, but I
think we won't be able to avoid having a global file because we can only
redirect inside the scripts, and there is always stuff that can happen outside
the scripts. I expect in the future the global logs will almost always be empty,
but as far as I understand, they will still exist, so we should check them.
msg6518 (view) Author: florian Date: 2017-09-03.19:07:09
I don't think this is something going wrong on a slurm level. The bash code
Jendrik posted is part of labs main experiment. It is included there to stop a
"run"-step of an experiment if the "build"-step failed but this is probably to
extreme. Jendrik's explanation makes sense to me: one task runs out memory, this
triggers slurm to kill the task and write a warning to stderr. The warning ends
up in driver.err and the next task that starts will trigger the if-block and
exit without generating run.{log,err}.

To fix this we should either make sure that task-specific warnings end up in
task-specific logs or if they can end up in job-wide-logs we should not treat
this as a critical failure of the whole job.

I suggest we remove the if-block as a short-term solution but think about adding
it back in once we have task-specific logs (I think whether we want to have the
check there is a separate question).

In any case, I'd add something to the parser that treats a non-exsting run.log
as a critical error.
msg6513 (view) Author: malte Date: 2017-09-03.16:26:01
OK, runs that fail silently are of course a serious problem if they mean we
can't trust our experiments any more. Could this also happen with a more
"regular" lab experiment, or will lab complain about missing attributes or
something like that as an unexpected error in other circumstances where we have
a more normal attribute list? Either way, I find this disconcerting -- one of
our first priorities should be to make sure that we notice if things go wrong.

Are the runs that don't get started something we should report to the Slurm
admins as well?
msg6512 (view) Author: jendrik Date: 2017-09-03.16:20:29
1) I copied our code from Fast Downward for measuring peak memory usage and print 
it just before the program exits.

2) It appears these tasks have never been executed by Slurm. The run_dirs are 
pristine, i.e., they contain the run script etc. but no logs or output files. 
Florian, didn't you have a similar issue recently? The problem is probably that 
the validator uses too much memory, which makes slurm log "slurmstepd: error: 
Exceeded step memory limit at some point." to "grid-steps/driver.err", which 
makes all runs that are started started afterwards exit early. Here is the 
relevant part from the Slurm sbatch script:


if [ -s "driver.err" ]; then
    echo "There was output on stderr. Please check 'driver.err'. Aborting."
    exit 1
fi

cd ../issue734-base
RUN="$SLURM_ARRAY_TASK_ID"
./run-dispatcher.py "$RUN"


As long as we don't have Slurm logs per run, maybe we should remove the if-test.
msg6510 (view) Author: malte Date: 2017-09-03.15:49:27
Thanks! It seems the memory usage is almost always under 20 MiB, so one would
hope that the very large values in PSR-Large could be reduced. (Reasoning with
axioms does make validation harder, and in PSR-Large, a substantial number of
derived atoms need to be inferred. But of course this should not reach multiple
gigabytes in a reasonable implementation.)

Two questions:

1) How did you measure validator peak memory?
2) What do the "None" entries for validation time and validation memory mean?
For example, they occur in philosophers, schedule, transport-opt08,
visitall-opt11, visitall-opt14 and visitall-sat11.
msg6509 (view) Author: jendrik Date: 2017-09-03.15:11:28
Here are the results: http://ai.cs.unibas.ch/_tmp_files/seipp/issue734-base.html

The unexplained errors are due to the translator and the validator running out of 
memory.
msg6499 (view) Author: jendrik Date: 2017-09-01.17:57:16
The all-domains experiment is queued.
msg6498 (view) Author: jendrik Date: 2017-09-01.17:56:28
I'm attaching the PSR-Large plans.
msg6483 (view) Author: malte Date: 2017-09-01.13:24:46
PS: I think it would be interesting to also do at least one all-domains
experiment that measures validation time and peak memory for future reference.
The same cea_lazy config might be a good one to use; it's not our greatest, but
it does seem to do well in some problematic domains.
msg6482 (view) Author: malte Date: 2017-09-01.13:23:22
Thanks, Jendrik! Sounds like inval is too slow to use for us in PSR-Large, but
perhaps the "Enviroment" thing can be changed.

To make working on this issue easier, can you attach some large plans that
people can use to test this? Perhaps the set of all PSR-Large plans for config
issue731-v4-cea_lazy-release32 from issue731 would be a good test set.
msg6481 (view) Author: jendrik Date: 2017-09-01.13:01:23
In experiments for issue731, we noticed that VAL uses more than 4000 MiB (and thus 
leads to non-validated tasks on the grid) for the following tasks:

  psr-large:p37-s193-n50-l2-f30.pddl
  psr-large:p38-s195-n50-l3-f30.pddl
  psr-large:p41-s201-n60-l2-f30.pddl
  psr-large:p43-s205-n70-l2-f30.pddl
  psr-large:p45-s209-n80-l2-f30.pddl
  psr-large:p47-s213-n90-l2-f30.pddl

The algorithm from the experiment that solves these tasks the fastest is 
--heuristic "hlm=lmcount(lm_zg())" --search "lazy_greedy([hlm], preferred=[hlm])".

One option for validating these tasks using less memory is using INVAL 
(https://github.com/patrikhaslum/INVAL). However, INVAL is much slower:

  psr-large:p01
    VAL:   finished       VmPeak:     5 MiB  Time:    0s
    INVAL: finished       VmPeak:    48 MiB  Time:   72s

  psr-large:p30
    VAL:   finished       VmPeak:   1.5 GiB  Time:    2m
    INVAL: out-of-time    VmPeak:   316 MiB  Time: > 53m

  psr-large:p37
    VAL:   out-of-memory  VmPeak:   > 4 GiB  Time: > 11m
    INVAL: out-of-time    VmPeak:   641 MiB  Time: > 80m

Instructions for obtaining and compiling INVAL:
  sudo apt-get install ecl libffi-dev
  git clone git@github.com:patrikhaslum/INVAL.git
  cd INVAL
  ecl -shell compile-with-ecl

I profiled the memory usage of VAL for a psr-large task (p27) and it seems that the 
main reason for the high memory usage is that VAL constantly copies Environment 
objects, puts the copies on the heap and stores pointers to them in a std::map 
(^/include/Environment.h). The copies are only deleted when the program exits. The 
memory profile is attached.

Discussing this issue is on the agenda for the next Fast Downward meeting.
History
Date User Action Args
2021-03-18 14:14:08jendriksetstatus: chatting -> resolved
messages: + msg10184
2017-09-21 15:30:25jendriksetmessages: + msg6538
2017-09-04 19:16:19jendriksetmessages: + msg6528
2017-09-04 19:03:34maltesetmessages: + msg6527
2017-09-04 19:02:35maltesetmessages: + msg6526
2017-09-04 15:44:24jendriksetmessages: + msg6524
2017-09-03 19:24:47maltesetmessages: + msg6519
2017-09-03 19:07:10floriansetmessages: + msg6518
2017-09-03 16:26:02maltesetmessages: + msg6513
2017-09-03 16:20:29jendriksetmessages: + msg6512
2017-09-03 15:49:27maltesetmessages: + msg6510
2017-09-03 15:11:28jendriksetmessages: + msg6509
2017-09-01 23:41:22floriansetnosy: + florian
2017-09-01 17:57:16jendriksetmessages: + msg6499
2017-09-01 17:56:28jendriksetfiles: + psr-large-plans.tar.gz
messages: + msg6498
2017-09-01 13:24:46maltesetmessages: + msg6483
2017-09-01 13:23:22maltesetstatus: unread -> chatting
messages: + msg6482
2017-09-01 13:01:23jendrikcreate