Issue 734: validator uses too much memory - Fast Downward issue tracker

Title	validator uses too much memory
Priority	wish	Status	resolved
Superseder		Nosy List	florian, jendrik, malte
Assigned To		Keywords
Optional summary

Created on 2017-09-01.13:01:23 by jendrik, last changed by jendrik.

Files
File name	Uploaded	Type	Edit	Remove
massif.out.val-psr-large-p27.gz	jendrik, 2017-09-01.13:01:23	application/gzip
psr-large-plans.tar.gz	jendrik, 2017-09-01.17:56:28	application/gzip

Messages
msg10184 (view)	Author: jendrik	Date: 2021-03-18.14:14:08
I'm closing this issue since it's now tracked in the VAL repo.
msg6538 (view)	Author: jendrik	Date: 2017-09-21.15:30:25
This is probably related to https://github.com/KCL-Planning/VAL/issues/10 .
msg6528 (view)	Author: jendrik	Date: 2017-09-04.19:16:19
I think fixing the bug would be best, yes.
msg6527 (view)	Author: malte	Date: 2017-09-04.19:03:34
(Assuming that the behaviour that Jendrik observed is a bug and not a feature. :-))
msg6526 (view)	Author: malte	Date: 2017-09-04.19:02:35
All sounds good, thanks! How do we proceed with this issue? Try to fix the VAL bug?
msg6524 (view)	Author: jendrik	Date: 2017-09-04.15:44:24
Malte, I think your explanation regarding the small number of failed tasks makes sense. I removed the if-block for now. Lab also now requires that the "error" attribute is set for each run. It can either be "none", "out-of-memory", etc. or "unexplained:output-to-stderr" etc. This catches the case that the parser (which normally sets the "error" attribute) is not run at all (e.g., in case the whole run isn't started by Slurm). I agree that it would be useful to store all observed errors. I opened https://bitbucket.org/jendrikseipp/lab/issues/41/store-all-observed-errors-instead-of-a for this. However, we should still inspect the Slurm logs. This requires some preparation however. Currently, all tasks write into the same driver.err and driver.log files, which causes these files to contain mostly garbage, since the access is not synchronized. We will have to separate the task-wise logs from the global logs before we can parse them. I opened https://bitbucket.org/jendrikseipp/lab/issues/40/move-task-wise-slurm-logs-into-run and https://bitbucket.org/jendrikseipp/lab/issues/39/check-slurm-logs to keep track of this.
msg6519 (view)	Author: malte	Date: 2017-09-03.19:24:47
Makes sense, but what I don't understand is why there are so few failing tasks then. If I understand it correctly, as soon as one task fails in this way, all tasks started afterwards will fail. But the experiment only shows very few failures. If I hadn't scrolled through all tasks manually to find outliers, we would have missed this. Maybe it's because it's a small experiment and the tasks causing failures are quite long-running ones, so by the time the failures occur, almost everything has been started. > I suggest we remove the if-block as a short-term solution but think about adding > it back in once we have task-specific logs (I think whether we want to have the > check there is a separate question). I think slurm writing something to stderr is not something we should ignore, so we should definitely have some kind of check of slurm's output. I think we should treat every slurm error as either a lab bug (in this case, failure to set a memory limit for VAL) or a sign that something serious has gone wrong that needs investigating (such as grid nodes not working properly). In either case, I guess the user deserves a big warning in the reports (if we even want to produce a report). If I understand correctly, we currently don't check the slurm output except in this place of the code, so the current codes will only catch errors that happen before a job runs, but we won't get notified of errors that happen after the last task has started. Is this correct? Would it be possible for the fetcher code to have a look at the driver.err file? Switching to individual driver.log/driver.err files will mitigate this, but I think we won't be able to avoid having a global file because we can only redirect inside the scripts, and there is always stuff that can happen outside the scripts. I expect in the future the global logs will almost always be empty, but as far as I understand, they will still exist, so we should check them.
msg6518 (view)	Author: florian	Date: 2017-09-03.19:07:09
I don't think this is something going wrong on a slurm level. The bash code Jendrik posted is part of labs main experiment. It is included there to stop a "run"-step of an experiment if the "build"-step failed but this is probably to extreme. Jendrik's explanation makes sense to me: one task runs out memory, this triggers slurm to kill the task and write a warning to stderr. The warning ends up in driver.err and the next task that starts will trigger the if-block and exit without generating run.{log,err}. To fix this we should either make sure that task-specific warnings end up in task-specific logs or if they can end up in job-wide-logs we should not treat this as a critical failure of the whole job. I suggest we remove the if-block as a short-term solution but think about adding it back in once we have task-specific logs (I think whether we want to have the check there is a separate question). In any case, I'd add something to the parser that treats a non-exsting run.log as a critical error.
msg6513 (view)	Author: malte	Date: 2017-09-03.16:26:01
OK, runs that fail silently are of course a serious problem if they mean we can't trust our experiments any more. Could this also happen with a more "regular" lab experiment, or will lab complain about missing attributes or something like that as an unexpected error in other circumstances where we have a more normal attribute list? Either way, I find this disconcerting -- one of our first priorities should be to make sure that we notice if things go wrong. Are the runs that don't get started something we should report to the Slurm admins as well?
msg6512 (view)	Author: jendrik	Date: 2017-09-03.16:20:29
1) I copied our code from Fast Downward for measuring peak memory usage and print it just before the program exits. 2) It appears these tasks have never been executed by Slurm. The run_dirs are pristine, i.e., they contain the run script etc. but no logs or output files. Florian, didn't you have a similar issue recently? The problem is probably that the validator uses too much memory, which makes slurm log "slurmstepd: error: Exceeded step memory limit at some point." to "grid-steps/driver.err", which makes all runs that are started started afterwards exit early. Here is the relevant part from the Slurm sbatch script: if [ -s "driver.err" ]; then echo "There was output on stderr. Please check 'driver.err'. Aborting." exit 1 fi cd ../issue734-base RUN="$SLURM_ARRAY_TASK_ID" ./run-dispatcher.py "$RUN" As long as we don't have Slurm logs per run, maybe we should remove the if-test.
msg6510 (view)	Author: malte	Date: 2017-09-03.15:49:27
Thanks! It seems the memory usage is almost always under 20 MiB, so one would hope that the very large values in PSR-Large could be reduced. (Reasoning with axioms does make validation harder, and in PSR-Large, a substantial number of derived atoms need to be inferred. But of course this should not reach multiple gigabytes in a reasonable implementation.) Two questions: 1) How did you measure validator peak memory? 2) What do the "None" entries for validation time and validation memory mean? For example, they occur in philosophers, schedule, transport-opt08, visitall-opt11, visitall-opt14 and visitall-sat11.
msg6509 (view)	Author: jendrik	Date: 2017-09-03.15:11:28
Here are the results: http://ai.cs.unibas.ch/_tmp_files/seipp/issue734-base.html The unexplained errors are due to the translator and the validator running out of memory.
msg6499 (view)	Author: jendrik	Date: 2017-09-01.17:57:16
The all-domains experiment is queued.
msg6498 (view)	Author: jendrik	Date: 2017-09-01.17:56:28
I'm attaching the PSR-Large plans.
msg6483 (view)	Author: malte	Date: 2017-09-01.13:24:46
PS: I think it would be interesting to also do at least one all-domains experiment that measures validation time and peak memory for future reference. The same cea_lazy config might be a good one to use; it's not our greatest, but it does seem to do well in some problematic domains.
msg6482 (view)	Author: malte	Date: 2017-09-01.13:23:22
Thanks, Jendrik! Sounds like inval is too slow to use for us in PSR-Large, but perhaps the "Enviroment" thing can be changed. To make working on this issue easier, can you attach some large plans that people can use to test this? Perhaps the set of all PSR-Large plans for config issue731-v4-cea_lazy-release32 from issue731 would be a good test set.
msg6481 (view)	Author: jendrik	Date: 2017-09-01.13:01:23
In experiments for issue731, we noticed that VAL uses more than 4000 MiB (and thus leads to non-validated tasks on the grid) for the following tasks: psr-large:p37-s193-n50-l2-f30.pddl psr-large:p38-s195-n50-l3-f30.pddl psr-large:p41-s201-n60-l2-f30.pddl psr-large:p43-s205-n70-l2-f30.pddl psr-large:p45-s209-n80-l2-f30.pddl psr-large:p47-s213-n90-l2-f30.pddl The algorithm from the experiment that solves these tasks the fastest is --heuristic "hlm=lmcount(lm_zg())" --search "lazy_greedy([hlm], preferred=[hlm])". One option for validating these tasks using less memory is using INVAL (https://github.com/patrikhaslum/INVAL). However, INVAL is much slower: psr-large:p01 VAL: finished VmPeak: 5 MiB Time: 0s INVAL: finished VmPeak: 48 MiB Time: 72s psr-large:p30 VAL: finished VmPeak: 1.5 GiB Time: 2m INVAL: out-of-time VmPeak: 316 MiB Time: > 53m psr-large:p37 VAL: out-of-memory VmPeak: > 4 GiB Time: > 11m INVAL: out-of-time VmPeak: 641 MiB Time: > 80m Instructions for obtaining and compiling INVAL: sudo apt-get install ecl libffi-dev git clone git@github.com:patrikhaslum/INVAL.git cd INVAL ecl -shell compile-with-ecl I profiled the memory usage of VAL for a psr-large task (p27) and it seems that the main reason for the high memory usage is that VAL constantly copies Environment objects, puts the copies on the heap and stores pointers to them in a std::map (^/include/Environment.h). The copies are only deleted when the program exits. The memory profile is attached. Discussing this issue is on the agenda for the next Fast Downward meeting.

History
Date	User	Action	Args
2021-03-18 14:14:08	jendrik	set	status: chatting -> resolved messages: + msg10184
2017-09-21 15:30:25	jendrik	set	messages: + msg6538
2017-09-04 19:16:19	jendrik	set	messages: + msg6528
2017-09-04 19:03:34	malte	set	messages: + msg6527
2017-09-04 19:02:35	malte	set	messages: + msg6526
2017-09-04 15:44:24	jendrik	set	messages: + msg6524
2017-09-03 19:24:47	malte	set	messages: + msg6519
2017-09-03 19:07:10	florian	set	messages: + msg6518
2017-09-03 16:26:02	malte	set	messages: + msg6513
2017-09-03 16:20:29	jendrik	set	messages: + msg6512
2017-09-03 15:49:27	malte	set	messages: + msg6510
2017-09-03 15:11:28	jendrik	set	messages: + msg6509
2017-09-01 23:41:22	florian	set	nosy: + florian
2017-09-01 17:57:16	jendrik	set	messages: + msg6499
2017-09-01 17:56:28	jendrik	set	files: + psr-large-plans.tar.gz messages: + msg6498
2017-09-01 13:24:46	malte	set	messages: + msg6483
2017-09-01 13:23:22	malte	set	status: unread -> chatting messages: + msg6482
2017-09-01 13:01:23	jendrik	create

Issue734