msg4656 (view) |
Author: florian |
Date: 2015-10-15.18:05:35 |
|
Thank you! I just merged.
|
msg4655 (view) |
Author: jendrik |
Date: 2015-10-13.20:17:39 |
|
I also had a look at the code and believe it's fine.
|
msg4654 (view) |
Author: malte |
Date: 2015-10-13.17:34:40 |
|
Thanks, Florian! I looked over the code very briefly. It would be nice if
someone else could double-check, but I wouldn't consider it essential.
|
msg4653 (view) |
Author: florian |
Date: 2015-10-13.17:24:52 |
|
Now that issue67 is merged, we could test this on Windows. I merged the current
default branch and updated the pull request. The experiments still seem fine: no
hanging processes and the error codes are reported correctly. The build also
works on Windows, but there some features are not available: SIGXCPU does not
exist and we have to use the function signal() instead of sigaction() which can
cause strange behaviour when too many signals are received in a short time.
There seems to be no way around this on Windows, so I suggest we leave it like
this. Can someone do a review (the merge was a bit complicated, so the code
should be reviewed again)?
Pull request:
https://bitbucket.org/flogo/downward-issues-issue479/pull-requests/1
Reports:
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479-5min.html
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479.html
|
msg3851 (view) |
Author: malte |
Date: 2014-10-20.15:37:13 |
|
I don't think we need larger experiments.
|
msg3849 (view) |
Author: florian |
Date: 2014-10-20.15:14:02 |
|
No more hanging processes in an experiment with our previous setting (airport,
blind and M&S, release compiled). The same experiment with a time limit of 5
minutes also did not have any hanging processes. Should we run larger
experiments now? If so, which configs?
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479-5min.html
http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479.html
|
msg3819 (view) |
Author: malte |
Date: 2014-10-13.22:04:55 |
|
There are many ways to refactor the code to avoid this, but also there is no
problem with x.cc depending on y.h and y.cc depending on x.h. I'm sure we have
hundreds of examples of this in the current codebase.
|
msg3818 (view) |
Author: florian |
Date: 2014-10-13.21:57:39 |
|
How about the non-reentrant version of print_peak_memory. It should stay in
utilities but the reentrant version calls it on systems without a reentrant
implementation. Wouldn't this cause a cyclic dependency?
|
msg3817 (view) |
Author: malte |
Date: 2014-10-13.21:47:53 |
|
The division into files should generally be governed by the interface, i.e., by
the users. ABORT() is a generally useful facility and should remain in
utilities.h, I would say. The other low-level system code could go into a
separate file that is then only used from a high-level function in
utililities.{h,cc}.
That is, utilities.{h,cc} would contain the interface for the rest of the
planner (ABORT, exit_with, some function for registering exit functions, out of
memory and signal handlers), and everything that would only be used as tools to
implement these could go into a separate file.
|
msg3816 (view) |
Author: florian |
Date: 2014-10-13.21:43:03 |
|
Thanks for the code review. We discussed offline that we could move all
reentrant code into a new file. I was wondering if the functions signal_handler,
out_of_memory_handler, exit_handler and exit_with should also go there since
they are reentrant, too. We could even move everything related to exiting the
planner into a new file. In addition to the list above this would be
register_event_handlers, the ABORT macro and the definition of exit codes.
|
msg3808 (view) |
Author: florian |
Date: 2014-10-13.14:56:55 |
|
Yes, sorry. I ran this locally, hit ctrl+c and got the following output:
Peak memory: eak: KB
|
msg3807 (view) |
Author: malte |
Date: 2014-10-13.14:34:43 |
|
Can you be more specific w.r.t. "which is still buggy"?
|
msg3806 (view) |
Author: florian |
Date: 2014-10-13.14:32:13 |
|
I started with an implementation which is still buggy. Can one of you have a look?
https://bitbucket.org/flogo/downward-issues-issue479/pull-request/1
|
msg3802 (view) |
Author: malte |
Date: 2014-10-12.20:32:01 |
|
> If I understand this correctly, we should make the definition of signal handler
> "extern "C" void signal_handler(int)".
This may be safer, but I'd be very surprised if that really made a difference
here. (So: feel free to change it, but I don't think we need to.)
> For cout and fprint one explanation was that they modify global data (i.e. the
content
> of stdout), but this would also be a problem for write().
write() is a system call and is on the list of guaranteed reentrant functions
that you linked. (The problem with iostreams and fprintf is that they can do
buffering on top of the actual system calls that perform the actual I/O.)
Generally speaking, for library functions such as atoi that have no good reason
to require static data, I'd assume they are reentrant unless we see problems.
|
msg3801 (view) |
Author: florian |
Date: 2014-10-12.20:00:58 |
|
Re-entrancy is interesting stuff, but its a bit hard to find definite
information about it. Here are some things I found out:
"Signal handlers are expected to have C linkage and, in general, only use the
features from the common subset of C and C++. It is implementation-defined if a
function with C++ linkage can be used as a signal handler."
(http://en.cppreference.com/w/cpp/utility/program/signal)
If I understand this correctly, we should make the definition of signal handler
"extern "C" void signal_handler(int)".
The list of guaranteed re-entrant functions only includes system calls:
https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers#SIG30-C.Callonlyasynchronous-safefunctionswithinsignalhandlers-Asynchronous-Signal-SafeFunctions
For malloc, cout and fprint most sources agree that they are not re-entrant
although, the reasons for this vary from one explanation to the next. For
malloc, we saw that the reason is locking. For cout and fprint one explanation
was that they modify global data (i.e. the content of stdout), but this would
also be a problem for write(). Another page claimed that all functions that
operate on FILE* variables are supposed to use flockfile(). If this is the case,
we might have to unlock stdout before writing to it in the signal handler.
http://stackoverflow.com/questions/467938/stdout-thread-safe-in-c-on-linux
http://stackoverflow.com/questions/3941271/why-are-malloc-and-printf-said-as-non-reentrant#3941499
For some string manipulation functions (atoi, strlen, sprintf, ...) could be
re-entrant but this apparently depends on the implementation, so it is not clear
if we can use any of them. For example, the following page lists them as
re-entrant, but is for a different compiler:
http://www.mikecramer.com/qnx/qnx_4.25_docs/watcom/clibref/creentrant_fns.html
As far as I know, the C standard does not require any function to be re-entrant
except the system calls mentioned above.
|
msg3800 (view) |
Author: silvan |
Date: 2014-10-12.11:47:43 |
|
Sure, sorry for not replying earlier. (I am still following your discussion.)
|
msg3799 (view) |
Author: florian |
Date: 2014-10-12.11:13:12 |
|
(Silvan, I'm stealing this issue from you)
The debug experiment confirmed our suspicion:
#0 0x55555430 in __kernel_vsyscall ()
#1 0x08c25202 in __lll_lock_wait_private ()
#2 0x08c4fab2 in _L_lock_9520 ()
#3 0x08c4d990 in malloc ()
#4 0x08c3f135 in __fopen_internal ()
#5 0x08c405c0 in fopen64 ()
#6 0x08c0c727 in std::__basic_file<char>::open(char const*, std::_Ios_Openmode,
int) ()
#7 0x08bc35cb in std::basic_filebuf<char, std::char_traits<char> >::open(char
const*, std::_Ios_Openmode) ()
#8 0x080c1c64 in open (use_buffered_input=false) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/fstream:527
#9 get_peak_memory_in_kb (use_buffered_input=false) at utilities.cc:189
#10 0x080c2610 in print_peak_memory (use_buffered_input=false) at utilities.cc:224
#11 0x080c27cf in signal_handler (signal_number=24) at utilities.cc:154
#12 <signal handler called>
#13 0x08c4c927 in _int_malloc ()
#14 0x08c4d999 in malloc ()
#15 0x08c0de7a in operator new(unsigned int) ()
#16 0x08076813 in allocate (this=0x21a04460, block_X=...) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/ext/new_allocator.h:89
#17 _M_get_node (this=0x21a04460, block_X=...) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_list.h:316
#18 _M_create_node (this=0x21a04460, block_X=...) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_list.h:461
#19 insert (this=0x21a04460, block_X=...) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/list.tcc:100
#20 add_empty_block (this=0x21a04460, block_X=...) at equivalence_relation.cc:64
#21 EquivalenceRelation::refine (this=0x21a04460, block_X=...) at
equivalence_relation.cc:124
#22 0x08076a8f in EquivalenceRelation::refine (this=0x21a04460, other=...) at
equivalence_relation.cc:74
#23 0x080e206a in LabelReducer::compute_outside_equivalence (this=0xb4237f8,
ts_index=42, all_transition_systems=..., labels=...,
local_equivalence_relations=...) at merge_and_shrink/label_reducer.cc:190
#24 0x080e25dc in LabelReducer::reduce_labels (this=0xb4237f8, next_merge=...,
all_transition_systems=..., labels=...) at merge_and_shrink/label_reducer.cc:106
#25 0x080e1a6f in Labels::reduce (this=0xb438b48, next_merge=...,
all_transition_systems=...) at merge_and_shrink/labels.cc:34
#26 0x080e5d96 in MergeAndShrinkHeuristic::build_transition_system
(this=0xb3fe410) at merge_and_shrink/merge_and_shrink_heuristic.cc:86
#27 0x080e64fc in MergeAndShrinkHeuristic::initialize (this=0xb3fe410) at
merge_and_shrink/merge_and_shrink_heuristic.cc:175
#28 0x0807fa50 in Heuristic::evaluate (this=0xb3fe410, state=...) at heuristic.cc:30
#29 0x080573ef in EagerSearch::initialize (this=0xb42d758) at eager_search.cc:74
#30 0x080ba546 in SearchEngine::search (this=0xb42d758) at search_engine.cc:50
#31 0x080483b7 in main (argc=2, argv=0x0) at planner.cc:44
|
msg3797 (view) |
Author: malte |
Date: 2014-10-11.22:21:43 |
|
> Is it possible to terminate with a signal?
I've asked myself the same question last week. At least in Python I was able to
use exit with an appropriate exit code to fool "echo $?", but I'm not sure if
that's really the same thing as terminating with a signal, i.e., if it would
look the same to a parent that looks more carefully at the process structure. In
any case, as I wrote later on, I'd suggest following the gdb route for now,
verifying if indeed we're stuck somewhere inside malloc, and if so, try to
rewrite the functions called by the signal handler to avoid dynamic memory.
As a bonus, this might enable us to get rid of the memory reserve we currently use.
|
msg3796 (view) |
Author: florian |
Date: 2014-10-11.22:06:45 |
|
> > Without it, the "raise(signal_number)" at the end of our handler would turn
> > the handler into an endless loop.
>
> Right, if we change this, we would need to terminate instead of reraising the
> signal.
Is it possible to terminate with a signal? I read that exit(128+signal) does not
exit the same way the default signal handler exits. The gnu page I linked before
recommends reraising the signal but this part of the documentation still uses
signal() while other parts of the same documentation recommend sigaction():
https://www.gnu.org/software/libc/manual/html_node/Termination-in-Handler.html#Termination-in-Handler
> But this may indeed be a source of the problem.
Yes, I think so too. Even if this is not the source of the problem, it
definitely could lead to problem like this and we should fix it. The magic
keyword for the google search was "reentrant functions". A function is reentrant
if it can be interrupted and restarted at any time. In particular:
"On most systems, malloc and free are not reentrant."
https://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy
There is only a very limited set of reentrant functions and all other functions
should not be used in signal handlers. People have reimplemented reentrant
versions of very basic functions (like number-to-string conversion) for their
signal handlers
http://www.ibm.com/developerworks/library/l-reent/
http://phajdan-jr.blogspot.de/2013/01/signal-handler-safety-re-entering-malloc.html
|
msg3795 (view) |
Author: malte |
Date: 2014-10-11.21:14:47 |
|
> I think SA_RESETHAND simulates what we did before (run the handler the first
> time the signal is caught and then reset to the default signal handler).
Right, but I'm not sure this is a good idea.
> Without it, the "raise(signal_number)" at the end of our handler would turn
> the handler into an endless loop.
Right, if we change this, we would need to terminate instead of reraising the
signal.
> I don't know how to do 3. or 4. with C and lab respectively.
For 3., this is tricky indeed. I can only think of doing things with the Unix
functions (open/read/close) we're using for debugging.
But this may indeed be a source of the problem. See for example here:
http://stackoverflow.com/questions/3366307/why-is-malloc-not-async-signal-safe
http://edc.tversu.ru/elib/inf/0088/0596003943_secureprgckbk-chp-13-sect-5.html
https://access.redhat.com/solutions/48701
For 4., we should write "strace" before the planner call in the downward script.
We should add an option to block "times" or merge the corresponding issue first,
though; otherwise we'll kill the grid disk with gigs of output from each planner
run.
Reading the links for 3. makes me think that we should try attaching gdb to a
failed run. For this we need a debug build though, so it would be good if we
could reproduce the error with a debug build.
Actually, I think that's the best option to do for now, and I'm not sure it's
worth trying the other things at this point. So the suggestion would be to run a
debug experiment, see if we can reproduce the error, then attach to the running
process and test if the stack looks similar to
https://access.redhat.com/solutions/48701.
|
msg3794 (view) |
Author: florian |
Date: 2014-10-11.20:57:09 |
|
> 1. I'm not sure if SA_RESETHAND is really what we want. It might be worth trying
> things out without it.
I think SA_RESETHAND simulates what we did before (run the handler the first
time the signal is caught and then reset to the default signal handler). Without
it, the "raise(signal_number)" at the end of our handler would turn the handler
into an endless loop.
> 2. I'd be interested to see if any signal handling functions are already
> installed at the start of our program, and if yes, for which signals.
I'll add some output for this and switch to block all the signals we currently
handle.
I don't know how to do 3. or 4. with C and lab respectively.
|
msg3793 (view) |
Author: malte |
Date: 2014-10-11.20:18:30 |
|
It should be sufficient to block the signals we are otherwise handling.
Everything else will either be ignored or lead to immediate termination, which
should both be appropriate.
Various thoughts after reading the code and some more documentation:
1. I'm not sure if SA_RESETHAND is really what we want. It might be worth trying
things out without it.
2. I'd be interested to see if any signal handling functions are already
installed at the start of our program, and if yes, for which signals.
3. We might run into trouble if we're handling a signal while inside a memory
management function (malloc/free) and then do stuff that takes us back into such
a function. I'm not sure whether our current code really uses dynamic memory
anywhere, though. Perhaps inside the ifstream, and in that case it might be
worth doing things with lower-level (C or even Unix) functions here.
4. It might be worth running our code inside strace to see more about what is
going on.
|
msg3792 (view) |
Author: florian |
Date: 2014-10-11.20:13:38 |
|
We could block all signals with sigfillset() or we could use sigaddset to just
block the signals we would otherwise catch with our handler. What do you prefer?
https://www.gnu.org/software/libc/manual/html_node/Blocking-Signals.html#Blocking-Signals
|
msg3791 (view) |
Author: florian |
Date: 2014-10-11.20:01:39 |
|
https://bitbucket.org/flogo/downward-issues-issue479/branch/issue479-experimental-sigaction#chg-src/search/utilities.cc
|
msg3790 (view) |
Author: malte |
Date: 2014-10-11.20:00:24 |
|
Where can I find our code that uses sigaction?
|
msg3789 (view) |
Author: malte |
Date: 2014-10-11.19:59:38 |
|
> if reading the status file fails with a signal (is this possible?) we would
> not receive the signal.
Our reaction to signals is to print memory usage and terminate. If we're already
doing that, there's not really any point in recursively attempting to do the
same thing again, I think. (Our previous code was designed to ignore
interrupting signals anyway.)
|
msg3788 (view) |
Author: florian |
Date: 2014-10-11.19:59:14 |
|
Here is the relevant part of the documentation of sigaction:
sa_mask specifies a mask of signals which should be blocked (i.e., added to the
signal mask of the thread in which the signal handler is invoked) during
execution of the signal handler. In addition, the signal which triggered the
handler will be blocked, unless the SA_NODEFER flag is used.
http://linux.die.net/man/2/sigaction
|
msg3787 (view) |
Author: florian |
Date: 2014-10-11.19:56:54 |
|
The third experiment we ran used the new signal handlers. Their default
behaviour is to block the signal type that is currently being handled. This is
one of the new features of the new signal handlers and was not possible in a
thread-safe way before. Unfortunately, blocking the signal during the handler
execution did not help (msg3734).
We could block more signals instead of just the one currently handled. I thought
about doing this but wasn't sure if this could lead to a new problem, e.g. if
reading the status file fails with a signal (is this possible?) we would not
receive the signal.
|
msg3784 (view) |
Author: malte |
Date: 2014-10-11.19:37:22 |
|
That's an interesting idea. But I think that concurrent reads should normally be
possible, and I couldn't find a trace in the lsof output of someone trying to
read the file. Moreover, even if concurrent reads weren't possible, I don't
think they should cause a dead-lock: one of the two attempts should win, and the
other one should either fail or have to wait for the other one to complete, but
not remain blocked indefinitely.
I wonder if the suggestion that reading the /proc/self/status file is the
problem is a red herring. After all, we read it successfully in many other
contexts. In the case I looked at, the CPU time was very close to 30:01:00, and
from how I understand SIGXCPU, we should get a signal at 30:00:00 and another
one at 30:01:00. Maybe it's the second signal that is causing the problem?
Perhaps we can try something that ensures that after the first signal,
subsequent ones will be blocked?
This shouldn't be a problem anyway with our signal handlers given that our
eventual response to all signals that we catch is to quit.
|
msg3783 (view) |
Author: florian |
Date: 2014-10-11.19:31:37 |
|
We read the status file during the normal normal execution of the planner. Could
it cause a dead-lock if the planner gets SIGXCPU while it is waiting for the
file and then tries to access it a second time in the signal handler?
|
msg3782 (view) |
Author: jendrik |
Date: 2014-10-11.18:42:35 |
|
Yes, the hard limit is 5 seconds higher
than the soft limit.
|
msg3773 (view) |
Author: malte |
Date: 2014-10-11.12:58:42 |
|
This is odd -- we're not in a low-memory situation, so there's no reason why the
open() of the status file should cause any kind of trouble. Also, why does this
kind of error *only* show up while we're handling SIGXCPU? I checked the grid
documentation, and it does suggest catching SIGXCPU for graceful shutdown.
Maybe it's some kind of temporary grid error or other interaction with the
grid's job shepherding job. The hard limit is substantially higher than the soft
limit, right? And we haven't set a grid time limit ourselves. So that shouldn't
cause trouble.
Another possibility is some kind of dead-lock with multiple processes trying to
access the status file and waiting for each other for some reason. But that
doesn't really sound likely, and I tried lsof to check for something like this
and found no trace.
Silvan, can you log in to ase12 and try:
strace -p 121217
? (But I fear you have to be root to do this, even when it's your own process.)
Maybe the best we can do at the moment is to switch to the more modern signal
handling code and add something that makes sure we don't block in this part of
the code forever. The "alarm" system call would be an option.
|
msg3772 (view) |
Author: florian |
Date: 2014-10-11.10:51:58 |
|
I created a lab issue for the randomization, so we don't hijack this one:
https://bitbucket.org/jendrikseipp/lab/issue/16
|
msg3771 (view) |
Author: jendrik |
Date: 2014-10-11.01:41:42 |
|
No, I don't have an explanation for this. This is the first time I hear about the
/tmp/sge_spool folder.
|
msg3769 (view) |
Author: florian |
Date: 2014-10-11.00:30:44 |
|
We partly figured out why the randomization was off. The grid engine actually
does not start the run script in the folder but a local copy in its
/tmp/sge_spool folder. This is not actually a copy, but a recreation of the run
file and recreating it also shuffles the runs again. I'm not sure yet, who
creates the second version of the run script. Jendrik, do you have an explanation?
On the actual issue, we are out of ideas to test right now. Any suggestions? The
hanging tasks are still running. If anyone wants to have a look, their job
numbers are 2167176 and 2167406.
|
msg3739 (view) |
Author: silvan |
Date: 2014-10-09.10:07:13 |
|
You're right, I totally overlooked the declaration of the file stream.
|
msg3738 (view) |
Author: malte |
Date: 2014-10-09.09:19:46 |
|
Maybe you're looking at a base class? Including ios_base::out would be an odd
default because ifstream doesn't have methods for writing; the "i" in the class
name stands for "input".
Here's documentation showing a default value of ios_base::in:
http://www.cplusplus.com/reference/fstream/ifstream/open/
|
msg3737 (view) |
Author: silvan |
Date: 2014-10-09.09:13:57 |
|
According to the documentation of
open, the default is both ios_base::in
and ios_base::out. I thought that
would be reading and writing.
|
msg3736 (view) |
Author: malte |
Date: 2014-10-09.08:49:20 |
|
> When opening /proc/self/status with only reading access, only 1 out rather than
> 4 tasks remain in the error state.
Doesn't our original code only open it for reading, too? I don't see what else
it could do without causing an error every time.
|
msg3735 (view) |
Author: florian |
Date: 2014-10-08.23:10:45 |
|
I don't think the number of failures tells us anything. This bug seems to occur
randomly with a small chance of happing anyway.
Too bad that sigaction() did not help. I still suggest we switch from signal()
to sigaction() after we figure out this issue. signal() is deprecated and with
sigaction() we can avoid the special case for deferring a second signal.
|
msg3734 (view) |
Author: silvan |
Date: 2014-10-08.22:57:58 |
|
The same happens when using sigaction instead of signal. I can have a more
detailed look at the logs tomorrow, but I don't think that the picture changed.
|
msg3731 (view) |
Author: silvan |
Date: 2014-10-08.21:32:03 |
|
When opening /proc/self/status with only reading access, only 1 out rather than
4 tasks remain in the error state.
|
msg3719 (view) |
Author: silvan |
Date: 2014-10-08.13:07:21 |
|
I attached a file. But this is not the debug output, is it?
Our debug output says:
reg
in signal_handler
signal_handler 1
signal_handler 2
signal_handler 3
in print_peak_memory
in get_peak_memory_in_kb
get_peak_memory_in_kb 1
get_peak_memory_in_kb 2
get_peak_memory_in_kb 3
get_peak_memory_in_kb 4
|
msg3718 (view) |
Author: malte |
Date: 2014-10-08.13:04:44 |
|
That's the output I'm interested in, but not just the last line. In particular,
I'm wondering if more than one signal is involved. (If not, it's hard to
understand why the problem occurs some of the time, but not all of the time.)
Can you post the complete output of /proc/.../status somewhere?
|
msg3716 (view) |
Author: silvan |
Date: 2014-10-08.13:01:53 |
|
What do you mean by debug output?
We only added print messages to know where in the code we are (see below).
|
msg3715 (view) |
Author: malte |
Date: 2014-10-08.12:59:49 |
|
What does the debug output say in the failing cases?
|
msg3712 (view) |
Author: silvan |
Date: 2014-10-08.11:51:08 |
|
I had a look at one of the process id files (I can open them), and it says
(between a lots of other text) status: sleeping.
|
msg3711 (view) |
Author: malte |
Date: 2014-10-08.11:50:49 |
|
Regarding signal/sigaction: signals are unsafe in the sense that they are
subject to race conditions, but in our scenario this should not be an
insurmountable obstacle, given that we don't want to do much more than terminate
the program in a controlled way. Using sigaction in place of signal sounds like
a good idea and might help fix this issue. It would be good if we could also
find out what is actually going on, though.
|
msg3709 (view) |
Author: malte |
Date: 2014-10-08.11:46:51 |
|
> The randomization is stored in the file "run" in an array that maps task ids to
> run dirs. I expected index 24 (+- 1) of this array to be one of the offending
> run dirs but it isn't.
This is indeed odd. It definitely used to work that way.
|
msg3708 (view) |
Author: jendrik |
Date: 2014-10-08.11:34:57 |
|
Well, it would be nice if it worked the way you thought it works, since you
implemented the randomization ;)
|
msg3707 (view) |
Author: florian |
Date: 2014-10-08.11:30:03 |
|
The randomization is stored in the file "run" in an array that maps task ids to
run dirs. I expected index 24 (+- 1) of this array to be one of the offending
run dirs but it isn't. Apparently, this works diffently than I thought.
You should check the file "/proc/<pid>/status" instead of "/proc/self/status"
(self is the calling process). You can also look at "cat /proc/<pid>/cmdline" to
show the command line or at "ls -la /proc/<pid>/cwd" to show the working dir of
a process. The latter should help to identify the correct process on the node.
|
msg3706 (view) |
Author: silvan |
Date: 2014-10-08.11:16:32 |
|
Ok, apparently, the files written by the experiments did not automatically get
group access rights, which is strange, because my entire maia home directory
should be accessible to you. I've added the rights manually now for all the run
dirs.
Why are these unexpected? According to Jendrik, this is how to the randomization
works: the runs are still sorted and numbered according to domain:problem, but
then are assigned task ids randomly to ensure random execution.
I don't know how to determine the correct process id. On the four nodes, there
are also other downward-release processes of my other experiment running. If I
know the process ID, how would I open the "self/status" file?
|
msg3704 (view) |
Author: florian |
Date: 2014-10-08.10:28:57 |
|
The log files do not have group access rights, could you add this?
Also, can you open the status file for the correct process id manually?
For reference, the run_dirs are:
runs-00001-00100/00081
runs-00001-00100/00085
runs-00001-00100/00090
runs-00001-00100/00094
Side note: the numbers are kind of unexpected, the task ids of the hanging tasks
are 24, 29, 41 and 93. We do randomization, but I could not match these numbers
to 81, 85, 90, 94 in the run script.
|
msg3703 (view) |
Author: silvan |
Date: 2014-10-08.10:02:33 |
|
Currently, I have an experiment with four jobs "hanging" in the queue without
doing anything. I had a look into the log file we added and it seems that the
process cannot terminate the following line of code:
procfile.open("/proc/self/status"); (utilities.cc::170 in
https://bitbucket.org/flogo/downward-issues-issue479/src/010442ba16c51046fbc94d274bac10515f91cf55/src/search/utilities.cc?at=issue479-experimental)
I.e., the output in our logfile stops after: get_peak_memory_in_kb 4
You can find the logs under
/infai/sieverss/repos/downward/issue479/experiments/issue479/data/issue479-issue479/.
Looking for empty "run.err" files (wc -l runs-00001-00100/*/run.err), you get
exactly the runs which didn't complete. (I checked all four files and they all
look the same.)
|
msg3701 (view) |
Author: florian |
Date: 2014-10-08.00:34:59 |
|
Reading up on this I found this:
http://en.wikipedia.org/wiki/Sigaction#Replacement_of_deprecated_signal.28.29
Could this be the problem we are seeing here? Maybe we can fix it by switching
from signal() to sigaction().
|
msg3688 (view) |
Author: florian |
Date: 2014-10-07.11:39:21 |
|
Silvan and I started looking into this.
|
msg3632 (view) |
Author: malte |
Date: 2014-10-04.19:31:31 |
|
Anyone willing to look into this one? If we don't fix it, we can't get memory
statistics when running out of time, which would really be a pity.
|
msg3557 (view) |
Author: jendrik |
Date: 2014-09-27.01:21:25 |
|
When fixing this we should ensure that the fix doesn't break the Windows build.
Patrick can probably test this.
|
msg3555 (view) |
Author: malte |
Date: 2014-09-26.16:36:27 |
|
I don't know enough about signal handling, but I wonder if the underlying
problem is that our signal handler is called, but doesn't manage to properly
terminate the process because of some interaction with signal handling of the
grid queue. Or can we also reproduce the SIGXCPU problem locally? (It worked for
me, but I didn't try many different tasks.)
As a first step, I suggest we add more output to signal_handler(). To test
things, I suggest to add output after every single line of the function,
including inside the if block. This means we might get odd interference patterns
in the output if the signal handler is invoked while it is running, but I think
this will still probably give us the maximum amount of information to continue with.
I wonder if the stuff we do to prevent recursive invocations is perhaps shooting
us in the foot in cases where we get a non-deadly signal. (I don't know if
SIGXCPU kills a process by default.)
|
msg3554 (view) |
Author: silvan |
Date: 2014-09-26.16:26:12 |
|
With changeset da23dc33eeba, Fast Downward catches the SIGXCPU signal. This
causes the following problem in many experiments I tried with that (and newer)
revisions:
Some tasks of a grid job run "for ever" (I deleted them 12 hours after
surpassing the time limit), i.e. they are not correctly terminated after the
given time limit. They don't appear in top, but in pstree, and the nodes of the
cluster are still occupied by the job. The experiments are not reproducible in
the sense that always the same task/config pair cannot be correctly terminated,
but even when using only one domain like airport (50 tasks), this problem arises
(but not when using a single task which caused this problem in a previous
experiment).
For the moment, I fixed this problem by commenting out the change from that
revision.
Maybe also a quick side-remark: generally, I sometimes have problems killing the
planner with Ctrl+C (also confirmed by Malte), which could also be due to the
handling of signals in Fast Downward.
|