Issue 479: Problem with intercepting SIGXCPU - Fast Downward issue tracker

Title	Problem with intercepting SIGXCPU
Priority	bug	Status	resolved
Superseder		Nosy List	florian, jendrik, malte, pvonreth, silvan
Assigned To	florian	Keywords
Optional summary

Created on 2014-09-26.16:26:12 by silvan, last changed by florian.

Files
File name	Uploaded	Type	Edit	Remove
proc.txt	silvan, 2014-10-08.13:07:21	text/plain

Messages
msg4656 (view)	Author: florian	Date: 2015-10-15.18:05:35
Thank you! I just merged.
msg4655 (view)	Author: jendrik	Date: 2015-10-13.20:17:39
I also had a look at the code and believe it's fine.
msg4654 (view)	Author: malte	Date: 2015-10-13.17:34:40
Thanks, Florian! I looked over the code very briefly. It would be nice if someone else could double-check, but I wouldn't consider it essential.
msg4653 (view)	Author: florian	Date: 2015-10-13.17:24:52
Now that issue67 is merged, we could test this on Windows. I merged the current default branch and updated the pull request. The experiments still seem fine: no hanging processes and the error codes are reported correctly. The build also works on Windows, but there some features are not available: SIGXCPU does not exist and we have to use the function signal() instead of sigaction() which can cause strange behaviour when too many signals are received in a short time. There seems to be no way around this on Windows, so I suggest we leave it like this. Can someone do a review (the merge was a bit complicated, so the code should be reviewed again)? Pull request: https://bitbucket.org/flogo/downward-issues-issue479/pull-requests/1 Reports: http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479-5min.html http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479.html
msg3851 (view)	Author: malte	Date: 2014-10-20.15:37:13
I don't think we need larger experiments.
msg3849 (view)	Author: florian	Date: 2014-10-20.15:14:02
No more hanging processes in an experiment with our previous setting (airport, blind and M&S, release compiled). The same experiment with a time limit of 5 minutes also did not have any hanging processes. Should we run larger experiments now? If so, which configs? http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479-5min.html http://ai.cs.unibas.ch/_tmp_files/pommeren/issue479-issue479.html
msg3819 (view)	Author: malte	Date: 2014-10-13.22:04:55
There are many ways to refactor the code to avoid this, but also there is no problem with x.cc depending on y.h and y.cc depending on x.h. I'm sure we have hundreds of examples of this in the current codebase.
msg3818 (view)	Author: florian	Date: 2014-10-13.21:57:39
How about the non-reentrant version of print_peak_memory. It should stay in utilities but the reentrant version calls it on systems without a reentrant implementation. Wouldn't this cause a cyclic dependency?
msg3817 (view)	Author: malte	Date: 2014-10-13.21:47:53
The division into files should generally be governed by the interface, i.e., by the users. ABORT() is a generally useful facility and should remain in utilities.h, I would say. The other low-level system code could go into a separate file that is then only used from a high-level function in utililities.{h,cc}. That is, utilities.{h,cc} would contain the interface for the rest of the planner (ABORT, exit_with, some function for registering exit functions, out of memory and signal handlers), and everything that would only be used as tools to implement these could go into a separate file.
msg3816 (view)	Author: florian	Date: 2014-10-13.21:43:03
Thanks for the code review. We discussed offline that we could move all reentrant code into a new file. I was wondering if the functions signal_handler, out_of_memory_handler, exit_handler and exit_with should also go there since they are reentrant, too. We could even move everything related to exiting the planner into a new file. In addition to the list above this would be register_event_handlers, the ABORT macro and the definition of exit codes.
msg3808 (view)	Author: florian	Date: 2014-10-13.14:56:55
Yes, sorry. I ran this locally, hit ctrl+c and got the following output: Peak memory: eak: KB
msg3807 (view)	Author: malte	Date: 2014-10-13.14:34:43
Can you be more specific w.r.t. "which is still buggy"?
msg3806 (view)	Author: florian	Date: 2014-10-13.14:32:13
I started with an implementation which is still buggy. Can one of you have a look? https://bitbucket.org/flogo/downward-issues-issue479/pull-request/1
msg3802 (view)	Author: malte	Date: 2014-10-12.20:32:01
> If I understand this correctly, we should make the definition of signal handler > "extern "C" void signal_handler(int)". This may be safer, but I'd be very surprised if that really made a difference here. (So: feel free to change it, but I don't think we need to.) > For cout and fprint one explanation was that they modify global data (i.e. the content > of stdout), but this would also be a problem for write(). write() is a system call and is on the list of guaranteed reentrant functions that you linked. (The problem with iostreams and fprintf is that they can do buffering on top of the actual system calls that perform the actual I/O.) Generally speaking, for library functions such as atoi that have no good reason to require static data, I'd assume they are reentrant unless we see problems.
msg3801 (view)	Author: florian	Date: 2014-10-12.20:00:58
Re-entrancy is interesting stuff, but its a bit hard to find definite information about it. Here are some things I found out: "Signal handlers are expected to have C linkage and, in general, only use the features from the common subset of C and C++. It is implementation-defined if a function with C++ linkage can be used as a signal handler." (http://en.cppreference.com/w/cpp/utility/program/signal) If I understand this correctly, we should make the definition of signal handler "extern "C" void signal_handler(int)". The list of guaranteed re-entrant functions only includes system calls: https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers#SIG30-C.Callonlyasynchronous-safefunctionswithinsignalhandlers-Asynchronous-Signal-SafeFunctions For malloc, cout and fprint most sources agree that they are not re-entrant although, the reasons for this vary from one explanation to the next. For malloc, we saw that the reason is locking. For cout and fprint one explanation was that they modify global data (i.e. the content of stdout), but this would also be a problem for write(). Another page claimed that all functions that operate on FILE* variables are supposed to use flockfile(). If this is the case, we might have to unlock stdout before writing to it in the signal handler. http://stackoverflow.com/questions/467938/stdout-thread-safe-in-c-on-linux http://stackoverflow.com/questions/3941271/why-are-malloc-and-printf-said-as-non-reentrant#3941499 For some string manipulation functions (atoi, strlen, sprintf, ...) could be re-entrant but this apparently depends on the implementation, so it is not clear if we can use any of them. For example, the following page lists them as re-entrant, but is for a different compiler: http://www.mikecramer.com/qnx/qnx_4.25_docs/watcom/clibref/creentrant_fns.html As far as I know, the C standard does not require any function to be re-entrant except the system calls mentioned above.
msg3800 (view)	Author: silvan	Date: 2014-10-12.11:47:43
Sure, sorry for not replying earlier. (I am still following your discussion.)
msg3799 (view)	Author: florian	Date: 2014-10-12.11:13:12
(Silvan, I'm stealing this issue from you) The debug experiment confirmed our suspicion: #0 0x55555430 in __kernel_vsyscall () #1 0x08c25202 in __lll_lock_wait_private () #2 0x08c4fab2 in _L_lock_9520 () #3 0x08c4d990 in malloc () #4 0x08c3f135 in __fopen_internal () #5 0x08c405c0 in fopen64 () #6 0x08c0c727 in std::__basic_file<char>::open(char const, std::_Ios_Openmode, int) () #7 0x08bc35cb in std::basic_filebuf<char, std::char_traits<char> >::open(char const, std::_Ios_Openmode) () #8 0x080c1c64 in open (use_buffered_input=false) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/fstream:527 #9 get_peak_memory_in_kb (use_buffered_input=false) at utilities.cc:189 #10 0x080c2610 in print_peak_memory (use_buffered_input=false) at utilities.cc:224 #11 0x080c27cf in signal_handler (signal_number=24) at utilities.cc:154 #12 <signal handler called> #13 0x08c4c927 in _int_malloc () #14 0x08c4d999 in malloc () #15 0x08c0de7a in operator new(unsigned int) () #16 0x08076813 in allocate (this=0x21a04460, block_X=...) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/ext/new_allocator.h:89 #17 _M_get_node (this=0x21a04460, block_X=...) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_list.h:316 #18 _M_create_node (this=0x21a04460, block_X=...) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_list.h:461 #19 insert (this=0x21a04460, block_X=...) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/list.tcc:100 #20 add_empty_block (this=0x21a04460, block_X=...) at equivalence_relation.cc:64 #21 EquivalenceRelation::refine (this=0x21a04460, block_X=...) at equivalence_relation.cc:124 #22 0x08076a8f in EquivalenceRelation::refine (this=0x21a04460, other=...) at equivalence_relation.cc:74 #23 0x080e206a in LabelReducer::compute_outside_equivalence (this=0xb4237f8, ts_index=42, all_transition_systems=..., labels=..., local_equivalence_relations=...) at merge_and_shrink/label_reducer.cc:190 #24 0x080e25dc in LabelReducer::reduce_labels (this=0xb4237f8, next_merge=..., all_transition_systems=..., labels=...) at merge_and_shrink/label_reducer.cc:106 #25 0x080e1a6f in Labels::reduce (this=0xb438b48, next_merge=..., all_transition_systems=...) at merge_and_shrink/labels.cc:34 #26 0x080e5d96 in MergeAndShrinkHeuristic::build_transition_system (this=0xb3fe410) at merge_and_shrink/merge_and_shrink_heuristic.cc:86 #27 0x080e64fc in MergeAndShrinkHeuristic::initialize (this=0xb3fe410) at merge_and_shrink/merge_and_shrink_heuristic.cc:175 #28 0x0807fa50 in Heuristic::evaluate (this=0xb3fe410, state=...) at heuristic.cc:30 #29 0x080573ef in EagerSearch::initialize (this=0xb42d758) at eager_search.cc:74 #30 0x080ba546 in SearchEngine::search (this=0xb42d758) at search_engine.cc:50 #31 0x080483b7 in main (argc=2, argv=0x0) at planner.cc:44
msg3797 (view)	Author: malte	Date: 2014-10-11.22:21:43
> Is it possible to terminate with a signal? I've asked myself the same question last week. At least in Python I was able to use exit with an appropriate exit code to fool "echo $?", but I'm not sure if that's really the same thing as terminating with a signal, i.e., if it would look the same to a parent that looks more carefully at the process structure. In any case, as I wrote later on, I'd suggest following the gdb route for now, verifying if indeed we're stuck somewhere inside malloc, and if so, try to rewrite the functions called by the signal handler to avoid dynamic memory. As a bonus, this might enable us to get rid of the memory reserve we currently use.
msg3796 (view)	Author: florian	Date: 2014-10-11.22:06:45
> > Without it, the "raise(signal_number)" at the end of our handler would turn > > the handler into an endless loop. > > Right, if we change this, we would need to terminate instead of reraising the > signal. Is it possible to terminate with a signal? I read that exit(128+signal) does not exit the same way the default signal handler exits. The gnu page I linked before recommends reraising the signal but this part of the documentation still uses signal() while other parts of the same documentation recommend sigaction(): https://www.gnu.org/software/libc/manual/html_node/Termination-in-Handler.html#Termination-in-Handler > But this may indeed be a source of the problem. Yes, I think so too. Even if this is not the source of the problem, it definitely could lead to problem like this and we should fix it. The magic keyword for the google search was "reentrant functions". A function is reentrant if it can be interrupted and restarted at any time. In particular: "On most systems, malloc and free are not reentrant." https://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy There is only a very limited set of reentrant functions and all other functions should not be used in signal handlers. People have reimplemented reentrant versions of very basic functions (like number-to-string conversion) for their signal handlers http://www.ibm.com/developerworks/library/l-reent/ http://phajdan-jr.blogspot.de/2013/01/signal-handler-safety-re-entering-malloc.html
msg3795 (view)	Author: malte	Date: 2014-10-11.21:14:47
> I think SA_RESETHAND simulates what we did before (run the handler the first > time the signal is caught and then reset to the default signal handler). Right, but I'm not sure this is a good idea. > Without it, the "raise(signal_number)" at the end of our handler would turn > the handler into an endless loop. Right, if we change this, we would need to terminate instead of reraising the signal. > I don't know how to do 3. or 4. with C and lab respectively. For 3., this is tricky indeed. I can only think of doing things with the Unix functions (open/read/close) we're using for debugging. But this may indeed be a source of the problem. See for example here: http://stackoverflow.com/questions/3366307/why-is-malloc-not-async-signal-safe http://edc.tversu.ru/elib/inf/0088/0596003943_secureprgckbk-chp-13-sect-5.html https://access.redhat.com/solutions/48701 For 4., we should write "strace" before the planner call in the downward script. We should add an option to block "times" or merge the corresponding issue first, though; otherwise we'll kill the grid disk with gigs of output from each planner run. Reading the links for 3. makes me think that we should try attaching gdb to a failed run. For this we need a debug build though, so it would be good if we could reproduce the error with a debug build. Actually, I think that's the best option to do for now, and I'm not sure it's worth trying the other things at this point. So the suggestion would be to run a debug experiment, see if we can reproduce the error, then attach to the running process and test if the stack looks similar to https://access.redhat.com/solutions/48701.
msg3794 (view)	Author: florian	Date: 2014-10-11.20:57:09
> 1. I'm not sure if SA_RESETHAND is really what we want. It might be worth trying > things out without it. I think SA_RESETHAND simulates what we did before (run the handler the first time the signal is caught and then reset to the default signal handler). Without it, the "raise(signal_number)" at the end of our handler would turn the handler into an endless loop. > 2. I'd be interested to see if any signal handling functions are already > installed at the start of our program, and if yes, for which signals. I'll add some output for this and switch to block all the signals we currently handle. I don't know how to do 3. or 4. with C and lab respectively.
msg3793 (view)	Author: malte	Date: 2014-10-11.20:18:30
It should be sufficient to block the signals we are otherwise handling. Everything else will either be ignored or lead to immediate termination, which should both be appropriate. Various thoughts after reading the code and some more documentation: 1. I'm not sure if SA_RESETHAND is really what we want. It might be worth trying things out without it. 2. I'd be interested to see if any signal handling functions are already installed at the start of our program, and if yes, for which signals. 3. We might run into trouble if we're handling a signal while inside a memory management function (malloc/free) and then do stuff that takes us back into such a function. I'm not sure whether our current code really uses dynamic memory anywhere, though. Perhaps inside the ifstream, and in that case it might be worth doing things with lower-level (C or even Unix) functions here. 4. It might be worth running our code inside strace to see more about what is going on.
msg3792 (view)	Author: florian	Date: 2014-10-11.20:13:38
We could block all signals with sigfillset() or we could use sigaddset to just block the signals we would otherwise catch with our handler. What do you prefer? https://www.gnu.org/software/libc/manual/html_node/Blocking-Signals.html#Blocking-Signals
msg3791 (view)	Author: florian	Date: 2014-10-11.20:01:39
https://bitbucket.org/flogo/downward-issues-issue479/branch/issue479-experimental-sigaction#chg-src/search/utilities.cc
msg3790 (view)	Author: malte	Date: 2014-10-11.20:00:24
Where can I find our code that uses sigaction?
msg3789 (view)	Author: malte	Date: 2014-10-11.19:59:38
> if reading the status file fails with a signal (is this possible?) we would > not receive the signal. Our reaction to signals is to print memory usage and terminate. If we're already doing that, there's not really any point in recursively attempting to do the same thing again, I think. (Our previous code was designed to ignore interrupting signals anyway.)
msg3788 (view)	Author: florian	Date: 2014-10-11.19:59:14
Here is the relevant part of the documentation of sigaction: sa_mask specifies a mask of signals which should be blocked (i.e., added to the signal mask of the thread in which the signal handler is invoked) during execution of the signal handler. In addition, the signal which triggered the handler will be blocked, unless the SA_NODEFER flag is used. http://linux.die.net/man/2/sigaction
msg3787 (view)	Author: florian	Date: 2014-10-11.19:56:54
The third experiment we ran used the new signal handlers. Their default behaviour is to block the signal type that is currently being handled. This is one of the new features of the new signal handlers and was not possible in a thread-safe way before. Unfortunately, blocking the signal during the handler execution did not help (msg3734). We could block more signals instead of just the one currently handled. I thought about doing this but wasn't sure if this could lead to a new problem, e.g. if reading the status file fails with a signal (is this possible?) we would not receive the signal.
msg3784 (view)	Author: malte	Date: 2014-10-11.19:37:22
That's an interesting idea. But I think that concurrent reads should normally be possible, and I couldn't find a trace in the lsof output of someone trying to read the file. Moreover, even if concurrent reads weren't possible, I don't think they should cause a dead-lock: one of the two attempts should win, and the other one should either fail or have to wait for the other one to complete, but not remain blocked indefinitely. I wonder if the suggestion that reading the /proc/self/status file is the problem is a red herring. After all, we read it successfully in many other contexts. In the case I looked at, the CPU time was very close to 30:01:00, and from how I understand SIGXCPU, we should get a signal at 30:00:00 and another one at 30:01:00. Maybe it's the second signal that is causing the problem? Perhaps we can try something that ensures that after the first signal, subsequent ones will be blocked? This shouldn't be a problem anyway with our signal handlers given that our eventual response to all signals that we catch is to quit.
msg3783 (view)	Author: florian	Date: 2014-10-11.19:31:37
We read the status file during the normal normal execution of the planner. Could it cause a dead-lock if the planner gets SIGXCPU while it is waiting for the file and then tries to access it a second time in the signal handler?
msg3782 (view)	Author: jendrik	Date: 2014-10-11.18:42:35
Yes, the hard limit is 5 seconds higher than the soft limit.
msg3773 (view)	Author: malte	Date: 2014-10-11.12:58:42
This is odd -- we're not in a low-memory situation, so there's no reason why the open() of the status file should cause any kind of trouble. Also, why does this kind of error only show up while we're handling SIGXCPU? I checked the grid documentation, and it does suggest catching SIGXCPU for graceful shutdown. Maybe it's some kind of temporary grid error or other interaction with the grid's job shepherding job. The hard limit is substantially higher than the soft limit, right? And we haven't set a grid time limit ourselves. So that shouldn't cause trouble. Another possibility is some kind of dead-lock with multiple processes trying to access the status file and waiting for each other for some reason. But that doesn't really sound likely, and I tried lsof to check for something like this and found no trace. Silvan, can you log in to ase12 and try: strace -p 121217 ? (But I fear you have to be root to do this, even when it's your own process.) Maybe the best we can do at the moment is to switch to the more modern signal handling code and add something that makes sure we don't block in this part of the code forever. The "alarm" system call would be an option.
msg3772 (view)	Author: florian	Date: 2014-10-11.10:51:58
I created a lab issue for the randomization, so we don't hijack this one: https://bitbucket.org/jendrikseipp/lab/issue/16
msg3771 (view)	Author: jendrik	Date: 2014-10-11.01:41:42
No, I don't have an explanation for this. This is the first time I hear about the /tmp/sge_spool folder.
msg3769 (view)	Author: florian	Date: 2014-10-11.00:30:44
We partly figured out why the randomization was off. The grid engine actually does not start the run script in the folder but a local copy in its /tmp/sge_spool folder. This is not actually a copy, but a recreation of the run file and recreating it also shuffles the runs again. I'm not sure yet, who creates the second version of the run script. Jendrik, do you have an explanation? On the actual issue, we are out of ideas to test right now. Any suggestions? The hanging tasks are still running. If anyone wants to have a look, their job numbers are 2167176 and 2167406.
msg3739 (view)	Author: silvan	Date: 2014-10-09.10:07:13
You're right, I totally overlooked the declaration of the file stream.
msg3738 (view)	Author: malte	Date: 2014-10-09.09:19:46
Maybe you're looking at a base class? Including ios_base::out would be an odd default because ifstream doesn't have methods for writing; the "i" in the class name stands for "input". Here's documentation showing a default value of ios_base::in: http://www.cplusplus.com/reference/fstream/ifstream/open/
msg3737 (view)	Author: silvan	Date: 2014-10-09.09:13:57
According to the documentation of open, the default is both ios_base::in and ios_base::out. I thought that would be reading and writing.
msg3736 (view)	Author: malte	Date: 2014-10-09.08:49:20
> When opening /proc/self/status with only reading access, only 1 out rather than > 4 tasks remain in the error state. Doesn't our original code only open it for reading, too? I don't see what else it could do without causing an error every time.
msg3735 (view)	Author: florian	Date: 2014-10-08.23:10:45
I don't think the number of failures tells us anything. This bug seems to occur randomly with a small chance of happing anyway. Too bad that sigaction() did not help. I still suggest we switch from signal() to sigaction() after we figure out this issue. signal() is deprecated and with sigaction() we can avoid the special case for deferring a second signal.
msg3734 (view)	Author: silvan	Date: 2014-10-08.22:57:58
The same happens when using sigaction instead of signal. I can have a more detailed look at the logs tomorrow, but I don't think that the picture changed.
msg3731 (view)	Author: silvan	Date: 2014-10-08.21:32:03
When opening /proc/self/status with only reading access, only 1 out rather than 4 tasks remain in the error state.
msg3719 (view)	Author: silvan	Date: 2014-10-08.13:07:21
I attached a file. But this is not the debug output, is it? Our debug output says: reg in signal_handler signal_handler 1 signal_handler 2 signal_handler 3 in print_peak_memory in get_peak_memory_in_kb get_peak_memory_in_kb 1 get_peak_memory_in_kb 2 get_peak_memory_in_kb 3 get_peak_memory_in_kb 4
msg3718 (view)	Author: malte	Date: 2014-10-08.13:04:44
That's the output I'm interested in, but not just the last line. In particular, I'm wondering if more than one signal is involved. (If not, it's hard to understand why the problem occurs some of the time, but not all of the time.) Can you post the complete output of /proc/.../status somewhere?
msg3716 (view)	Author: silvan	Date: 2014-10-08.13:01:53
What do you mean by debug output? We only added print messages to know where in the code we are (see below).
msg3715 (view)	Author: malte	Date: 2014-10-08.12:59:49
What does the debug output say in the failing cases?
msg3712 (view)	Author: silvan	Date: 2014-10-08.11:51:08
I had a look at one of the process id files (I can open them), and it says (between a lots of other text) status: sleeping.
msg3711 (view)	Author: malte	Date: 2014-10-08.11:50:49
Regarding signal/sigaction: signals are unsafe in the sense that they are subject to race conditions, but in our scenario this should not be an insurmountable obstacle, given that we don't want to do much more than terminate the program in a controlled way. Using sigaction in place of signal sounds like a good idea and might help fix this issue. It would be good if we could also find out what is actually going on, though.
msg3709 (view)	Author: malte	Date: 2014-10-08.11:46:51
> The randomization is stored in the file "run" in an array that maps task ids to > run dirs. I expected index 24 (+- 1) of this array to be one of the offending > run dirs but it isn't. This is indeed odd. It definitely used to work that way.
msg3708 (view)	Author: jendrik	Date: 2014-10-08.11:34:57
Well, it would be nice if it worked the way you thought it works, since you implemented the randomization ;)
msg3707 (view)	Author: florian	Date: 2014-10-08.11:30:03
The randomization is stored in the file "run" in an array that maps task ids to run dirs. I expected index 24 (+- 1) of this array to be one of the offending run dirs but it isn't. Apparently, this works diffently than I thought. You should check the file "/proc/<pid>/status" instead of "/proc/self/status" (self is the calling process). You can also look at "cat /proc/<pid>/cmdline" to show the command line or at "ls -la /proc/<pid>/cwd" to show the working dir of a process. The latter should help to identify the correct process on the node.
msg3706 (view)	Author: silvan	Date: 2014-10-08.11:16:32
Ok, apparently, the files written by the experiments did not automatically get group access rights, which is strange, because my entire maia home directory should be accessible to you. I've added the rights manually now for all the run dirs. Why are these unexpected? According to Jendrik, this is how to the randomization works: the runs are still sorted and numbered according to domain:problem, but then are assigned task ids randomly to ensure random execution. I don't know how to determine the correct process id. On the four nodes, there are also other downward-release processes of my other experiment running. If I know the process ID, how would I open the "self/status" file?
msg3704 (view)	Author: florian	Date: 2014-10-08.10:28:57
The log files do not have group access rights, could you add this? Also, can you open the status file for the correct process id manually? For reference, the run_dirs are: runs-00001-00100/00081 runs-00001-00100/00085 runs-00001-00100/00090 runs-00001-00100/00094 Side note: the numbers are kind of unexpected, the task ids of the hanging tasks are 24, 29, 41 and 93. We do randomization, but I could not match these numbers to 81, 85, 90, 94 in the run script.
msg3703 (view)	Author: silvan	Date: 2014-10-08.10:02:33
Currently, I have an experiment with four jobs "hanging" in the queue without doing anything. I had a look into the log file we added and it seems that the process cannot terminate the following line of code: procfile.open("/proc/self/status"); (utilities.cc::170 in https://bitbucket.org/flogo/downward-issues-issue479/src/010442ba16c51046fbc94d274bac10515f91cf55/src/search/utilities.cc?at=issue479-experimental) I.e., the output in our logfile stops after: get_peak_memory_in_kb 4 You can find the logs under /infai/sieverss/repos/downward/issue479/experiments/issue479/data/issue479-issue479/. Looking for empty "run.err" files (wc -l runs-00001-00100/*/run.err), you get exactly the runs which didn't complete. (I checked all four files and they all look the same.)
msg3701 (view)	Author: florian	Date: 2014-10-08.00:34:59
Reading up on this I found this: http://en.wikipedia.org/wiki/Sigaction#Replacement_of_deprecated_signal.28.29 Could this be the problem we are seeing here? Maybe we can fix it by switching from signal() to sigaction().
msg3688 (view)	Author: florian	Date: 2014-10-07.11:39:21
Silvan and I started looking into this.
msg3632 (view)	Author: malte	Date: 2014-10-04.19:31:31
Anyone willing to look into this one? If we don't fix it, we can't get memory statistics when running out of time, which would really be a pity.
msg3557 (view)	Author: jendrik	Date: 2014-09-27.01:21:25
When fixing this we should ensure that the fix doesn't break the Windows build. Patrick can probably test this.
msg3555 (view)	Author: malte	Date: 2014-09-26.16:36:27
I don't know enough about signal handling, but I wonder if the underlying problem is that our signal handler is called, but doesn't manage to properly terminate the process because of some interaction with signal handling of the grid queue. Or can we also reproduce the SIGXCPU problem locally? (It worked for me, but I didn't try many different tasks.) As a first step, I suggest we add more output to signal_handler(). To test things, I suggest to add output after every single line of the function, including inside the if block. This means we might get odd interference patterns in the output if the signal handler is invoked while it is running, but I think this will still probably give us the maximum amount of information to continue with. I wonder if the stuff we do to prevent recursive invocations is perhaps shooting us in the foot in cases where we get a non-deadly signal. (I don't know if SIGXCPU kills a process by default.)
msg3554 (view)	Author: silvan	Date: 2014-09-26.16:26:12
With changeset da23dc33eeba, Fast Downward catches the SIGXCPU signal. This causes the following problem in many experiments I tried with that (and newer) revisions: Some tasks of a grid job run "for ever" (I deleted them 12 hours after surpassing the time limit), i.e. they are not correctly terminated after the given time limit. They don't appear in top, but in pstree, and the nodes of the cluster are still occupied by the job. The experiments are not reproducible in the sense that always the same task/config pair cannot be correctly terminated, but even when using only one domain like airport (50 tasks), this problem arises (but not when using a single task which caused this problem in a previous experiment). For the moment, I fixed this problem by commenting out the change from that revision. Maybe also a quick side-remark: generally, I sometimes have problems killing the planner with Ctrl+C (also confirmed by Malte), which could also be due to the handling of signals in Fast Downward.

History
Date	User	Action	Args
2015-10-15 18:05:35	florian	set	status: chatting -> resolved messages: + msg4656
2015-10-13 20:17:39	jendrik	set	messages: + msg4655
2015-10-13 17:34:40	malte	set	messages: + msg4654
2015-10-13 17:24:52	florian	set	messages: + msg4653
2014-10-20 15:37:13	malte	set	messages: + msg3851
2014-10-20 15:14:02	florian	set	messages: + msg3849
2014-10-13 22:04:55	malte	set	messages: + msg3819
2014-10-13 21:57:39	florian	set	messages: + msg3818
2014-10-13 21:47:53	malte	set	messages: + msg3817
2014-10-13 21:43:03	florian	set	messages: + msg3816
2014-10-13 14:56:55	florian	set	messages: + msg3808
2014-10-13 14:34:43	malte	set	messages: + msg3807
2014-10-13 14:32:13	florian	set	messages: + msg3806
2014-10-12 20:32:02	malte	set	messages: + msg3802
2014-10-12 20:00:58	florian	set	messages: + msg3801
2014-10-12 11:47:43	silvan	set	messages: + msg3800
2014-10-12 11:13:12	florian	set	assignedto: silvan -> florian messages: + msg3799
2014-10-11 22:21:43	malte	set	messages: + msg3797
2014-10-11 22:06:45	florian	set	messages: + msg3796
2014-10-11 21:14:47	malte	set	messages: + msg3795
2014-10-11 20:57:09	florian	set	messages: + msg3794
2014-10-11 20:18:30	malte	set	messages: + msg3793
2014-10-11 20:13:38	florian	set	messages: + msg3792
2014-10-11 20:01:39	florian	set	messages: + msg3791
2014-10-11 20:00:24	malte	set	messages: + msg3790
2014-10-11 19:59:38	malte	set	messages: + msg3789
2014-10-11 19:59:15	florian	set	messages: + msg3788
2014-10-11 19:56:54	florian	set	messages: + msg3787
2014-10-11 19:37:22	malte	set	messages: + msg3784
2014-10-11 19:31:37	florian	set	messages: + msg3783
2014-10-11 18:42:35	jendrik	set	messages: + msg3782
2014-10-11 12:58:42	malte	set	messages: + msg3773
2014-10-11 10:51:58	florian	set	messages: + msg3772
2014-10-11 01:41:42	jendrik	set	messages: + msg3771
2014-10-11 00:30:44	florian	set	messages: + msg3769
2014-10-09 10:07:13	silvan	set	messages: + msg3739
2014-10-09 09:19:46	malte	set	messages: + msg3738
2014-10-09 09:13:57	silvan	set	messages: + msg3737
2014-10-09 08:49:20	malte	set	messages: + msg3736
2014-10-08 23:10:45	florian	set	messages: + msg3735
2014-10-08 22:57:58	silvan	set	messages: + msg3734
2014-10-08 21:32:03	silvan	set	messages: + msg3731
2014-10-08 13:07:22	silvan	set	files: + proc.txt messages: + msg3719
2014-10-08 13:04:44	malte	set	messages: + msg3718
2014-10-08 13:01:53	silvan	set	messages: + msg3716
2014-10-08 12:59:49	malte	set	messages: + msg3715
2014-10-08 11:51:08	silvan	set	messages: + msg3712
2014-10-08 11:50:49	malte	set	messages: + msg3711
2014-10-08 11:46:51	malte	set	messages: + msg3709
2014-10-08 11:34:57	jendrik	set	messages: + msg3708
2014-10-08 11:30:03	florian	set	messages: + msg3707
2014-10-08 11:16:32	silvan	set	messages: + msg3706
2014-10-08 10:28:57	florian	set	messages: + msg3704
2014-10-08 10:02:33	silvan	set	messages: + msg3703
2014-10-08 00:34:59	florian	set	messages: + msg3701
2014-10-07 15:37:54	jendrik	unlink	issue478 superseder
2014-10-07 11:39:21	florian	set	assignedto: silvan messages: + msg3688
2014-10-04 19:31:31	malte	set	messages: + msg3632
2014-09-27 01:21:25	jendrik	set	nosy: + pvonreth messages: + msg3557
2014-09-27 01:20:31	jendrik	link	issue478 superseder
2014-09-26 16:36:27	malte	set	messages: + msg3555
2014-09-26 16:26:12	silvan	create

Issue479