Issue837

Title use debug mode for STL in debug builds
Priority wish Status resolved
Superseder Nosy List cedric, florian, guillem, jendrik, malte, silvan
Assigned To jendrik Keywords
Optional summary

Created on 2018-09-20.11:27:26 by jendrik, last changed by florian.

Messages
msg9728 (view) Author: florian Date: 2020-10-01.12:14:39
I created issue982 to deal with _GLIBCXX_DEBUG in LP builds.
msg9198 (view) Author: malte Date: 2020-02-13.12:56:30
I suggest we discuss the details of this live, there must be something I'm still missing.
msg9196 (view) Author: florian Date: 2020-02-13.09:38:52
Yes, that would be possible. I thought you wanted to detect whether the library is compatible. I think your suggestion is a bit drastic though. It would make it even more difficult to debug LP code and would make all assertions in the LP code redundant. If something breaks with an LP, I'd rather have the option to remove the _GLIBCXX_DEBUG flag or even use an OSI version compiled with the flag. Maybe we can add a CMake option instead that switches off _GLIBCXX_DEBUG and stop the build only if there is an LP solver and this option is not used?
msg9195 (view) Author: malte Date: 2020-02-12.16:00:02
Wouldn't it be as simple as "In a debug build, LP configurations must be disabled?" When you say "some of the library calls work without issue", it sounds like we don't know which one and we in general cannot trust LP code in a debug build. Maybe I am misunderstanding or misinterpreting something.
msg9194 (view) Author: florian Date: 2020-02-12.15:58:03
Not that I know of. The build does not complain and some of the library calls work without issue.
msg9193 (view) Author: malte Date: 2020-02-12.15:53:08
Is there any way to detect this and abort the build, give an error during runtime when LP code is used in the wrong context or something along those lines?
msg9192 (view) Author: florian Date: 2020-02-12.15:36:17
Roughly one year later I ran into a similar problem again and found this issue after another day of debugging. Again, the problem was that Fast Downward was compiled with _GLIBCXX_DEBUG while OSI was not. This can lead to strange and hard to debug problems like a library function returning a vector with a single element producing a vector with size 8972014882083734.

I am writing this here as a reminder: if you have to debug LP code, do *not* use a debug build! I hope I remember it this time or I will be back here next year.
msg8120 (view) Author: florian Date: 2018-11-28.12:46:03
Yes, this was in the middle of the ICAPS rush, so I stopped spending time on it
once I had a running solution. I'm still not completely sure what happened. Here
is what I observed:

In our adapter to OSI, we derive a class ErrorCatchingCoinMessageHandler from
the OSI class CoinMessageHandler and override a method checkSeverity to handle
CPLEX out-of-memory situations cleanly. For that, we compare the content of the
produced message to some constant strings and exit with the correct exit code if
it matches. If it doesn't match, we call checkSeverity of the base class

In a debug build, the content of the message was always empty in our overridden
method but had the correct content in the base class. The field I accessed is a
protected char[1000] and I printed it just before calling the base class
implementation and in the first line of the base class implementation (modifying
the OSI code). The first print showed it as an empty string, the second had the
message as expected. In the release build this didn't occur.

OSI was compiled without the flag and our debug build was compiled with it, so I
assume that the class is layed out differently in memory in the library code and
our code so accessing the protected field gets the wrong data. But this is the
point where I stopped looking into it.
msg8108 (view) Author: malte Date: 2018-11-28.08:05:03
@Florian: Gabi mentioned that the debug STL build caused you a day of work
because it messed up something. Is that right? If yes, would it be worth telling
everyone (e.g. on ai@... or downward-dev) what the problem was, so that others
won't run into the same problems? Is there a way to prevent these problems?
msg7869 (view) Author: silvan Date: 2018-10-03.19:09:06
I see. Then this is fine with me.
msg7868 (view) Author: malte Date: 2018-10-03.19:08:22
No, this is run-time checking, not compile-time checking. Compilation shouldn't
take significantly longer, although we didn't test this.

The factor 2-3 (or perhaps it's more like 5-10, we didn't really analyze it a
lot) is for *runtime*, not compile time. This should catch all out-of-bounds
errors for vectors, along with some but not all other errors regarding use of
invalidated iterators and similar. I think it is most useful in cases where we
previously did *not* get a segfault; it is not unusual for an out-of-bounds
access to just lead to a garbage result rather than any kind of immediate crash,
and this can of course be hard to detect and debug.
msg7867 (view) Author: silvan Date: 2018-10-03.19:04:22
But this only works for index violations that can be determined at compile time,
I suppose? If so, this doesn't sound like it would prevent getting segfaults if
you just use the wrong index into a vector. Or will the segfault that the
compiled program then also be more informative?

I also think that 2-3 times longer compilation is a very big difference; I
normally exclusively compile in debug when developing.
msg7866 (view) Author: malte Date: 2018-10-03.19:01:11
Many thanks!

Regarding the message to the group, I think we should generally strive to give
people enough information to make an informed opinion without spending a lot of
time on it, i.e., ideally based on the message alone.

In this case, I think it's important to give people some idea of the magnitude
of the speed difference. People might think of this differently depending on
whether the slowdown factor is 1.1, 10, 100 or 1000. (Of course everybody can do
their own research if they like, but that's not time-efficient for the group as
a whole.)
msg7865 (view) Author: jendrik Date: 2018-10-03.18:56:59
I merged this. As you said, this change could quickly be reverted.
msg7864 (view) Author: malte Date: 2018-10-03.18:51:19
I think the more comprehensive option is better. We don't need a fast debug build.
msg7863 (view) Author: jendrik Date: 2018-10-03.18:49:36
Quick note: There's also _GLIBCXX_ASSERTIONS, which is automatically enabled by 
_GLIBCXX_DEBUG 
(https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_macros.html). However, 
_GLIBCXX_ASSERTIONS doesn't do container bounds checks as I just verified 
locally.
msg7862 (view) Author: malte Date: 2018-10-03.18:44:12
Thanks, Jendrik! I've left a message over there. Sometimes bumping an issue
helps getting it resolved.
msg7860 (view) Author: jendrik Date: 2018-10-03.18:41:35
There's already an issue for the failing assertions: issue467.
msg7859 (view) Author: malte Date: 2018-10-03.18:03:24
Thanks, Jendrik! I don't think efficiency of the debug build is a big deal, and
of course there is a reason why these additional checks are disabled by default.
It looks like in some tasks runtimes grows by more than a factor of 5.

I would be in favour of enabling this flag, but again I suggest writing to
downward-dev, both to give people a chance to check if they like this despite
the runtime impact, and also to let them know what this change means for them in
practice (in terms of getting more useful error checking out of debug builds). I
think it would be fine to merge before waiting for feedback, though. We can
always undo the change if necessary.

The most interesting thing for me is that we got a bunch of assertion failures
in both experiments. It seems they are all related to landmarks, but someone
should double-check this. Can you look into these a little bit and open an issue
for them if we don't already have one?
msg7857 (view) Author: jendrik Date: 2018-10-03.17:47:51
Here are the results:

https://ai.dmi.unibas.ch/_tmp_files/seipp/issue837-v1-opt-issue837-base-issue837-v1-compare.html
https://ai.dmi.unibas.ch/_tmp_files/seipp/issue837-v1-sat-issue837-base-issue837-v1-compare.html

Unfortunately, coverage decreases significantly.
msg7843 (view) Author: jendrik Date: 2018-10-03.12:23:12
Yes, the errors are printed on stderr and Lab will report them as unexplained 
errors.

Experiments are running.
msg7839 (view) Author: malte Date: 2018-10-02.20:02:19
Thanks, Jendrik! I would suggest running the usual experiments we do for
"whole-planner" changes here because there is a chance this will actually catch
some bugs. (It would also be interesting to see how much this slows down the code.)

We should also make sure that we actually see any warning that are produced in
the experiment reports. But I assume they are printed on stderr, and hence lab
will tell us about them, right?

I would suggest a before-after experiment, obviously only in debug mode, using a
5-minute timeout.
msg7838 (view) Author: jendrik Date: 2018-10-02.19:55:54
I tested Florian's suggestion locally and it works for GCC and Clang. When I 
compile and run the following code

vector<int> test;
cout << test[4] << endl;

the following warning is printed

/usr/include/c++/5.4.0/debug/vector:409:error: attempt to subscript
    container with out-of-bounds index 4, but container only holds 0
    elements.

Objects involved in the operation:
sequence "this" @ 0x0x7ffc94412ec0 {
  type = NSt7__debug6vectorIiSaIiEEE;
}

The (very small) patch is here: 
https://bitbucket.org/jendrikseipp/downward/pull-requests/102
msg7639 (view) Author: malte Date: 2018-09-20.13:29:13
Thanks, Florian!

I think "-D" is equivalent to using #define in the code, which means that you
can define anything, and if the compiler doesn't need it, it won't care. So we
don't necessarily need a per-compiler distinction.

The compiler itself doesn't do anything with these, I think. They are solely
there to control #ifdefs etc. So whether or not this will do something in clang
depends on whether or not clang uses glibc or not. I think it does, right? I'm
not aware that there are two different stdlib implementations in common use in
Linux. But I'm not 100% sure.
msg7638 (view) Author: florian Date: 2018-09-20.13:26:09
It should not be hard to add to the CMake build. We could add it here if it can
be passed as a compiler flag (I saw it used with "-D", i.e., "-D_GLIBCXX_DEBUG"):
http://hg.fast-downward.org/file/f67b4916c871/src/cmake_modules/FastDownwardMacros.cmake#l21

If it doesn't work in clang, we would need another if block because the check in
line 7 treats clang and gcc the same.
msg7633 (view) Author: malte Date: 2018-09-20.12:44:03
vector::at should *always* do a bounds check (and raise an exception, not fail
an assert if it fails). It's part of the spec.

The only reason I could see against _GLIBCXX_DEBUG is that it might be difficult
to set up, either on the CMake side or by requiring additional installs. We'll
find out if we try it out. :-)
msg7631 (view) Author: guillem Date: 2018-09-20.12:33:10
Don't know why, but I had always assumed that "-DDEBUG" was enabling this type of 
checks.
Ok, now I see that the confusion arose from the fact that "-DDEBUG" allows the the 
asserts throughout the STL, which include a bounds check when using vector::at:

https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-
v3/include/bits/stl_vector.h#L1045

But not when using operator[], of course. 
Looks to me like it would be good to enable _GLIBCXX_DEBUG in our debug mode, would 
there be any strong argument against doing that?
msg7625 (view) Author: jendrik Date: 2018-09-20.11:27:26
We should think about whether we want to use the debug mode of libstdc++ in debug 
builds to enable vector bounds checking among other things. 
(https://stackoverflow.com/questions/38958523/safe-versions-of-stl-on-linux).
History
Date User Action Args
2020-10-01 12:14:39floriansetmessages: + msg9728
2020-02-13 12:56:30maltesetmessages: + msg9198
2020-02-13 09:38:52floriansetmessages: + msg9196
2020-02-12 16:00:02maltesetmessages: + msg9195
2020-02-12 15:58:03floriansetmessages: + msg9194
2020-02-12 15:53:08maltesetmessages: + msg9193
2020-02-12 15:36:18floriansetmessages: + msg9192
2018-11-28 12:46:03floriansetmessages: + msg8120
2018-11-28 08:05:03maltesetmessages: + msg8108
2018-10-04 09:01:44cedricsetnosy: + cedric
2018-10-03 19:09:06silvansetmessages: + msg7869
2018-10-03 19:08:22maltesetmessages: + msg7868
2018-10-03 19:04:22silvansetnosy: + silvan
messages: + msg7867
2018-10-03 19:01:11maltesetmessages: + msg7866
2018-10-03 18:56:59jendriksetstatus: testing -> resolved
messages: + msg7865
2018-10-03 18:51:19maltesetmessages: + msg7864
2018-10-03 18:49:36jendriksetmessages: + msg7863
2018-10-03 18:44:12maltesetmessages: + msg7862
2018-10-03 18:41:35jendriksetmessages: + msg7860
2018-10-03 18:03:24maltesetmessages: + msg7859
2018-10-03 17:47:51jendriksetmessages: + msg7857
2018-10-03 12:23:12jendriksetstatus: reviewing -> testing
messages: + msg7843
2018-10-02 20:02:19maltesetmessages: + msg7839
2018-10-02 19:55:54jendriksetstatus: chatting -> reviewing
assignedto: jendrik
messages: + msg7838
2018-09-20 13:29:13maltesetmessages: + msg7639
2018-09-20 13:26:09floriansetnosy: + florian
messages: + msg7638
2018-09-20 12:44:03maltesetmessages: + msg7633
2018-09-20 12:33:10guillemsetstatus: unread -> chatting
nosy: + guillem
messages: + msg7631
2018-09-20 11:27:26jendrikcreate