Issue703

Title run valgrind's memory debugging tools
Priority wish Status resolved
Superseder Nosy List florian, haz, jendrik, malte, silvan
Assigned To jendrik Keywords
Optional summary

Created on 2016-12-24.15:06:29 by malte, last changed by jendrik.

Messages
msg8624 (view) Author: jendrik Date: 2019-03-06.09:58:06
I addressed it and merged the branch again.
msg8623 (view) Author: malte Date: 2019-03-06.09:16:08
Works for me. I left a small comment.
msg8622 (view) Author: jendrik Date: 2019-03-06.09:09:13
I think your latter suggestion is a good workaround. I pushed the change to
https://bitbucket.org/jendrikseipp/downward/pull-requests/122 .
msg8621 (view) Author: malte Date: 2019-03-06.00:39:26
I think that's unsafe, there's no reason why the compiler should have any
particular file name. On my machine I have g++ version 5, but it's just called
/usr/bin/g++. We could call the compiler executable with --dumpversion and hope
that doesn't do anything stupid on non-g++, but that's not perfect either.

It's unfortunate that the CMAKE_CXX_COMPILER_ID (and version) are missing. I'm
sure there would be a way to get at them programmatically if we knew what we
were doing...

It looks like "cmake -LA -N ." includes all variables that *we* set, so we might
be able to work around this by adding an assignment into the CMake file that
assigns CMAKE_CXX_COMPILER_ID (and the version) to some externally visible name
of our choosing if it's internally visible, and then just use "cmake
--get-variable=JUERGEN_POTZKOTHEN" (or whatever variable name we choose). Or use
CMAKE_COMPILER_IS_GNUCXX. (See our function "fast_downward_set_compiler_flags"
in the CMake files.)
msg8619 (view) Author: jendrik Date: 2019-03-06.00:11:35
Thanks for the link! I searched around a bit more, but didn't find anything
else. Using the command from your link I get:

builds/release$ cmake -LA -N ../../src/ | grep COMPILER
CMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++-5
CMAKE_CXX_COMPILER_AR:FILEPATH=/usr/bin/gcc-ar-5
CMAKE_CXX_COMPILER_RANLIB:FILEPATH=/usr/bin/gcc-ranlib-5
CMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc-5
CMAKE_C_COMPILER_AR:FILEPATH=/usr/bin/gcc-ar-5
CMAKE_C_COMPILER_RANLIB:FILEPATH=/usr/bin/gcc-ranlib-5

The CMAKE_CXX_COMPILER_ID and CMAKE_CXX_COMPILER_VERSION are missing however.
Are you fine with checking that the last part of CMAKE_CXX_COMPILER:FILEPATH is
"g++-5"?
msg8617 (view) Author: malte Date: 2019-03-05.23:10:04
I'm reasonably sure cmake has an interface for querying these, which would be a
cleaner way. This is the first hit I found; didn't investigate more deeply:

https://stackoverflow.com/questions/8474753/how-to-get-a-cmake-variable-from-the-command-line
msg8616 (view) Author: jendrik Date: 2019-03-05.17:55:08
In my setup the file builds/release/CMakeFiles/3.10.2/CMakeCXXCompiler.cmake
starts with the following lines:

set(CMAKE_CXX_COMPILER "/usr/bin/g++-5")
set(CMAKE_CXX_COMPILER_ARG1 "")
set(CMAKE_CXX_COMPILER_ID "GNU")
set(CMAKE_CXX_COMPILER_VERSION "5.5.0")
set(CMAKE_CXX_COMPILER_VERSION_INTERNAL "")
set(CMAKE_CXX_COMPILER_WRAPPER "")

What do you think about globbing
"builds/release/CMakeFiles/*/CMakeCXXCompiler.cmake"
to find the CMakeCXXCompiler.cmake file and then grepping the file for
CMAKE_CXX_COMPILER_ID and CMAKE_CXX_COMPILER_VERSION?
msg8615 (view) Author: malte Date: 2019-03-05.17:40:26
I think that's preferable. It's not great because we cannot be sure that "gcc"
is the compiler that was configured to perform this build. If we can make at
least a minimal effort to figure out which compiler was configured (by checking
environment variables like CC or whatever CMake's most basic mechanism for
determining the compiler is), this would be preferable. Perhaps it's possible to
directly query CMake which compiler it uses for a given build. That would be the
cleanest solution.
msg8614 (view) Author: jendrik Date: 2019-03-05.17:30:29
I agree that I was a little too quick to 
follow stackoverflow here. Should I just test 
the GCC version with -dumpversion instead?
msg8613 (view) Author: malte Date: 2019-03-05.17:17:58
I don't think using "strings" to grep through the binary for traces of GCC is
the right way to handle this.
msg8612 (view) Author: silvan Date: 2019-03-05.17:11:01
From my side, yes. (The changes are technical and I can't judge if they work
just from reading the file. So if they do, it's fine.)
msg8611 (view) Author: jendrik Date: 2019-03-05.15:53:04
The patch at https://bitbucket.org/jendrikseipp/downward/pull-requests/122
should fix the problem. (I reopened branch issue703 for this.) Can I merge it?
msg8606 (view) Author: malte Date: 2019-03-05.10:55:29
Setting this back to in-progress since it looks like we'll want further commits.

I'm not sure what the best place to work around bugs specific to particular gcc
and valgrind version combinations is. Keeping this in the Fast Downward
repository makes it a bit more "official", and I'm not sure how easy it is to
define the suppression in such a way that we can reasonably confident that it
will not lead to false negatives.

But I suppose either alternative works for me. If it ends up in the repository,
it should be very clear from the code which versions of the tools this is a
workaround for, and it should include the links from Jendrik's earlier message.
That way, it will be easier to take this out again once we stop supporting this
gcc version.
msg8602 (view) Author: jendrik Date: 2019-03-05.08:50:11
Unfortunately, adapting the Docker containers won't help users who run the tests
locally. I therefore suggest to bake the workaround from msg8587 into the
test-memory-leaks.py script.
msg8601 (view) Author: florian Date: 2019-03-05.08:16:44
Adding a file to some but not all Docker containers should be easy enough. We
already do this for the cplex installer. I can help with this once I'm in the
office.
msg8597 (view) Author: malte Date: 2019-03-04.22:47:33
I looked around a bit more in the gcc and valgrind trackers; looks good. Yes, a
suppression file would be nice. It would be good if we could use it *only* for
gcc5, if our setup allows that, either by customizing that docker container in
some way (probably the cleaner version), or by dynamically testing the version
using something like

if gcc -dumpversion | grep -q '^5\.'; then
   do special gcc5 thing
fi

People with better shell-fu can probably replace the call to grep with some
shell-internal thing here, but I think this one wouldn't be too terrible either.
msg8596 (view) Author: jendrik Date: 2019-03-04.22:16:53
Here is what I found:

https://stackoverflow.com/questions/30393229/new-libstdc-of-gcc5-1-may-allocate-large-heap-memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66339

It seems that GCC 5 allocates memory that is not freed. This was fixed in a
later compiler version.

Should I let Valgrind generate a suppression file and apply it?
msg8595 (view) Author: malte Date: 2019-03-04.19:19:34
Doesn't "new" in most or at least many implementations end up calling "malloc",
so that this might well be an actual leak? So a filter that says one malloc leak
is fine doesn't seem safe to me. Are you confident of your diagnosis that
everything is alright and it isn't our code that is responsible? What is it
based on?

Even if it is the case that "our" code cannot cause complaints about malloc, our
code could trigger leaks that are reported as complaints about malloc, e.g. when
we use standard library functions that use malloc internally or when we create
memory leaks via the LP solver interface.

The clean way to handle things like these are valgrind suppression files. I hope
that it should be reasonably easy to define one here, but first I'd like to be
sure what is actually going on.
msg8594 (view) Author: jendrik Date: 2019-03-04.19:11:44
The new test works fine for our Clang and GCC-8 workers, but fails for GCC-5:
http://buildbot.fast-downward.org/#/builders/17/builds/301

The error is the same for all configurations. It seems that Valgrind reports its
own malloc memory allocation. What do you think about a solution that checks the
Valgrind output and ignores the error if both " malloc " and "ERROR SUMMARY: 1
errors from 1 contexts (suppressed: 0 from 0)" occur in the output? It should be
safe to ignore errors related to malloc since we don't use it ourselves. Of
course I'd be happy if someone knows a less hacky solution :-)
msg8593 (view) Author: malte Date: 2019-03-04.19:06:07
Sounds good. Since we don't have a lot of experience with these yet, it may be a
good idea for Florian to review the changes to the infrastructure repository,
but I'll leave it to the two of you whether you want to do that. No need to have
me in the loop.
msg8592 (view) Author: jendrik Date: 2019-03-04.18:56:37
You're right, thanks for the heads up! I forgot to push the changes and will do
so tomorrow from the build worker.
msg8591 (view) Author: malte Date: 2019-03-04.18:50:44
Doesn't this require updating the buildbot setup (docker/docker-compose recipes
etc.)?
msg8590 (view) Author: jendrik Date: 2019-03-04.16:19:09
I resolved the two TODOs from the summary and merged the code (the wiki simply
links to bitbucket-pipelines.yml to document the tools for automated tests, so
there wasn't anything to update there).
msg8578 (view) Author: jendrik Date: 2019-03-02.00:23:55
Awesome! 

We still need to install valgrind on the buildbots, I guess, so I'll wait with
merging until I'm back in the office. (Added this to the summary.)
msg8576 (view) Author: malte Date: 2019-03-01.23:38:07
Yes, looks good to merge. Thanks!
msg8575 (view) Author: jendrik Date: 2019-03-01.23:32:23
I agree that it would be nice to get rid of more raw pointers and I can see
myself looking into this when time allows. However, I don't think this should
block making progress on this issue. Do you think we can merge the code?
msg8573 (view) Author: malte Date: 2019-03-01.22:34:22
No, no, in the paragraph "I think this is a useful addition, but...", focus more
on the first 7 words.

By saying I would prefer we shouldn't rely on valgrind I didn't mean we
shouldn't use it, just that we should also look at memory allocation more
systematically.

> Converting all raw pointers to std::unique_ptr will require quite some work,
> I think. In my opinion, this issue should only be about releasing all memory
> before the planner exits.

If getting rid of (more) manual new/deletes is too much work for now, I'm fine
with not doing it now. It's a bit of a pity because I think we already deferred
it once or twice.
msg8571 (view) Author: jendrik Date: 2019-03-01.20:19:56
I dealt with your comments on Bitbucket.

Judging by your msg8567, the issue title is wrong and we don't want to use
valgrind after all?

Converting all raw pointers to std::unique_ptr will require quite some work, I
think. In my opinion, this issue should only be about releasing all memory
before the planner exits.
msg8567 (view) Author: malte Date: 2019-03-01.17:21:47
I count two leaks and one false positive, but that's still an improvement. :-)

I think this would be a useful addition, but I also want to point out that these
were both intentionally implemented this way back when we didn't care yet about
releasing objects with (essentially) global lifetime because they caused no harm
and before unique_ptr and friends, it was harm to deal with these gracefully.
Now that we care more about this, I'd be much happier if we looked at these
legacy memory allocations systematically rather than relying on valgrind to
catch them.

Searching for the word "new" in the code finds about 30-40 uses of the operator
(unfortunately plus lots of occurrences of the word "new" in comments) and a
similar amount of uses of the operator "delete". Some of these are intentional,
but it seems that quite a few are not; from a quick glance, it sounds like
searching for these and using unique_ptrs etc. as appropriate would find the
same code issues and more.
msg8566 (view) Author: silvan Date: 2019-03-01.16:46:22
I left one comment. Otherwise, it looks fine to me.
msg8565 (view) Author: jendrik Date: 2019-03-01.15:14:34
I looked into this and created a pull request at 
https://bitbucket.org/jendrikseipp/downward/pull-requests/122 .

It includes a test that runs valgrind for some "standard" configs. (I created
issue897 for updating the set of tested configurations.) The test uncovered
three memory leaks that are now fixed. At least for the tested combinations of
tasks and configurations, the planner doesn't have any memory leaks anymore.
msg5991 (view) Author: malte Date: 2016-12-24.15:07:39
Now that we're better about freeing memory, we should run valgrind to check that
there are no leaks. While we're at it, we should also look at other memory
issues that valgrind can detect. One problem is that any individual planner run
only covers a small amount of search algorithm, heuristics, etc., so we would
also need to think about code coverage. Perhaps this is also interesting for
other reasons (e.g. when testing whole-planner changes, where we currently test
more or less ad-hoc configs).

We might want to link dynamically for this. (See Silvan's comment at issue662.)
History
Date User Action Args
2019-03-06 09:58:06jendriksetmessages: + msg8624
2019-03-06 09:16:08maltesetmessages: + msg8623
2019-03-06 09:09:13jendriksetmessages: + msg8622
2019-03-06 00:39:26maltesetmessages: + msg8621
2019-03-06 00:11:35jendriksetmessages: + msg8619
2019-03-05 23:10:04maltesetmessages: + msg8617
2019-03-05 17:55:08jendriksetmessages: + msg8616
2019-03-05 17:40:27maltesetmessages: + msg8615
2019-03-05 17:30:29jendriksetmessages: + msg8614
2019-03-05 17:17:58maltesetmessages: + msg8613
2019-03-05 17:11:01silvansetmessages: + msg8612
2019-03-05 15:53:04jendriksetmessages: + msg8611
2019-03-05 10:55:29maltesetmessages: + msg8606
2019-03-05 08:50:11jendriksetmessages: + msg8602
2019-03-05 08:16:44floriansetmessages: + msg8601
2019-03-04 22:47:33maltesetmessages: + msg8597
2019-03-04 22:16:53jendriksetmessages: + msg8596
2019-03-04 19:19:34maltesetmessages: + msg8595
2019-03-04 19:11:44jendriksetmessages: + msg8594
2019-03-04 19:06:07maltesetmessages: + msg8593
2019-03-04 18:56:37jendriksetmessages: + msg8592
2019-03-04 18:50:44maltesetmessages: + msg8591
2019-03-04 16:19:09jendriksetstatus: in-progress -> resolved
messages: + msg8590
summary: TODO: Install valgrind on buildbots. TODO: Document valgrind dependency in the wiki. ->
2019-03-02 00:23:55jendriksetstatus: reviewing -> in-progress
messages: + msg8578
summary: TODO: Install valgrind on buildbots. TODO: Document valgrind dependency in the wiki.
2019-03-01 23:38:07maltesetmessages: + msg8576
2019-03-01 23:32:23jendriksetmessages: + msg8575
2019-03-01 22:34:22maltesetmessages: + msg8573
2019-03-01 20:19:56jendriksetmessages: + msg8571
2019-03-01 17:21:47maltesetmessages: + msg8567
2019-03-01 16:46:22silvansetmessages: + msg8566
2019-03-01 15:14:34jendriksetstatus: chatting -> reviewing
assignedto: jendrik
messages: + msg8565
2017-01-09 11:05:53floriansetnosy: + florian
2017-01-03 10:58:30silvansetnosy: + silvan
2016-12-27 19:33:25hazsetnosy: + haz
2016-12-24 16:21:10jendriksetnosy: + jendrik
2016-12-24 15:07:40maltesetmessages: + msg5991
summary: Now that we're better about freeing memory, we should run valgrind to check that there are no leaks. While we're at it, we should also look at other memory issues that valgrind can detect. One problem is that any individual planner run only covers a small amount of search algorithm, heuristics, etc., so we would also need to think about code coverage. Perhaps this is also interesting for other reasons (e.g. when testing whole-planner changes, where we currently test more or less ad-hoc configs). We might want to link dynamically for this. (See Silvan's comment at issue662.) -> (no value)
2016-12-24 15:06:29maltecreate