Issue 703: run valgrind's memory debugging tools

Title	run valgrind's memory debugging tools
Priority	wish	Status	resolved
Superseder		Nosy List	florian, haz, jendrik, malte, silvan
Assigned To	jendrik	Keywords
Optional summary

Created on 2016-12-24.15:06:29 by malte, last changed by jendrik.

Messages
msg8624 (view)	Author: jendrik	Date: 2019-03-06.09:58:06
I addressed it and merged the branch again.
msg8623 (view)	Author: malte	Date: 2019-03-06.09:16:08
Works for me. I left a small comment.
msg8622 (view)	Author: jendrik	Date: 2019-03-06.09:09:13
I think your latter suggestion is a good workaround. I pushed the change to https://bitbucket.org/jendrikseipp/downward/pull-requests/122 .
msg8621 (view)	Author: malte	Date: 2019-03-06.00:39:26
I think that's unsafe, there's no reason why the compiler should have any particular file name. On my machine I have g++ version 5, but it's just called /usr/bin/g++. We could call the compiler executable with --dumpversion and hope that doesn't do anything stupid on non-g++, but that's not perfect either. It's unfortunate that the CMAKE_CXX_COMPILER_ID (and version) are missing. I'm sure there would be a way to get at them programmatically if we knew what we were doing... It looks like "cmake -LA -N ." includes all variables that we set, so we might be able to work around this by adding an assignment into the CMake file that assigns CMAKE_CXX_COMPILER_ID (and the version) to some externally visible name of our choosing if it's internally visible, and then just use "cmake --get-variable=JUERGEN_POTZKOTHEN" (or whatever variable name we choose). Or use CMAKE_COMPILER_IS_GNUCXX. (See our function "fast_downward_set_compiler_flags" in the CMake files.)
msg8619 (view)	Author: jendrik	Date: 2019-03-06.00:11:35
Thanks for the link! I searched around a bit more, but didn't find anything else. Using the command from your link I get: builds/release$ cmake -LA -N ../../src/ \| grep COMPILER CMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++-5 CMAKE_CXX_COMPILER_AR:FILEPATH=/usr/bin/gcc-ar-5 CMAKE_CXX_COMPILER_RANLIB:FILEPATH=/usr/bin/gcc-ranlib-5 CMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc-5 CMAKE_C_COMPILER_AR:FILEPATH=/usr/bin/gcc-ar-5 CMAKE_C_COMPILER_RANLIB:FILEPATH=/usr/bin/gcc-ranlib-5 The CMAKE_CXX_COMPILER_ID and CMAKE_CXX_COMPILER_VERSION are missing however. Are you fine with checking that the last part of CMAKE_CXX_COMPILER:FILEPATH is "g++-5"?
msg8617 (view)	Author: malte	Date: 2019-03-05.23:10:04
I'm reasonably sure cmake has an interface for querying these, which would be a cleaner way. This is the first hit I found; didn't investigate more deeply: https://stackoverflow.com/questions/8474753/how-to-get-a-cmake-variable-from-the-command-line
msg8616 (view)	Author: jendrik	Date: 2019-03-05.17:55:08
In my setup the file builds/release/CMakeFiles/3.10.2/CMakeCXXCompiler.cmake starts with the following lines: set(CMAKE_CXX_COMPILER "/usr/bin/g++-5") set(CMAKE_CXX_COMPILER_ARG1 "") set(CMAKE_CXX_COMPILER_ID "GNU") set(CMAKE_CXX_COMPILER_VERSION "5.5.0") set(CMAKE_CXX_COMPILER_VERSION_INTERNAL "") set(CMAKE_CXX_COMPILER_WRAPPER "") What do you think about globbing "builds/release/CMakeFiles/*/CMakeCXXCompiler.cmake" to find the CMakeCXXCompiler.cmake file and then grepping the file for CMAKE_CXX_COMPILER_ID and CMAKE_CXX_COMPILER_VERSION?
msg8615 (view)	Author: malte	Date: 2019-03-05.17:40:26
I think that's preferable. It's not great because we cannot be sure that "gcc" is the compiler that was configured to perform this build. If we can make at least a minimal effort to figure out which compiler was configured (by checking environment variables like CC or whatever CMake's most basic mechanism for determining the compiler is), this would be preferable. Perhaps it's possible to directly query CMake which compiler it uses for a given build. That would be the cleanest solution.
msg8614 (view)	Author: jendrik	Date: 2019-03-05.17:30:29
I agree that I was a little too quick to follow stackoverflow here. Should I just test the GCC version with -dumpversion instead?
msg8613 (view)	Author: malte	Date: 2019-03-05.17:17:58
I don't think using "strings" to grep through the binary for traces of GCC is the right way to handle this.
msg8612 (view)	Author: silvan	Date: 2019-03-05.17:11:01
From my side, yes. (The changes are technical and I can't judge if they work just from reading the file. So if they do, it's fine.)
msg8611 (view)	Author: jendrik	Date: 2019-03-05.15:53:04
The patch at https://bitbucket.org/jendrikseipp/downward/pull-requests/122 should fix the problem. (I reopened branch issue703 for this.) Can I merge it?
msg8606 (view)	Author: malte	Date: 2019-03-05.10:55:29
Setting this back to in-progress since it looks like we'll want further commits. I'm not sure what the best place to work around bugs specific to particular gcc and valgrind version combinations is. Keeping this in the Fast Downward repository makes it a bit more "official", and I'm not sure how easy it is to define the suppression in such a way that we can reasonably confident that it will not lead to false negatives. But I suppose either alternative works for me. If it ends up in the repository, it should be very clear from the code which versions of the tools this is a workaround for, and it should include the links from Jendrik's earlier message. That way, it will be easier to take this out again once we stop supporting this gcc version.
msg8602 (view)	Author: jendrik	Date: 2019-03-05.08:50:11
Unfortunately, adapting the Docker containers won't help users who run the tests locally. I therefore suggest to bake the workaround from msg8587 into the test-memory-leaks.py script.
msg8601 (view)	Author: florian	Date: 2019-03-05.08:16:44
Adding a file to some but not all Docker containers should be easy enough. We already do this for the cplex installer. I can help with this once I'm in the office.
msg8597 (view)	Author: malte	Date: 2019-03-04.22:47:33
I looked around a bit more in the gcc and valgrind trackers; looks good. Yes, a suppression file would be nice. It would be good if we could use it only for gcc5, if our setup allows that, either by customizing that docker container in some way (probably the cleaner version), or by dynamically testing the version using something like if gcc -dumpversion \| grep -q '^5\.'; then do special gcc5 thing fi People with better shell-fu can probably replace the call to grep with some shell-internal thing here, but I think this one wouldn't be too terrible either.
msg8596 (view)	Author: jendrik	Date: 2019-03-04.22:16:53
Here is what I found: https://stackoverflow.com/questions/30393229/new-libstdc-of-gcc5-1-may-allocate-large-heap-memory https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66339 It seems that GCC 5 allocates memory that is not freed. This was fixed in a later compiler version. Should I let Valgrind generate a suppression file and apply it?
msg8595 (view)	Author: malte	Date: 2019-03-04.19:19:34
Doesn't "new" in most or at least many implementations end up calling "malloc", so that this might well be an actual leak? So a filter that says one malloc leak is fine doesn't seem safe to me. Are you confident of your diagnosis that everything is alright and it isn't our code that is responsible? What is it based on? Even if it is the case that "our" code cannot cause complaints about malloc, our code could trigger leaks that are reported as complaints about malloc, e.g. when we use standard library functions that use malloc internally or when we create memory leaks via the LP solver interface. The clean way to handle things like these are valgrind suppression files. I hope that it should be reasonably easy to define one here, but first I'd like to be sure what is actually going on.
msg8594 (view)	Author: jendrik	Date: 2019-03-04.19:11:44
The new test works fine for our Clang and GCC-8 workers, but fails for GCC-5: http://buildbot.fast-downward.org/#/builders/17/builds/301 The error is the same for all configurations. It seems that Valgrind reports its own malloc memory allocation. What do you think about a solution that checks the Valgrind output and ignores the error if both " malloc " and "ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)" occur in the output? It should be safe to ignore errors related to malloc since we don't use it ourselves. Of course I'd be happy if someone knows a less hacky solution :-)
msg8593 (view)	Author: malte	Date: 2019-03-04.19:06:07
Sounds good. Since we don't have a lot of experience with these yet, it may be a good idea for Florian to review the changes to the infrastructure repository, but I'll leave it to the two of you whether you want to do that. No need to have me in the loop.
msg8592 (view)	Author: jendrik	Date: 2019-03-04.18:56:37
You're right, thanks for the heads up! I forgot to push the changes and will do so tomorrow from the build worker.
msg8591 (view)	Author: malte	Date: 2019-03-04.18:50:44
Doesn't this require updating the buildbot setup (docker/docker-compose recipes etc.)?
msg8590 (view)	Author: jendrik	Date: 2019-03-04.16:19:09
I resolved the two TODOs from the summary and merged the code (the wiki simply links to bitbucket-pipelines.yml to document the tools for automated tests, so there wasn't anything to update there).
msg8578 (view)	Author: jendrik	Date: 2019-03-02.00:23:55
Awesome! We still need to install valgrind on the buildbots, I guess, so I'll wait with merging until I'm back in the office. (Added this to the summary.)
msg8576 (view)	Author: malte	Date: 2019-03-01.23:38:07
Yes, looks good to merge. Thanks!
msg8575 (view)	Author: jendrik	Date: 2019-03-01.23:32:23
I agree that it would be nice to get rid of more raw pointers and I can see myself looking into this when time allows. However, I don't think this should block making progress on this issue. Do you think we can merge the code?
msg8573 (view)	Author: malte	Date: 2019-03-01.22:34:22
No, no, in the paragraph "I think this is a useful addition, but...", focus more on the first 7 words. By saying I would prefer we shouldn't rely on valgrind I didn't mean we shouldn't use it, just that we should also look at memory allocation more systematically. > Converting all raw pointers to std::unique_ptr will require quite some work, > I think. In my opinion, this issue should only be about releasing all memory > before the planner exits. If getting rid of (more) manual new/deletes is too much work for now, I'm fine with not doing it now. It's a bit of a pity because I think we already deferred it once or twice.
msg8571 (view)	Author: jendrik	Date: 2019-03-01.20:19:56
I dealt with your comments on Bitbucket. Judging by your msg8567, the issue title is wrong and we don't want to use valgrind after all? Converting all raw pointers to std::unique_ptr will require quite some work, I think. In my opinion, this issue should only be about releasing all memory before the planner exits.
msg8567 (view)	Author: malte	Date: 2019-03-01.17:21:47
I count two leaks and one false positive, but that's still an improvement. :-) I think this would be a useful addition, but I also want to point out that these were both intentionally implemented this way back when we didn't care yet about releasing objects with (essentially) global lifetime because they caused no harm and before unique_ptr and friends, it was harm to deal with these gracefully. Now that we care more about this, I'd be much happier if we looked at these legacy memory allocations systematically rather than relying on valgrind to catch them. Searching for the word "new" in the code finds about 30-40 uses of the operator (unfortunately plus lots of occurrences of the word "new" in comments) and a similar amount of uses of the operator "delete". Some of these are intentional, but it seems that quite a few are not; from a quick glance, it sounds like searching for these and using unique_ptrs etc. as appropriate would find the same code issues and more.
msg8566 (view)	Author: silvan	Date: 2019-03-01.16:46:22
I left one comment. Otherwise, it looks fine to me.
msg8565 (view)	Author: jendrik	Date: 2019-03-01.15:14:34
I looked into this and created a pull request at https://bitbucket.org/jendrikseipp/downward/pull-requests/122 . It includes a test that runs valgrind for some "standard" configs. (I created issue897 for updating the set of tested configurations.) The test uncovered three memory leaks that are now fixed. At least for the tested combinations of tasks and configurations, the planner doesn't have any memory leaks anymore.
msg5991 (view)	Author: malte	Date: 2016-12-24.15:07:39
Now that we're better about freeing memory, we should run valgrind to check that there are no leaks. While we're at it, we should also look at other memory issues that valgrind can detect. One problem is that any individual planner run only covers a small amount of search algorithm, heuristics, etc., so we would also need to think about code coverage. Perhaps this is also interesting for other reasons (e.g. when testing whole-planner changes, where we currently test more or less ad-hoc configs). We might want to link dynamically for this. (See Silvan's comment at issue662.)

History
Date	User	Action	Args
2019-03-06 09:58:06	jendrik	set	messages: + msg8624
2019-03-06 09:16:08	malte	set	messages: + msg8623
2019-03-06 09:09:13	jendrik	set	messages: + msg8622
2019-03-06 00:39:26	malte	set	messages: + msg8621
2019-03-06 00:11:35	jendrik	set	messages: + msg8619
2019-03-05 23:10:04	malte	set	messages: + msg8617
2019-03-05 17:55:08	jendrik	set	messages: + msg8616
2019-03-05 17:40:27	malte	set	messages: + msg8615
2019-03-05 17:30:29	jendrik	set	messages: + msg8614
2019-03-05 17:17:58	malte	set	messages: + msg8613
2019-03-05 17:11:01	silvan	set	messages: + msg8612
2019-03-05 15:53:04	jendrik	set	messages: + msg8611
2019-03-05 10:55:29	malte	set	messages: + msg8606
2019-03-05 08:50:11	jendrik	set	messages: + msg8602
2019-03-05 08:16:44	florian	set	messages: + msg8601
2019-03-04 22:47:33	malte	set	messages: + msg8597
2019-03-04 22:16:53	jendrik	set	messages: + msg8596
2019-03-04 19:19:34	malte	set	messages: + msg8595
2019-03-04 19:11:44	jendrik	set	messages: + msg8594
2019-03-04 19:06:07	malte	set	messages: + msg8593
2019-03-04 18:56:37	jendrik	set	messages: + msg8592
2019-03-04 18:50:44	malte	set	messages: + msg8591
2019-03-04 16:19:09	jendrik	set	status: in-progress -> resolved messages: + msg8590 summary: TODO: Install valgrind on buildbots. TODO: Document valgrind dependency in the wiki. ->
2019-03-02 00:23:55	jendrik	set	status: reviewing -> in-progress messages: + msg8578 summary: TODO: Install valgrind on buildbots. TODO: Document valgrind dependency in the wiki.
2019-03-01 23:38:07	malte	set	messages: + msg8576
2019-03-01 23:32:23	jendrik	set	messages: + msg8575
2019-03-01 22:34:22	malte	set	messages: + msg8573
2019-03-01 20:19:56	jendrik	set	messages: + msg8571
2019-03-01 17:21:47	malte	set	messages: + msg8567
2019-03-01 16:46:22	silvan	set	messages: + msg8566
2019-03-01 15:14:34	jendrik	set	status: chatting -> reviewing assignedto: jendrik messages: + msg8565
2017-01-09 11:05:53	florian	set	nosy: + florian
2017-01-03 10:58:30	silvan	set	nosy: + silvan
2016-12-27 19:33:25	haz	set	nosy: + haz
2016-12-24 16:21:10	jendrik	set	nosy: + jendrik
2016-12-24 15:07:40	malte	set	messages: + msg5991 summary: Now that we're better about freeing memory, we should run valgrind to check that there are no leaks. While we're at it, we should also look at other memory issues that valgrind can detect. One problem is that any individual planner run only covers a small amount of search algorithm, heuristics, etc., so we would also need to think about code coverage. Perhaps this is also interesting for other reasons (e.g. when testing whole-planner changes, where we currently test more or less ad-hoc configs). We might want to link dynamically for this. (See Silvan's comment at issue662.) -> (no value)
2016-12-24 15:06:29	malte	create

Issue703