Issue960

Title Add debugging methods to LP solver interface
Priority feature Status resolved
Superseder Nosy List augusto, florian, jendrik, malte, rik
Assigned To rik Keywords
Optional summary

Created on 2020-01-24.16:59:07 by rik, last changed by jendrik.

Files
File name Uploaded Type Edit Remove
issue960-experiments.zip rik, 2020-09-21.16:32:41 application/x-zip-compressed
issue960-v3-experiments.zip rik, 2020-09-29.11:43:13 application/x-zip-compressed
lp_internals.cc.diff florian, 2020-02-27.16:26:07 text/x-patch
lp_solver.cc.diff florian, 2020-02-27.16:25:57 text/x-patch
Messages
msg9744 (view) Author: jendrik Date: 2020-10-23.14:07:30
I added a changelog entry for this issue on the main branch. Feel free to edit/expand.
msg9743 (view) Author: florian Date: 2020-10-22.15:29:35
Warnings like these depend on the compiler version, so it is normal that you don't get some warnings locally that we see on the buildbot (that is why we test the build with different compilers).

Looks like the buildbot is happy now, so I'll close the issue. Thanks again.
msg9742 (view) Author: rik Date: 2020-10-22.14:32:36
I already noticed the same problem and alleviated the problem and pushed a 
minute ago. I tried compiling the code locally after fixing the style issues, 
but didn't receive so much as a warning on my machine.
msg9741 (view) Author: florian Date: 2020-10-22.14:25:04
There is another error about an unused field "infinity". I think this is about the field in the class LinearProgram that we planned to add to avoid passing it in parallel all the time. We didn't do that in the end because it would have been a larger refactoring. Please just remove the field.

http://buildbot.fast-downward.org/#/builders/10/builds/598
msg9740 (view) Author: florian Date: 2020-10-20.21:44:43
one more problem just popped up: looks like the style test do not pass. Rik, can you run them locally with tox and fix the issue?
http://buildbot.fast-downward.org/#/builders/4/builds/288
msg9739 (view) Author: malte Date: 2020-10-20.21:43:40
It looks like the buildbot found a style violation. I did not look at the details.
msg9738 (view) Author: florian Date: 2020-10-20.21:42:56
Rik merged the change. There is one last small issue: the commit message of the merge was autogenerated by GitHub and broke our "[branchname] commit message" convention. I tried to fix it and force-pushed the change. If anyone pulled between 17:50 and now, please update your local copy. Rik, could you check the history in the main repository? If everything looks good to you, please make another update here to close the issue (set the field Status above to "resolved" and submit). Thanks to everyone involved in this. It'll be nice to have these debugging features available in the future.
msg9737 (view) Author: malte Date: 2020-10-20.17:18:51
Sounds good to me.
msg9736 (view) Author: rik Date: 2020-10-20.17:17:35
We have completed what this time really should be the last round of code 
review and I implemented some minor stylistic changes suggested by Florian and 
Malte. The code should now be ready to merge.
msg9735 (view) Author: florian Date: 2020-10-20.11:31:53
We discussed this live and decided not to care about the change in memory usage. Malte had some additional comments on the pull request (I added them in GitHub). After they are done and reviewed by Jendrik or me, we can merge.
msg9734 (view) Author: jendrik Date: 2020-10-20.10:46:43
I agree.
msg9733 (view) Author: rik Date: 2020-10-20.10:37:29
I have also tried to replicate memory issues by running the same 
configuration locally. The memory usage was not consistent with the memory 
usage on the grid. This goes for the new as well as the base version. 
However, the memory usage on the grid as well as locally does seem to be 
deterministic. I used version 0.107.9 of Osi and 3.1.1 of Soplex on both 
machines.

Florian and I are not quite sure what to make of this, but feel like the 
difference is minor enough and does not noticeably skew in one direction 
that this could be merged regardless. Jendrik, would you agree to this?
msg9725 (view) Author: rik Date: 2020-09-29.11:43:12
I have rerun the experiments after incorporating all of the changes and the 
results look much the same. Evaluations and expansions are the same for the 
base and v3 version. Coverage once again went up slightly. Memory usage and 
search time show the same patterns as in the previous experiment.

As announced, I ran a copy of the diverse potentials heuristic with soplex 
and compared it with itself. The search time shows the same phenomenon where 
most fall within 5% but a few instances are either almost twice as fast or 
twice as slow, so this seems to have some cause that does not relate to this 
issue. The memory usage is very consistent between the two copies however. 
It seems like the variability is caused by the changes in this issue. The 
versions with cplex as an lp solver does not show this variability in memory 
usage, so I imagine it's something subtle that influences the memory usage 
of soplex.

The full results can be found in issue960-v3-experiments.zip
msg9724 (view) Author: florian Date: 2020-09-22.18:40:02
I did another iteration through the changes and noticed a few more small things. I also saw a few comments by Jendrik that were still open. I think in the "conversation" tab they are hidden inside a "more comments" block. In the "Files changed" tab everything shows up as expected.

Once you resolved all open conversations, feel free to just run the experiments. No need to wait for an OK from us.
msg9723 (view) Author: rik Date: 2020-09-22.17:05:28
Thanks for the review Jendrik. I have addressed everything you brought up and 
run all style checks. If you and Florian sign off on this version, I can rerun 
the experiments to make sure this has not affected performance.
I will also run the soplex variant of the potential heuristic twice to see if 
the variation in memory usage we see there is really just noise.
msg9719 (view) Author: jendrik Date: 2020-09-22.10:19:30
Ah, I forgot about that incompatibility. Tieing the names to the log level sounds like a good idea.
msg9718 (view) Author: florian Date: 2020-09-22.00:20:42
Thanks Jendrik. I don't think having names in debug mode would work because LP builds are incompatible with the debug build since the time we added the _GLIBCXX_DEBUG flag (the LP libraries are compiled without this flag which makes them incompatible on a binary level). We could tie it to the log-level in specific LP instances and have names only in non-silent log levels. But this shouldn't affect the current code.
msg9717 (view) Author: jendrik Date: 2020-09-21.23:54:42
I reviewed the code and left some comments.

On a more general note: it seems the current design assumes that users name LP variables and constraints only when they're debugging something and remove the names afterwards. I think we should encourage users to leave the names in permanently, but only in debug builds. On never knows when it's time for debugging again and this also helps documenting the LP. I don't think any code changes are necessary for this, but we could rephrase the comment that explicitly warns against using names in "production code". I think we should also add names (in debug mode) for at least one of the files that uses the LP code.
msg9716 (view) Author: florian Date: 2020-09-21.17:32:17
The memory usage looks fine for all configurations except for the SoPlex variant of potential heuristics. There, the results are a bit surprising: there are some tasks (mostly freecell and spider) where the memory usage increases a lot (e.g., 148MB -> 194MB) although the number of expansions is the same. At the same time, there are tasks where the usage decreases and there does not seem to be a trend in one direction or the other. I would have expected that the potential heuristic found in the two runs might be different but since the evaluations are the same, this cannot be the cause.

The timing looks very similar for all configurations: most tasks behave as expected but then some tasks vary a lot stronger than what I would expect: their factors are in the range of 0.5 - 2.0 instead of the range of 0.95 - 1.05. Again, there are cases with positive and negative impact and no clear trend. I cannot explain the higher variance. Did someone observe this in other experiments?

Augusto, do you know what's up with the translator segfaults in organic synthesis? Anyway, since the patch does not touch the translator, it seems to be unrelated.

Despite the strange variations, I think we could merge the patch if we don't spot anything in the review as the results do not seem systematically worse. I already reviewed the code. Jendrik, could you do a second iteration?
msg9715 (view) Author: rik Date: 2020-09-21.16:33:09
The pull request has now moved to github. Here is the new link:
https://github.com/aibasel/downward/pull/12

I have run experiments to compare the performance of the new implementation 
of the LP wrapper class to the old one when 
the debugging functionality is not used. I tested it with three different 
heuristics which utilize lp solvers both with 
cplex and soplex as an lp solver. There seems to be no meaningful impact on 
performance. Coverage actually went up 
slightly (this is definitely a coincidence). The number of evaluations and 
expansions is identical for all domains and 
configurations, so we can be reasonably confident that this change did not 
affect the calculated heuristic values.

There seems to be no significant impact on search time. For any given 
configuration, the score_total_time is at most 0.1% 
lower and for some configurations the summed score was slightly higher. The 
relative scatter plots show that the vast 
majority of problems differed by less than 15% in terms of total time. The 
plots indicate that this is down to noise. 
There is a strange phenomenon which causes a significant minority of 
problems to either be about 1.8x faster or slower 
across all configurations, domains, and problem sizes.

There is a small but measurable impact on memory usage. Every configurations 
sees an increase in memory use on the order 
of 0.1%. The scatter plots show that there are very few outliers with. Only 
the diverse potentials heuristic with soplex 
as an lp solver contains runs where the memory usage differed by more than 
5% and all of those problems were rather 
small. This configurations also saw some runs where the new version used 
more than 10% less memory.

The experiments contain 20 unexplained errors (out of over 21,000 runs), but 
none of these relate to this issue in any 
way. 12 of these errors were due to the translator segfaulting while 
translating p11, p12 or p18 of the organic-
synthesis-opt18-strips domain and the other 8 were down to the planner 
producing too much output when solving p12 and p14 
of the parcprinter-08-strips domain when using the lmcount heuristic.

The full tables and charts can be found in issue960-experiments.zip.
msg9211 (view) Author: rik Date: 2020-03-12.17:14:51
I have added a few more commits to the pull request to address the issues 
brought up in the way Florian and I agreed upon.

Here is the link once again: https://bitbucket.org/rikdegraaff/fast-downward-
issues/pull-requests/1/issue960/diff
msg9204 (view) Author: florian Date: 2020-02-27.16:25:56
Since we offering more information on the status of the LP, we should also add methods to test for infeasible and unbounded LPs. There is also a try-catch missing around create_lp_solver that we should add now. Additionally, adding methods to dump a constraint, a variable, a set of constraints, a set of variables, or a full LP, would be helpful. And finally, there is a CPLEX error that can sneak in as an unexplained error when CPLEX runs out of memory in the wrong place, we can fix that, too.

I'll attach patches for these things but please don't take the code for dumping the constraints verbatim. It should be a function that also takes a double ("infinity") and only print bounds if they are not infinite.
msg9200 (view) Author: florian Date: 2020-02-14.17:09:00
I think adapting the operator-counting framework and the other places where we currently use LPs should be straight-forward.

Actually, we might want to take this idea a step further: instead of two classes LPVariables and LPConstraints, we could have a single class LinearProgram that contains all required data. We could generate proxies for variables and constraints from it that in turn have the necessary getters and setters. We discussed something in this direction a while ago where we actually wanted to make these objects smarter, so you could for example change a bound on a constraint object and then re-solve the LP and the classes would keep track of the required changes internally. I think all of this would be too much to do in this issue (if we want to do it, we should create a new issue for it) but adding a LinearProgram with proxies for variables and constraints would be a step in this direction.

Rik, to see an example of what I mean, you can have a look at the classes TaskProxy, VariablesProxy, and VariableProxy. We can also discuss this live if you prefer.
msg9199 (view) Author: rik Date: 2020-02-14.14:09:57
Adding the LPVariables and LPConstraints classes seems like the best option to me. It does lose some flexibility over simply having a vector of something. Considering that I don't think what I'm working on in the context of time-unrolling (which generates and modifies an LP in multiple phases) would require too much of a rewrite to be compatible with this system, I suppose the benefits outweigh the costs. Do you think adapting the operator counting framework would present a significant challenge, or would that be very straightforward?

For posterity's sake, the solution I suggested was adding a separate, overloaded load_problem() function which accepts vector<NamedLPVariable>. The upside is that this has no impact on existing code and is incredibly easy to implement. The downside is that switching from debugging to not using debugging would be cumbersome because you need to adapt any function which touches the variabkes and constraints.
msg9191 (view) Author: florian Date: 2020-02-08.13:21:18
Ahh, of course another option would be to hide all this code with a preprocessor switch that is off by default and you'd just enable it for debugging. But so far, we mostly avoided this. Something like this:

class LPVariable {
    // ...
#ifdef LP_DEBUG
    string name;
#endif
}
msg9190 (view) Author: florian Date: 2020-02-08.13:17:19
I thought the size of empty string would be smaller but it looks like it would roughly double the size of LPVariable and LPConstraint in gcc > 5. (Interesting article, by the way: https://shaharmike.com/cpp/std-string/). We could reduce the overhead to a single pointer by using a unique_ptr<string> but it would still be an overhead.

We could also go for functions in the LPSolver class that forward the setRowName and setColName functions. The disadvantage I see there is that these functions only make sense after loading the LP into the solver and our code often splits the generation of constraints from the loading. So the code that generates the constraints could not easily set the names for the variables and names it creates.

Maybe another option would be to generate parallel structures LPVariableNames and LPConstraintNames similar to the way we generate names for enum options in the option parser. As for the option before, this would mean that existing code is not easy to adapt because we would have to modify the types of data passed between the class that generates LP variables and constraints and the class that uses them.


The main use case I have in mind is when we already have code like this:

vector<LPVariable> variables = generate_variables(task);
vector<LPConstraint> constraints = constraint_generator.generate(task)
lp_solver.load(variables, constraints);

and now want to add names inside the methods generate_variables and constraint_generator.generate for debugging. Without adding a field to LPVariable and LPConstraint, we'd have to modify the interface if we want to debug this code. How about adding a new class like this:

class LPVariables {
    vector<LPVariable> variables;
    vector<string> names;
public:
    add_variable(double lower_bound, double upper_bound, double objective, string name = "");
}

and replacing the existing cases of vector<LPVariable> with it (and analogously for constraints of course)?
msg9189 (view) Author: malte Date: 2020-02-07.17:49:46
Makes sense. Feel free to involve me in the discussion if you cannot find something that makes you sufficiently happy, but only then. (Much on the plate at the moment.)
msg9188 (view) Author: rik Date: 2020-02-07.17:47:57
I initially implemented a version which had no impact when it is not used, but Florian found it too cumbersome; perhaps I could try and find a different approach which offers the best of both worlds with him?
msg9187 (view) Author: malte Date: 2020-02-07.17:43:46
I am not sure we want to pay for storing a string with every constraint whether or not we use the feature. Usually we try to make debug features more lightweight.
msg9186 (view) Author: rik Date: 2020-02-07.17:36:23
I have created a pull request for this issue: https://bitbucket.org/rikdegraaff/fast-downward-issues/pull-requests/1/issue960/diff
msg9178 (view) Author: rik Date: 2020-01-24.16:59:07
We want to add the ability to assign custom variable and constraint names and write out the LP to analyse errors.
History
Date User Action Args
2020-10-23 14:07:30jendriksetmessages: + msg9744
2020-10-22 15:29:35floriansetstatus: in-progress -> resolved
messages: + msg9743
2020-10-22 14:32:36riksetmessages: + msg9742
2020-10-22 14:25:04floriansetmessages: + msg9741
2020-10-20 21:44:44floriansetmessages: + msg9740
2020-10-20 21:43:40maltesetmessages: + msg9739
2020-10-20 21:42:56floriansetmessages: + msg9738
2020-10-20 17:18:51maltesetmessages: + msg9737
2020-10-20 17:17:35riksetmessages: + msg9736
2020-10-20 11:31:53floriansetmessages: + msg9735
2020-10-20 10:46:43jendriksetmessages: + msg9734
2020-10-20 10:37:30riksetmessages: + msg9733
2020-09-29 11:43:14riksetfiles: + issue960-v3-experiments.zip
messages: + msg9725
2020-09-22 18:40:02floriansetmessages: + msg9724
2020-09-22 17:05:28riksetmessages: + msg9723
2020-09-22 10:19:30jendriksetmessages: + msg9719
2020-09-22 00:20:42floriansetmessages: + msg9718
2020-09-21 23:54:42jendriksetmessages: + msg9717
2020-09-21 17:32:17floriansetmessages: + msg9716
2020-09-21 16:33:09riksetmessages: + msg9715
summary: The pull request has now moved to github. Here is the new link: https://github.com/aibasel/downward/pull/12 I have run experiments to compare the performance of the new implementation of the LP wrapper class to the old one when the debugging functionality is not used. I tested it with three different heuristics which utilize lp solvers both with cplex and soplex as an lp solver. There seems to be no meaningful impact on performance. Coverage actually went up slightly (this is definitely a coincidence). The number of evaluations and expansions is identical for all domains and configurations, so we can be reasonably confident that this change did not affect the calculated heuristic values. There seems to be no significant impact on search time. For any given configuration, the score_total_time is at most 0.1% lower and for some configurations the summed score was slightly higher. The relative scatter plots show that the vast majority of problems differed by less than 15% in terms of total time. The plots indicate that this is down to noise. There is a strange phenomenon which causes a significant minority of problems to either be about 1.8x faster or slower across all configurations, domains, and problem sizes. There is a small but measurable impact on memory usage. Every configurations sees an increase in memory use on the order of 0.1%. The scatter plots show that there are very few outliers with. Only the diverse potentials heuristic with soplex as an lp solver contains runs where the memory usage differed by more than 5% and all of those problems were rather small. This configurations also saw some runs where the new version used more than 10% less memory. The experiments contain 20 unexplained errors (out of over 21,000 runs), but none of these relate to this issue in any way. 12 of these errors were due to the translator segfaulting while translating p11, p12 or p18 of the organic- synthesis-opt18-strips domain and the other 8 were down to the planner producing too much output when solving p12 and p14 of the parcprinter-08-strips domain when using the lmcount heuristic. The full tables and charts can be found in issue960-experiments.zip. ->
2020-09-21 16:32:42riksetfiles: + issue960-experiments.zip
summary: The pull request has now moved to github. Here is the new link: https://github.com/aibasel/downward/pull/12 I have run experiments to compare the performance of the new implementation of the LP wrapper class to the old one when the debugging functionality is not used. I tested it with three different heuristics which utilize lp solvers both with cplex and soplex as an lp solver. There seems to be no meaningful impact on performance. Coverage actually went up slightly (this is definitely a coincidence). The number of evaluations and expansions is identical for all domains and configurations, so we can be reasonably confident that this change did not affect the calculated heuristic values. There seems to be no significant impact on search time. For any given configuration, the score_total_time is at most 0.1% lower and for some configurations the summed score was slightly higher. The relative scatter plots show that the vast majority of problems differed by less than 15% in terms of total time. The plots indicate that this is down to noise. There is a strange phenomenon which causes a significant minority of problems to either be about 1.8x faster or slower across all configurations, domains, and problem sizes. There is a small but measurable impact on memory usage. Every configurations sees an increase in memory use on the order of 0.1%. The scatter plots show that there are very few outliers with. Only the diverse potentials heuristic with soplex as an lp solver contains runs where the memory usage differed by more than 5% and all of those problems were rather small. This configurations also saw some runs where the new version used more than 10% less memory. The experiments contain 20 unexplained errors (out of over 21,000 runs), but none of these relate to this issue in any way. 12 of these errors were due to the translator segfaulting while translating p11, p12 or p18 of the organic- synthesis-opt18-strips domain and the other 8 were down to the planner producing too much output when solving p12 and p14 of the parcprinter-08-strips domain when using the lmcount heuristic. The full tables and charts can be found in issue960-experiments.zip.
2020-04-15 17:17:23augustosetnosy: + augusto
2020-03-12 17:14:51riksetmessages: + msg9211
2020-02-27 16:26:07floriansetfiles: + lp_internals.cc.diff
2020-02-27 16:25:57floriansetfiles: + lp_solver.cc.diff
messages: + msg9204
2020-02-14 17:09:00floriansetmessages: + msg9200
2020-02-14 14:09:57riksetmessages: + msg9199
2020-02-08 13:21:18floriansetmessages: + msg9191
2020-02-08 13:17:20floriansetmessages: + msg9190
2020-02-07 17:49:46maltesetmessages: + msg9189
2020-02-07 17:47:57riksetmessages: + msg9188
2020-02-07 17:43:47maltesetmessages: + msg9187
2020-02-07 17:36:23riksetmessages: + msg9186
2020-01-27 13:38:07riksetstatus: unread -> in-progress
2020-01-24 16:59:07rikcreate