Title Replace OSI with our own interface to LP/MIP solvers
Priority feature Status chatting
Superseder Nosy List florian, gabi, jendrik, malte, silvan, thomas
Assigned To florian Keywords
Optional summary
Pull Request:

Related issues:
* issue926
* issue972
* issue1010
* issue1047

Created on 2023-01-27.21:08:30 by florian, last changed by florian.

Pull Request:

Related issues:
* issue926
* issue972
* issue1010
* issue1047
msg11036 (view) Author: florian Date: 2023-03-13.17:36:09
> perhaps we could also ask Domenico for his thoughts?

I was just writing to him about this, when I noticed that this discussion was out of date, so the last update was just to bring everything up to date before sending a link. :-)
msg11034 (view) Author: malte Date: 2023-03-13.17:08:08
Hi Florian, thanks for the update, this in encouraging progress.
Universally good news would perhaps have been too much to ask for. ;-)

If we stop being able to make progress at some point where it's still mixed news, perhaps we could also ask Domenico for his thoughts?
msg11033 (view) Author: florian Date: 2023-03-13.16:57:53
The issue in msg11010 is now fixed (it turns out OSI handled this case wrong but not in a way that affected us).

I also ran some experiments which can be found here (ignore v1 and v2, they still had some bugs):

Sorry for the bad names:
 * c2a93e68d is the commit on the main branch, using OSI
 * 7a11a0fc is on the issue branch and uses CPLEX directly

There are a lot of unexplained errors but all of them are out of memory errors that are correctly recognized as oom errors. I suspect lab lists them as unexplained because of the output on stderr, where we print "Ran out of memory while calling CPLEX.".

I'll go config by config here as the change impacts them very differently:

# h^+ (as an operator-counting MIP)

Performance in h^+ improves drastically, with only positive effects and a total of 85 additional tasks solved.
Most of the additional tasks previously ran out of time, some out of memory. The number of evaluations/expansions stays the same, so this is indeed the same heuristic, just computed faster. I'm not sure exactly where this change comes from, but suspect the default MIP strategy of CPLEX is just better than the parameters set in OSI by default.

# Operator-Counting LP with h^SEQ and LM-cut constraints

Evaluations/Expansions are also the same and we also solve more tasks, but only 5 more. Time scores go up by about 5 and memory score by 9. Total time varies with factors between 0.4 and 1.5 in extreme cases but as the scores indicate, there is a positive trend and solving is faster on average. Memory is saved in almost all cases, using between 95% and 99% of the memory used by the baseline.

# Cost partitioning over Landmarks (still called lmcount in this revision, optimal/admissible version)

Evaluations/Expansions are also the same but here we lose 7 tasks and get 7 additional timeouts. As above, we also reduce memory usage by 1-5%, so the lost tasks are due to time. Two of these are duplicates (2011 and 2008 version of a task in pegsol and parcprinter), so I'd treat this as 5 tasks lost. The tasks in the baseline all take above 250s out of the 300s time limit, the worst case is in pegsol11/p14 with 255s. The overall and search time scores drop by about 2.75. There clearly is something slowing down here and the fact that this happens here and not in the operator-counting heuristics hints at that the problem is somehow with loading in the problem (happens once for operator-counting and once per state for lmcount).

# Diverse Potential Heuristics

This is somewhat harder to analyze because multiple optimal solutions to the LP can exist and each one will be a different potential heuristic that might be better or worse. Here the number of expansions and evaluations varies, so we get different heuristics that then also take up a different amount of time and memory. Over all, we lose 11 tasks.

The scores are better in some domains, worse in others, and of course tasks that ran out of time or memory while trying to find a potential function get a score of -1. Overall the time score is worse by 3 (evaluations) and 3.4 (expansions). From scrolling over the scores, the positive and negative scores look equally distributed except for the ones influenced by -1, so I don't think we get systematically works heuristics.

While we do save on memory on average (score +1.91), the time scores go down and we get 28 additional timeouts and 17 fewer memouts. The total time score drops by 10.31 while the search time score drops by 6.09. This makes sense as the LP-related stuff happens before the search starts for this configuration. With diverse potential heuristics, we also solve multiple LPs.

Since this configuration is non-deterministic in a way, I suggest we start with looking into the performance of lmcount and run potential heuristics again once we fix something there.
msg11010 (view) Author: florian Date: 2023-02-10.17:09:10
There is also an open issue in the CPLEX interface:
When setting up constraints in our code, we use lower and upper bounds, while CPLEX uses a different representation internally (right-hand side, sense, and range). In CPLEX's model, we cannot specify an unsatisfiable constraint like
  2 <= x <= 1
Of course we could catch such constraints before we pass them to CPLEX but if we then later change the bounds on the constraint (e.g., change the upper bound in the example above to 4), the constraint and the old bounds can become relevant again. We still have to figure out how OSI handled this case.
msg11007 (view) Author: florian Date: 2023-02-10.16:58:32
We started work on the new interface and I added a link to a pull request to the summary. Currently, the code is not ready for review yet. The general setup for the interface is done, the CMake scripts are adapted, and there is an initial version of the CPLEX interface. In this version, we assumed that CPLEX copies the data passed in to the interface which is not the case for all library functions. The next step is to find out which library calls copy the data and which ones do not. The SoPlex interface is not implemented yet.
msg10962 (view) Author: malte Date: 2023-02-02.15:54:41
Thanks! I think that's sufficient evidence that we can focus the new interface on Soplex and CPLEX only.
msg10961 (view) Author: florian Date: 2023-02-02.13:05:45
I installed Gurobi locally and compiled Fast Downward with it. I took some shortcuts to get this working (e.g., the CMake find scripts I wrote will not work on other systems), but it is sufficient for a local test.

I ran 4 configs with two tasks each on my laptop and measured total time, except for potential heuristics, where I measured the time to initialize the heuristic instead because the search itself doesn't use the LP solver. Mostly CPLEX was faster but there also was a case where Gurobi was faster and for the most part, the times seem to be in the same ball park. An exception is the potential heuristic on Airport where Gurobi is much slower.

                                 CPLEX      Gurobi
A*/h^SEQ                        171.31s    166.90s    miconic/s9-0.pddl
  (total time)                  361.51s    424.01s    termes-opt18-strips/p03.pddl

A*/div potentials                32.42s     95.99s    satellite/p30-HC-pfile10.pddl
  (time to initialize heuristic)  8.07s    200.56s    airport/p35-airport4halfMUC-p12.pddl

A*/opcount, pho, lmcut           31.26s     60.05s    woodworking-opt11-strips/p05.pddl
  (total time)                   24.76s     29.16s    freecell/p02.pddl

A*/h^+                           20.96s     30.92s    blocks/probBLOCKS-6-2.pddl
  (total time)                    7.33s      7.41s    depot/p01.pddl

All in all, I'd say Gurobi is still interesting to try out in a full experiment eventually but we do not have to support it as part of this issue.
msg10959 (view) Author: malte Date: 2023-02-01.23:41:58
I would say more advanced changes of the interface should not be part of this issue. I think lpsolve is an attractive nuisance that it might be good to explicitly *not* support -- otherwise there is a risk of the "pyperplan effect" where we see lots of people using lpsolve (and getting terrible runtime) because it's easier to set up than the alternatives.

As discussed, I think it would be nice to invest a bit of work in finding out how Gurobi would perform before we rip it out because of the strong performance numbers and because we currently claim to support it in our documentation. But if it turns out to be too much trouble, it shouldn't keep up this issue.

Regarding versions, upgrading to more recent versions sounds good to me; I don't know if there is a good reason to write code for an old version with the immediate plan of changing it again to move to a new version.
msg10949 (view) Author: florian Date: 2023-01-27.21:08:30
We want to write our own interface to drop the dependency on OSI. The first step
is to figure out the scope of this issue. Since we already have an interface to
OSI on our end, I propose to keep using this as-is and only changing how it
works internally, i.e., talk directly to the solvers rather than talking to OSI.

There are more advanced options that we also previously discussed. For example,
we could make the classes for LPConstraint and LPVariable smarter, so, for
example, bounds can be updated on them. I would see this and any other change
that would require changing the calling code as future work for now. This
probably means we have to copy the data into the solvers but we do so currently
as well.

Regarding supported solvers, I would aim for SoPlex and CPLEX for now with the
option of supporting Gurobi in the future. In recent tests I saw, Gurobi was
fantanstic specifically for LPs. We also have a module on teh grid for Gurobi
with a license that allows us to run experiments (the normal academic license is
limited to one parallel instance). So Gurobi is worth trying out but if we can
support SoPlex and CPLEX, I'd say that is good enough for this issue. Regarding
other solvers, maybe lpsolve is interesting because it could be interesting
because it can be installed through apt. I would like to keep this issue small,
though, so I suggest we limit it to the solver we currently use. I don't think
it makes sense to drop either of them from this issue, as we want CPLEX for
experiments and SoPlex for release containers.

There is another question about which versions of the solvers we want to support
and how we want to deal with multiple versions in the interface. For
comparability to the current code, it might make sense to support the same
versions that we currently use (although, we support multiple ones, so this is
not completely clear). On the other hand, it would be nice to support more
recent versions. SoPlex specifically had a API-breaking change after the version
we currently use, so we'd probably need changes in the interface (or two
interfaces) if we want support for both new and old versions. I guess we could
generally have one interface per solver and handle version differences with
#ifdefs but I'm not sure if there is a better way of handling this.

Finally, there are also a couple of issues that are related and we should update or fix on the fly:
* issue926 and issue972 discuss support for Gurobi and how to use it on the grid
* issue1010 is about switching to newer versions of SoPlex and CPLEX
* issue1047 is about exiting cleanly when using SoPlex as a MIP solver
Date User Action Args
2023-03-13 17:36:09floriansetmessages: + msg11036
2023-03-13 17:08:08maltesetmessages: + msg11034
2023-03-13 16:57:53floriansetmessages: + msg11033
2023-02-10 17:09:10floriansetmessages: + msg11010
2023-02-10 16:58:32floriansetmessages: + msg11007
summary: Related issues: * issue926 * issue972 * issue1010 * issue1047 -> Pull Request: * Related issues: * issue926 * issue972 * issue1010 * issue1047
2023-02-02 15:54:41maltesetmessages: + msg10962
2023-02-02 13:05:45floriansetmessages: + msg10961
2023-02-01 23:41:58maltesetmessages: + msg10959
2023-01-30 14:49:18gabisetnosy: + gabi
2023-01-30 07:55:35thomassetnosy: + thomas
2023-01-27 21:08:30floriancreate