Issue1120

Title parser chokes on newlines on Windows
Priority bug Status resolved
Superseder Nosy List florian, jendrik, malte
Assigned To Keywords
Optional summary
Bug demo: https://github.com/aibasel/downward/pull/189
PR: https://github.com/aibasel/downward/pull/191

Created on 2023-10-05.10:21:36 by jendrik, last changed by florian.

Summary
Bug demo: https://github.com/aibasel/downward/pull/189
PR: https://github.com/aibasel/downward/pull/191
Messages
msg11435 (view) Author: florian Date: 2023-10-06.11:34:59
Jendrik and I agreed on the fix and merged the change.
msg11433 (view) Author: florian Date: 2023-10-05.16:34:46
Looks like match_continuous does what we want:

> Only match a sub-sequence that begins at first. 
(https://en.cppreference.com/w/cpp/regex/match_flag_type)

I added a PR to include this flag.
https://github.com/aibasel/downward/pull/191
msg11432 (view) Author: florian Date: 2023-10-05.15:45:17
I was able to reproduce this on Windows and added a few print statements to see what is going on. It looks like multi-line regex matching is treated differently on the systems.

To split something like "foo()" into tokens, we repeatedly match regular expressions against the start of the string. Simplified for this case, we have

  TokenType::OpenParen  with regex "^\s*\("
  TokenType::CloseParen with regex "^\s*\)"
  TokenType::Identifier with regex "^\s*\w+"

Without newlines in the string, the first two do not match and the third one does, so we recognize the token Identifier("foo"), advance the start and have "()" left over to parse. On that part, only the first regex matches and we recognize an OpenParen token, and so on.

Now say there is a newline in the string "foo\n()". On Windows, this means that the regex "^\s*\(" matches the second line, so an OpenParen token is recognized and the string is advanced, leaving "oo\n()".

We would like to have "^" only match the start of the string not the start of the line. This seems to be the behavior on Linux otherwise we would have seen such issues already. But it is not guaranteed by the standard (https://en.cppreference.com/w/cpp/regex/ecmascript). It says:

> The assertion ^ (beginning of line) matches
> 1) The position that immediately follows a LineTerminator character
> (this is only guaranteed if std::regex_constants::multiline(C++ only) is enabled)
> 2) The beginning of the input (unless std::regex_constants::match_not_bol(C++ only) is enabled)

So using the flag std::regex_constants::multiline guarantees the multi-line behavior but not using it does not guarantee the opposite behavior.

There also is match_not_bol which sounds like what we want initially, but it means that the first character in the string does not count as a match to "^".
msg11430 (view) Author: jendrik Date: 2023-10-05.12:01:06
The problem also occurs for configurations without predefinitions, so the rewriting is not the problem.
msg11429 (view) Author: florian Date: 2023-10-05.11:31:14
Strange, it completely messes up the tokens. I have no idea where the lists are coming from.

From the demo, the full call string is

  --evaluator "lmc=landmark_cost_partitioning(lm_merged([lm_rhw(),lm_hm(m=1)]))"
  --search "astar(lmc,lazy_evaluator=lmc)"

This is an old-style predefinition that should internally be re-written to

  "let(lmc, landmark_cost_partitioning(lm_merged([lm_rhw(),lm_hm(m=1)])), astar(lmc,lazy_evaluator=lmc))"

I suspect that this rewriting already is the problem. Can you try with the let expression instead?

        "bjolp": [
            "--search",
            """let(lmc, landmark_cost_partitioning(lm_merged(
                [lm_rhw(),lm_hm(m=1)])), astar(lmc,lazy_evaluator=lmc))"""],

Maybe also put some output into command_line.cc so we see what replace_old_style_predefinitions is doing.
msg11428 (view) Author: jendrik Date: 2023-10-05.10:21:36
If one uses a multiline string to define (even parts of) a configuration in Python such as 

"""lmc=landmark_cost_partitioning(lm_merged(
    [lm_rhw(),lm_hm(m=1)]))"""

the Fast Downward parser fails to parse it on Windows:

Traceback:

  Start Syntactical Parsing: [[([[lm_rhw(),lm_hm ...

  -> Identify node type: [[([[lm_rhw(),lm_hm ...

  -> Parsing List: [[([[lm_rhw(),lm_hm ...

  -> Parsing list arguments: [([[lm_rhw(),lm_hm( ...

  -> Parsing sequence: [([[lm_rhw(),lm_hm( ...

  -> Parsing 1. argument: [([[lm_rhw(),lm_hm( ...

  -> Identify node type: [([[lm_rhw(),lm_hm( ...

  -> Parsing List: [([[lm_rhw(),lm_hm( ...

  -> Parsing list arguments: ([[lm_rhw(),lm_hm(m ...

  -> Parsing sequence: ([[lm_rhw(),lm_hm(m ...

  -> Parsing 1. argument: ([[lm_rhw(),lm_hm(m ...

  -> Identify node type: ([[lm_rhw(),lm_hm(m ...



[[([[lm_rhw(),lm_hm(m=1)])),astar(lmc,lazy_evaluator=lmc))

  ^

Unexpected token '<Type: '(', Value: '('>'. Expected any of the following token types: Let, Identifier, Boolean, Integer, Float, [

Usage error occurred.


My guess is that we need to explicitly handle Windows line breaks.
History
Date User Action Args
2023-10-06 11:34:59floriansetstatus: chatting -> resolved
messages: + msg11435
2023-10-05 16:34:46floriansetmessages: + msg11433
summary: Bug demo: https://github.com/aibasel/downward/pull/189 -> Bug demo: https://github.com/aibasel/downward/pull/189 PR: https://github.com/aibasel/downward/pull/191
2023-10-05 15:45:17floriansetmessages: + msg11432
2023-10-05 12:01:06jendriksetmessages: + msg11430
2023-10-05 11:31:14floriansetstatus: unread -> chatting
nosy: + florian
messages: + msg11429
2023-10-05 10:21:36jendrikcreate