Issue1115

Title Undefined behaviour/bug in StateRegistry::get_successor_state
Priority bug Status resolved
Superseder Nosy List clemens, jendrik, malte, silvan, tkloessner
Assigned To silvan Keywords
Optional summary
I noticed some weird behaviour today when I wanted to implement some automated tests for our student projects in Fast Downward. Consider the following:

StateRegistry state_registry = ...;
OperatorID op1 = ...;
OperatorID op2 = ...;
OperatorID op3 = ...;

State state1 = registry.get_initial_state();
State state2 = state_registry.get_successor_state(state1, op1); // op1 is a self-loop: state1 == state2, no new state registered

assert(state1 == state2); // Assertion passes.

State state3 = state_registry.get_successor_state(state2, op2); // state3 is new

assert(state1 == state2); // Assertion fails.

Now, I don't think that should happen given that StateRegistry::get_successor_state(const State&, OperatorProxy) takes state2 by const reference. I also don't see a misuse of the StateRegistry here beyond its intended prupose. The issue seems to be triggered by the following lines:

----
state_data_pool.push_back(predecessor.get_buffer()); // Copies buffer of source state
PackedStateBin* buffer = state_data_pool[state_data_pool.size() - 1]; // Get copied buffer sitting at the end of state_data_pool
...
StateID id = insert_id_or_pop_state(); // Removes the buffer created above from state_data_pool if the state is a duplicate
return task_proxy.create_state(*this, id, buffer); // buffer is dangling if the state was a duplicate, accessing it is undefined behaviour
----

In the above example, state3 and state2 will point to the same buffer after insertion of state3 into the registry. The other methods seem to explicitly use  lookup_state(id) to avoid this behaviour, so that state2 shares the buffer with state1, as they represent the same state.

Created on 2023-09-18.18:59:47 by tkloessner, last changed by silvan.

Summary
I noticed some weird behaviour today when I wanted to implement some automated tests for our student projects in Fast Downward. Consider the following:

StateRegistry state_registry = ...;
OperatorID op1 = ...;
OperatorID op2 = ...;
OperatorID op3 = ...;

State state1 = registry.get_initial_state();
State state2 = state_registry.get_successor_state(state1, op1); // op1 is a self-loop: state1 == state2, no new state registered

assert(state1 == state2); // Assertion passes.

State state3 = state_registry.get_successor_state(state2, op2); // state3 is new

assert(state1 == state2); // Assertion fails.

Now, I don't think that should happen given that StateRegistry::get_successor_state(const State&, OperatorProxy) takes state2 by const reference. I also don't see a misuse of the StateRegistry here beyond its intended prupose. The issue seems to be triggered by the following lines:

----
state_data_pool.push_back(predecessor.get_buffer()); // Copies buffer of source state
PackedStateBin* buffer = state_data_pool[state_data_pool.size() - 1]; // Get copied buffer sitting at the end of state_data_pool
...
StateID id = insert_id_or_pop_state(); // Removes the buffer created above from state_data_pool if the state is a duplicate
return task_proxy.create_state(*this, id, buffer); // buffer is dangling if the state was a duplicate, accessing it is undefined behaviour
----

In the above example, state3 and state2 will point to the same buffer after insertion of state3 into the registry. The other methods seem to explicitly use  lookup_state(id) to avoid this behaviour, so that state2 shares the buffer with state1, as they represent the same state.
Messages
msg11506 (view) Author: silvan Date: 2024-01-08.12:12:59
Merged.
msg11452 (view) Author: silvan Date: 2023-10-13.11:13:44
Results for blind search look good: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1115/data/issue1115-v1-eval/issue1115-v1-issue1115-base-issue1115-v1-compare.html

Anyone else besides Florian wants to have a look at the PR?
msg11427 (view) Author: silvan Date: 2023-10-05.09:22:05
Here is a pull request: https://github.com/aibasel/downward/pull/190

I don't think reassigning the pointer to the right place should introduce any overhead, but I can still set up an experiment if you prefer. If so, what configurations should it include?
msg11414 (view) Author: malte Date: 2023-10-03.15:43:31
I agree that fixing the bug should take priority over more cleanup, but I'd also want to leave this decision up to the person that ends up doing the actual work.

Silvan, it looks like you already know how to fix this, but this is a delicate part of the code, so I think a pull request and code review would be good. Is this performance-sensitive?
msg11413 (view) Author: clemens Date: 2023-10-03.15:20:27
Since this is a bug, I'd personally do the easy fix right away. If we want to discuss a larger refactoring, we can still do so (doesn't have to be a separate issue). My main point is: I'd like to avoid that fixing the bug is put on hold until the discussion of the refactoring comes to a conclusion, because that may take some time when there's no sprint waiting around the corner.
msg11409 (view) Author: silvan Date: 2023-10-03.12:08:35
I looked into this a bit. Thorsten's description and explanation is right, the problem is that insert_id_or_pop_state() invalidates buffer. The easiest solution would be to simply reassign buffer before constructing the state:

StateID id = insert_id_or_pop_state();
buffer = state_data_pool[id.value];
return task_proxy.create_state(*this, id, buffer);

In the non-axiom case, one could simply use lookup_state() which does exactly the last two lines above. For the axiom case, we need to move values() to create_state and therefore the same solution doesn't work.

I think that a cleaner solution would try to avoid pushing to state_data_pool outside the method insert_id_or_pop_state(), which is currently done in get_successor_state(). It would be better that insert_id_or_pop_state() is the only method responsible for modifying state_data_pool. Furthermore, there is a TODO at method get_successor_state() stating that it could be nice to have a global function operating on state buffers instead of doing this in the registry. I'm less convinced that this solves the issue because manually copying buffers without using the state_data_pool requires feels a bit hacky. 

So the question is: do we only want to fix the bug? Then this requires changing two lines of code. Or do we want to refactor get_successor() and insert_id_or_pop_state() to prevent similar errors in the future?
msg11386 (view) Author: malte Date: 2023-09-19.11:42:51
From your description that definitely sounds like a bug. I don't have time to look into this personally in the next few weeks, but I've set a reminder after our upcoming deadlines in case nobody else can look into this in the meantime.
msg11385 (view) Author: malte Date: 2023-09-19.11:40:14
Thorsten: I added you to the "nosy list" so that you're copied on messages on this. (I'm surprised this didn't happen automatically.) If you don't want to receive emails for this issue, just remove your name from the list.
History
Date User Action Args
2024-01-08 12:12:59silvansetstatus: reviewing -> resolved
messages: + msg11506
2023-10-13 11:13:44silvansetmessages: + msg11452
2023-10-05 09:22:05silvansetstatus: chatting -> reviewing
assignedto: silvan
messages: + msg11427
2023-10-03 15:43:31maltesetmessages: + msg11414
2023-10-03 15:20:27clemenssetnosy: + clemens
messages: + msg11413
2023-10-03 12:08:35silvansetmessages: + msg11409
2023-09-27 16:27:46silvansetnosy: + silvan
2023-09-19 11:42:51maltesetmessages: + msg11386
2023-09-19 11:40:14maltesetstatus: unread -> chatting
nosy: + malte, jendrik, tkloessner
messages: + msg11385
2023-09-18 18:59:47tkloessnercreate