Issue1146

Title Validate output.sas
Priority feature Status chatting
Superseder Nosy List davidspeck, florian, gabi, jendrik, malte, tanja
Assigned To Keywords
Optional summary
related: issue1069 issue458 issue762 issue1039 issue54

Created on 2024-07-12.10:56:38 by tanja, last changed by malte.

Summary
related: issue1069 issue458 issue762 issue1039 issue54
Messages
msg11881 (view) Author: malte Date: 2025-07-30.12:19:32
We previously marked this as a meta-issue, but it evolved into a feature issue in which we added validation of the search component input. So I'm changing this to "feature".

After this issue is resolved, we should also make the various ideas of what the translator output format/search component input format is consistent:

- we have the quite strict definition in src/translate/sas_tasks.py
- we have the public documentation in docs/translator-output-format.md
- we now also have what is checked by the search component in src/search/tasks/root_task.{h,cc}

The plan is to open a separate issue for this.
msg11876 (view) Author: malte Date: 2025-07-25.15:45:27
I ran a new experiment. I changed conventions and use "exp" for the experiments from now onwards to avoid confusion between experiment numbers and tag numbers. The first four experiments are still called v1 etc., so the experiments are now v1, v2, v3, v4, exp5.

exp5 is a new experiment that tests a new tag, v6. Tag v6 is an evolution of v5 (lambdas using [&] to capture for tracing messages) that removes the coupling with utils/logging and does some other clean-ups and hack removal.

I didn't rerun all tags this time; I thought looking at a subset of tags would be good enough, especially since I don't expect this to be the last experiment.

Result table:

https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/data/issue1146-exp5-eval/

read_input_time values:
base: 404.95s
v5:   514.53s (+27% compared to base)
v6:   507.78s (+25% compared to base)

description of tags:
base: main branch code with no input validation
v5: lambdas using [&] to capture for tracing messages
v6: cleaned-up version of v5

Observations:

1. The cleaned-up version is a bit faster. I hoped this would happen because we got rid of lots of cases where we needed to pass around a "Context &" that is now no longer used. But I'm pleasantly surprised the effect is as large as it is.

2. All versions are substantially slower than in all previous experiments. For example, base is 7% slower than in the previous experiment v4. This is interesting because the runtime of the same code in experiments v2, v3 and v4 was within 1% of each other.

But I'm not surprised. In v2-v4, nobody else was using these grid nodes at the same time. For this experiment, David ran an experiment simultaneously (and perhaps later also other people). So we see clearly the impact of different jobs interfering with each other, which is the reason why we always recommend running versions for which we want to compare the runtime in the same experiment rather than merging properties files.

Within the same experiment, due to the task order randomization, in expectation the comparison should always be fair even if the situation on the grid changes drastically while the experiment is running. (Although of course that's only really useful if we aggregate over many tasks, and individual runs can very well be strong outliers.)
msg11868 (view) Author: malte Date: 2025-07-24.13:15:02
> I cannot upload the data to our web server at the moment because
> of the grid maintenance

The data has now been uploaded. Direct link to the absolute report:

https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/data/issue1146-v4-eval/issue1146-v4.html
msg11867 (view) Author: malte Date: 2025-07-24.11:19:18
Experiment v4 has finished. It adds two more versions that use lambda expressions to encapsulate the "trace block" behaviour.

I cannot upload the data to our web server at the moment because of the grid maintenance, but here are the read_input_time values by config:

base: 378.15s
v1:  1000.08s (+164% compared to base)
v2:   470.88s (+25% compared to base)
v3:   460.81s (+22% compare to base)
v4:   510.39s (+35% compare to base)
v5:   483.18s (+28% compare to base)

Versions v2-v5 all use a TRACE_BLOCK macro, which they implement differently. This is the only difference between these versions.

v2: try/catch blocks
v3: do not add tracing messages
v4: lambdas using [=] to capture for tracing messages
v5: lambdas using [&] to capture for tracing messages

We are currently favouring v2 or v5:

- v2 is faster and uses simpler language constructs, but there is a lot of code repetition.
- With v5 we still need to verify that capture by reference is sufficient in all these cases. It has a bit of overhead compared to v2, but less repetition.

We will next ask the larger dev group on the Discord server for their thoughts on this.
msg11861 (view) Author: malte Date: 2025-07-22.19:22:48
The third experiment has also finished (experiment v3 behind David's link). It is like the second experiment, but adds a new configuration v3 as a reference point. v3 is like v2 but removes the "context logging" code completely. This allows us to see how much overhead is produced by the context logging.

HTML report:

https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/data/issue1146-v3-eval/issue1146-v3.html

"read_input_time" values by config:

base: 379.66s
v1:   999.97s (+163% compared to base)
v2:   473.28s (+25% compared to base)
v3:   461.95s (+22% compare to base)

So we see that the try/catch blocks in v2 are quite cheap, and the remaining loss in performance compared to base is caused by other things.


For each code revision that was part of this experiment and the previous one (base, v1, v2), the difference in total time from one experiment to the other is less than 1%, so the amount of noise seems to be reasonable.
msg11860 (view) Author: malte Date: 2025-07-22.19:13:01
The second experiment has finished. The link from David's message is still current (go to the "v2" subdirectories).

The v2 experiment compares three versions:
- issue1146-base: before the new FDR file parser, same revision as in David's experiment
- issue1146-v1: the version of the FDR file parser from the previous experiment, which we found a bit slow in that experiment
- issue1146-v2: an experimental (as in: too ugly to be merged) version that uses macros that expand to try/catch blocks to create the error contexts on demand

The new experiments compares on the entire (jumbo) benchmark suite, of which 128 tasks fail to translate due to memory limits, so in the results table there are 128 tasks missing. We don't need to worry about this.

The HTML reports show that we recovered a lot of runtime:

https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/data/issue1146-v2-eval/issue1146-v2.html

The relevant attribute is "read_input_time", and the totals are:

base: 376.00s
v1:   992.87s (+164% compared to base)
v2:   469.96s (+25% compared to base)

Inside the experiment folder you also find relative scatter plots, in particular this one:

https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/data/issue1146-v2-eval/scatter-relative/issue1146-base-issue1146-v2/issue1146-v2-issue1146-base-issue1146-v2-read_input_time-just-parse.png

It shows some weirdness happening for small tasks, but the behaviour is reasonably consistent across domains.
msg11859 (view) Author: davidspeck Date: 2025-07-22.10:51:41
Experiments: https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/
msg11689 (view) Author: florian Date: 2024-09-05.15:21:06
In issue1150, we saw a case where the original task satisfied a property (in that case that prevail conditions do not explicitly mention the precondition fact as an effect), but a task transformation of the task did not.

This is a reminder that once we finalize the list of properties we expect a task to have, we have to decide if we want all task transformations to maintain these properties (with the possible exception of transformations meant only for internal use).

Currently, task transformations that are not exposed to the user level cannot be freely combined with all plugins because they may violate assumptions in places were it doesn't matter in the situations they are currently used in.
msg11658 (view) Author: tanja Date: 2024-07-19.15:44:20
A draft pull request for this issue can be found at https://github.com/aibasel/downward/pull/227
msg11630 (view) Author: tanja Date: 2024-07-12.10:56:38
Create meta issue related to sprint story.
History
Date User Action Args
2025-07-30 12:19:32maltesetpriority: meta -> feature
messages: + msg11881
2025-07-25 15:45:27maltesetmessages: + msg11876
2025-07-24 13:15:02maltesetmessages: + msg11868
2025-07-24 11:19:18maltesetmessages: + msg11867
2025-07-22 19:22:48maltesetmessages: + msg11861
2025-07-22 19:13:01maltesetmessages: + msg11860
2025-07-22 10:51:41davidspecksetnosy: + davidspeck
messages: + msg11859
2024-09-05 15:21:06floriansetmessages: + msg11689
2024-07-19 15:44:20tanjasetstatus: unread -> chatting
messages: + msg11658
2024-07-14 11:22:03floriansetnosy: + florian
2024-07-12 11:36:42gabisetnosy: + gabi
2024-07-12 10:56:38tanjacreate