I ran a new experiment. I changed conventions and use "exp" for the experiments from now onwards to avoid confusion between experiment numbers and tag numbers. The first four experiments are still called v1 etc., so the experiments are now v1, v2, v3, v4, exp5.
exp5 is a new experiment that tests a new tag, v6. Tag v6 is an evolution of v5 (lambdas using [&] to capture for tracing messages) that removes the coupling with utils/logging and does some other clean-ups and hack removal.
I didn't rerun all tags this time; I thought looking at a subset of tags would be good enough, especially since I don't expect this to be the last experiment.
Result table:
https://ai.dmi.unibas.ch/_experiments/ai/downward/issue1146/data/issue1146-exp5-eval/
read_input_time values:
base: 404.95s
v5: 514.53s (+27% compared to base)
v6: 507.78s (+25% compared to base)
description of tags:
base: main branch code with no input validation
v5: lambdas using [&] to capture for tracing messages
v6: cleaned-up version of v5
Observations:
1. The cleaned-up version is a bit faster. I hoped this would happen because we got rid of lots of cases where we needed to pass around a "Context &" that is now no longer used. But I'm pleasantly surprised the effect is as large as it is.
2. All versions are substantially slower than in all previous experiments. For example, base is 7% slower than in the previous experiment v4. This is interesting because the runtime of the same code in experiments v2, v3 and v4 was within 1% of each other.
But I'm not surprised. In v2-v4, nobody else was using these grid nodes at the same time. For this experiment, David ran an experiment simultaneously (and perhaps later also other people). So we see clearly the impact of different jobs interfering with each other, which is the reason why we always recommend running versions for which we want to compare the runtime in the same experiment rather than merging properties files.
Within the same experiment, due to the task order randomization, in expectation the comparison should always be fair even if the situation on the grid changes drastically while the experiment is running. (Although of course that's only really useful if we aggregate over many tasks, and individual runs can very well be strong outliers.)
|