Pin the dataset once. Argue about the diff, not the setup
Benchmarks stay tied to rows, scorers, and model settings so "better" means the same contract as last week.
motion-quality-suite
completedBenchmark · Compare runs on a pinned dataset
| Scorer | v1.4 (last week) | v1.5 (today) | Δ |
|---|---|---|---|
| prompt_adherence | 0.88 | 0.92 | +0.04 |
| motion_smoothness | 0.91 | 0.89 | -0.02 |
| audio_sync | 0.94 | 0.94 | ±0.00 |
| Aggregate | 0.91 | 0.92 | +0.06 |