Benchmarks overview
How benchmarks combine datasets and scorers to measure generative video quality.
Purpose
Benchmarks answer: “If we change the model, prompt, or pipeline, what happens to measurable quality?” They run the same scoring logic across every row of a dataset so results are comparable.
Lifecycle
- Author or import a dataset with the inputs you need.
- Attach scorers that encode your rubric or automated judge.
- Launch a run and wait for completion notifications.
- Review aggregate metrics and drill into per-row failures.
Good benchmark design
- Keep rows independent so failures do not cascade.
- Prefer small, representative slices while iterating; scale up once stable.
- Version prompts explicitly — tie runs to prompt versions.