Benchmarks overview

How benchmarks combine datasets and scorers to measure generative video quality.

Purpose

Benchmarks answer: “If we change the model, prompt, or pipeline, what happens to measurable quality?” They run the same scoring logic across every row of a dataset so results are comparable.

Lifecycle

Author or import a dataset with the inputs you need.
Attach scorers that encode your rubric or automated judge.
Launch a run and wait for completion notifications.
Review aggregate metrics and drill into per-row failures.

Good benchmark design

Keep rows independent so failures do not cascade.
Prefer small, representative slices while iterating; scale up once stable.
Version prompts explicitly — tie runs to prompt versions.

Benchmarks overview

Purpose

Lifecycle

Good benchmark design

On this page