Frametail

Benchmarks overview

How benchmarks combine datasets and scorers to measure generative video quality.

Purpose

Benchmarks answer: “If we change the model, prompt, or pipeline, what happens to measurable quality?” They run the same scoring logic across every row of a dataset so results are comparable.

Lifecycle

  1. Author or import a dataset with the inputs you need.
  2. Attach scorers that encode your rubric or automated judge.
  3. Launch a run and wait for completion notifications.
  4. Review aggregate metrics and drill into per-row failures.

Good benchmark design

  • Keep rows independent so failures do not cascade.
  • Prefer small, representative slices while iterating; scale up once stable.
  • Version prompts explicitly — tie runs to prompt versions.