Frametail

Datasets

Building and curating datasets for repeatable video benchmarks.

Row fields (Braintrust-aligned)

Each dataset row uses the same eval-case shape as common LLM eval tooling, plus Frametail's media-specific output field:

FieldRole
inputWhat you condition generation or scoring on: prompts, source image_url for image-to-video, variables.
outputOptional pre-existing media to score (image or video URL, attachment envelope, or { kind, url }). Benchmarks hydrate a generation from this URL and skip provider generation.
expectedOptional reference data: gold labels, a baseline clip for comparative scorers later—not the primary artifact URL for rubric scoring.
metadataExtra fields for templates, provenance, or import columns.
modelPer-row model override when the benchmark should generate for that row.

For score-only benchmarks, put reachable media URLs on output and attach scorers. Rows with output do not require a benchmark default model or per-row model.

CSV / JSON import

  • Map column output → row output (media to score).
  • Map column expected → row expected (reference only).
  • Do not put final clip URLs in input.image_url unless they are source images for image-to-video.

Hygiene

  • Remove duplicate rows that skew score distributions.
  • Document assumptions (resolution, frame rate) in dataset descriptions.
  • Snapshot datasets before major benchmark reruns so you can compare apples-to-apples.

API note

Dataset listing is available through the HTTP API and SDK; creation flows are primarily dashboard-first today — check the latest UI for import options.

Example: score existing images

{
  "input": { "prompt": "A red apple on a wooden table" },
  "output": { "kind": "image", "url": "https://example.com/apple.png" },
  "metadata": { "source": "production-export" }
}

Run a benchmark on this dataset with scorers attached. Frametail creates a generation per row, copies output into generations.media, and runs scorers without calling fal for those rows.