Datasets
Building and curating datasets for repeatable video benchmarks.
Row fields (Braintrust-aligned)
Each dataset row uses the same eval-case shape as common LLM eval tooling, plus Frametail's media-specific output field:
| Field | Role |
|---|---|
input | What you condition generation or scoring on: prompts, source image_url for image-to-video, variables. |
output | Optional pre-existing media to score (image or video URL, attachment envelope, or { kind, url }). Benchmarks hydrate a generation from this URL and skip provider generation. |
expected | Optional reference data: gold labels, a baseline clip for comparative scorers later—not the primary artifact URL for rubric scoring. |
metadata | Extra fields for templates, provenance, or import columns. |
model | Per-row model override when the benchmark should generate for that row. |
For score-only benchmarks, put reachable media URLs on output and attach scorers. Rows with output do not require a benchmark default model or per-row model.
CSV / JSON import
- Map column
output→ rowoutput(media to score). - Map column
expected→ rowexpected(reference only). - Do not put final clip URLs in
input.image_urlunless they are source images for image-to-video.
Hygiene
- Remove duplicate rows that skew score distributions.
- Document assumptions (resolution, frame rate) in dataset descriptions.
- Snapshot datasets before major benchmark reruns so you can compare apples-to-apples.
API note
Dataset listing is available through the HTTP API and SDK; creation flows are primarily dashboard-first today — check the latest UI for import options.
Example: score existing images
{
"input": { "prompt": "A red apple on a wooden table" },
"output": { "kind": "image", "url": "https://example.com/apple.png" },
"metadata": { "source": "production-export" }
}Run a benchmark on this dataset with scorers attached. Frametail creates a generation per row, copies output into generations.media, and runs scorers without calling fal for those rows.