Datasets

Row fields (Braintrust-aligned)

Each dataset row uses the same eval-case shape as common LLM eval tooling, plus Frametail's media-specific output field:

Field	Role
`input`	What you condition generation or scoring on: prompts, source `image_url` for image-to-video, variables.
`output`	Optional pre-existing media to score (image or video URL, attachment envelope, or `{ kind, url }`). Benchmarks hydrate a `generation` from this URL and skip provider generation.
`expected`	Optional reference data: gold labels, a baseline clip for comparative scorers later—not the primary artifact URL for rubric scoring.
`metadata`	Extra fields for templates, provenance, or import columns.
`model`	Per-row model override when the benchmark should generate for that row.

For score-only benchmarks, put reachable media URLs on output and attach scorers. Rows with output do not require a benchmark default model or per-row model.

CSV / JSON import

Map column output → row output (media to score).
Map column expected → row expected (reference only).
Do not put final clip URLs in input.image_url unless they are source images for image-to-video.

Hygiene

Remove duplicate rows that skew score distributions.
Document assumptions (resolution, frame rate) in dataset descriptions.
Snapshot datasets before major benchmark reruns so you can compare apples-to-apples.

API note

Dataset listing is available through the HTTP API and SDK; creation flows are primarily dashboard-first today — check the latest UI for import options.

Example: score existing images

{
  "input": { "prompt": "A red apple on a wooden table" },
  "output": { "kind": "image", "url": "https://example.com/apple.png" },
  "metadata": { "source": "production-export" }
}

Run a benchmark on this dataset with scorers attached. Frametail creates a generation per row, copies output into generations.media, and runs scorers without calling fal for those rows.

Row fields (Braintrust-aligned)

CSV / JSON import

Hygiene

API note

Example: score existing images

On this page