Ensure scorers are attached and dataset rows validate in the preview step. Fix schema issues before launching large runs to avoid partial failures.

Execution

Start runs from the benchmark detail page. Long jobs may queue — watch status chips and notifications rather than refreshing constantly.

Cost awareness

Video benchmarks consume compute and third-party inference depending on your configuration. Pilot with a subset of rows first.

Scorers

Configuring scorers that grade benchmark outputs consistently.

Results and analytics

Reading benchmark outputs, spotting regressions, and exporting learnings.

Running benchmarks

Preconditions

Execution

Cost awareness

On this page