Experiments isolate variables: you change one aspect (for example, a prompt variant) while holding datasets and scorers constant. This reduces debate about what caused a metric shift.

Designing comparisons

Keep sample sizes meaningful — too few rows invite noise.
Document hypotheses in the experiment description for future readers.
Link experiments back to traces when validating behavior in production-like settings.

Results and analytics

Reading benchmark outputs, spotting regressions, and exporting learnings.

Tasks

Saved prompt, model, and parameters for reuse across experiments and the workbench.

Experiments

Why experiments

Designing comparisons

On this page