Results and analytics
Reading benchmark outputs, spotting regressions, and exporting learnings.
Aggregate metrics
Start with summary charts to see score distributions and failure rates. Compare runs across time or branches when your workflow tags traces and benchmarks consistently.
Row-level inspection
Open failing rows to view inputs, outputs, and scorer rationales. Use this loop to refine prompts, scorers, or preprocessing.
Exporting
Export CSV or JSON where the UI offers it for offline analysis or executive summaries.