Benchmarks
Create, inspect, run, and remove benchmarks over HTTP.
Benchmarks tie datasets and scorers to repeatable evaluation runs. These endpoints mirror what the TypeScript SDK exposes as Benchmark resource methods.
Base path: /api/v1/benchmarks. Authenticate as described in HTTP authentication.
List and create
List benchmarks
GET /api/v1/benchmarks
Returns JSON listing benchmarks visible to the authenticated project and organization.
Create benchmark
POST /api/v1/benchmarks
- Body: JSON with at least the fields required by your workspace (for example name, dataset reference, and scorers configuration).
- Returns: JSON describing the created benchmark, including its id for follow-up calls.
Single benchmark
Replace {id} with the benchmark id.
| Method | Path | Purpose |
|---|---|---|
| GET | /api/v1/benchmarks/{id} | Fetch configuration and status for one benchmark. |
| DELETE | /api/v1/benchmarks/{id} | Remove the benchmark from the project. |
Start a run
POST /api/v1/benchmarks/{id}/start
- Body: Optional JSON arguments depending on your evaluation setup (for example row subset, experiment flags, or runner options supported by the product).
- Effect: Schedules or starts benchmark execution according to server rules and quota.
Poll the benchmark GET route or inspect the tasks route below to monitor progress.
Tasks
GET /api/v1/benchmarks/{id}/tasks
Returns JSON describing outstanding and completed work units (tasks) associated with the benchmark run, suitable for progress UIs or batch orchestration.
Best practices
- Idempotency: starting the same benchmark twice may enqueue duplicate work; guard with client-side locks or check status first.
- Cost: benchmark runs can invoke generative models and scorers; monitor usage while iterating.
- SDK: prefer Benchmark helpers for typed requests and consistent error handling.