Frametail

Scorers

Configuring scorers that grade benchmark outputs consistently.

Types of scorers

Scorers may combine human rubrics, heuristic checks, and model-assisted judges. Pick the approach that matches your risk: subjective aesthetics often need human review, while structural constraints can be automated.

Writing effective prompts

Judge prompts should reference observable criteria (“Are hands anatomically plausible?”) instead of vague preferences (“Is it nice?”). Provide JSON schemas when you need machine-parseable outputs for downstream analytics.

Calibration

Before trusting a new scorer, run a pilot benchmark and manually review a sample of grades. Adjust wording until scores correlate with expert judgment.