Scorers
Configuring scorers that grade benchmark outputs consistently.
Types of scorers
Scorers may combine human rubrics, heuristic checks, and model-assisted judges. Pick the approach that matches your risk: subjective aesthetics often need human review, while structural constraints can be automated.
Writing effective prompts
Judge prompts should reference observable criteria (“Are hands anatomically plausible?”) instead of vague preferences (“Is it nice?”). Provide JSON schemas when you need machine-parseable outputs for downstream analytics.
Calibration
Before trusting a new scorer, run a pilot benchmark and manually review a sample of grades. Adjust wording until scores correlate with expert judgment.