Scorers

Types of scorers

Scorers may combine human rubrics, heuristic checks, and model-assisted judges. Pick the approach that matches your risk: subjective aesthetics often need human review, while structural constraints can be automated.

Media scorers (video & image)

Adherence — Does the output follow the specified constraints?
Realism — Does the output look authentic and believable?
NSFW — Flags adult content.
Adaptive — Customizable rubric scored by an LLM judge.

Text / LLM scorers

For pure-text and LLM trace evaluation:

LLM-judge (call a model to score):

Factuality — Is the answer factually accurate?
AnswerRelevancy — Does the response address the question?
Conciseness — Is the answer appropriately brief?

Heuristic (deterministic, no model call):

ExactMatch — Whitespace-normalized exact match to expected output.
JsonMatch — Structural equality of JSON (key-order insensitive).
Levenshtein — String similarity (0–1) via edit distance.

Writing effective prompts

Judge prompts should reference observable criteria (“Are hands anatomically plausible?”) instead of vague preferences (“Is it nice?”). Provide JSON schemas when you need machine-parseable outputs for downstream analytics.

Calibration

Before trusting a new scorer, run a pilot benchmark and manually review a sample of grades. Adjust wording until scores correlate with expert judgment.

Types of scorers

Media scorers (video & image)

Text / LLM scorers

Writing effective prompts

Calibration

On this page