Designing scorers that survive contact with real users
Automated scorers promise objective evaluation of generative video, but the reality is messier. What looks like a clean metric in the lab often frays when exposed to real user preferences, edge cases, and changing product requirements.
The scorer lifecycle
A scorer starts as a hypothesis: this attribute matters, and we can measure it automatically. Early versions are usually simple heuristics or off-the-shelf models. They work well on obvious cases but struggle with nuance.
The critical phase is validation against human judgment. Run your scorer on a representative sample and compare its ratings to human ratings. Where they disagree, ask why. Often you will find the scorer is optimizing for something adjacent to what users actually care about.
Designing for disagreement
Not all disagreements are equal. We bucket them into three categories:
- False positives: The scorer flags a failure that humans do not see. Usually a sign the metric is too strict or misaligned with user perception.
- False negatives: Humans see a problem the scorer misses. Often indicates missing signal in the scoring model or an underspecified criterion.
- Preference divergence: Scorer and humans rate differently but neither is obviously wrong. This is the hardest case, requiring product judgment about which signal to trust.
Iterating with feedback loops
The best scorers evolve through tight feedback loops. Collect human ratings on model outputs, compare to automated scores, and retrain or refine the scorer. This is not one-time calibration; it is ongoing maintenance as your product and user base change.
We recommend a hybrid approach: automated scoring for the bulk of evaluation, with human spot-checking on outliers and a periodic full-sample audit. The automation gives you scale; the human oversight keeps you honest.
Knowing when to give up
Some qualities resist automated scoring. Aesthetic taste, humor, narrative coherence: these often require human judgment at scale. The right move is not to force a bad automated scorer but to design workflows that make human evaluation efficient and consistent.
The goal is reliable signal, not perfect automation. Scorers that survive contact with real users are the ones that know their limits.