Regression detection
Automatically detect scorer performance drops between benchmark runs.
When you re-run a benchmark after changing a model, prompt, or system, Frametail automatically compares the results against the previous run and flags any scorer regressions — indicating a performance drop worth investigating.
Viewing regressions
On the benchmark detail page, a regression banner appears at the top if any scorers dropped since the previous run:
- Green checkmark — No regressions detected (all scorers stable or improved).
- Yellow warning — N scorers regressed. The banner shows each scorer's before→after score and the delta.
Click into the run to inspect which samples drove the regression.
How regressions are detected
Frametail compares the average score per scorer across all samples:
- Baseline = most recent completed run of this evaluation
- Candidate = current run
- Regression threshold = 0.05 (5 percentage points) by default
If candidate_average − baseline_average ≤ −0.05, the scorer is flagged as regressed.
Example:
- Baseline
Factuality: 0.92 (92 samples average) - Candidate
Factuality: 0.85 - Delta: −0.07 → regressed (dropped > 5%)
Programmatic access
Query regressions server-side via the GraphQL / REST API:
# Compare a run to its predecessor (same evaluation)
curl -X POST https://api.frametail.io/graphql \
-H "Authorization: Bearer $API_KEY" \
-d '{
"query": "query { benchmarkRegressionVsPrevious(benchmarkId: \"...\") { baseline { scorerAverages } candidate { scorerAverages } deltas { scorerId delta } regressions { scorerId baseline candidate delta } } }"
}'Or compare any two runs manually:
# Compare baseline run to candidate run
curl -X POST https://api.frametail.io/graphql \
-H "Authorization: Bearer $API_KEY" \
-d '{
"query": "query { benchmarkCompare(baselineId: \"...\", candidateId: \"...\") { deltas { scorerId baseline candidate delta } regressions { ... } } }"
}'Customizing the threshold
Use a stricter or more lenient threshold:
curl -X POST https://api.frametail.io/graphql \
-H "Authorization: Bearer $API_KEY" \
-d '{
"query": "query { benchmarkRegressionVsPrevious(benchmarkId: \"...\", regressionThreshold: 0.10) { ... } }"
}'Default is 0.05 (5%). Common alternatives:
- 0.03 — Very strict (catch small drops)
- 0.10 — Lenient (only flag large regressions)
Integrations
You can hook Frametail regression alerts into your CI/CD pipeline:
- After a benchmark run completes, fetch
benchmarkRegressionVsPrevious. - If
regressions.length > 0, fail the workflow or send a Slack/Teams notification. - Optionally block a deployment until regressions are resolved.
Example GitHub Actions snippet:
- name: Check for benchmark regressions
run: |
RESULT=$(curl -X POST https://api.frametail.io/graphql \
-H "Authorization: Bearer ${{ secrets.FRAMETAIL_API_KEY }}" \
-d "{\"query\": \"query { benchmarkRegressionVsPrevious(benchmarkId: \\\"${{ env.BENCHMARK_ID }}\\\") { regressions { scorerId } } }\"}")
REGRESSION_COUNT=$(echo "$RESULT" | jq '.data.benchmarkRegressionVsPrevious.regressions | length')
if [ "$REGRESSION_COUNT" -gt 0 ]; then
echo "❌ $REGRESSION_COUNT scorer(s) regressed!"
exit 1
fi
echo "✅ No regressions detected."Aggregation & trends
For longer-term trend analysis, store historical averages:
# Fetch scorer averages for a specific run (without comparing)
curl -X POST https://api.frametail.io/graphql \
-H "Authorization: Bearer $API_KEY" \
-d '{
"query": "query { benchmarkSummary(benchmarkId: \"...\") { rowCount scorerAverages { scorerId average scored total } } }"
}'Collect these over time to build a performance dashboard and spot trends before they become regressions.