Frametail
Benchmarks

Regression detection

Automatically detect scorer performance drops between benchmark runs.

When you re-run a benchmark after changing a model, prompt, or system, Frametail automatically compares the results against the previous run and flags any scorer regressions — indicating a performance drop worth investigating.

Viewing regressions

On the benchmark detail page, a regression banner appears at the top if any scorers dropped since the previous run:

  • Green checkmark — No regressions detected (all scorers stable or improved).
  • Yellow warning — N scorers regressed. The banner shows each scorer's before→after score and the delta.

Click into the run to inspect which samples drove the regression.

How regressions are detected

Frametail compares the average score per scorer across all samples:

  • Baseline = most recent completed run of this evaluation
  • Candidate = current run
  • Regression threshold = 0.05 (5 percentage points) by default

If candidate_average − baseline_average ≤ −0.05, the scorer is flagged as regressed.

Example:

  • Baseline Factuality: 0.92 (92 samples average)
  • Candidate Factuality: 0.85
  • Delta: −0.07 → regressed (dropped > 5%)

Programmatic access

Query regressions server-side via the GraphQL / REST API:

# Compare a run to its predecessor (same evaluation)
curl -X POST https://api.frametail.io/graphql \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "query": "query { benchmarkRegressionVsPrevious(benchmarkId: \"...\") { baseline { scorerAverages } candidate { scorerAverages } deltas { scorerId delta } regressions { scorerId baseline candidate delta } } }"
  }'

Or compare any two runs manually:

# Compare baseline run to candidate run
curl -X POST https://api.frametail.io/graphql \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "query": "query { benchmarkCompare(baselineId: \"...\", candidateId: \"...\") { deltas { scorerId baseline candidate delta } regressions { ... } } }"
  }'

Customizing the threshold

Use a stricter or more lenient threshold:

curl -X POST https://api.frametail.io/graphql \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "query": "query { benchmarkRegressionVsPrevious(benchmarkId: \"...\", regressionThreshold: 0.10) { ... } }"
  }'

Default is 0.05 (5%). Common alternatives:

  • 0.03 — Very strict (catch small drops)
  • 0.10 — Lenient (only flag large regressions)

Integrations

You can hook Frametail regression alerts into your CI/CD pipeline:

  1. After a benchmark run completes, fetch benchmarkRegressionVsPrevious.
  2. If regressions.length > 0, fail the workflow or send a Slack/Teams notification.
  3. Optionally block a deployment until regressions are resolved.

Example GitHub Actions snippet:

- name: Check for benchmark regressions
  run: |
    RESULT=$(curl -X POST https://api.frametail.io/graphql \
      -H "Authorization: Bearer ${{ secrets.FRAMETAIL_API_KEY }}" \
      -d "{\"query\": \"query { benchmarkRegressionVsPrevious(benchmarkId: \\\"${{ env.BENCHMARK_ID }}\\\") { regressions { scorerId } } }\"}")
    
    REGRESSION_COUNT=$(echo "$RESULT" | jq '.data.benchmarkRegressionVsPrevious.regressions | length')
    if [ "$REGRESSION_COUNT" -gt 0 ]; then
      echo "❌ $REGRESSION_COUNT scorer(s) regressed!"
      exit 1
    fi
    echo "✅ No regressions detected."

For longer-term trend analysis, store historical averages:

# Fetch scorer averages for a specific run (without comparing)
curl -X POST https://api.frametail.io/graphql \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "query": "query { benchmarkSummary(benchmarkId: \"...\") { rowCount scorerAverages { scorerId average scored total } } }"
  }'

Collect these over time to build a performance dashboard and spot trends before they become regressions.