LLM evaluation quick start

Frametail now has complete LLM evaluation capabilities to match video/image scoring. Run LLM benchmarks in the cloud, score live production traces, and detect performance regressions — all with the same infrastructure that powers video evals.

What's new

1. Text/LLM scorers

Score pure-text and LLM outputs using:

LLM-judge scorers (call a model to evaluate):
- Factuality — Is the answer factually correct?
- AnswerRelevancy — Does it address the question?
- Conciseness — Is it appropriately brief?
Heuristic scorers (no model call, deterministic):
- ExactMatch — Whitespace-normalized exact match to expected.
- JsonMatch — Structural equality of JSON (key-order insensitive).
- Levenshtein — String similarity (0–1) via edit distance.

2. LLM benchmarks in the cloud

Create a benchmark with an LLM model (e.g., openai/gpt-4o-mini) instead of a video model:

Create an evaluation — Go to Evaluations → New Evaluation → LLM Text Generation.
Upload a dataset — CSV/JSON with prompt (or input), optional expected column for reference output.
Configure scorers — Pick text scorers (heuristic or LLM-judge).
Run — Frametail generates text via OpenRouter and scores each sample.

Results show per-sample scores and averages, just like video benchmarks.

3. Live scoring of LLM traces

Instrument your LLM app with the Frametail SDK or OpenTelemetry, and Frametail will score production traces in real time:

import Frametail from 'frametail'

const client = new Frametail()

// Every generateText call is traced and scored automatically
const result = await client.trace('check-answer', async (span) => {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'Explain quantum computing.' }],
  })
  return response.choices[0].message.content
})

Then define a live-scoring rule to grade matching traces:

Rules → New Rule → Match on span name or attributes.
Scorers — Pick text scorers.
Expected output (optional) — Fixed text, or a column from a linked dataset.
Save — Scores flow into your trace automatically.

4. Regression detection

Rerun a benchmark after tweaking your model or prompt. Frametail automatically compares it to the previous run and flags any scorer regressions:

Green banner — No regressions (all scorers stable or improved).
Yellow banner — N scorers regressed (shows before→after deltas).

Use the API to integrate regressions into your CI/CD pipeline.

Choosing between cloud benchmarks and live scoring

Use case	Cloud Benchmark	Live Scoring
Evaluate a new model systematically	✅	❌
Grade production traffic automatically	❌	✅
Detect regressions over time	✅	✅
Compare A/B changes	✅	❌
Sample traces for manual review	✅	✅

Getting started

Option 1: Start with a benchmark

Create a small CSV with 5–10 prompts and expected outputs.
Go to Benchmarks → New Benchmark → LLM Text.
Upload the CSV and pick Factuality + ExactMatch as scorers.
Run and inspect results.

Option 2: Start with live scoring

Instrument your app with the SDK: npm install frametail
Wrap your LLM calls: client.trace('name', async span => {...})
Create a live-scoring rule matching your span name.
Pick text scorers and save.
Send traces to your Frametail project.
Watch scores appear in the Traces tab.

Option 3: Use OpenTelemetry

If you're already using OpenTelemetry, point the exporter at Frametail:

import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'

const exporter = new OTLPTraceExporter({
  url: 'https://api.frametail.io/v1/traces/otlp',
  headers: {
    authorization: `Bearer ${process.env.FRAMETAIL_API_KEY}`,
  },
})

No SDK needed — Frametail reads OTLP traces directly.

Costs

Heuristic scorers (ExactMatch, Levenshtein, JsonMatch) — No API calls, no cost.
LLM-judge scorers (Factuality, AnswerRelevancy, Conciseness) — One call per sample, ~$0.0005 each (using gpt-4.1-mini).

Costs are tracked automatically in Cost Analytics.

Combining LLM + video evaluation

You don't have to choose. A single evaluation can include both:

Create a dataset with prompts and reference images (or videos).
Run a benchmark that generates both:
- Video via FAL / Pipevideo
- LLM caption / description via OpenRouter
Score both:
- Video: Adherence, Realism, NSFW
- Text (caption): Factuality, Conciseness, ExactMatch
Compare runs — See how model changes affect both modalities.