LLM evaluation quick start
Score LLM outputs in production and benchmarks using built-in text scorers.
Frametail now has complete LLM evaluation capabilities to match video/image scoring. Run LLM benchmarks in the cloud, score live production traces, and detect performance regressions — all with the same infrastructure that powers video evals.
What's new
1. Text/LLM scorers
Score pure-text and LLM outputs using:
-
LLM-judge scorers (call a model to evaluate):
Factuality— Is the answer factually correct?AnswerRelevancy— Does it address the question?Conciseness— Is it appropriately brief?
-
Heuristic scorers (no model call, deterministic):
ExactMatch— Whitespace-normalized exact match to expected.JsonMatch— Structural equality of JSON (key-order insensitive).Levenshtein— String similarity (0–1) via edit distance.
2. LLM benchmarks in the cloud
Create a benchmark with an LLM model (e.g., openai/gpt-4o-mini) instead of a video model:
- Create an evaluation — Go to Evaluations → New Evaluation → LLM Text Generation.
- Upload a dataset — CSV/JSON with
prompt(orinput), optionalexpectedcolumn for reference output. - Configure scorers — Pick text scorers (heuristic or LLM-judge).
- Run — Frametail generates text via OpenRouter and scores each sample.
Results show per-sample scores and averages, just like video benchmarks.
3. Live scoring of LLM traces
Instrument your LLM app with the Frametail SDK or OpenTelemetry, and Frametail will score production traces in real time:
import Frametail from 'frametail'
const client = new Frametail()
// Every generateText call is traced and scored automatically
const result = await client.trace('check-answer', async (span) => {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: 'Explain quantum computing.' }],
})
return response.choices[0].message.content
})Then define a live-scoring rule to grade matching traces:
- Rules → New Rule → Match on span name or attributes.
- Scorers — Pick text scorers.
- Expected output (optional) — Fixed text, or a column from a linked dataset.
- Save — Scores flow into your trace automatically.
4. Regression detection
Rerun a benchmark after tweaking your model or prompt. Frametail automatically compares it to the previous run and flags any scorer regressions:
- Green banner — No regressions (all scorers stable or improved).
- Yellow banner — N scorers regressed (shows before→after deltas).
Use the API to integrate regressions into your CI/CD pipeline.
Choosing between cloud benchmarks and live scoring
| Use case | Cloud Benchmark | Live Scoring |
|---|---|---|
| Evaluate a new model systematically | ✅ | ❌ |
| Grade production traffic automatically | ❌ | ✅ |
| Detect regressions over time | ✅ | ✅ |
| Compare A/B changes | ✅ | ❌ |
| Sample traces for manual review | ✅ | ✅ |
Getting started
Option 1: Start with a benchmark
- Create a small CSV with 5–10 prompts and expected outputs.
- Go to Benchmarks → New Benchmark → LLM Text.
- Upload the CSV and pick
Factuality+ExactMatchas scorers. - Run and inspect results.
Option 2: Start with live scoring
- Instrument your app with the SDK:
npm install frametail - Wrap your LLM calls:
client.trace('name', async span => {...}) - Create a live-scoring rule matching your span name.
- Pick text scorers and save.
- Send traces to your Frametail project.
- Watch scores appear in the Traces tab.
Option 3: Use OpenTelemetry
If you're already using OpenTelemetry, point the exporter at Frametail:
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
const exporter = new OTLPTraceExporter({
url: 'https://api.frametail.io/v1/traces/otlp',
headers: {
authorization: `Bearer ${process.env.FRAMETAIL_API_KEY}`,
},
})No SDK needed — Frametail reads OTLP traces directly.
Costs
- Heuristic scorers (ExactMatch, Levenshtein, JsonMatch) — No API calls, no cost.
- LLM-judge scorers (Factuality, AnswerRelevancy, Conciseness) — One call per sample, ~$0.0005 each (using gpt-4.1-mini).
Costs are tracked automatically in Cost Analytics.
Combining LLM + video evaluation
You don't have to choose. A single evaluation can include both:
- Create a dataset with prompts and reference images (or videos).
- Run a benchmark that generates both:
- Video via FAL / Pipevideo
- LLM caption / description via OpenRouter
- Score both:
- Video: Adherence, Realism, NSFW
- Text (caption): Factuality, Conciseness, ExactMatch
- Compare runs — See how model changes affect both modalities.