AI eval scoring for product teams
LLM as a Judge Evals: How They Work and When to Trust Them
An LLM as a judge eval uses a language model to score AI output against a rubric. Here's how LLM judges work, what they get wrong, and when to trust them.
Some AI output quality questions have a definitive answer you can check in code: did the response return valid JSON? Did the agent call the right tool? Did the SQL query return the expected row count? Write a test, run it, get a pass or fail.
Other questions don't have a clean answer in code. Did the chatbot actually address what the user was asking? Was the analysis grounded in the retrieved documents, or did the model fill gaps from its own training data? Was the tone right for a billing conversation? Code can verify form. It can't verify substance.
LLM as a judge evals were built for that gap. They use a separate language model to score another model's output against a rubric defined in natural language. This page explains how they work, when to use them, and the calibration problem that makes or breaks them in practice.
What an LLM as a judge eval is
An LLM as a judge (LLMaaJ) eval is a repeatable test that uses a separate language model to score an AI system's output against a rubric written in natural language. The judge model receives three inputs: the rubric (the criteria for what counts as good), the user's original input, and the system's response. It returns a score, a pass or fail verdict, or both.
LLM judges are one of two main scoring approaches in the eval world. The other is code-based evaluation, where deterministic logic checks the output directly. Code-based evals are faster and cheaper. LLM judges scale to questions that code can't answer.
The term LLMaaJ appears in the academic literature. In product team conversations, you'll hear it as "LLM judge," "model-based eval," or just "automated evaluation." They all mean the same thing: a model scoring another model's output according to a human-defined rubric.
How LLM judges work in practice
A basic LLM judge takes three inputs and returns a verdict. The rubric defines what good looks like. The input is what the user sent to the system. The output is what the system returned. The judge reads all three and produces a score.
Here's a concrete example. A customer support AI handles billing questions. The team wants to know whether responses are grounded in the retrieved knowledge base, complete (no missing information), and appropriate in tone. They write a rubric with three criteria, each scored on a simple scale:
The judge model reads the rubric, the billing question, and the agent's response. It returns a pass or fail for each criterion plus an explanation. The team can review flagged responses, fix the underlying prompt or retrieval setup, and watch the pass rate improve over subsequent runs.
Common judge configurations
- Single-answer grading. The judge scores one response against the rubric. This is the most common form for production monitoring.
- Pairwise comparison. The judge receives two responses and decides which better satisfies the rubric. Useful during prompt iteration when the team is comparing a new prompt against a baseline.
- Reference-based grading. The judge compares the response against a known-good reference answer. Useful when ground truth exists but is too complex to check with an exact string match.
- Reference-free grading. The judge evaluates quality without a reference answer. Used for open-ended responses where no single correct answer exists.
The judge model should be different from the model being evaluated. Using the same model to evaluate its own output creates circular scoring and tends to produce inflated pass rates.
When LLM judges are the right tool
LLM judges fit questions where output quality is partly subjective and two expert reviewers, given the same rubric, would reliably agree on their verdict.
Good fits include open-ended question answering (was this response actually helpful?), grounding checks (did the response use the retrieved context or hallucinate?), tone and appropriateness scoring, multi-turn conversation coherence, and intent alignment (did the agent do what the user actually wanted?).
Code-based evals are a better fit when ground truth is verifiable in code. If the agent should return JSON, validate the schema. If it should call a specific tool, check the tool call log. If it should return the correct SQL row count, run the query. Code-based evals are faster, cheaper, and more reproducible for anything with a deterministic correct answer.
A practical decision rule: if you could write a pass/fail condition without reading the actual content of the response, use code. If the verdict depends on reading and interpreting the response, use a judge.
Most production AI teams run both. Code-based evals handle the structural and functional checks. LLM judges handle the quality and substance checks. The combination gives teams a complete picture of whether the system is working correctly (code) and well (judge).
The calibration problem
An uncalibrated LLM judge is worse than no judge. A judge that scores every response as passing gives the team a clean dashboard over a broken product. The false signal is more dangerous than no signal because it creates confidence where none is warranted.
Calibration means verifying that the judge's verdicts match what human reviewers would decide, then adjusting the rubric until the agreement is high enough to trust the judge at scale.
A basic calibration workflow
- Sample 50 to 100 real interactions from production or a recent test run.
- Have two human reviewers score each interaction independently using the same rubric.
- Compare human labels to judge labels. Calculate agreement rates for each rubric criterion.
- Find where the judge and reviewers disagree systematically. Common patterns: the judge is too lenient on grounding (it passes responses that cite documents loosely), or too strict on tone (it flags direct language as dismissive).
- Revise the rubric with clearer criteria and worked examples. Re-run the comparison.
- Repeat until judge-human agreement reaches a level the team trusts, typically above 80% on each criterion.
Rubric design is where PMs have a direct role. The criteria for what counts as helpful, grounded, or appropriate are product decisions, not engineering decisions. A PM who can write a clear rubric criterion with a concrete worked example is doing eval work, not just reviewing it.
The Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" paper documents several systematic failure modes in common judge configurations, including position bias (the judge prefers whichever response appears first in pairwise evaluations) and verbosity bias (the judge favors longer responses regardless of quality). Teams building production eval pipelines should read it before choosing their judge model and rubric structure.
Connecting judge scores to product outcomes
A high pass rate on an LLM judge suite tells you the system handles your test cases well. It doesn't tell you whether users who received high-quality responses came back the next week, or whether failure modes concentrate in the segments that drive the most revenue.
Those questions require joining eval scores to product engagement data under the same user identity. Eval infrastructure and product analytics need to speak the same language.
Amplitude's Agent Analytics connects trace data, eval scores, and product engagement metrics in one event stream. A team using Agent Analytics can segment users by their eval pass/fail history and compare 30-day retention rates. If users who received low-scored responses churn at twice the rate of users who received passing responses, that's a quantified business case for fixing the rubric and the underlying system.
The same join enables cost-aware quality analysis: which query types cost the most per token, and do they also produce the lowest pass rates? It enables experiment analysis: when you change the prompt or the retrieval strategy, do the eval scores improve, and does that improvement translate to higher activation or conversion in the product? Teams using Feature Experimentation can run controlled prompt experiments and measure quality signal alongside business outcomes in the same platform.
LLM judges produce a quality signal. Product analytics tells you whether quality moves the metrics you care about. Both are necessary; neither is sufficient on its own. If your team is building AI features and starting to think about evals, the AI evals for product managers guide covers the full measurement stack, including offline and online eval approaches, trace analysis, and how eval-driven development fits into a product team's workflow.
Try Amplitude for free today to connect your AI eval scores to the product engagement data that actually drives decisions.
Frequently asked questions about LLM as a judge evals
An LLM as a judge eval uses a separate language model to score an AI system's output against a rubric defined in natural language. The judge receives the rubric, the user's input, and the system's response, then returns a score or pass/fail verdict. Teams use LLM judges for quality checks that can't be expressed as code.
A human reviewer is slower, more expensive, and generally more accurate on subjective questions. An LLM judge scales to thousands of cases at low cost but produces unreliable scores until calibrated against human labels. Most production teams use human reviewers to define the rubric and calibrate the judge, then run the judge at scale for ongoing monitoring.
A rubric is a structured set of criteria used to score AI output. Each criterion defines what passing looks like. For example, a grounding criterion might specify that all factual claims in the response must be traceable to the retrieved documents. Rubrics are written by product and engineering teams, not generated automatically. Rubric quality determines judge reliability.
Use a code-based eval when the correct output can be verified programmatically: JSON schema validation, tool call matching, exact string equality, SQL row counts. Use an LLM judge when quality depends on reading and interpreting the response: helpfulness, grounding, tone, intent alignment, multi-turn coherence. Many teams run both for complete coverage.
Sample 50 to 100 real interactions, have two human reviewers score them independently with the same rubric, and compare human verdicts to judge verdicts. Find systematic disagreements, revise the rubric criteria to resolve them, and repeat until judge-human agreement exceeds 80% per criterion. Calibration is ongoing; add new cases when the product or user behavior changes.
Yes. LLM judges work for agent evaluation, though multi-turn interactions add complexity. A single-turn judge scores one response. For agents, teams often need to evaluate the full interaction trace: did the agent call the right tools, in the right order, and produce a coherent final response? Trace-level rubrics that score sequences of actions require more design work than single-response rubrics.