# What Is a Code-Based Eval? | Amplitude

A code-based eval scores AI output using deterministic logic: regex checks, JSON schema validation, expected tool calls, and more. No LLM required to judge it.

Source: https://amplitude.com/en-us/explore/product/what-is-a-code-based-eval

---

###### AI evals for product and engineering teams

# What Is a Code-Based Eval?

A code-based eval scores AI output using deterministic logic: regex checks, JSON schema validation, expected tool calls, and more. No LLM required to judge it.

<!--$-->

Table of Contents

-

A code-based eval is a type of AI evaluation that scores model output using deterministic logic written in code. The output either passes or it doesn't.

Examples include checking whether the AI returned valid JSON, called the correct function, or included a required string in its response. None of that requires a separate language model to judge.

For product teams shipping AI features, code-based evals are the foundation of any quality measurement system. They're fast, cheap, and run consistently in CI/CD pipelines the same way unit tests do.

Browse this guide

- [How code-based evals work](#how-code-based-evals-work)
- [When to use code-based evals](#when-to-use-code-based-evals)
- [Code-based evals vs. LLM judges](#code-based-evals-vs-llm-judges)
- [Code-based eval patterns for AI agents](#code-based-eval-patterns-for-ai-agents)
- [Connecting code-based eval results to product outcomes](#connecting-code-based-eval-results-to-product-outcomes)
- [Frequently asked questions](#faq)

<!--/$-->

## How code-based evals work

A code-based eval takes a fixed input, runs it through the AI system, captures the output, and applies a scoring function written in code. That function returns a pass, fail, or numeric score based on explicit logic: no model call, no human reviewer, no ambiguity.

The scoring function is the key part. It can be as simple as a string comparison or as structured as a multi-step schema validator. What makes it "code-based" is that the result is fully deterministic: given the same output, the same function always returns the same score.

Here are the most common patterns:

- **Exact match.** The response must contain a specific string. A customer support agent that always ends with a case number can be tested this way. If the case number is missing, the eval fails.
- **Regex match.** The output must conform to a pattern. An AI that generates SQL can be tested with a regex that checks whether the output starts with SELECT or UPDATE and contains no injection-risky keywords.
- **JSON schema validation.** The output is parsed and validated against an expected schema. If the agent is supposed to return a structured object, a schema check catches any malformed or incomplete responses before they reach downstream systems.
- **Tool call assertion.** The agent must have called a specific function with specific arguments. If a billing agent is supposed to call fetch\_invoice and instead calls fetch\_account, the eval catches the wrong tool call regardless of what the final response said.
- **Data validation.** The agent returned the expected number of rows or items. A data analyst agent that runs SQL queries can be tested by comparing the row count of the output against an expected value.

Each of these runs in milliseconds and adds near-zero cost to the eval pipeline.

## When to use code-based evals

Code-based evals work when the quality criterion is verifiable: when there's a correct answer that can be checked without human judgment.

If you can write a function that takes an output and returns true or false, you have a code-based eval. If you can't, because the question is "was this response helpful?" or "did the tone feel appropriate?", that's a job for an LLM judge or a human reviewer.

### **Use code-based evals for:**

- Verifying the agent called the correct tool
- Checking that output conforms to an expected format or schema
- Asserting that required fields, strings, or values appear in the response
- Catching forbidden outputs (specific keywords, banned tools, injection patterns)
- Validating that latency or cost fell within acceptable bounds

### **Don't use code-based evals for:**

- Judging whether a response was helpful or on-topic
- Assessing tone, empathy, or communication quality
- Evaluating whether a claim is factually grounded in retrieved context
- Scoring creative or open-ended responses where no single answer is correct

Code-based evals also fit naturally into CI/CD. Because they run fast and return deterministic results, they can gate pull request merges or block deployments the same way a test suite does. A team that ships a prompt change can run its full code-based eval suite in seconds and know immediately whether any verifiable quality criterion broke.

## Code-based evals vs. LLM judges

Code-based evals and LLM judges aren't competing approaches. They cover different types of quality criteria, and most production AI teams run both.

|                   | Code-based eval                                                     | LLM judge                                                      |
| ----------------- | ------------------------------------------------------------------- | -------------------------------------------------------------- |
| Speed             | Milliseconds per eval                                               | Seconds per eval (requires a model call)                       |
| Cost              | Near-zero                                                           | Adds model API cost per evaluation                             |
| Reproducibility   | Fully deterministic; same input always yields the same score        | Non-deterministic across runs; scores can vary                 |
| Best for          | Verifiable properties: format, schema, tool calls, required strings | Subjective quality: helpfulness, tone, groundedness, relevance |
| Failure mode      | Can pass a technically incorrect but correctly formatted output     | Requires calibration against human labels to be trustworthy    |
| CI/CD integration | Runs as part of a standard test suite                               | Typically runs as a separate eval pipeline step                |

The failure mode row is worth understanding before you build. A code-based eval will pass an output that looks structurally correct but says something wrong. An LLM judge might rate that same output poorly on helpfulness, but it requires calibration work before you can trust those ratings at scale. The two approaches catch different things, which is why most teams treat code-based evals as the first layer and LLM judges as the second.

For teams getting started, code-based evals are the lower-friction entry point. You can write and run your first eval without any judge design, rubric calibration, or model costs. Start there, identify what code-based logic can't catch, and add LLM judges for those gaps.

## Code-based eval patterns for AI agents

AI agents produce structured outputs (tool calls, SQL queries, JSON payloads, function arguments) that map directly to code-based evaluation. The trace is the source of truth: it captures every tool the agent called, every argument it passed, and every value it returned.

Here are the eval patterns product and engineering teams apply most often:

- **Tool selection check.** Did the agent call the right tool for the given intent? An agent with access to both search\_knowledge\_base and create\_ticket should call the knowledge base tool for informational questions. An eval asserts the correct tool appeared in the trace for each test case.
- **Parameter validation.** Did the agent pass the correct arguments? Beyond checking which tool was called, the eval verifies that the arguments matched expected values: the right customer ID, the right date range, the right filters.
- **Negative tool assertion.** Did the agent avoid calling a tool it shouldn't have? This matters for agents with access to write operations. An eval can assert that delete\_record or send\_email never appeared during a read-only task.
- **Schema conformance.** Does the final output match the expected response structure? If the agent returns a structured payload to a downstream system, schema validation catches malformed responses before they cause failures elsewhere.
- **Cost and latency bounds.** Did the agent complete the task within acceptable limits? An eval can assert that the total number of tool calls stayed below a threshold, or that end-to-end latency for a given task type stayed within acceptable bounds.

Each of these evals runs against trace data. The trace captures what happened; the eval function asserts whether what happened was correct. That separation matters because you can add new eval assertions against historical trace data without re-running the agent.

## Connecting code-based eval results to product outcomes

A passing code-based eval score tells you the agent behaved correctly on a test case. It doesn't tell you whether correct behavior correlates with user retention, task completion, or revenue.

Eval scores become strategically useful when they join to product engagement data under the same user identity. A team that can correlate eval pass rate for a given intent type with 30-day retention for users who triggered that intent can prioritize quality improvements by business impact, not just by failure count.

Amplitude's [AI Agents](https://amplitude.com/ai-agents) product was built for this join. It treats code-based eval results, trace-level quality signals, and user engagement data as events in the same product stream where retention, conversion, and feature adoption are already tracked. A PM can chart eval pass rate over time alongside the retention curve for users who interacted with that feature, without exporting data or joining tables in a warehouse.

Teams that have instrumented their AI agents with Amplitude can also build cohorts based on eval outcomes, run experiments on prompt changes with business outcome metrics as the primary success criteria, and use [Session Replay](https://amplitude.com/session-replay) to watch the actual user experience that followed a failing trace.

## Frequently asked questions

### **What's the difference between a code-based eval and a unit test?**

A unit test checks that a deterministic function returns an expected output for a given input. A code-based eval checks that an AI system (which involves model calls, tool execution, and context retrieval) produces output meeting defined quality criteria. Unit tests run on code you wrote. Code-based evals run on outputs a language model generated.

### **Can code-based evals catch hallucinations?**

Only indirectly. A code-based eval can check whether a response includes a specific string or passes a schema check, but it cannot verify whether a claim is factually grounded in retrieved context. Catching hallucinations reliably requires an LLM judge that compares the response against retrieved documents, or a retrieval-grounded correctness check.

### **How many code-based evals do I need to start?**

Ten to 20 cases covering your most common tool calls and output types is a reasonable starting point. Prioritize cases where a failure would be visible to users or block a downstream step. Add new cases whenever you encounter a production failure your existing evals didn't catch.

### **How do code-based evals fit into a CI/CD pipeline?**

Most teams build an eval runner that takes a fixed dataset of (input, expected outcome) pairs, runs each through the system, and scores outputs with deterministic functions. The runner returns a pass rate that gates pull request merges, the same way a test suite does. Evals that fall below a defined threshold block the deployment.

## Start measuring AI quality with Amplitude

Amplitude's AI Analytics Platform connects eval scores to the product metrics your team already tracks. See how trace-level quality signals fit into the same workflow as retention, conversion, and feature adoption.

[Try Amplitude for free today](https://app.amplitude.com/signup) to see how unified analytics, AI Agents, and eval quality signals work together.
