agents

agentic AI

AI Agent Evals: Why Most Teams Still Do Vibe-Testing

Dishant Sharma

•

Jan 13th, 2026

•

7 min read

AI Agent Evals: Why Most Teams Still Do Vibe-Testing

Someone on Reddit asked how people evaluate their AI agents. The top comment was brutal. "I've had discussions with numerous AI and machine learning engineers working on similar projects, and none have achieved satisfactory results".

Another developer replied they just do "vibe-testing".

That's where most teams are right now. Building agents that book flights, write code, or handle support tickets. And testing them by... hoping nothing breaks. Anthropic just dropped a guide on agent evals that makes it clear: without evals, you're flying blind. You wait for complaints. Fix one bug. Break something else. Repeat.

The guide opens with a warning. Good evaluations help teams ship AI agents more confidently. Without them, you get stuck in reactive loops. Catching issues only in production, where fixing one failure creates others.

SWE-Bench scores went from 40% to over 80% in one year. But nobody noticed their agent got better until they had numbers to prove it.

Why most teams skip evals

Early on, manual testing feels fine. You dogfood your agent. Check a few scenarios. Ship it.

Then you hit scale. Users report the agent feels worse after your latest update. You have no idea if it's true. You can't reproduce it. You change something and hope it helps.

I used to think evals were overhead. Extra work that slowed down shipping. Turns out, that breaking point comes fast. Once your agent is in production and users depend on it, not having evals becomes the bottleneck.

Descript built an agent for video editing. They started with three questions: don't break things, do what i asked, do it well. Simple. They evolved from manual grading to LLM graders with clear criteria. Now they run two separate suites for quality and regression testing.

The cost of evals is visible upfront. The benefits compound later.

Bolt AI waited until they had a widely used agent before building evals. In 3 months, they built a system that runs their agent and grades outputs. Static analysis for code. Browser agents to test apps. LLM judges for behaviors like instruction following.

Both approaches work. But starting early forces you to define what success means before your agent does weird things in front of customers.

The three types of graders nobody tells you about

Code-based graders are fast and cheap. String matching. Binary tests. Static analysis. Does the code run? Do the tests pass?

SWE-Bench Verified works exactly this way. Give agents GitHub issues from real Python repos. Run the test suite. A solution passes only if it fixes failing tests without breaking existing ones.

But code-based graders are brittle. They fail on valid variations that don't match expected patterns exactly.

Model-based graders are flexible. You can grade subjective things like tone, coherence, whether the agent grounded its response in actual data. They cost more. They're slower. And they can hallucinate their own grades.

Human graders are the gold standard. Expensive. Slow. But necessary for calibrating your LLM graders.

Most teams combine all three. Anthropic's example for a coding agent includes unit tests for correctness, an LLM rubric for code quality, static analysis tools like ruff and mypy, state checks in security logs, and tool call verification.

In practice, coding evaluations rely on unit tests and an LLM rubric. Everything else is optional.

Here's what people don't say. If your agent passes 0% of tasks after 100 tries, your task is probably broken, not your agent. Frontier models are good enough now that zero success usually means ambiguous specs or misconfigured graders.

Terminal-Bench found this out the hard way. A task asked agents to write a script but didn't specify a filepath. The tests assumed a particular filepath. Agents failed through no fault of their own.

Pass@k vs pass^k

These metrics sound academic but they matter.

Pass@k measures the chance your agent gets at least one correct solution in k attempts. If your agent has a 50% success rate on first try, that's 50% pass@1. Give it three tries and pass@3 goes up.

Pass^k measures whether all k trials succeed. If your agent succeeds 75% of the time, the probability of passing all three trials is 42%.

Use pass@k when one success matters. Like a coding agent that proposes solutions.

Use pass^k when consistency is essential. Like a customer support agent where users expect reliable behavior every time.

Most people optimize for pass@1 and hope pass^3 follows. It doesn't always work that way.

The frameworks everyone's comparing

LangSmith is the obvious choice if you're deep in LangChain. It traces chains and agents. Minimal code changes to log runs.

Arize Phoenix is open source and covers broader LLM observability. Model performance, data quality, bias detection. It works with various frameworks, not just LangChain. One developer said "phoenix is fast and easy. I've been testing it out on openai agents recently and it works great".

Langfuse excels at tracing. Open source. Popular for teams that want flexibility.

Maxim AI offers end-to-end simulation and observability. Multi-turn agent simulation. API endpoint testing.

Here's the honest take from a Reddit thread: "For those building agents seeking an independent solution, Arize Ax is likely the best choice. If you're invested in the LangChain ecosystem, LangSmith is a strong option".

Nobody's claiming one framework solves everything. They all have trade-offs.

Why naming things is still the hardest problem

This has nothing to do with evals. But i've noticed something.

Every eval framework has a different name for the same concept. LangSmith calls it a "run." Anthropic calls it a "trial". Some call it a "trace" or "trajectory".

Same with graders. Some call them "checks" or "assertions" or "validators".

Reading documentation becomes a translation exercise. You learn one framework's vocabulary. Switch to another. Relearn everything.

It's like when every database called transactions something different in the 90s. Eventually everyone converged on standard terms. We're not there yet with AI evals.

The people who figure out naming will probably win adoption. Not because their tech is better. Because developers won't have to mentally map concepts between docs.

Who shouldn't bother with evals

If you're prototyping, skip it. Seriously.

Evals make sense when you have something that works and you want to make it better without breaking it. They protect against regressions.

But if you're still figuring out what your agent should do, evals are premature. You'll spend more time updating eval tasks than building features.

Start with 20 to 50 simple tasks drawn from real failures. Not hundreds. Not a perfect suite. Just the manual checks you already run during development.

One person on Reddit admitted they talk to ML engineers working on agents and "none have achieved satisfactory results" with marketed eval solutions. AutoEvals "requires extra human oversight to function properly".

Evals aren't magic. They're organized manual testing that runs automatically.

If you're a solo developer shipping fast, manual testing might be fine for months. If you're a team where one person's change breaks another person's fix, you need evals yesterday.

The breaking point is different for everyone. But the pattern is the same. First you dismiss evals as overhead. Then one production bug makes you wish you had them.

What happens when models get better

Qodo's story is telling. They were unimpressed by Opus 4.5 because their one-shot coding evals didn't capture gains on longer, complex tasks.

So they built a new agentic evaluation framework. Suddenly Opus 4.5 looked much better.

Narrow evals make capable models look weak. How you measure determines what you see. When Claude Opus 4.5 launched, it initially scored 42% on CORE-Bench. An Anthropic researcher found rigid grading that penalized "96.12" when expecting "96.124991". After fixing bugs and using a less constrained scaffold, the score jumped to 95%.

Your evals will need updates as models improve. Tasks that once measured "can we do this at all?" become "can we still do this reliably?". High-performing capability evals graduate to regression suites.

Teams with good evals can adopt new models in days. Teams without evals face weeks of manual testing while competitors ship.

Reading transcripts is not optional

Anthropic's guide emphasizes this. You won't know if your graders work unless you read the transcripts from many trials.

When a task fails, the transcript shows whether the agent made a real mistake or your grader rejected a valid solution.

Failures should seem fair. It's clear what the agent got wrong and why. If scores don't improve, you need confidence it's the agent's fault, not the eval's.

They invested in tooling for viewing eval transcripts. And they regularly take the time to read them.

Reading transcripts is how you verify your eval measures what actually matters.

Nobody takes eval scores at face value until someone digs into details and reads some transcripts. If grading is unfair, tasks are ambiguous, or valid solutions are penalized, the eval needs revision.

i still think about those Reddit comments. People asking how to evaluate agents. Getting replies like "vibe-testing" or "none have achieved satisfactory results". It's 2026. Models can write code and control computers. But most teams still don't have a systematic way to know if their agent got better or worse.

Enjoyed this article? Check out more posts.

View All Posts