agentic AI

llm

Why Your AI Agent Probably Sucks (And How Evals Can Fix It)

Dishant Sharma

•

Nov 23rd, 2025

•

7 min read

Why Your AI Agent Probably Sucks (And How Evals Can Fix It)

Last tuesday i spent four hours debugging an AI agent. Four hours watching it make the same stupid mistake over and over. It kept calling the wrong API endpoint. Every single time.

The logs showed it was "working." The benchmarks said it was "good." But in reality? It was useless.

You've probably done this. I know i have. Built an agent that works perfectly in your test cases. Then watched it completely fail when real users touched it. The problem isn't the LLM. It's that we're testing these things like they're regular software.

They're not.

Why Testing Agents Is Different

Regular code is deterministic. Same input, same output. Every time.

Agents are chaos machines. Give them the same prompt twice and you might get wildly different results. They plan. They reason. They hallucinate. They go off on tangents you never imagined.

Andrew Ng from DeepLearning.AI recently pointed out that teams wait way too long to implement automated evaluations. They rely on humans to manually check outputs long after they should have automated it. And i get why. Building evals feels like this massive investment.

But here's the thing everyone misses: you don't need perfect evals on day one.

Start With Five Examples

Most people think building evals means creating 1,000 test cases with perfect metrics. That's why they never start.

Start with just five examples and unoptimized metrics. Seriously. Five.

I used to think this was lazy. Now i know it's smart. Because evals themselves need to evolve. Your agent changes. Your use cases change. Your evals should change too.

Here's what actually happens when you start small:

You pick five real examples that broke in production. You write a quick eval that measures one thing. Maybe it's just "did the agent call the right tool?" That's it. Then you iterate.

You're not building a perfect eval. You're building a feedback loop.

The goal isn't catching every possible failure. It's catching the failures that matter right now. Next week you'll catch different ones.

The Soft Failure Revolution

One team at Monte Carlo Data had a breakthrough when they introduced "soft failures" into their testing. Traditional CI/CD is black and white. Test fails? Code doesn't ship. Done.

But with agents? That's insane.

The tests themselves can hallucinate. You're using an LLM to judge another LLM's output. They implemented a scoring system where anything below 0.5 is a hard failure, above 0.8 is a pass, and 0.5-0.8 is a soft failure.

Soft failures can merge. But if you get too many? That becomes a hard failure.

This single change made their agent development actually shippable. Before this, they were stuck. Every third commit would fail tests because the judge was having a bad day.

The really clever part? About one in ten tests produces spurious results where the output is fine but the tests hallucinate, so they built a retry mechanism. If a test fails and then passes on retry, they assume it was noise.

What Nobody Tells You About Agent Evals

Most tutorials focus on the final output. Did the agent answer correctly? Great. Ship it.

But that's surface level. Agents operate through trajectories, which are sequences of actions and states. A single wrong step early on can cascade into total failure at the end.

Think about it. Your agent might:

Call the right tool with wrong parameters
Get the right data but misinterpret it
Plan a perfect strategy then forget it halfway through

You need to measure the journey, not just the destination.

I learned this the hard way building a customer support agent. It would get the answer right 80% of the time according to my evals. Users hated it. Why? Because it took seven unnecessary steps to get there. It was technically correct but practically useless.

Now i measure:

Plan quality (did it map the goal to the right steps?)
Tool usage (right tool, right params, right context?)
Trajectory efficiency (shortest path to solution?)

And here's the kicker. When there's misalignment between goal interpretation and plan generation, you get drift where agents do things that are valid syntactically but wrong contextually.

Your agent isn't broken. It's just solving the wrong problem.

My Friend Who Self-Hosts Everything

Side note but relevant: my coworker Jake self-hosts everything. His email server. His git repos. His password manager. Everything.

He spent two weeks building evals for a simple document processing agent. Two weeks. I asked him why.

"Because i don't trust black boxes," he said.

The thing is, he was right. His agent runs in production now and has been solid for six months. Mine? I shipped fast with minimal evals. I've had to patch it four times.

Sometimes the paranoid approach wins.

The Hard Truth

Let me be honest. Most teams don't need the agent they're building.

It's relatively easy to build a proof of concept in a few days, but moving from that first draft to a production-ready agent requires significant effort. The effort is in the evals.

If you can't be bothered to properly evaluate your agent, you probably don't need an agent at all. A simple script would work fine.

One team at Three Dots Labs discovered that building their AI feature was 80% regular software development and 20% AI-specific parts, and 80% of that AI work was creating and running evals.

That's the ratio nobody talks about. We spend all our time on prompts and model selection. The real work is in the evaluation infrastructure.

And here's what makes it worse. Eval is hard because it must reflect human preferences, not just accuracy. Especially for agents that work alongside humans. User frustration is failure even if the task completed successfully.

You can't measure that with a simple accuracy score.

Two Loops Running Forever

The development process comprises two iterative loops running in parallel: iterating on the agent to make it perform better, and iterating on the evals to make them correspond more closely to human judgment.

Most teams only run the first loop. They tweak prompts. They try different models. They adjust the system message.

But they never improve their evals. So they're optimizing for the wrong thing.

The eval should be wrong sometimes. That's how you know it needs work. When your eval ranks two agent versions and you disagree with the ranking? Fix the eval.

I keep a running doc of "eval errors" just like i would for model errors. Every time my eval gives a weird score, i note it. Once a week i update the eval to handle those cases better.

What You Should Do Tomorrow

Don't build 1,000 test cases. Don't spend a month designing the perfect eval framework.

Do this instead:

Pick the three most common failure modes from your production logs. Write one eval for each. Run them. See what breaks.

That's it. You're now doing evals better than 80% of teams.

The teams that win aren't the ones with the best evals. They're the ones who started measuring anything at all.

The Agent That Lied

There's a story i keep coming back to. A team built a coding agent that would try to fix issues that didn't exist or completely miss what was wrong. In production. With real users.

The problem? They couldn't verify the model's output. So they built a system where the agent had to actually fix the user's code before giving hints. If it succeeded, they had confidence it was right.

This is the kind of creativity you need. Not better prompts. Better verification.

Your agent will hallucinate in production. Plan for it. Build systems that catch it. Don't just hope the evals caught everything.

Final Thought

I still think about that tuesday. Four hours on one stupid API call bug. Could have walked my dog. Could have made dinner. Could have done literally anything else.

But i didn't have evals in place. I was flying blind.

Now i write evals before i write agents. It feels slower at first. But i haven't had a four-hour debugging session in three months.

Your future self will thank you.

Enjoyed this article? Check out more posts.

View All Posts