agentic AI

chatgpt

Agent Design Is Still Hard

Dishant Sharma

•

Nov 22nd, 2025

•

7 min read

Thursday morning. Coffee getting cold.

Armin Ronacher drops a post about building agents. You know Armin. Flask guy. Sentry early engineer. The person who's been documenting his AI journey for months.

This one hits different.

He's been building agents at his new company. Not toy examples. Real production systems. And his main point? Building agents is still messy.

You've probably heard the hype. Agents will do everything. Just write some prompts and ship. But if you've actually tried building one, you know. It's not that simple.

The SDK Problem

I used to think picking the right SDK would solve everything.

Armin's team started with Vercel AI SDK. Made sense. It's good tech. Clean abstractions.

They wouldn't make that choice again.

Here's what happens when you build a real agent:

The differences between models are big enough that you need your own abstraction. Cache control works different. Reinforcement needs different handling. Provider-side tools don't play nice.

With higher-level SDKs, you build on top of their abstractions. Which might not be the ones you want.

The right abstraction isn't clear yet.

Vercel SDK and Anthropic's web search tool routinely destroy message history. They still haven't figured out why.

My friend tried building an agent last month. Different SDK. Same problem. Spent three days debugging why tool calls kept failing. Turns out the SDK was doing something clever with message formatting. Something he didn't ask for.

He ended up writing directly against the API.

Caching Is Weird

Anthropic makes you manage cache points explicitly.

Armin initially thought this was dumb. Why not automate it?

Now he vastly prefers explicit cache management.

Here's why:

Costs become predictable
You can split conversations and run them in different directions
Context editing becomes possible

Their strategy: one cache point after the system prompt, two at the conversation start, with the last one moving up with the conversation tail.

Because the system prompt needs to stay static, they feed dynamic info like current time later. Otherwise it trashes the cache.

Most people don't think about this. They assume caching just works. It doesn't.

Reinforcement Does Heavy Lifting

Every time the agent runs a tool, you can feed more information back.

Not just the tool output. Everything.

You can remind the agent about the overall objective. Provide hints when tools fail. Inform it about background state changes.

Claude Code has a todo write tool that's just an echo tool. The agent tells it what tasks it should do, and it echoes them back.

That's it. Self-reinforcement.

That's enough to drive the agent forward better than just giving tasks at the start.

When Things Break

If you expect lots of failures during code execution, you can hide those failures from context.

Two ways:

Run tasks in a subagent until they succeed, only report back success. Maybe include what didn't work.

Use context editing to remove certain failures that didn't help.

But there's a catch. Context editing invalidates caches. Always. It's unclear when that trade-off is worth it.

I watched an agent try to fix a database connection error last week. Seventeen attempts. Same error. Same approach.

It never occurred to the agent to try something different.

The File System Thing

Most of their agents are based on code execution. That needs a common place to store data.

They use a virtual file system.

Why? You should avoid dead ends where a task can only continue in one specific tool.

Example: An image generation tool should write to the same place where the code execution tool can read. Otherwise you can't use the code tool to zip those images.

It needs to work both ways. Code tool unpacks a zip, inference describes images, code tool processes them.

The file system is the glue.

Output Is Surprisingly Hard

Their agent doesn't represent a chat session. Messages in between aren't revealed.

They have one output tool. The agent uses it to communicate to humans. In their case, it sends an email.

Problems:

It's hard to steer the wording and tone. Much harder than using the main loop's text output.

They tried running another quick LLM to adjust tone. It increased latency and reduced quality.

Sometimes the agent doesn't call the output tool at all. They inject a reinforcement message if the loop ends without it.

I don't know why output is this hard. Armin thinks it's related to how models are trained.

A Random Thing About Naming

I name my projects after birds. Raven. Finch. Crow.

Started as a joke. Now i can't stop.

My coworker names everything after Star Trek ships. Enterprise-db. Voyager-api. Defiant-cache.

We spent twenty minutes last week arguing about whether my Magpie service should integrate with his Runabout queue.

The agents don't care what we name things. But we do. And somehow it makes the work more fun.

Models Still Matter

Haiku and Sonnet are still the best tool callers. They make excellent choices for the agent loop.

For individual sub-tools that need inference, Gemini 2.5 works well. Especially for summarizing large documents or extracting info from images.

The Sonnet models hit safety filters sometimes, which is annoying.

Token cost alone doesn't define how expensive an agent is.

A better tool caller will do the job in fewer tokens. There are cheaper models than Sonnet, but they're not necessarily cheaper in a loop.

Testing Is The Worst Part

Testing and evals are the hardest problem.

Unlike prompts, you can't do evals in an external system. Too much to feed in.

You need evals based on observability data or instrumented test runs.

None of the solutions they tried convinced them. At the moment they haven't found something that makes them happy.

This tracks with what i'm seeing. Everyone's struggling with this. Some people just run the agent fifty times and hope. Some build elaborate scoring systems that don't really work.

Nobody has this figured out yet.

What People Are Saying

The post hit Hacker News. Got 90 points, 195 comments.

One comment stuck with me. Someone said they spent three months building disk encryption for AWS before AWS just added it as a button. Their lesson: often it's better to do nothing.

Fair point. But also. Armin's building actual production systems. Not experimenting. Not waiting for someone else to solve it.

Simon Willison called attention to Armin saying AI writes 90% of the code at his new company. But Simon emphasized: the AI doesn't fully comprehend threading, and if you don't catch bad decisions early, you can't operate it stably.

The realization that came up over and over: despite the basic agent design being just a loop, there are subtle differences based on tools you provide.

Those differences are everything.

The Real Talk

Most people don't need to build agents.

Not yet anyway.

If you're just trying to automate some tasks, use an existing tool. Claude Code. Cursor. Something that works.

Building your own agent means dealing with SDKs that break in weird ways. Managing caches manually. Writing reinforcement logic. Building file systems for tool communication.

It's a lot.

But if you do need to build one. If you're past the point where existing tools work. Then this stuff matters.

Armin mentions he's now trying Amp alongside Claude Code. Not because it's objectively better, but because he likes how they're thinking about agents. The Oracle and main loop interactions are beautifully done.

The tools are getting better. Slowly.

Final Thought

Armin's been writing about AI and agents all year. This is just his latest update. He even wrote a separate post about LLM APIs being a synchronization problem.

The pattern i notice: he shares what actually works. Not what should work. Not what the demos show.

What breaks. What's annoying. What's still unsolved.

That's the kind of honesty this space needs more of.

I still think about my first agent attempt. Six months ago. It was supposed to monitor logs and alert on anomalies. Worked great in testing.

Production? Absolute disaster. False alerts everywhere. Missed actual problems.

Turned out the training data didn't match production patterns at all.

Should have just used grep.

Enjoyed this article? Check out more posts.

View All Posts