How I designed an AI agent without code: eval first

I set myself an ambitious goal: to actually understand AI agents at an engineer's level, not a user's. On day one I learned something that reframed the whole thing: building agents is mostly not about code.

Here's the short model that makes everything click.

A language model has no memory. A single call is just "text in → text out," and between calls it remembers nothing. An "agent" is simply a loop wrapped around that memoryless function: assemble the context → ask the model → run a tool → append the result → repeat. And since the model is only as smart as the context you hand it, the entire job comes down to one question at every step: what goes into the context window right now — and nothing more? This is called context engineering, and it's the real craft.

To keep it from staying theory, I picked a real task: an agent that triages my work email — sorting messages by priority (needs action / FYI / background) and preparing draft replies. No auto-send: only a human sends. I designed the whole loop and the classification rules on my phone, without a single line of code.

Then came the most important lesson of the day. Before building anything, I ran an eval: I took 5 real emails, hand-labeled the "correct" priorities, and then "played the agent" — classifying each message using only metadata (sender, subject, who's on the To line).

And here's the surprise. When I actually read the body of the emails, two of the five verdicts flipped:

one email looked like "cc'd to a colleague, not my problem" — but buried deep in a forwarded chain was a task assigned to me personally, already being chased a second time;
an automated system notification looked like "background" — but it was actually requesting my sign-off.

Both misses went in the dangerous direction: cheap signals (sender, subject, the To line) were downgrading real tasks to "not urgent." If the agent judged by metadata alone, it would have quietly hidden exactly the things I can't afford to miss.

That gave me two takeaways I now treat as fundamentals:

Classify cheaply, but load the body wherever the cost of being wrong is high. Metadata is fast and nearly free, but it hides the most expensive misses. Reading the full content costs more — so do it selectively, for the borderline cases.
Weight "dangerous errors" separately in your eval. An agent that occasionally keeps some clutter is harmless. An agent that loses a real task is not. They're different errors, and your metric has to tell them apart.

But the bigger lesson is broader. I designed and tested an agent — its rules and its eval set — entirely in my head, before a single line of code. And it was the eval, not the code, that turned out to separate a working agent from a nice-looking demo. Almost anyone can build an agent that works on one example. Proving with a number that it works across twenty different ones — and never loses what matters — is the rare part.

This was, in effect, day one of my path into AI agents. Next: building this agent in code (drafts, human-in-the-loop, eval as a test) — and figuring it out in public.

To be continued.

# I designed an AI agent without writing a line of code — and the first eval caught two dangerous mistakes

Comments

Command Palette

Comments