AI agent observability vs. traditional software

Date

Apr 27, 2026

Reading time

5 minutes

Author

Kaique Nogueira · Engineering · Capim

When I shipped traditional software to production, I had a very clear idea of what to expect. My test suite would cover most code paths and my monitoring routine was predictable. But ever since I started working with agents (LLMs), my routine changed drastically. The biggest lesson I’ve learned day to day is simple: you don’t know what your agent will do until it’s in production, facing an infinity of possible scenarios.

The Comfort of Traditional Software

In the past, we dealt with a finite, constrained input space. Users interact with systems through buttons, forms and API calls with very specific formats. If I designed a signup flow, I knew the exact sequence of screens.

My monitoring relied on traditional tooling, and my focus was structured logs, stack traces and system metrics. When a web request returned “POST /api/checkout 200 OK in 200ms”, I knew everything was fine. Failure handling was exhaustive simply because the failure modes could be enumerated in advance.

Uncharted Territory

Today, reality is completely different. Agents take natural language as their primary input, which makes the space of possibilities infinite. The user no longer clicks “Request Refund”; they can be vague, specific, casual or formal, writing things like “my order arrived broken, what do I do?” or just “refund order #12345”.

Beyond unlimited inputs, we deal daily with the non-deterministic behavior of LLMs. Small variations in how the user writes can produce different outputs. During conversations, the agent often makes decisions through multi-step reasoning chains and tool calls I simply can’t predict at development time. A prompt that worked perfectly in my local tests can fail on edge cases in production.

Trying to use the old monitoring tools for this proved frustrating. They weren’t built to store the sheer scale of free-text conversations with multi-turn context, nor do they offer semantic search to analyze full trajectories. More than that: an HTTP “200 OK” tells me absolutely nothing about the quality of the agent’s answer. Quality now lives in the conversations themselves.

The Search for the Best Way to Monitor

Determining whether an answer was helpful, whether the tone was right or whether the agent made things up requires human judgment. But how could I manually review thousands of interactions a day? That doesn’t scale — and trying to solve that problem is how I arrived at the structure we use today.

The first step was changing what I monitored. I moved away from infrastructure metrics to capture the agent’s complete behavior: the prompt sent, the answer generated, the context used, the tools invoked and the reasoning chain when available. This data is stored as structured traces in Langfuse, which gives me real auditability over what happened in each interaction. A “200 OK” still shows up in my logs, but I’ve learned it tells me absolutely nothing about quality.

To scale judgment without scaling the team, we set up one LLM-based evaluator agent that runs automatically over a sample of production traffic, usually between 10% and 20%. It’s not perfect, but it gives me continuous metrics that would be impossible to get manually.

The question I imagine you’re asking now: why only one?

We’ve tested LLM-as-a-judge enough to understand that each new evaluator adds a degree of complexity, and they don’t come ready-made, even when their role is well scoped. For them to be truly effective, the metric they’ll own needs to be refined first. Only after a metric matures does it make sense to create a dedicated agent to monitor it. It’s a gradual process, not a one-shot architecture decision.

And precisely because these evaluators aren’t perfect, I’ve learned I can’t depend on them alone. Today we have human review queues with predefined criteria, fed by metadata generated alongside the traces. It’s surgical work, not exhaustive — specialists look at the cases that really matter, not everything. That review is where we identify new problems, mature existing metrics and, eventually, decide when it makes sense to create a new monitoring agent.

The fourth pillar changed my development routine the most: I stopped tweaking prompts “by feel”. Today I treat prompts as code — versioned on GitHub, tested with Promptfoo in a CI/CD pipeline, and published in a controlled way via Langfuse. Every change goes through unit tests, scenario tests and regression checks before reaching production.

Closing the Loop

What made this structure truly valuable was realizing that monitoring feeds development. When a failure shows up in production, I don’t discard that interaction — I use it. Analysis tools cluster error patterns, problematic interactions go into the dataset, new tests are created in Promptfoo, the prompt gets fixed and the change goes through the pipeline again before shipping. Production errors become permanent test cases. The system learns from what breaks.

The full loop looks like this:

Local development
Test scenario creation (Promptfoo)
Commit on GitHub
Evals in CI/CD
Publishing via Langfuse
Logs and trace collection
Automated evaluation + human review
Prompt adjustment → new cycle

This process gave me something I’d been missing since I started working with agents: reproducibility, auditability and traceability — properties traditional software always had in code, but that LLM systems had to learn to build.

Conclusion

At the end of the day, the shift from traditional software to agents is a paradigm shift. We went from deterministic systems — where we validated that the code ran without breaking — to probabilistic systems, where monitoring needs to focus on understanding natural language interactions and continuously improving the answers.

The tool matters less than the technique. What I’ve learned in practice is that an agent’s success in production isn’t about the model you pick, but about the validation architecture and the data engineering that support it. Combining the scale of automated evaluations with surgical human review is, so far, the most sustainable path I’ve found to scale agents with quality and safety.