Author
Lucas Rego · Engineering · Capim
We were going in circles trying to improve one of our AI agents. It’s the kind of agent that talks directly to our customers and needs to decide when to escalate a ticket to a human analyst.
We tried several approaches to solve the problem — including LLM-as-a-Judge — but they overcomplicated things for an agent still in its early days.
In this post, I’ll show how we doubled the agent’s accuracy (from 40% to 95%) in two weeks using nothing but error analysis and manual annotation of conversations.
The problem
Early this year we noticed that one of our agents seemed to be escalating tickets to humans more often than we’d like.
Reviewing a few conversations by hand, we found two kinds of error:
- The agent could have solved the problem, but escalated to a human.
- The agent should have escalated, but tried to solve it on its own.
In other words, the problem lived on both sides of the decision.
Observability
Our first instinct was to measure the size of the problem.
With a few agents running for a while, we had already learned an important lesson: without observability, improving agents is practically impossible. We’re using Langfuse, which has worked very well for this. Before that, analyzing conversations meant writing complex SQL queries over unfriendly data structures.
What didn’t work
Before explaining what worked, it’s worth quickly covering two attempts that didn’t.
Attempt 1: lots of metrics + LLM-as-a-Judge
Our first attempt was to create a large set of metrics to monitor the agent. We split the metrics into two types. The table below has some examples:
| Type | Example | How to measure |
|---|---|---|
| Deterministic | % of escalated tickets | Code |
| Deterministic | Response time | Code |
| Subjective | Hallucination | LLM-as-a-Judge |
| Subjective | Answer quality | LLM-as-a-Judge |
We created about 15 metrics, 5 of them with LLM-as-a-Judge, and the problem showed up fast.
An LLM-as-a-Judge is yet another AI generating information you first need to learn to trust. Before using a metric like that, we’d have to validate whether the evaluation was correct, whether the judge’s prompt worked across use cases, whether the classification was consistent over time, and so on.
In practice, we were creating another AI project to validate the first one.
In the end, we only trusted the deterministic metrics. Those are extremely important — they explain very well what happened, but very little about why.
Attempt 2: validating one judge at a time
Our second attempt was more careful. The idea was:
- create an LLM-as-a-Judge
- let it classify the conversations
- manually label its classifications as right/wrong
- fix the judge’s prompt
- iterate until we trusted it
The problem was simple: while we were validating the judge, we weren’t solving the original problem. We were just pushing it forward.
The most important lesson: human annotation
A very important takeaway from those attempts was the value of manual labelling.
In Langfuse it’s called Human Annotation, and it boils down to: looking at one interaction between the agent and a customer and classifying it against some criterion.
For example, you can take 30 interactions a day and manually classify whether the agent hallucinated or not. After a week you already have a good sample of the problem and, on top of that, you can identify types/causes of hallucination.
What worked
After the previous attempts, we decided to take a step back, forget LLM-as-a-Judge for a while, and focus on the problem to be solved.
The idea was: if we’re going to do human annotation/labelling anyway, better to aim it straight at the actual problem. In our case it was simple:
Should the agent have escalated this conversation? → yes/no
This classification criterion was heavily debated, because initially we were thinking of classifying escalations as right or wrong. I’ll explain shortly why that wouldn’t work — for now, let’s walk through our error analysis and subsequent prompt optimization.
Analysis process
- Select a sample of ~50 conversations
- Manually classify whether it should escalate or not (human annotation)
- Compare with the agent’s actual behavior
- Build a confusion matrix and compute accuracy
| Should escalate | Should not escalate | |
|---|---|---|
| Agent escalated | True Positive | False Positive |
| Agent did not escalate | False Negative | True Negative |
Accuracy = (True Positives + True Negatives) / Sample size
- Adjust the prompt
- Run the new prompt on the same sample (backtest)
- Recompute the confusion matrix
This lets us compute metrics like accuracy and track the prompt’s evolution. The process is very similar to what you’d do in traditional Machine Learning. In practice, we’re backtesting the prompt.
We iterated on this process for about two weeks. Each round:
- we analyzed new errors
- adjusted the prompt
- ran the backtest again
Accuracy climbed from 40% to 95% across iterations. At that point, we considered the prompt good enough and moved on to the agent’s next problem.
Why didn’t we classify “right or wrong escalation”?
Initially we thought about classifying only: “did the agent escalate correctly?” But that creates a problem.
If we do that, the new prompt can only be evaluated on the cases where the old prompt escalated. It completely ignores the cases where the agent should have escalated but didn’t. Half of the problem disappears from the analysis.
That’s why we classified only the expected behavior, regardless of what the agent did.
Why we like this approach
A few reasons:
- We went straight at the problem — we cut LLM-as-a-Judge validation efforts and went directly after our pain
- We could measure progress
- We used real human interactions — we proved the new prompt would have done better than the old one
- We learned a lot about the agent’s behavior — manually reviewing conversations reveals how problems actually emerge
So LLM-as-a-Judge is useless?
Not at all! Our main lesson was that they shouldn’t be the first option — they demand significant effort and a certain maturity in the agent’s development stage.
Today I see them more as a monitoring tool. Once implemented and validated, they can be extremely useful to measure the size of a problem and surface cases for analysis.
But if you can fetch the conversations that contain the problem you want to solve, it’s better to go straight to error analysis and prompt improvement.
In our case, the ideal strategy would be:
- Improve the prompt via error analysis
- Create an LLM-as-a-Judge to monitor the problem
- Use it as a regression alarm
Implementation notes
We used Langfuse for:
- observability
- human annotation
- prompt experiments
We also found we needed to add extra information to the traces to make analysis easier along the way.
The confusion matrices were computed in simple spreadsheets, exporting the data from Langfuse.
Useful links
- Hamel Husain — excellent content on LLM evaluation
- Confusion Matrix — Wikipedia