Error analysis: from 40% to 95% accuracy in 2 weeks

Date

Feb 12, 2026

Reading time

6 minutes

Author

Lucas Rego · Engineering · Capim

We were going in circles trying to improve one of our AI agents. It’s the kind of agent that talks directly to our customers and needs to decide when to escalate a ticket to a human analyst.

We tried several approaches to solve the problem — including LLM-as-a-Judge — but they overcomplicated things for an agent still in its early days.

In this post, I’ll show how we doubled the agent’s accuracy (from 40% to 95%) in two weeks using nothing but error analysis and manual annotation of conversations.

The problem

Early this year we noticed that one of our agents seemed to be escalating tickets to humans more often than we’d like.

Reviewing a few conversations by hand, we found two kinds of error:

The agent could have solved the problem, but escalated to a human.
The agent should have escalated, but tried to solve it on its own.

In other words, the problem lived on both sides of the decision.

Observability

Our first instinct was to measure the size of the problem.

With a few agents running for a while, we had already learned an important lesson: without observability, improving agents is practically impossible. We’re using Langfuse, which has worked very well for this. Before that, analyzing conversations meant writing complex SQL queries over unfriendly data structures.

What didn’t work

Before explaining what worked, it’s worth quickly covering two attempts that didn’t.

Attempt 1: lots of metrics + LLM-as-a-Judge

Our first attempt was to create a large set of metrics to monitor the agent. We split the metrics into two types. The table below has some examples:

Type	Example	How to measure
Deterministic	% of escalated tickets	Code
Deterministic	Response time	Code
Subjective	Hallucination	LLM-as-a-Judge
Subjective	Answer quality	LLM-as-a-Judge

We created about 15 metrics, 5 of them with LLM-as-a-Judge, and the problem showed up fast.

An LLM-as-a-Judge is yet another AI generating information you first need to learn to trust. Before using a metric like that, we’d have to validate whether the evaluation was correct, whether the judge’s prompt worked across use cases, whether the classification was consistent over time, and so on.

In practice, we were creating another AI project to validate the first one.

In the end, we only trusted the deterministic metrics. Those are extremely important — they explain very well what happened, but very little about why.

Attempt 2: validating one judge at a time

Our second attempt was more careful. The idea was:

create an LLM-as-a-Judge
let it classify the conversations
manually label its classifications as right/wrong
fix the judge’s prompt
iterate until we trusted it

The problem was simple: while we were validating the judge, we weren’t solving the original problem. We were just pushing it forward.

The most important lesson: human annotation

A very important takeaway from those attempts was the value of manual labelling.

In Langfuse it’s called Human Annotation, and it boils down to: looking at one interaction between the agent and a customer and classifying it against some criterion.

For example, you can take 30 interactions a day and manually classify whether the agent hallucinated or not. After a week you already have a good sample of the problem and, on top of that, you can identify types/causes of hallucination.

What worked

After the previous attempts, we decided to take a step back, forget LLM-as-a-Judge for a while, and focus on the problem to be solved.

The idea was: if we’re going to do human annotation/labelling anyway, better to aim it straight at the actual problem. In our case it was simple:

Should the agent have escalated this conversation? → yes/no

This classification criterion was heavily debated, because initially we were thinking of classifying escalations as right or wrong. I’ll explain shortly why that wouldn’t work — for now, let’s walk through our error analysis and subsequent prompt optimization.

Analysis process

Select a sample of ~50 conversations
Manually classify whether it should escalate or not (human annotation)
Compare with the agent’s actual behavior
Build a confusion matrix and compute accuracy

	Should escalate	Should not escalate
Agent escalated	True Positive	False Positive
Agent did not escalate	False Negative	True Negative

Accuracy = (True Positives + True Negatives) / Sample size

Adjust the prompt
Run the new prompt on the same sample (backtest)
Recompute the confusion matrix

This lets us compute metrics like accuracy and track the prompt’s evolution. The process is very similar to what you’d do in traditional Machine Learning. In practice, we’re backtesting the prompt.

We iterated on this process for about two weeks. Each round:

we analyzed new errors
adjusted the prompt
ran the backtest again

Accuracy climbed from 40% to 95% across iterations. At that point, we considered the prompt good enough and moved on to the agent’s next problem.

Why didn’t we classify “right or wrong escalation”?

Initially we thought about classifying only: “did the agent escalate correctly?” But that creates a problem.

If we do that, the new prompt can only be evaluated on the cases where the old prompt escalated. It completely ignores the cases where the agent should have escalated but didn’t. Half of the problem disappears from the analysis.

That’s why we classified only the expected behavior, regardless of what the agent did.

Why we like this approach

A few reasons:

We went straight at the problem — we cut LLM-as-a-Judge validation efforts and went directly after our pain
We could measure progress
We used real human interactions — we proved the new prompt would have done better than the old one
We learned a lot about the agent’s behavior — manually reviewing conversations reveals how problems actually emerge

So LLM-as-a-Judge is useless?

Not at all! Our main lesson was that they shouldn’t be the first option — they demand significant effort and a certain maturity in the agent’s development stage.

Today I see them more as a monitoring tool. Once implemented and validated, they can be extremely useful to measure the size of a problem and surface cases for analysis.

But if you can fetch the conversations that contain the problem you want to solve, it’s better to go straight to error analysis and prompt improvement.

In our case, the ideal strategy would be:

Improve the prompt via error analysis
Create an LLM-as-a-Judge to monitor the problem
Use it as a regression alarm

Implementation notes

We used Langfuse for:

observability
human annotation
prompt experiments

We also found we needed to add extra information to the traces to make analysis easier along the way.

The confusion matrices were computed in simple spreadsheets, exporting the data from Langfuse.

Useful links

Hamel Husain — excellent content on LLM evaluation
Confusion Matrix — Wikipedia