Fail fast, fix faster: Why faster models can beat smarter ones

AI Engineer Melbourne, 2026
AJ Fisher

Does smart mean slow?

Good enough, but fast

Fast-loop theory

Off the back of this I did some quick calculations: if you could produce a loop that was really fast, could you brute force your way to success? Put another way, could you speed run a Ralph Wiggum Loop?

The idea was, if you have a lower capability model but a reasonable rate of improvement and you can iterate quickly, in theory you can outperform a slow, high performance frontier model before it's finished generation.

A Ralph Loop is perfectly designed for goal based systems - assuming it has good feedback mechanisms

Looking at the Mercury 2 performance data I worked out that it should make this theory testable.

A couple of weeks later, Andrej Karpathy releases AutoResearch with a similar pattern but from a different direction. Cheap experiments, measureable feedback and repeated iterative improvement.

So with all of these ideas in hand I built a small benchmark to test the theory.

I'll talk you through that now.

Benchmark task

Spec and rules

OpenAPI contract
Over $1000 quote = approve
Line / Order Discount > 10% = approve
Approval once only
API best practices

Validation

47 parallel tests ~300ms exec
Specific failure feedback
Logic tests (rules, edge cases)
API shape tests (headers, enums)
Robustness tests (responses)

Agent harness

Invoke models via API
OpenAI, Gemini, Mercury, Ollama, alt local model providers
Run implementation - validate - inform next turn

Judging the agent loop

Run allowed up to 15 turns
Each turn attains PASS / PARTIAL / FAIL / THRASH
Record detailed trace performance data per run
10 fresh runs and aggregate model results

Latency economics

So why is diffusion so fast here?

It doesn't inherently make the model smarter but it does change the cost it takes to produce the next candidate solution.

If you look at these two animations side by side you can see how different model architectures solve the same problem with two different approaches.

An autoregressive model iterates sequentially through tokens until it gets to a stop point.

A diffusion model starts from random noise and iteratively refines the output over a small number of steps.

For this talk, don't worry too much about the internal mechanics of text diffusion - that could be a long talk on it's own. The key is that candidate generations can be produced with extremely low latency.

Then, external validation can make these fast generations steerable using our Ralph loop set up.

What this means is that a marginally competent model coupled to a very high speed loop improves quality very quickly.

And I'll show you why this happens.

Compounding progress

score = 1 - (1 - r)ⁿ

r improvement against remaining error

n completed iterations

Small gains accumulate

Completion score chart showing SOTA passing in three turns after three minutes, Fast 30% passing in nine turns after 54 seconds, and Fast 15% passing in 19 turns after 114 seconds.

Fastest loop wins

GPT-5.4:
one-shots more often, ~88s mean

Mercury 2:
requires feedback to pass, ~6s mean

Loop speed matters

Model	Class	Runs Passed	One-shot	Median Iterations	Mean Time To Pass	P90	Best Failed Score
Mercury 2	H/D	10/10	0	2	6s	7s	n/a
GPT-5.4 mini	H/AR	10/10	7	1	41s	65s	n/a
GPT-5.4	H/AR	10/10	6	1	88s	101s	n/a
GPT-4.1 mini	H/AR	10/10	0	2	54s	66s	n/a
Gemini 2.5 Flash	H/AR	8/10	0	4	77s	589s	0.947
Gemma 4 31B cloud	H/AR	10/10	0	5	133s	209s	n/a
Qwen3 Coder 480B cloud	H/AR	7/10	0	5	611s	1154s	0.957
Orthrus Qwen3 8B MLX	L/D	0/5	0	n/a	n/a	159s	0.938
Phi3	L/AR	0/5	0	n/a	n/a	355s	0.125

Class (H=Hosted, L=Local, AR=Autoregressive, D=Diffusion)

Going beyond the headline, this table shows a summary of the traces I recorded across a selection of models. Some of these were hosted and some local, some diffusion and most autoregressive.

Speed aside, what you can also observe is that there is a competence threshold the model has to exceed to even be able to do the task. The model has to be competent enough that it can eventually finish the task with feedback.

You can see that a couple of the models don't ever achieve the solution in ten runs with 15 turns. Ultimately they are not competent enough to solve this scenario.

The two highlights for me are:

Mercury completed 10 successful runs in roughly the time GPT-5.4 took to do one.

But GPT-5.4 Mini got the same result in half the time as it's bigger sibling because it's a tuned model and is more efficient at inference.

As I said a moment ago, loop speed changes the performance economics and these results change where the engineering work sits. Model intelligence does still matter, but the speed and quality of the loop starts to matter just as much.

So what's the implication of this from an engineering standpoint?

Validation is the product

Fast loops require strong guidance

Dense feedback
Cheap error detection
Fast feedback process

How do I build good validation?

We need dense feedback so the system doesn't thrash
it has to be cheap so the system can detect an error easily
And it's got to be fast so that the feedback process can happen quickly

The quality of our validation really matters for autonomous systems.

This became very evident as I set up the benchmark. Bad errors cause stalls. Feedback gaps inhibit progression. Weak validations cause the models to thrash. You need to have a level of quality or the system can't work autonomously.

As a pattern, this is more or less really test-driven development.

Test-driven development in this scenario is less about just building a confidence tool for developers. It's now also a control system so agents can go away and build something, or rebuild it, or build five versions of it at the same time or optimise it.

It changes what we can use this for.

Some problems still need judgement

Fast iteration when:

Dense feedback
Measurable correctness
Fast, cheap feedback

Frontier judgement when:

Sparse feedback
Subjective evaluation
Ambiguous validation

Now, based on the results I showed you might be thinking you need to chuck out your claude or chatgpt subscriptions and go all in on diffusion models.

So I would be remiss if I didn't call these things out.

Diffusion is not inherently better than an autoregressive model. In terms of absolute frontier, it's not. We're talking about a model that's about equivalent to Gemini 2.5, but an extremely high speed Gemini 2.5.

The frontier still matters for a lot of tasks, and raw intelligence is important. You have to have to attain a certain level of competence before any model can be used for meaningful work and especially if you want to do it autonomously.

Autonomous systems work best when you've got dense feedback, measurable correctness, and cheap, fast validation.

It breaks down when you have sparse feedback, subjective evaluation, or slow or ambiguous validation criteria. In that scenario you still want to have one really good attempt, try to validate against that, and provide feedback to refine it.

Own the loop, not just the model

Explore alternative architectures
Optimise system and validation speed
Consider stronger harnesses

Fail fast, fix faster:
Why faster models can
beat smarter ones

QR code for https://ajfisher.me/aieng26/

Additional resources: https://ajfisher.me/aieng26/

Andrew Fisher
VP Digital Science, Tetratherix andrewfisher
@ajfisher.social
@ajfisher

This talk was developed on the traditional lands of the Bunurong people, Victoria.

All images produced using ChatGPT Images 2.0.