Fail fast, fix faster: Why faster models can beat smarter ones
AI Engineer Melbourne, 2026
AJ Fisher
Does smart mean slow?
Good enough, but fast
Fast-loop theory
Benchmark task
Spec and rules
- OpenAPI contract
- Over $1000 quote = approve
- Line / Order Discount > 10% = approve
- Approval once only
- API best practices
Validation
- 47 parallel tests ~300ms exec
- Specific failure feedback
- Logic tests (rules, edge cases)
- API shape tests (headers, enums)
- Robustness tests (responses)
Agent harness
- Invoke models via API
- OpenAI, Gemini, Mercury, Ollama, alt local model providers
- Run implementation - validate - inform next turn
Judging the agent loop
- Run allowed up to 15 turns
- Each turn attains PASS / PARTIAL / FAIL / THRASH
- Record detailed trace performance data per run
- 10 fresh runs and aggregate model results
Latency economics
Compounding progress
score = 1 - (1 - r)n
r improvement against remaining error
n completed iterations
Small gains accumulate
Fastest loop wins
GPT-5.4:
one-shots more often, ~88s mean
Mercury 2:
requires feedback to pass, ~6s mean
Loop speed matters
| Model | Class | Runs Passed | One-shot | Median Iterations | Mean Time To Pass | P90 | Best Failed Score |
|---|---|---|---|---|---|---|---|
| Mercury 2 | H/D | 10/10 | 0 | 2 | 6s | 7s | n/a |
| GPT-5.4 mini | H/AR | 10/10 | 7 | 1 | 41s | 65s | n/a |
| GPT-5.4 | H/AR | 10/10 | 6 | 1 | 88s | 101s | n/a |
| GPT-4.1 mini | H/AR | 10/10 | 0 | 2 | 54s | 66s | n/a |
| Gemini 2.5 Flash | H/AR | 8/10 | 0 | 4 | 77s | 589s | 0.947 |
| Gemma 4 31B cloud | H/AR | 10/10 | 0 | 5 | 133s | 209s | n/a |
| Qwen3 Coder 480B cloud | H/AR | 7/10 | 0 | 5 | 611s | 1154s | 0.957 |
| Orthrus Qwen3 8B MLX | L/D | 0/5 | 0 | n/a | n/a | 159s | 0.938 |
| Phi3 | L/AR | 0/5 | 0 | n/a | n/a | 355s | 0.125 |
Class (H=Hosted, L=Local, AR=Autoregressive, D=Diffusion)
Validation is the product
Fast loops require strong guidance
- Dense feedback
- Cheap error detection
- Fast feedback process
Some problems still need judgement
Fast iteration when:
- Dense feedback
- Measurable correctness
- Fast, cheap feedback
Frontier judgement when:
- Sparse feedback
- Subjective evaluation
- Ambiguous validation
Own the loop, not just the model
- Explore alternative architectures
- Optimise system and validation speed
- Consider stronger harnesses
Fail fast, fix faster:
Why faster models can
beat smarter ones
Additional resources: https://ajfisher.me/aieng26/
Andrew Fisher
VP Digital Science, Tetratherix
andrewfisher
@ajfisher.social
@ajfisher
This talk was developed on the traditional lands of the Bunurong people, Victoria.
All images produced using ChatGPT Images 2.0.