Fail fast, fix faster: Why faster models can beat smarter ones

AI Engineer Melbourne, 2026
AJ Fisher

Does smart mean slow?

Good enough, but fast

Fast-loop theory

Benchmark task

Spec and rules

  • OpenAPI contract
  • Over $1000 quote = approve
  • Line / Order Discount > 10% = approve
  • Approval once only
  • API best practices

Validation

  • 47 parallel tests ~300ms exec
  • Specific failure feedback
  • Logic tests (rules, edge cases)
  • API shape tests (headers, enums)
  • Robustness tests (responses)

Agent harness

  • Invoke models via API
  • OpenAI, Gemini, Mercury, Ollama, alt local model providers
  • Run implementation - validate - inform next turn

Judging the agent loop

  • Run allowed up to 15 turns
  • Each turn attains PASS / PARTIAL / FAIL / THRASH
  • Record detailed trace performance data per run
  • 10 fresh runs and aggregate model results

Latency economics

Compounding progress

score = 1 - (1 - r)n

r improvement against remaining error

n completed iterations

Small gains accumulate

Fastest loop wins

GPT-5.4:
one-shots more often, ~88s mean

Mercury 2:
requires feedback to pass, ~6s mean

Loop speed matters

Model Class Runs Passed One-shot Median Iterations Mean Time To Pass P90 Best Failed Score
Mercury 2 H/D 10/10 0 2 6s 7s n/a
GPT-5.4 mini H/AR 10/10 7 1 41s 65s n/a
GPT-5.4 H/AR 10/10 6 1 88s 101s n/a
GPT-4.1 mini H/AR 10/10 0 2 54s 66s n/a
Gemini 2.5 Flash H/AR 8/10 0 4 77s 589s 0.947
Gemma 4 31B cloud H/AR 10/10 0 5 133s 209s n/a
Qwen3 Coder 480B cloud H/AR 7/10 0 5 611s 1154s 0.957
Orthrus Qwen3 8B MLX L/D 0/5 0 n/a n/a 159s 0.938
Phi3 L/AR 0/5 0 n/a n/a 355s 0.125

Class (H=Hosted, L=Local, AR=Autoregressive, D=Diffusion)

Validation is the product

Fast loops require strong guidance

  • Dense feedback
  • Cheap error detection
  • Fast feedback process

Some problems still need judgement

Fast iteration when:

  • Dense feedback
  • Measurable correctness
  • Fast, cheap feedback

Frontier judgement when:

  • Sparse feedback
  • Subjective evaluation
  • Ambiguous validation

Own the loop, not just the model

  • Explore alternative architectures
  • Optimise system and validation speed
  • Consider stronger harnesses

Fail fast, fix faster:
Why faster models can
beat smarter ones

QR code for https://ajfisher.me/aieng26/

Additional resources: https://ajfisher.me/aieng26/

Andrew Fisher
VP Digital Science, Tetratherix
andrewfisher
@ajfisher.social
@ajfisher

This talk was developed on the traditional lands of the Bunurong people, Victoria.

All images produced using ChatGPT Images 2.0.