AI Coding Research / February 2026
Frontier AI models that ace coding benchmarks fail dramatically at real software evolution tasks. GPT-5 drops from 65% to 21% when moving from isolated fixes to long-horizon multi-file changes.
A new benchmark testing long-horizon software evolution tasks — the kind of work that actually matters in production codebases.
Tasks
48
Real evolution scenarios from 7 mature Python projects
Avg Files
21
Files modified per task — true multi-file reasoning
Tests
874
Per instance validating functionality preservation
SWE-Bench Verified (isolated) vs SWE-EVO (long-horizon)
Same model. Same tasks. Different scope. The gap between isolated fixes and sustained multi-file evolution is enormous.
Source: SWE-EVO Paper, Vals.ai SWE-Bench
SWE-EVO vs SWE-Bench Verified — the gap shows capability lost on long-horizon tasks.
The same model performs very differently depending on which harness executes it.
Biggest Swing
5x
GPT-4.1: 2.08% → 10.42% with SWE-Agent
Exception
-20%
Deepseek-R1 performs worse on SWE-Agent
What's a harness? The agent framework executing LLM actions. SWE-Agent uses CLI-based shell access; OpenHands runs in isolated containers. Your harness architecture matters as much as model selection.
Source: SWE-EVO Paper (Table 2)
Important: Claude models were not included in SWE-EVO. Here's performance on related benchmarks:
Based on GPT-5's 3x drop, Claude Opus 4.5's estimated SWE-EVO: ~25-30%
A model scoring 65% on SWE-Bench Verified may only achieve 21% on real software evolution. The gap between isolated fixes and sustained multi-file work is enormous.
AI excels at bounded, single-file changes. A 2-day refactor should become 15-20 well-scoped AI tasks, not one "please refactor this system" prompt.
The same model can swing from 2% to 10% based on execution framework. Your tooling architecture is as important as model selection.
Frontier models fail on instruction following (misinterpreting nuanced specs). Weaker models fail on tool use and syntax. Train your team accordingly.
The 21% vs 65% gap is fundamentally about context. Engineers who excel with AI tools have learned to manage context boundaries explicitly.