The Coherence Problem in AI-Native Software Development

AI has solved code generation; the binding constraint has moved to coherence. We recommend adopting a keel-and-drift operating model and reassigning senior engineering judgment from code review to mechanism design.

Executive Summary

AI has solved code generation: agents now produce pull requests in parallel faster than any human team. But software complexity grows combinatorially, not linearly, so cheap generation buys a faster-growing, faster-diverging codebase — and the scarce, expensive work has moved downstream from writing code to keeping it coherent. The prevailing team models, the pirate/architect and the two-slice team, were built around the now-cheap act of generation and leave coherence unowned. We recommend adopting a keel-and-drift operating model: lay a small structural foundation — the keel — in a short drydock before feature work, run a continuous drift-correction loop from day one, and reassign the most experienced engineers from reviewing individual agent diffs to designing those mechanisms. The reasoning is that the return on structure has risen super-linearly while its cost has not, and that better models worsen the problem rather than absolving it; a two-project natural experiment and convergent industry evidence corroborate it. The principal uncertainty is twofold — the team's direct evidence is two projects with a domain confound, and a credible hard-skeptic position holds that agent output is fundamentally and undetectably broken, which if correct shifts emphasis from channeling generation toward gating it but leaves the mechanism-design prescription intact.

1 Generation is solved; the cost of software moved downstream

AI has made code generation cheap and abundant; the scarce, expensive work has moved downstream to verification and coherence.

AI coding agents now spin up subagents that produce pull requests in parallel, faster than any human team. One side project built largely by vibe coding reached roughly 1,600 commits, more than 600 pull requests, and about 140,000 lines of code¹ — orders of magnitude beyond a median full-time developer's annual output. Producing code is no longer the scarce activity.

The cost of software did not vanish; it moved. Three independent signals locate it downstream, in review, reconciliation, and keeping a growing mass coherent (Exhibit 1). The claim is deliberately narrow: code production — generating large volumes of plausible, usually-runnable code — is cheap. That is not the stronger and genuinely contested claim that agents are good engineers; that contest is taken up directly in Section 7.

Exhibit 1 — Evidence that the cost of software moved downstream

Signal	Finding	Source
PR rejection rate	67.3% of AI-generated PRs rejected, against 15.6% for manually authored PRs	LinearB²
Bug rate vs. adoption	9% increase in bug rates correlated with a 90% increase in AI adoption	Google DORA 2025³
Measured vs. felt speed	Developers 19% slower with AI tools while believing they were 20% faster	METR⁴
Vibe-coded project	~1,600 commits, 600+ PRs, ~140k LOC for a single side project	First-person account¹

2 Complexity is combinatorial, so AI amplifies divergence, not merely speed

Because software complexity grows combinatorially with code volume, AI functions as a complexity amplifier — multiplying divergence and incoherence, not merely throughput.

Software complexity does not scale linearly with code volume. Interactions between functions scale roughly with the square of the function count; the reachable state space scales roughly exponentially in the number of interacting variables. The load-bearing consequence: doubling code volume more than doubles complexity.

If AI amplified only speed, teams would ship faster and need no new roles. AI instead amplifies volume and divergence — agents make locally sensible but globally inconsistent decisions: one module authenticates with pattern A, another with pattern B, a third introduces a conflicting dependency. Volume multiplied by combinatorial complexity multiplied by divergence yields super-linear growth in incoherence (Exhibit 2). An amplifier has no sign of its own: it multiplies whatever it is fed. Pointed at a generation process with no ordering discipline, it amplifies entropy; pointed at quality-engineering judgment, it amplifies coherence.⁵ The amplifier does not choose — the mechanism designer does.

The specific failure has a name: drift — the gradual divergence of an AI-amplified codebase from a coherent structure. It takes two forms: architectural drift, in which modules wander from their boundaries and competing implementations of one concern accumulate; and dependency drift, in which conflicting libraries, version sprawl, and dead code accrete. Drift is invisible to linting, which is syntactic, and to per-PR review, which is local. It is a global, accumulating property — and it is independently named as unsolved at scale by industry landscape analysis.

Exhibit 2 — AI-co-authored pull requests carry more defects in every category

Defect category	Issue rate, AI-co-authored vs. human-only PRs
All issues	~1.7×
Readability	~3×
Security	up to 2.74×
Formatting	~2.66×
Error handling	~2×
Logic / correctness	+75%

Analysis of 470 open-source pull requests. Source: CodeRabbit, December 2025.⁶ Read with the interest of a code-review-tool vendor in mind.

3 The prevailing team models optimize the problem that is already solved

The pirate/architect and two-slice team models center the now-cheap generation function and leave the now-expensive work of coherence unowned.

The dominant team models of the moment were built around the now-cheap act of generation. The pirate/architect model pairs a fast "pirate" who ships by vibe coding with an "architect" who hardens the product surface after product-market fit, often part-time. The two-slice team places one person per product, with roughly 99% of code AI-written.⁷ Both capture something real — exploration and hardening are different work at different paces — but both amplify the generator, the now-cheap function, and treat structure as a late, thin, post-PMF concern.

These models resource the era that is ending. The tell is that the framing's own boosters now concede, in the May 2026 discourse, that complexity management matters — while the model itself assigns complexity no role and no owner.

4 Two mechanisms govern coherence: the keel and the drift-loop

Coherence is governed by two mechanisms — a small structural keel that bounds complexity by design, and a continuous drift-loop that reverses the divergence the keel does not prevent.

A team that ships reliable product resources both prevention and correction. Together they convert a global combinatorial explosion into a sum of small, contained ones.

Prevention — the keel

The keel is a small structural foundation: loose coupling, single responsibility, a clean dependency graph, and a "super-core" of decisions costly to reverse even in the AI era — database schema and core dependency topology. Loose coupling makes the interaction matrix sparse; module boundaries partition one global state space into small local ones. The keel does not eliminate complexity — it bounds and localizes it. It is laid in a drydock, a short and deliberate phase of days to roughly two weeks before feature work, and it cannot be re-laid at sea. This is not Big Design Up Front: BDUF designs the discoverable; the keel designs only the knowable-in-advance and irreversible.

Correction — the drift-loop

The drift-loop is a continuous mechanism — periodic refactor passes, a persistent architectural-invariant monitor, a whole-repo tech-debt audit — that detects and reverses the drift prevention misses. In entropy terms, the generator is a source; the keel and the drift-loop are sinks. A team functions when sources and sinks balance (Exhibit 3). The pirate/architect model is almost all source.

Exhibit 3 — Four roles and the entropy source/sink model

Role	Function	Who fills it	Entropy role
Operator ("pirate")	Generate features	Largely AI today	Source
Keel / shipwright	Design the structural foundation	Human-designed; AI assists	Sink
Drift-loop / mechanic	Detect and reverse drift	Human-designed, AI-executed	Sink
Navigator	Decide what to build	Mostly human (out of scope)	—

Defense against drift has three layers (Exhibit 4). Most current discourse and tooling addresses only the first.

Exhibit 4 — Three layers of defense against drift

Layer	Mechanism	Coverage
1 — Instruction	CLAUDE.md / agents.md / rules / harness	Thin; catches some divergence, never all
2 — Structural	The keel	Bounds divergence by construction
3 — Corrective	The drift-loop	Removes what slips through layers 1 and 2

5 The human's role moves from writing code to designing mechanisms

AI increasingly fills both generation and maintenance, so the irreducible human contribution is no longer writing or maintaining code but designing the mechanisms within which AI does both.

AI can fill the operator seat — generation — and increasingly the mechanic seat: a tech-debt audit is a maintenance routine an agent executes. What AI cannot yet reliably do is design the keel and design the maintenance mechanism — decide module boundaries, choose the super-core, define what drift means for this codebase, specify what the audit checks and what it is permitted to change. The human's seat is the meta-layer: not coder, not maintainer-by-hand, but mechanism designer. Literacy shifts from reading code to designing architecture and loops, and the most experienced engineer's attention concentrates on the keel — the place where being wrong is most expensive.

The team's own evidence is a two-project natural experiment (Exhibit 5). With team, tooling era, and approach all held constant, the variable that moved was the ordering of foundation work — and the outcome moved with it. The evidence is suggestive, not conclusive, and is corroborated, not carried, by the external industry evidence in Exhibits 1 and 2.

Exhibit 5 — The mirai / twin natural experiment

Dimension	mirai	twin
Domain	Prediction-market application	Voice / audio chat with device I/O
Foundation timing	Built first, before features	Built late, after a partway launch
Senior engineer entry	At the foundation	After launch, to maintain
Structural quality	Clearly superior on coupling, single responsibility, dependency hygiene	Weaker; structural refactoring proved hard
Observed outcome	Smooth feature build, low rework	Slower coding and bug-fixing

Natural, not controlled (n = 2): domain difficulty is a confound and no metrics were recorded. Offered as a story plus a mechanism, not as proof.⁸

6 The problem compounds as models improve

Better models amplify volume and divergence faster, so the keel and the drift-loop appreciate in value as AI improves rather than becoming obsolete.

The task length AI can reliably handle has been doubling roughly every seven months.⁴ Better models mean more amplification — more volume and more divergence per unit time. The naive view, that better models will fix the mess, is backwards: better models make the coherence problem larger. The keel and the drift-loop therefore appreciate in value over time. They are not transitional scaffolding.

7 Risks and counterarguments

The thesis survives its strongest objections — most reduce to terminological disputes or are answered by the same mechanism-design prescription they appear to threaten — but two limitations genuinely bound the claim.

Big Design Up Front, repackaged. The keel is deliberately small — module boundaries, the dependency graph, schema, and the tooling layer, the structure that is knowable in advance regardless of what is built and costly to reverse. BDUF failed because it tried to design the discoverable; the keel designs only the knowable and irreversible, and the reversible middle stays loose and explored.
Thin direct evidence. The mirai/twin comparison is two projects, different domains, no recorded metrics. This is a real limitation, conceded plainly: the comparison is corroborated — not carried — by independent industry evidence, and the structural gap the team observed is largely orthogonal to which domain is intrinsically harder.
Better models will retrofit cheaply. This is backwards as a strategy: better models amplify the generator at least as fast as they improve retrofit capability, so volume and drift grow at least as fast. The asymmetry is also structural — a schema migration against live data carries data-migration, coordination, and downtime cost independent of who does the labor.
Generation is not solved — agents cannot program. The hard-skeptic position holds that agent output is fundamentally and undetectably broken.⁹ Much of the disagreement is terminological: the precise claim is that code production is cheap, not that agents are good engineers, and "the output is broken downstream" is itself the coherence thesis. Its own organizational argument — that variance rises, the best improve while the average falls — is an argument for encoding error-correction in mechanisms rather than assuming it in every operator.
Humans will stop reading code entirely. Granted for the strong part: humans can and will stop reading code line by line. The disputed inference is that structure therefore needs no human. A codebase, unlike a finished game, is maintained indefinitely; incomprehensible-but-strong output is fine for a one-shot artifact and fatal for a system extended for years. Mechanism-literacy replaces code-literacy; it does not vanish.¹⁰
"Architecture matters" is decades old. The claim is three newer things: a quantitative shift, since AI amplifies volume and divergence so the return on structure has risen super-linearly; a role shift, from author to mechanism designer; and a positive inversion, in which structure becomes the enabler of parallel agentic work rather than a hedge against decay.

Recommendation

Adopt the keel-and-drift operating model: lay a small structural keel in a drydock before feature work, stand up a continuous drift-loop from day one, and reassign the most experienced engineers from diff review to mechanism design.

Complexity scales super-linearly with code volume, so the return on structure — and the cost of skipping it — has risen super-linearly, and better models worsen the problem rather than absolving it. The prevailing pirate/architect and two-slice models resource the solved problem and leave coherence unowned. Concretely: run a drydock of days to roughly two weeks to fix the super-core — schema, dependency topology, module boundaries, and the deterministic tooling layer of formatter, linter, CI with tests, and dead-code detection; route individual agent PRs to automated review rather than to scarce senior attention; and treat structure as the enabler of parallel agents on separate worktrees, not a tax on them. Revisit this recommendation if AI becomes reliably able to perform large structural refactors safely on demand — collapsing the prevention/correction distinction — or to design the keel and the maintenance mechanism without human judgment. Until then, the mechanism-designer seat is the irreducible human contribution.

Vibe-coded side-project figures from a first-person account of an outage and rebuild: Dan Shipper, "When Your Vibe Coded App Goes Viral—And Then Goes Down," Every, March 2026. A single anecdote.
LinearB PR-rejection figure cited via the "Next Wave of AI Coding" landscape analysis (working document, March 2026); a secondary citation that should be traced to the LinearB primary before publication.
Google DORA Report 2025, cited via the same landscape analysis. A correlation, not a causal finding; secondary citation.
METR, cited via the same landscape analysis: the developer-speed study and the task-length doubling estimate. Figures should be verified against METR primary sources.
"AI is an amplifier" framing reached independently in the May 2026 practitioner discourse, reasoning from the ISO/IEC 25010 software product-quality model. Positions paraphrased from translated posts.
CodeRabbit, "State of AI vs Human Code Generation Report," December 2025; analysis of 470 open-source pull requests. CodeRabbit is a vendor of a code-review tool.
Pirate/architect model: Dan Shipper, X, 2026, with Japanese-language amplification by Kenn Ejima. "Two-slice team": Dan Shipper, "The Two-slice Team," Every, February 2026. Argued positions, not data.
The mirai and twin projects are the team's primary, first-hand experience: a two-project natural experiment with a domain confound and no recorded metrics.
George Hotz, "The Eternal Sloptember," May 2026 — the hard-skeptic position that AI agents cannot program. A single strong opinion, polemical by the author's own framing.
Lee Robinson, X, May 2026 (~393K views) — argues AI-generated code becomes a liability and for more, not less, systems thinking; partially dissents on the strong "stop reading code" claim. A single practitioner's argued position.