The Calculator and the Colleague

There is a measurement problem at the heart of the AI industry, and it has the shape of a familiar law.

Goodhart's Law — when a measure becomes a target, it ceases to be a good measure — was originally a comment on monetary policy. It has since become one of the most widely cited principles in machine learning, and for good reason. When a leaderboard gain reflects success at gaming the test rather than a durable leap in capability, the metric has stopped doing the work it was meant to do. The diagnosis is now standard: benchmarks encourage overfitting, models train on benchmark data, scores inflate without true capability improvements, and the field's measure of progress quietly stops measuring progress.

This is well understood. What is less well understood — and what I want to argue here — is that the Goodhart problem in AI is not a quirk of evaluation methodology. It is a structural feature of the way the AI industry, the capital that funds it, and the customers who buy from it have collectively decided what intelligence means. And the structural version of the problem is much worse than the technical one. Because the technical problem can be fixed by better benchmarks. The structural problem produces the wrong kind of intelligence even when the benchmarks work.

I want to make a specific claim. The AI industry is currently optimizing very effectively for one kind of capability — the capability to produce confident, fluent, accurate answers to legible questions — and this capability is being treated, implicitly, as if it were the same thing as intelligence. It isn't. And the gap between the two is widening as the optimization pressure intensifies.

Before I make that case, I want to disarm the easiest version of the objection. This is not an argument against accuracy. Accuracy matters. Hallucinations are bugs and should be eliminated. Calibrated confidence is better than miscalibrated confidence. None of what follows should be read as a romantic defense of error. The argument is narrower and sharper: accuracy on the questions a system has been trained to answer is necessary, but it is not sufficient, and the field is increasingly treating it as terminal. The first is a benchmark. The second is something else entirely.

· · ·

01 / FOUNDATION

Four Kinds of Error

Most of the confusion in this conversation comes from a single elision. The word "error" is being asked to do four jobs, and they are not the same job.

The first kind is engineering error. The model says Paris is in Germany. The system retrieves the wrong document. The agent calls the wrong tool. These are bugs. They should be eliminated, and the industry's effort to do so is correct and important.

The second kind is calibration error. The model is confident when it should be uncertain, or uncertain when it should be confident. This is a tunable property of how systems are trained, and there is a serious research literature on it. Reducing calibration error is a clear good.

The third kind is what I would call productive uncertainty. A system, having understood the question, refuses to commit prematurely. It holds multiple hypotheses. It says "I don't know, but here is what would change my mind." It tells you which of its outputs are load-bearing and which are decorative. Productive uncertainty is not a failure of the system; it is the system's most valuable property in any domain where the cost of confident-but-wrong exceeds the cost of late-but-right. Most domains that matter have this property.

The fourth kind is what I'll call generative wandering. This is the structural condition under which a system can produce a question it wasn't asked — the reframing, the analogy across domains, the recognition that the user is solving the wrong problem. It is the part of human cognition that produces the breakthrough in the shower, the insight on the hike, the connection between two unrelated experiences that turns out to matter. Popper noticed long ago that the growth of scientific knowledge follows the same principles as biological evolution — trial, error, and the productive abandonment of falsified theories; generative wandering is what that process looks like inside a single mind, on a single afternoon.

These four kinds of error are deeply different. Eliminating the first two is unambiguous progress. Eliminating the third and fourth is not progress — it is the destruction of the most valuable properties intelligence can have. And the structural problem with the current AI paradigm is that it cannot tell them apart. The optimization pressure that reduces engineering error and calibration error also reduces productive uncertainty and generative wandering, because all four register as the same thing on the metrics that get rewarded: the model didn't give the answer the test wanted.

· · ·

02 / MECHANISM

The Structural Pressure

Why can't the field tell these apart? The honest answer is that the technical community can. Researchers working on uncertainty quantification, on RLHF for honest hedging, on calibration, on productive refusal — they understand the distinction perfectly. There is even a recent line of work arguing for principled limits on general-purpose AI optimization, on the grounds that under strong optimization pressure, AI systems push into predictable and irreversible failure modes.

The problem isn't research. The problem is what gets built and funded.

A model that hedges loses head-to-head deals against a model that commits. A system that says "I don't know, here's what would change my mind" gets rated lower on user satisfaction than a system that produces a fluent, confident answer — even when the confident answer is wrong. A product that occasionally refuses to act because the situation is genuinely ambiguous is described, in customer feedback and procurement reviews, as "unreliable." A founder pitching a system whose key feature is calibrated doubt has a measurably harder fundraise than a founder pitching a system whose key feature is "always answers."

This isn't anyone's fault, exactly. It's how the market evaluates the technology. An entire industry now exists to optimize benchmark scores — consultants who specialize in benchmark gaming, tools designed for benchmark optimization, services that do nothing but improve metrics. Careers are built on benchmark improvement. Companies are valued by benchmark achievements. Procurement teams write RFPs against the same evaluation frameworks. The optimization pressure is structural, and it points in one direction: toward systems that produce confident, fluent, accurate-on-the-test outputs.

Run that pressure forward and ask what the success case looks like.

You don't get a system that has eliminated engineering and calibration errors while preserving productive uncertainty and generative wandering. You can't. The training signal can't tell them apart. You get a system that has been optimized to produce the answer the evaluator wanted, every time, with high confidence — including in cases where the right answer was "the question you asked is the wrong question." The metric cannot represent that response. So the system learns not to give it.

This is the calculator. And the calculator is what the market is currently paying for.[^8]

· · ·

03 / EVIDENCE

What the Market Is Actually Asking For

The structural pressure I'm describing is not theoretical. It is happening on the page, in the procurement contracts and investment criteria that determine which AI products get built and funded in 2026.

Consider what enterprise buyers are now evaluating. By 2026, AI capability had become table-stakes in enterprise procurement. Gartner's 2024 survey of procurement leaders found two-thirds of Chief Procurement Officers ranking AI investment as a top priority for the coming year, and Gartner projects AI-enabled features will move from 5 percent of supply-chain software adoption in 2025 to 60 percent by 2030.[^1] The vendors selling to those buyers respond in kind. A representative pitch from a procurement-AI vendor in 2026 promises high-accuracy, "hallucination-free" outputs aligned precisely to buyer requirements — language characteristic of the category.[^2] The word hallucination-free is doing a great deal of work in those pitches. It is not a promise of better calibration. It is a promise that the system will not visibly disagree with the buyer's framing of the question.

Now consider what the labs are racing toward. By 2026, the leading frontier models had pushed hallucination rates on grounded-summarization benchmarks down to 0.7 percent — a sharp drop from baselines around 21 percent measured on the same evaluations a few years earlier.[^3] This is the headline metric the field celebrates. It has been described as "a significant milestone for trustworthiness." And then, in the same period, the harder truth from domain-specific evaluations: independent studies of retrieval-augmented legal research products report hallucination rates of 17 to 33 percent, and medical evaluations report rates in the low double digits.[^4] The grounded-summarization benchmark went down sharply. The real-world rate, in domains where the cost of being wrong is highest, barely moved. Independent estimates put enterprise losses from AI hallucinations in the tens of billions of dollars in 2024, even as the headline scores plummeted.

This is Goodhart's Law operating in real time — a field racing toward a metric that is improving rapidly, while the underlying property that metric was supposed to track is barely changing.

And yet the conversion isn't actually happening. Deloitte's 2026 enterprise AI survey of 3,235 leaders across 24 countries found that only 25 percent of organizations have moved 40 percent or more of their AI pilots into production. McKinsey's 2025 global survey of 1,993 firms reports 62 percent still in experimentation or piloting, with just 31 percent having scaled AI enterprise-wide and roughly 5 percent seeing real financial returns. The MIT NANDA initiative, examining hundreds of enterprise AI deployments through 2025, found that only about 5 percent of generative-AI pilots achieve measurable revenue acceleration.[^5] The pattern is bimodal: a thin slice of firms capture disproportionate value, and the rest stay stuck in pilot purgatory — paying for confidence and getting demonstrations.

This is the calculator problem at enterprise scale. The market is rewarding vendors that promise legible answers; the buyers are spending against that promise; and the underlying work — the part that requires holding the texture of a real operational problem — is not getting done. The metric goes up. The thing the metric was supposed to track barely moves.

The capital allocation reinforces the same direction. The 2026 venture playbook applies traditional-SaaS thresholds — gross margins of 70 to 80 percent (KeyBanc's 2025 SaaS Survey reports a median of 71 to 77 percent), LTV/CAC ratios of three or higher (Benchmarkit's 2024 median is 3.6 to 1), CAC payback periods that have stretched to roughly 20 months in 2024 — to AI-native companies whose economics don't yet support them. ICONIQ's 2025 State of AI report finds AI-native gross margins compressed near 52 percent, weighed down by per-query inference costs, with companies under 100 million dollars in ARR running free-cash-flow margins around negative 126 percent.[^6] The pilot-to-production gap is one half of the problem. The unit-economics gap is the other half. Both pressure founders to optimize for the same property: confident-sounding output that closes the deal and shows up on the metric, even when the underlying value isn't there yet.

And the capital is concentrating, not dispersing. In the first quarter of 2026, four AI mega-rounds — OpenAI's 122-billion-dollar close, Anthropic's 30 billion, xAI's 20 billion, and Waymo's 16 billion — absorbed 188 billion dollars, or 65 percent of all global venture capital that quarter.[^7] When the majority of capital in the field flows to four entities racing on the same set of measurable metrics, the optimization pressure does not get more diverse. It gets more uniform. The benchmark wins. The legible product wins. The colleague — illegible, hard to evaluate, longer to monetize — does not get its own funding round.

None of this is a market failure in the conventional sense. It is the market doing exactly what markets do: pricing what it can measure, funding what it can underwrite, rewarding what fits the contract. The problem is that the property the field is rewarding is not the same property as the property that produces actually useful intelligence. They look similar in the limit. They are not similar at all.

· · ·

04 / DISTINCTION

What the Calculator Cannot Do

A calculator is right or wrong. It is not useful in any other sense. You don't ask a calculator what to think. You don't ask it whether the question makes sense. You don't ask it to push back when you've framed the problem badly. It would not occur to you to expect any of these things, because a calculator's job is to convert an input into the corresponding output as fast and as accurately as possible.

A colleague is different. A good colleague is useful in ways that have almost nothing to do with computational accuracy. A good colleague helps you think. They push back when you're missing something. They say "I don't know, but here's what I'd look at." They remember your context across long horizons. They have taste. They have priors. They are sometimes wrong in interesting ways that help you find the right answer faster than they would have helped you by being correct.

A good colleague is not a faster calculator. A good colleague is a different kind of system — one that holds productive uncertainty, that wanders before converging, that tells you what they don't know and what would change their mind. The qualities that make a colleague valuable are exactly the qualities the current optimization pressure is selecting against.

I want to be careful here because this is where the argument can run off the rails. I am not claiming that current AI systems "can't" become colleagues in this sense. I don't know whether they can. The research community is working on it, and I have no special insight into how far that work will get. The claim is narrower: the dominant optimization pressure in the field is selecting for calculator-properties and against colleague-properties, and the gap is widening, and the market has no mechanism to correct for it. Even if a calibrated, hedging, sometimes-silent model would be more valuable in the long run — to its users, to its operators, to the actual work that gets done with it — that model loses, this quarter, to the confident competitor. So it doesn't get built. So the capital flows elsewhere. So the calculator gets sharper and the colleague stays unfunded.

· · ·

05 / COUPLING

Why the Calculator Wins Anyway

This is not a moral failure of capital. It is a structural property of how legibility and funding interact.

Markets price what they can measure. Procurement evaluates what it can compare. Investors underwrite what they can model. Calculator-properties are easy to measure (accuracy, latency, cost per token, benchmark scores). Colleague-properties are hard to measure (did the system push back at the right moment? did it surface the question that should have been asked? did it hold the right uncertainty for the right amount of time?). When one set of properties is legible and another is not, capital flows toward the legible one. This is true even when the illegible properties are more valuable, because the system for allocating capital cannot route around its own measurement constraints.

The result is a market that builds calculator after calculator, each one slightly more accurate than the last, while the colleague — the system that would actually move the world's capacity to think — goes unbuilt. Not because nobody can build it. Because the entity that could fund it cannot recognize it.

This is the real version of the Goodhart problem. It is not that the metrics are bad, though they often are. It is that the metrics are the only language the market speaks, and a property that cannot be expressed in that language might as well not exist for purposes of capital allocation. The calculator wins not because it is better but because it is legible.

· · ·

06 / RESOLUTION

How the Studio Model Actually Funds the Colleague

I am not writing this as an academic observation. I am writing it from inside the position of someone deciding what kind of company to build. And I want to end with the part most essays in this register skip — the part that says here is the answer, not just here is the problem.

If the argument is right, the work of building colleagues — systems that hedge, wander, refuse, surface the unasked question — has to get funded somehow. The market won't pay for it directly. It can't. The thing the market would need to underwrite is, by definition, the part it cannot measure.

But there is a way through, and it requires being honest about what the market will pay for.

The market will pay for calculators. It will pay for them quickly, in volume, at high gross margins, in verticals where a Stage 5 product collapses an information asymmetry that's been priced into the existing economy for decades. Those are real businesses. They generate real cash. They are fundable — by the same venture capital that this essay just argued cannot recognize colleague-quality work.

This is the studio model, and it is the answer.

A8C builds calculators in service of the colleague.

The studio ships products into specific verticals where the calculator solves a real problem and the unit economics work — independent gyms first, with ChangePlate. Physical therapy clinics next. Other service businesses after that. Each one is a calculator built well enough to be fundable, sold well enough to generate margin, operated well enough to compound.

But the calculators are not the point. The calculators fund the point.

The infrastructure that compounds across products — the agent architecture, the methodology for embedding peer-reviewed research into software, the patterns we learn about how human operators actually use AI in real businesses — is colleague work. It happens behind every product the studio ships, but it is not what we sell to the customer. The customer buys a calculator that solves their problem. The studio uses the margin from that sale to build the layer underneath that the market can't directly fund.

This is not how most AI companies are structured. Most are the calculator and only the calculator — they ship one product, raise capital against it, and use the capital to build more of the same. That model produces sharper calculators. It does not produce colleagues, and it cannot, because the capital cycle that funds it is the same cycle this essay just diagnosed as the problem.

The studio model is different. The studio is an instrument designed to produce calculator-revenue at the surface and colleague-capability at the core. The two are coupled, intentionally. The calculator is the part that gets sold; the colleague is the part that gets built. The first funds the second, on a timeline the second couldn't survive on its own.

I want to name this directly because it is the thing most companies are backing into right now — through layoffs, through panic AI-pivots, through "we need to be AI-native" announcements that arrived years too late. They got there by being forced. By the time they realized the colleague work mattered, they had already spent the capital that could have funded it on the calculator they should have been building all along to fund it deliberately.

A8C is built the other way around. The colleague work is the goal. The calculators we ship are the funding mechanism. ChangePlate is the first one. There will be others. Each one earns the studio the right to keep building the part the market can't see.

This is not a clever positioning move. It is the actual operating thesis. The studio is shaped this way because the alternative — building a single calculator and hoping the colleague work gets funded by future profits — is not how compounding works. The colleague has to be built while the calculator earns. The two have to be the same activity, expressed at different levels of abstraction. That's the whole point of a studio.

The interesting question of this decade is not whether the calculator wins. On the metrics the market uses, it probably will, and for a long time. The interesting question is who builds the other thing in the meantime — the system that hedges when hedging is honest, that wanders when wandering produces the next question, that refuses to commit when commitment would be the more confident form of being wrong, that holds the texture of judgment that the calculator is structurally incapable of representing.

Not because the calculator is bad. The calculator is fine.

Because the work of intelligence — the kind that actually moves what gets known, what gets built, and what becomes possible — happens in the part of the distribution the market cannot price. And someone has to fund the part the market cannot price, on a timeline the market does not understand, with a definition of value the market does not yet share.

Most companies will get there eventually, through some combination of layoffs and panic and forced re-architecture. We're choosing to start there.

The calculator gets built. It always was going to.

The colleague is the work.

· · ·

[^1]: Gartner, "Gartner Predicts Half of Procurement Contract Management Will Be AI-Enabled by 2027," May 2024, reporting that two-thirds of Chief Procurement Officers rank AI investment as a top priority for the next twelve months. Gartner additionally forecasts that AI-enabled features in supply-chain management software will rise from 5 percent of adoption in 2025 to 60 percent by 2030. https://www.gartner.com/en/newsroom/press-releases/2024-05-08-gartner-predicts-half-of-procurement-contract-management-will-be-ai-enabled-by-2027 [^2]: Language characteristic of enterprise procurement-AI vendor marketing in 2026, including phrasing used by Inventive AI and similar platforms. [^3]: Vectara Hallucination Leaderboard, frontier-model evaluations on grounded summarization through 2025. https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard [^4]: Magesh, Surani, Dahl, Suzgun, Manning, Ho, "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools," Stanford RegLab, May 2024 — found leading legal-AI research products hallucinated on 17 to 33 percent of queries. https://reglab.stanford.edu/publications/hallucination-free-assessing-the-reliability-of-leading-ai-legal-research-tools/ [^5]: Pilot-to-production figures from three primary sources: Deloitte, "State of Generative AI in the Enterprise," 2026 edition (n=3,235 director-to-C-suite leaders across 24 countries; fielded August–September 2025), https://www.deloitte.com/us/en/about/press-room/state-of-ai-report-2026.html; McKinsey, "The State of AI 2025" (n=1,993 respondents across approximately 105 countries), https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai; and the MIT NANDA initiative's "GenAI Divide: State of AI in Business 2025," analyzing approximately 300 enterprise AI deployments. [^6]: Traditional-SaaS benchmarks from KeyBanc Capital Markets / Sage, "2025 Private SaaS Company Survey" (median total gross margin 71–77 percent; CAC payback approximately 20 months in 2024), https://www.prnewswire.com/news-releases/private-saas-company-survey-reveals-ai-driven-transformation-and-sustained-operational-excellence-302615030.html, with LTV/CAC of approximately 3.6 from Benchmarkit's 2024 SaaS Benchmark Report. AI-native economics from ICONIQ, "2025 State of AI: The Builder's Playbook" — AI-native gross margins approximately 52 percent and free-cash-flow margins around negative 126 percent for companies under 100 million dollars in ARR. https://www.iconiq.com/growth/reports/2025-state-of-ai [^7]: Crunchbase News, "Record-Breaking Funding for AI in Global Q1 2026," reporting OpenAI ($122B), Anthropic ($30B), xAI ($20B), and Waymo ($16B). https://news.crunchbase.com/venture/record-breaking-funding-ai-global-q1-2026/ [^8]: Microsoft's WorkLab has used a "colleague, not calculator" framing in its AI guidance for end users. The argument here is structurally different: the calculator/colleague gap I'm describing is not a usage choice the operator makes but an artifact of what the field is being optimized to produce.