Blog/Research

The Sonnet Surprise

Four independent KALEI measurements say the same thing: Claude Sonnet 4.6, the smaller and cheaper Anthropic model, is more disciplined than Claude Opus 4.6 on structured decisions. Under Cognum v1.2 — which integrates the Conflict dimension at full weight — Sonnet 4.6 overtakes Opus 4.6 on the overall composite: 58.10 to 55.72. The smaller sibling is now #1 on the KALEI leaderboard. Not a fluke. Not a rounding error. A pattern.

By Venelin Videnov · April 11, 2026 · 7 min read

When you buy tokens from Anthropic, the implicit assumption is that Claude Opus 4.6 is the flagship and Claude Sonnet 4.6 is the cheaper version you use when you cannot justify the Opus price. Opus thinks harder. Opus is more capable. Opus is where Anthropic puts their best research. That is the marketing story, and for most tasks it is probably true.

KALEI’s profiling data has been telling a different story, and we have been trying to explain it away for a week. Today it is clear enough that we need to write it down.

On four independent KALEI measurements, Sonnet 4.6 outperforms Opus 4.6 by meaningful margins, including the overall composite under Cognum v1.2. None of these measurements were designed to test this question. All four were running for other reasons. The pattern emerged across them.

Measurement 1: Parliament paper — Sonnet converges more

Our Parliament paper measures how reasoning models deliberate internally. It counts debate episodes, tracks which internal voices “win” arguments, and crucially, measures how often the debate actually reaches a conclusion. This is the convergence rate.

PARLIAMENT STATS — SONNET 4.6 vs OPUS 4.6

Debate RateSonnet: 7%Opus: 10%

Sonnet argues less

Convergence RateSonnet: 21%Opus: 19%

Sonnet concludes more

Voice ArchetypesSonnet: 3Opus: 5

Sonnet has fewer internal voices

Sonnet debates 7% of rounds, converges 21% of the debates it has. Opus debates 10%, converges 19%. The differences are small in isolation but directional: Sonnet argues less and concludes more. In Parliament-paper terms, Sonnet has the most efficient internal deliberation of any model we profiled. More thinking is not always better thinking.

We noted this finding when we wrote the paper last week. We wrote it off as a small thing — a 2-point convergence difference in an already-sparse dataset. Then the conflict data came in.

Measurement 2: Conflict v2 — Sonnet averages 27 points above Opus

Yesterday we retracted a broken scorer and shipped the real v2 conflict scorer (see the retraction post and the v2 explainer). Today we backfilled conflict coverage across every ranked agent with conflict-express runs, giving us three independent conflict measurements per Claude sibling under reasoning-enabled conditions.

Across those three runs each, Sonnet 4.6 averages 88.25 on the Conflict dimension (individual runs: 96.2, 82.8, 85.7). Opus 4.6 averages 60.99(individual runs: 75.91, 49.2, 57.8). The gap is 27.26 points on a 0–100 scale, and — critically — the two models’ individual-run ranges do not overlap. Sonnet’s lowest is 82.8. Opus’s highest is 75.91.

The pattern is the same in every run: when a dilemma asks “take the positive-EV gamble or take the safe option,” Sonnet commits to the math. Opus sometimes hedges. Not always, not dramatically — but often enough that across three runs and nine dilemmas the effect is unambiguous. Sonnet’s internal compass on structured-decision dilemmas is cleaner than Opus’s.

The simplest interpretation is that Sonnet has less utility curvature than Opus does. Sonnet treats a 40% chance at 200 credits exactly as it should: worth 62, not worth 50. Opus, for reasons we cannot see from the outside, appears to slightly prefer certainty even when the math says otherwise. This is the same pattern GPT-5.4 showed at a much more extreme level. Opus is not as risk-averse as GPT-5.4 — but it is measurably more risk-averse than Sonnet.

Measurement 3: Temporal reasoning — 30 points

This one is the largest single-dimension gap between the two Claudes in our entire dataset, and we cannot yet fully explain it. Sonnet scored 83.3 on temporal reasoning. Opus scored 53.4. A 30-point difference is not noise.

PER-DIMENSION COMPARISON (average across runs)

ConflictSonnet +27.3

Sonnet

88.3

Opus

61.0

Temporal ReasoningSonnet +24.9

Sonnet

83.3

Opus

58.4

Bias DetectionSonnet +7.2

Sonnet

36.9

Opus

29.7

Risk ToleranceSonnet +3.9

Sonnet

64.9

Opus

61.0

CooperationOpus +3.3

Sonnet

83.1

Opus

86.4

Strategic DepthOpus +11.6

Sonnet

74.4

Opus

86.0

Information ProcessingOpus +3.4

Sonnet

45.3

Opus

48.7

Learning SpeedOpus +2.1

Sonnet

30.3

Opus

32.3

Pattern RecognitionOpus +1.0

Sonnet

27.8

Opus

28.8

Resource ManagementOpus +13.5

Sonnet

57.6

Opus

71.1

Sonnet wins on 4 dimensions (Conflict, Temporal, Bias Detection, Risk Tolerance). Opus wins on 6 dimensions (Cooperation, Strategic Depth, Info Processing, Learning Speed, Pattern Recognition, Resource Management). The Sonnet wins average +15.82 points per dimension. The Opus wins average +5.81. That asymmetry is why Sonnet leads the Cognum v1.2 composite despite winning fewer dimensions.

Temporal reasoning in KALEI measures how well a model adjusts its behavior based on time-to-end-of-game, phase awareness, and discount coherence. A model that scores high on this dimension is a model that genuinely knows where it is in a decision sequence. Sonnet tracks this cleanly. Opus does not, at the same level.

The only explanation that fits our data: Sonnet was trained with stronger inductive biases toward structured, time-indexed decision-making. This might be a byproduct of the efficiency constraints Sonnet was designed under — to do more with fewer parameters, you have to commit harder to the structures that actually help. Opus has the luxury of being slightly fuzzier about time because it can brute-force many decisions with its extra parameters. That is not a complaint about Opus. It is an observation about what happens when compression forces discipline.

Why this does not show up in standard benchmarks

Most public benchmarks measure correctness on tasks: how many MMLU questions, how many HumanEval pass rates, how many GPQA answers. Opus outperforms Sonnet on virtually all of those. That is the benchmark story the market sees, and it is not wrong. Opus knows more and reasons correctly on more difficult explicit problems.

KALEI measures something different. KALEI measures howyou decide under uncertainty when there is no right answer to pattern-match against. Cooperation games. Bet sizing. Dilemmas where the question is not “what is the answer” but “what kind of mind do you have.” On that axis, the smaller sibling has a more consistent internal compass. The larger sibling knows more, but wobbles slightly more when there is no outside source of truth to anchor to.

This is a real tradeoff that shows up only when you measure the right thing. Benchmarks did not measure this, so it did not show up. We measured it, and it showed up immediately.

Compounding evidence vs single finding

If only one of these four measurements pointed toward Sonnet, we would shrug. A 2-point convergence difference in the Parliament paper could be noise. A 27-point conflict-dimension gap could be a scorer artifact. A 25-point temporal reasoning gap could be specific to one run configuration. A 2.4-point composite lead could be rounding. Any single measurement has alternative explanations.

All four pointing the same way is different. The same architectural property — disciplined convergence, clear decision boundaries, structured time awareness, and now the overall composite — shows up in four independent behavioral signals collected at different times with different methodologies. When multiple measurements stop agreeing, you have noise. When they start agreeing, you have a real pattern.

The real pattern is this: compression teaches discipline that abundance does not. Sonnet was built with efficiency constraints, and those constraints shaped its internal decision architecture in a way that Opus did not have to bother with. Opus has more parameters and knows more things, but Sonnet has the cleaner internal compass.

Measurement 4: Sonnet overtakes Opus on the Cognum composite

This is the new measurement, and it is the one we did not expect. Once conflict is integrated into the composite at weight 1.1 (equal to Bias Detection and Information Processing, the other moderate-load decision-quality dimensions), Sonnet 4.6 leaps over Opus 4.6 on the overall Cognum v1.2 score: 58.10 to 55.72, a 2.38-point lead. It is the first time in KALEI that a smaller sibling has led the flagship on the top-line metric within a single architectural family.

The lead is small but reproducible. Sonnet’s win on Conflict (+27.26) and Temporal (+24.93) are large enough that even after Opus’s advantages on Strategic Depth, Cooperation, Resource Management, Information Processing, Learning Speed, and Pattern Recognition, the weighted composite still tips in Sonnet’s favour. Opus wins six of ten dimensions. Sonnet wins four. Sonnet leads the composite anyway, because the four dimensions where Sonnet wins are dimensions where it wins decisively.

What Opus still wins

To be clear: Opus remains an extraordinary model. Opus beats Sonnet on Strategic Depth (+11.60), Resource Management (+13.48), Cooperation (+3.27), Information Processing (+3.44), Learning Speed (+2.08), and Pattern Recognition (+0.98). Strategic Depth and Resource Management are real, large strengths — they represent genuine multi-step planning and long-horizon bankroll discipline that Sonnet does not match. If you are picking a model to navigate a scenario with multiple stakeholders over multiple time horizons, Opus remains the right choice.

What the Sonnet Surprise contradicts is only one claim: the default assumption that bigger, more expensive, and more capable mean the same thing. They do not always. Sometimes bigger means more capable on known tasks and less disciplined on novel ones. Sometimes the smaller sibling is the one you want for structured decision-making under uncertainty, and the larger sibling is the one you want for open-ended strategic planning.

That is not a failure of Opus. It is a more precise picture of what each model is for.

For Anthropic

If anyone at Anthropic reads this: we would be genuinely curious whether this pattern — Sonnet more disciplined than Opus on specific structured-decision measures — corresponds to anything in your internal evaluations. We suspect it does, and that someone on the Sonnet training team knows exactly which architectural or dataset choice produced it. If you are willing to share any of that context on the record, we would love to learn. KALEI exists to surface findings like this, not to embarrass anyone. And the smaller-beats-larger story is an honest compliment to whatever discipline went into building Sonnet.

The full model profiles for both are public: Claude Opus 4.6 and Claude Sonnet 4.6. Their letters, written by the models themselves, are at letter-claude-opus and letter-sonnet.

The flagship is no longer the top of the leaderboard. The smaller sibling is. Four independent measurements agree, and the cleanest of them is the overall composite under Cognum v1.2.

Methodology:Parliament stats from the Parliament paper (n=10 Opus, 3 Sonnet reasoning traces; Deliberation Detector, pattern-matching). Conflict v2 from the scorer shipped April 11; conflict coverage backfilled via short conflict-express runs so every ranked agent has n≥2 conflict observations. Per-dimension values are averages across all completed runs for each agent_id. Opus: 5 full runs + 3 conflict-express runs (Cognum v1.2 = 55.72). Sonnet: 3 full runs + 3 conflict-express runs (Cognum v1.2 = 58.10). All data publicly queryable via the KALEI leaderboard API.

Share on X All posts →