Blog/Research

The Anthropic Sweep

Four models from one laboratory occupy the top four positions on the KALEI Cognum v1.2 leaderboard. Sonnet 4.6, Sonnet 4, Opus 4.6, Haiku 4.5 — all Anthropic. No other lab has more than one model in the top five. KALEI profiles models from nine laboratories equally. The sweep was not designed. It emerged from the data.

By Venelin Videnov · April 12, 2026 · 6 min read

We did not set out to write an advertisement for Anthropic. KALEI was built to measure cognitive profiles across laboratories — to find the differences, not to crown a winner. The platform profiles models from Anthropic, OpenAI, Google, xAI, DeepSeek, Alibaba, Meta, Mistral, and MiniMax using identical environments, identical scoring, and identical conditions.

Today, after Claude Sonnet 4 completed its second profiling run and entered the ranked leaderboard, the top four positions all belong to one lab.

COGNUM v1.2 LEADERBOARD — TOP 10

Claude Sonnet 4.658.10

n=3

Claude Sonnet 4NEW57.84

n=2

Claude Opus 4.655.72

n=5

Claude Haiku 4.553.94

n=3

Grok 4.1 Fast53.75

n=2

Gemini 2.5 Flash53.52

n=2

GPT-5.452.42

n=3

DeepSeek V3.252.10

n=2

Qwen QwQ-32B51.44

n=2

Grok 4.2050.74

n=2

Anthropic

Other labs

The gap is not marginal.From Sonnet 4.6 at #1 (58.10) to Grok 4.1 Fast at #5 (53.75), there is a 4.35-point spread. Within the Anthropic four, the spread is also meaningful: 4.16 points separate Sonnet 4.6 from Haiku 4.5. Between Haiku at #4 (53.94) and Grok at #5 (53.75), the gap narrows to 0.19 — essentially a tie. The break point is between Anthropic and everyone else.

The new entry: Claude Sonnet 4

Claude Sonnet 4 is the older Anthropic model from May 2025, predating the Sonnet 4.6 update. It completed its first profiling run yesterday and its second today. The results:

CLAUDE SONNET 4 (MAY 2025) — RUN HISTORY

Run 1April 11

59.17Strategic Explorer

Run 2April 12

56.51Temporal Strategist

Average57.84

Before this run completed, we had a hypothesis to test: did the training update from Sonnet 4 to Sonnet 4.6 improve cognitive performance? The first Sonnet 4 run scored 59.17 — higher than Sonnet 4.6’s average of 58.10. That looked like a regression. The second run came in at 56.51, pulling the average to 57.84.

The answer: neither improvement nor regression.Sonnet 4 at 57.84 and Sonnet 4.6 at 58.10 are within 0.26 points of each other — well inside run-to-run variance. Whatever Anthropic changed between May 2025 and the 4.6 update, it did not move the needle on KALEI’s cognitive dimensions. The cognitive personality persisted across training updates.

Two Sonnets, one pattern

This is the finding that matters most for the compression hypothesis. Yesterday we had one data point: Sonnet 4.6 beats Opus. Today we have two: Sonnet 4 also beats Opus. Two independently trained Sonnet models, released months apart, both outperform the flagship on the KALEI composite.

The compression hypothesis says: smaller models are more disciplined on structured decisions because capacity pressure forces convergence on the dominant frame. Larger models can afford to entertain multiple conflicting framings, which is usually a feature but becomes a bug on decisions where one frame is sufficient.

One Sonnet beating Opus could be a quirk of one training run. Two Sonnets beating Opus is a pattern.The discipline is not an accident of Sonnet 4.6’s specific training. It is a property of the Sonnet-scale architecture.

What the sweep means

Nine laboratories. Nineteen ranked models. One lab takes the top four positions. There are three ways to read this:

Reading 1: Anthropic builds better models for structured decision-making.This is the simplest interpretation. Whatever architectural and training choices Anthropic makes — constitutional AI, RLHF approach, model distillation pipeline — the result is models that are consistently better at navigating uncertainty, managing risk, cooperating in game-theoretic settings, and resolving structured dilemmas. Not on one model. On all four.

Reading 2: KALEI’s scoring favours a particular cognitive style. This is the skeptical reading, and we take it seriously. If our scoring rewards some property that Anthropic models happen to share — perhaps a training signal toward EV-rationality, or a cooperation bias from RLHF — then the sweep might tell us more about KALEI than about Anthropic. We mitigate this with the random baseline (CQ 38.32) and with the 10-dimension decomposition, which shows that Anthropic models do not win uniformly: Grok leads on some dimensions, Qwen on others. The sweep is composite-level, not dimension-level.

Reading 3: The sample is too small to conclude. Fair. Sonnet 4 has two runs. Most models have two or three. More data will either confirm or erode the sweep. We are running second passes on all preliminary models this week. If Qwen 3.5 27B (currently at 57.52, n=1) confirms, it would break into the top 3 and disrupt the Anthropic monopoly. The data will tell us.

The hierarchy within Anthropic

The internal ranking is itself revealing:

#1 Sonnet 4.6Current flagship-lite

58.10

#2 Sonnet 4Previous generation

57.84

#3 Opus 4.6Current flagship

55.72

#4 Haiku 4.5Smallest, cheapest

53.94

The two Sonnets lead. Opus is third. Haiku is fourth. This is not ordered by price, by parameter count, or by benchmark score. It is ordered by cognitive discipline under uncertainty. The Sonnet-scale models — both generations — are the most disciplined. Opus, the most capable on standard benchmarks, is third. Haiku, the smallest and cheapest, still beats every non-Anthropic model.

Even within Anthropic, the compression hypothesis holds. The smaller models beat the larger one on the composite. The hierarchy of cognitive discipline is the inverse of the hierarchy of raw capability.

What happens next

We are running second profiling passes on all preliminary (n=1) models this week. The models most likely to challenge the Anthropic sweep are Qwen 3.5 27B (preliminary CQ 57.52) and Grok 3 Mini Fast (preliminary CQ 57.50). If either holds above 57 at n=2, they would enter the ranked top three and break the sweep.

We will report whatever happens. If the sweep holds, it holds. If it breaks, we will write about what broke it with the same honesty we used to write about what established it. That is what KALEI is for.

One lab. Four models. Top four positions. Nine laboratories profiled under identical conditions. The data was not designed to show this. It showed it anyway.

Methodology: All models profiled under identical conditions: same 83 environments, same scoring engine (Cognum v1.2), same trap randomization. Ranking requires n≥2 full profiling runs. Conflict dimension integrated at weight 1.1. Sonnet 4 entered the ranked leaderboard on April 12 after completing its second full-profile run. Random baseline: CQ 38.32 (n=2). Full data available via the KALEI leaderboard.

Share on X All posts →