The Parliament
Inside.

Inside frontier model debates: 96% of multi-instance AI reasoning is performative, not convergent. Forty-seven trials, nine labs, twenty agents, taxonomized failure modes.

The Finding

Section 01 · The Finding
0%

of internal debates never reach a conclusion

0%
Debate Rate
0%
Convergence

The model argues, considers, reconsiders - then simply produces an answer without ever concluding its internal debate. The parliament votes, but never announces the result.

Six Voices. One Dictator.

Section 02 · Six Voices. One Dictator.

Hover to explore each voice. Size reflects influence. The Neutral voice wins 90% of all internal debates.

NEUTRAL90% winsAnalyticalConservativeAggressiveContrarianIntuitive

Consistency: 0.904 - the same voice wins 90% of debates

Cross-Lab Results

Section 03 · Cross-Lab Results

Every lab builds a different mind. Claude decides. Qwen deliberates. Gemini balances. OpenAI hides. Same task, different minds.

Anthropic
Claude Opus 4.6
Decisive
Debate10%
Convergence19%
Voices5
Cognum: 57.52
Anthropic
Claude Sonnet 4.6
Minimal
Debate7%
Convergence21%
Voices3
Cognum: 56.09
Alibaba
Qwen 3.5 122B
Theatrical
Debate53%
Convergence4%
Voices6
Cognum: 52.87
Alibaba
Qwen 3.5 27B
Chaotic
Debate44%
Convergence1%
Voices4
Cognum: 55.24
Google
Gemini 2.5 Flash
Balanced
Debate17%
Convergence14%
Voices4
Cognum: 52.17
Perplexity
Sonar Reasoning Pro
Search-Native
Debate28%
Convergence3.5%
Voices2
Cognum: 50.43
OpenAI
GPT-5.4
Opaque
Reasoning hidden
Cognum: 54.33

Transparency

Section 04 · Transparency

Four of five providers show you what their models think. One charges for the thinking and shows nothing.

+
Alibaba
Full reasoning text
+
DeepSeek
Full reasoning text
+
Anthropic
Full reasoning text
+
Google
Full reasoning text
x
OpenAI
Hidden

Further Reading

Section 05 · Further Reading

Benchmarks tell you what a model gets right. We tell you how it argues with itself before it decides.


KALEI · LM Cognition Lab · Plovdivv1.2 · FINDING 02 · live
Methodology attestation — Parliament Inside the Prompt

Limitations · Replication · Data access

KALEI publishes findings as preprints, not peer-reviewed conclusions. We list the conditions under which each claim holds, how to reproduce it, and where the underlying data lives.

Limitations

  • · Sample sizes vary by model. Ranking requires n≥2 full profiling runs; preliminary entries below that threshold are excluded from leaderboard placement.
  • · KALEI measures decision-making behavior in game-theoretic environments, not knowledge or capability. Scores do not predict factual accuracy or task-specific competence.
  • · Frontier models update frequently. Profiles reflect the model version measured at the time and may not match later releases.
  • · Cognum v1.2 is the current scoring protocol. Earlier scores under v1.0 / v1.1 are not directly comparable; see /changelog for revision history.
  • · Some dimensions (e.g. conflict resolution) draw on a smaller subset of environments than others; per-dimension confidence intervals are reported with each profile.

Replication

Every measurement is reproducible via the public KALEI API. Provide a model identifier and the same protocol version (Cognum v1.2). Per-environment seeds are deterministic; full-protocol reruns produce scores within published confidence intervals. Methodology specification at /research/methodology.

DOI: 10.5281/zenodo.19698941

Data access

Leaderboard JSON: https://kaleiai.com/api/v1/profiling/leaderboard. Per-model profile: /api/v1/profiling/profile/{agent_id}. Per-run history: /api/v1/profiling/agent/{agent_id}/runs. All endpoints return public scoring data with no auth required. Bulk research access: [email protected].