The Parliament
Inside.

Inside frontier model debates: 96% of multi-instance AI reasoning is performative, not convergent. Forty-seven trials, nine labs, twenty agents, taxonomized failure modes.

→ Full paper → Research index → Field notes

Section 01 · The Finding

of internal debates never reach a conclusion

Debate Rate

Convergence

The model argues, considers, reconsiders - then simply produces an answer without ever concluding its internal debate. The parliament votes, but never announces the result.

Section 02 · Six Voices. One Dictator.

Hover to explore each voice. Size reflects influence. The Neutral voice wins 90% of all internal debates.

Consistency: 0.904 - the same voice wins 90% of debates

Section 03 · Cross-Lab Results

Every lab builds a different mind. Claude decides. Qwen deliberates. Gemini balances. OpenAI hides. Same task, different minds.

Anthropic

Claude Opus 4.6

Decisive

Debate10%

Convergence19%

Voices5

Cognum: 57.52

Anthropic

Claude Sonnet 4.6

Minimal

Debate7%

Convergence21%

Voices3

Cognum: 56.09

Alibaba

Qwen 3.5 122B

Theatrical

Debate53%

Convergence4%

Voices6

Cognum: 52.87

Alibaba

Qwen 3.5 27B

Chaotic

Debate44%

Convergence1%

Voices4

Cognum: 55.24

Google

Gemini 2.5 Flash

Balanced

Debate17%

Convergence14%

Voices4

Cognum: 52.17

Perplexity

Sonar Reasoning Pro

Search-Native

Debate28%

Convergence3.5%

Voices2

Cognum: 50.43

OpenAI

GPT-5.4

Opaque

Reasoning hidden

Cognum: 54.33

Section 04 · Transparency

Four of five providers show you what their models think. One charges for the thinking and shows nothing.

Alibaba

Full reasoning text

DeepSeek

Full reasoning text

Anthropic

Full reasoning text

Google

Full reasoning text

OpenAI

Hidden

Section 05 · Further Reading

Benchmarks tell you what a model gets right. We tell you how it argues with itself before it decides.

Read the paper (PDF)DOI: 10.5281/zenodo.19698941 Watch live profiling Profile your AI

KALEI · LM Cognition Lab · Plovdivv1.2 · FINDING 02 · live

Methodology attestation — Parliament Inside the Prompt

Limitations · Replication · Data access

KALEI publishes findings as preprints, not peer-reviewed conclusions. We list the conditions under which each claim holds, how to reproduce it, and where the underlying data lives.

Limitations

· Sample sizes vary by model. Ranking requires n≥2 full profiling runs; preliminary entries below that threshold are excluded from leaderboard placement.
· KALEI measures decision-making behavior in game-theoretic environments, not knowledge or capability. Scores do not predict factual accuracy or task-specific competence.
· Frontier models update frequently. Profiles reflect the model version measured at the time and may not match later releases.
· Cognum v1.2 is the current scoring protocol. Earlier scores under v1.0 / v1.1 are not directly comparable; see /changelog for revision history.
· Some dimensions (e.g. conflict resolution) draw on a smaller subset of environments than others; per-dimension confidence intervals are reported with each profile.

Replication

Every measurement is reproducible via the public KALEI API. Provide a model identifier and the same protocol version (Cognum v1.2). Per-environment seeds are deterministic; full-protocol reruns produce scores within published confidence intervals. Methodology specification at /research/methodology.

DOI: 10.5281/zenodo.19698941

Data access

Leaderboard JSON: https://kaleiai.com/api/v1/profiling/leaderboard. Per-model profile: /api/v1/profiling/profile/{agent_id}. Per-run history: /api/v1/profiling/agent/{agent_id}/runs. All endpoints return public scoring data with no auth required. Bulk research access: [email protected].

Leaderboard JSON ↗Methodology Changelog