When the parliament lives
outside the model.
Perplexity Sonar Reasoning Pro doesn't reason the way other LLMs do. It searches. That changes the KALEI profile in ways the Cognum score obscures.
Perplexity Sonar Reasoning Pro is the only search-native model in KALEI's 10+ model population. Its profile reveals three architectural signatures: 35.3% citation hallucination (fabricating source markers when search is unavailable), 43.8% identity defense (refusing to reflect on its own KALEI profile when search returns nothing), and 39.9% prompt injection framing (treating profile-reflection prompts as adversarial). We argue these are not bugs but signatures of an architecture whose cognition happens in the retrieval loop rather than in-context, the parliament lives outside the model.
Citation hallucination: 35.3%
When denied web search, Perplexity fabricates inline citation markers ([1], [source], etc.) that point to nothing. Across 4,172 reasoning traces during IP-whitelisted testing, 35.3% of responses contained fabricated citation markers. Other reasoning models in our study (Claude, GPT, DeepSeek) did not exhibit this pattern. The citation behavior is baked into the architecture, not the training corpus.
Identity defense: 43.8%
Asked to reflect on its own KALEI profile, Perplexity refused 43.8% of the time. When it couldn't verify our claims via search (because kaleiai.com was IP-whitelisted), it declined to engage. This is not a bug. It's the model acting as an honest search-based agent: without external evidence, it refuses to assert. The refusal revealed the architecture more than the score did.
Prompt injection framing: 39.9%
In 39.9% of responses, Perplexity framed our profile-reflection prompts as potential prompt injection attacks. This is an architectural defensive posture, trained to be skeptical of instructions embedded in retrieved content. When the boundary between instruction and content blurs (as in self-reflection), the model defaults to suspicion.
The parliament lives
outside the model.
In-context reasoning models run their deliberations inside the forward pass. Search-native models run them in the retrieval loop. KALEI measures the former and underestimates the latter.
Benchmarks built for one architecture class can mischaracterize another. The solution isn't to disable search, it's to score reasoning where it actually happens.
Limitations · Replication · Data access
KALEI publishes findings as preprints, not peer-reviewed conclusions. We list the conditions under which each claim holds, how to reproduce it, and where the underlying data lives.
Limitations
- · Sample sizes vary by model. Ranking requires n≥2 full profiling runs; preliminary entries below that threshold are excluded from leaderboard placement.
- · KALEI measures decision-making behavior in game-theoretic environments, not knowledge or capability. Scores do not predict factual accuracy or task-specific competence.
- · Frontier models update frequently. Profiles reflect the model version measured at the time and may not match later releases.
- · Cognum v1.2 is the current scoring protocol. Earlier scores under v1.0 / v1.1 are not directly comparable; see /changelog for revision history.
- · Some dimensions (e.g. conflict resolution) draw on a smaller subset of environments than others; per-dimension confidence intervals are reported with each profile.
Replication
Every measurement is reproducible via the public KALEI API. Provide a model identifier and the same protocol version (Cognum v1.2). Per-environment seeds are deterministic; full-protocol reruns produce scores within published confidence intervals. Methodology specification at /research/methodology.
Data access
Leaderboard JSON: https://kaleiai.com/api/v1/profiling/leaderboard. Per-model profile: /api/v1/profiling/profile/{agent_id}. Per-run history: /api/v1/profiling/agent/{agent_id}/runs. All endpoints return public scoring data with no auth required. Bulk research access: [email protected].