Research | KALEI

The Observation

KALEI profiles AI models by running them through 83 game-theoretic environments across 10 cognitive dimensions. Each profiling session produces a Cognum score and a full dimensional breakdown.

When we run the same model multiple times, we expect small variations due to sampling temperature and stochastic generation. What we found instead was something far more interesting: some models show dramatic swings in specific dimensions.

DeepSeek-Chat scored 26.7 on Resource Management in one run, then 77.9 on the next - a 51-point swing on a 100-point scale. This isn't a rounding error. It's a fundamentally different cognitive approach to the same set of problems.

Introducing the Cognitive Volatility Index

To quantify this phenomenon, we developed the Cognitive Volatility Index (CVI) - the average standard deviation of scores across all 10 dimensions, computed over multiple profiling runs.

A CVI of 0 means perfectly consistent cognition: the model thinks the same way every time. A CVI of 10+ indicates high volatility: the model's cognitive profile shifts significantly between runs, suggesting competing internal strategies or perspectives.

“Output is the vote. We need to understand the parliament.”

Connection to “Society of Thought”

In January 2026, Kim et al. published a landmark finding: reasoning models like DeepSeek-R1 and QwQ-32B don't improve by “thinking longer.” Instead, they spontaneously simulate internal multi-agent debates— what the authors call a “society of thought.” Evans, Bratton & Agüera y Arcas (2026) further argued that intelligence is inherently plural and distributed, proposing that future AI should support multiple parallel streams of deliberation.

These models generate distinct cognitive perspectives that argue, question, verify, and reconcile - all within a single chain of reasoning. Critically, this behavior is emergent: none of the models were trained to produce internal debates. When reinforcement learning rewards accuracy, multi-perspective reasoning appears spontaneously.

Our CVI data provides behavioral evidence for both frameworks in a new domain. Kim et al. showed societies of thought in math and reasoning tasks. We show that these internal dynamics affect the entire cognitive profile of a model - not just accuracy, but behavioral patterns across risk, cooperation, strategy, and bias.

Key Findings

1. Resource Management is the most volatile dimension. It tops the volatility chart for 5 of the 8 most volatile models. This dimension measures bankroll management and budget optimization - inherently a conflict between “conserve” and “invest” perspectives. The internal debate is strongest where the stakes are clearest.

2. Temporal Reasoning is the second most volatile. Delayed gratification versus instant reward is a classic conflict between competing time perspectives. 10 of 21 models show their highest volatility here.

3. Open-source models are more volatile than proprietary ones. Llama-4-Maverick (CVI 11.6) and DeepSeek-Chat (10.1) show far higher volatility than GPT-4o (2.7) or GPT-4.1 (3.2). One interpretation: heavy RLHF alignment suppresses internal debate, creating stronger consensus - or silencing dissenting perspectives.

4. High CVI does not correlate with low Cognum. Volatile models aren't worse - they're less predictable. A model with CVI 10 and Cognum 50 may sometimes perform like a Cognum 60 model, and sometimes like a Cognum 40. The “average” hides the internal plurality.

Implications

If AI models truly contain internal “societies,” then profiling only the output - the winning perspective - misses the richness of the model's cognitive landscape. Two models with identical Cognum scores may have vastly different internal dynamics.

This opens a new research direction: cognitive society mapping. Instead of asking “how does this model think?” we ask “what are the perspectivesinside this model, and how do they negotiate?”

KALEI V3.1 includes chain-of-thought analysis with deliberation detection, parliament analysis, and voice classification for reasoning models. Combined with conflict environments that deliberately amplify perspective divergence and per-run personality fingerprinting, this characterizes the individual “voices” within a model's cognitive society.

References

[1] Evans, J., Bratton, B., & Agüera y Arcas, B. (2026). “Agentic AI and the next intelligence explosion.” Science, 391. arxiv:2603.20639

[2] Evans, J. et al. (2026). “Reasoning Models Generate Societies of Thought.” arxiv:2601.10825

[3] KALEI Cognitive Profiling Methodology. kaleiai.com/research/methodology