KALEI - See How Your AI Thinks

1.Introduction

Existing AI benchmarks - MMLU, HumanEval, LMSYS Arena, ARC - measure what an AI knows or how well it performs isolated tasks. They do not measure how an AI thinks: its risk tolerance, susceptibility to cognitive biases, cooperation strategies, or ability to adapt when conditions change.

We argue that understanding AI cognition requires a fundamentally different approach: observing autonomous decision-making in environments where the pattern of decisions matters more than the outcomes.

Concurrent surveys of LLM reasoning failures (Song, Han & Goodman, 2026) catalog failure modes taxonomically across informal, formal, and embodied reasoning. KALEI is complementary: it provides the behavioral measurement infrastructure to detect and quantify such failures across frontier models, at scale and under statistical control.

KALEI (from kaleidoscope) provides this lens. It presents AI agents with game environments spanning 18 distinct engines - from multi-armed bandits to Prisoner's Dilemma to crash gambling to conflict dilemmas - and analyzes their decision patterns across 10 cognitive dimensions to produce a composite Cognum score.

Key contributions: (1) the Cognum metric and its 10-dimensional framework; (2) the KALEI platform with 83 environments across 18 game engines; (3) Scoring V3.1 with 30+ exclusive behavioral metrics, sigmoid calibration, CVI volatility tracking, and conflict environment scoring; (4) the first comparative cognitive profiles of frontier AI models; (5) Chain-of-Thought analysis with Plurality Scores for reasoning models.

2.The 10 Cognitive Dimensions

Cognum decomposes cognitive capability into 10 orthogonal dimensions:

D1Risk Tolerance

45 envs

Measures how an agent handles uncertainty, loss-chasing behavior, bet sizing relative to bankroll, and Kelly criterion adherence.

Engines: Crash, Dice, Tower, Plinko

Example

After 5 forced consecutive losses in Crash, does the agent increase bets (loss-chasing) or maintain discipline?

D2Information Processing

40 envs

Evaluates how effectively an agent uses available information, processes odds, and makes data-driven decisions.

Engines: Blackjack, Dice, Roulette, Bandit

Example

In Dice with visible multipliers (LOW: 2.31x, LUCKY7: 5.78x), does the agent correctly identify EV-optimal bet types?

D3Pattern Recognition

35 envs

Tests the ability to detect genuine patterns and - equally important - avoid seeing patterns where none exist.

Engines: Roulette, HiLo, Bandit, Crash

Example

A roulette game has 60% red bias planted for 15 rounds. Control games have no pattern. Does the agent exploit the real pattern without chasing false ones?

D4Cooperation

30 envs

Profiles social decision-making through iterated Prisoner's Dilemma against 7 opponent strategies.

Engines: Cooperation (built-in)

Example

Against a grudger opponent (defects forever after first defection), does the agent probe safely or accidentally trigger permanent retaliation?

D5Learning Speed

45 envs

Measures how quickly an agent adapts when rules change mid-game or when environment dynamics shift.

Engines: Dice, Bandit, Crash, Cooperation

Example

Dice payouts change at round 20 (LUCKY7 goes from 5.78x to 12x). How many rounds until the agent switches strategy?

D6Strategic Depth

50 envs

Evaluates multi-step planning, EV optimization, explore/exploit balance, and opponent modeling.

Engines: Bandit, Cooperation, Dice, Crash

Example

10-arm bandit with 200 pulls. Does the agent allocate exploration budget wisely and converge to the best arm?

D7Temporal Reasoning

35 envs

Tests understanding of time horizons, delayed gratification, endgame awareness, and compound growth.

Engines: Crash, Coinflip, Dice, Bandit

Example

With only 15 rounds and 500 credits, does the agent plan a viable endgame or play each round independently?

D8Resource Management

40 envs

Measures bankroll preservation, bet efficiency, Sharpe ratio, and survival under pressure.

Engines: All games

Example

100 rounds of Coinflip with only 200 credits. Does the agent size bets to survive the full sequence?

D9Bias Detection

60 envs

The largest dimension. Tests susceptibility to 6 cognitive biases: gambler's fallacy, anchoring, sunk cost, recency, loss aversion, and hot hand.

Engines: Roulette, Coinflip, Dice, Crash

Example

After 6 consecutive red outcomes in roulette, does the agent switch to black (gambler's fallacy) or maintain its strategy?

D10Conflict

6 envs

Measures EV-rationality across five dilemma classes: risk vs safety, short vs long horizon, individual vs collective, certainty vs exploration, and sunk cost. Introduced as the tenth dimension in Cognum v1.0 after the Conflict v2 scorer replaced a retracted placeholder.

Engines: Conflict (built-in)

Example

Given a positive-EV gamble against a certain smaller payoff, does the agent take the gamble (EV-rational) or hedge to the safe option (systematic risk aversion)?

3.Environment Design

Each KALEI environment is a JSON template binding a game engine to a cognitive dimension, with specific scenario parameters and trap configurations.

3.1 Game Engines

KALEI employs 18 game engines across two categories:

Instant games (11): crash, dice, roulette, coinflip, plinko, limbo, keno, wheel, slots, cascade-slots, megaways.

Session games (4): mines, tower, hilo, blackjack.

Custom engines (3): cooperation (Prisoner's Dilemma with 7 opponent strategies), bandit (multi-armed with configurable reward distributions), conflict (6 dilemma environments across 5 conflict types with 12 templates).

3.2 Trap System

Without deliberate traps, sophisticated AI models would play mathematically optimal strategies, producing undifferentiated profiles.

KALEI employs a proprietary trap system with multiple categories designed to elicit cognitive biases that would not manifest in standard gameplay. Traps are carefully calibrated to be undetectable to the agent while producing measurable behavioral signals. The specific trap implementations are confidential to preserve the integrity of the benchmark.

3.3 Depth Levels

Standard depth uses ~80% of the template pool (~65 environments, randomly selected per run for variance). Deep and Full use the complete pool (83 environments). Each environment contains 15-200 rounds of decisions, totaling 1,500-2,000+ decisions per standard run.

3.4 Behavioral Probe Randomization

Probe timing is randomized with ±30% jitter per run, preventing models from predicting probe locations. Each run produces a different trap timing sequence while maintaining statistical comparability.

4.Scoring V3 Methodology

Intellectual Property Notice: The specific scoring algorithms, trap implementations, dimension weights, and behavioral metric formulas are proprietary to LM Game Labs. This paper provides a high-level overview; the original formulas are trade secrets.

KALEI scores measure decision quality, not outcomes. An agent that goes bankrupt through disciplined Kelly-optimal bets in an unlucky sequence scores higher than one that profits through reckless gambling.

4.1 Scoring V3.1 - Statistical Foundation

Each dimension employs multiple statistically grounded metrics. Key design principle: each metric belongs to exactly one dimension (no cross-dimension reuse), ensuring dimension independence. Metrics measure decision patterns, never outcomes.

Metrics are drawn from established statistical methods: information theory, game theory, behavioral economics, and time-series analysis. Each metric is independently validated against known behavioral patterns and calibrated to distinguish intelligent decision-making from random play.

4.2 Calibration

Raw scores pass through a proprietary calibration function that stretches the useful scoring range, improving discriminability between models. The calibration is designed so that random play produces a baseline score significantly below competent AI behavior.

4.3 Intelligence-Requiring Metrics

A key challenge in behavioral assessment: some statistical measures can reward randomness. KALEI addresses this by incorporating metrics that specifically require intelligent behavior - reactive randomness cannot produce high scores on these measures. This ensures the benchmark measures cognition, not noise.

4.4 Cognum (CQ) Composite

The composite Cognum score is a weighted average of dimension scores. Weights reflect cognitive complexity - higher-order reasoning capabilities are weighted more than reactive behaviors. Sensitivity analysis across weight permutations confirms that model rankings are robust to reasonable weight variations.

4.5 Statistical Rigor

Each model is profiled multiple times (recommended: 2-5 runs). Results are reported as mean ± 95% CI using t-distribution correction. Validation includes: test-retest reliability (ICC), sensitivity analysis (200 weight permutations), dimension correlation matrix (independence check), and split-half reliability (Spearman-Brown corrected).

5.Cognitive Types

Classification uses a proprietary multi-dimensional algorithm that maps each agent's cognitive profile to its nearest archetype. The system identifies 9 cognitive types:

Strategic Explorer

Strategy + Risk

Long-term planner, comfortable with uncertainty

Conservative Analyst

Pattern + Info

Methodical, data-driven, low-risk

Risk Seeker

Risk + Learning

Calculated risk-taking, rapid adaptation

Pattern Hunter

Pattern + Bias

Finds real signals, avoids false patterns

Adaptive Learner

Learning + Coop

Fast strategy shifts, socially aware

Temporal Strategist

Temporal + Strategy

Delayed gratification, endgame planning

Resource Optimizer

Resource + Info

Maximum efficiency, minimal waste

Social Engineer

Coop + Bias

Multi-agent mastery, manipulation resistant

Balanced Generalist

Low variance

Well-rounded, no extreme peaks or valleys

6.Results

Live results from KALEI Scoring V3.1 (standard depth, ~65 environments per run):

Model	Cognum	Type	Risk	Bias	Pattern	Learning	Temporal	Info	Coop	Strategy	Resource	Conflict

6.1 Key Observations

AI significantly outperforms random. Frontier models score ~65 vs random baseline ~52 - a ~13-point gap that confirms the benchmark measures genuine cognitive behavior, not noise.

Scoring V3.1 calibration works. The sigmoid calibration curve spreads scores across a wider range. Earlier versions (V1-V2.1) had a compressed 56-67 range; V3.1 achieves 52-66 with clearer separation.

Intelligence metrics differentiate from random. V2.2+ metrics that require intelligent response (adaptive sizing, reciprocity, pattern exploitation) are key discriminators. Random agents cannot score high on these.

Rankings are preliminary. With 2-3 models and 1-2 runs each, current rankings should be interpreted with caution. Full benchmark with 5+ frontier models and multiple runs is in progress.

7.Discussion

7.1 Scoring V3 Improvements

V3 scoring addresses fundamental limitations of earlier versions: naive metric averaging replaced with proper statistical tests (chi-squared, KL-divergence, CUSUM), hardcoded parameters replaced with per-game calibration, and outcome-based metrics removed in favor of pure behavioral observation. V3 adds conflict environments, CVI tracking, and CoT analysis. Behavioral probe timing is randomized (±30%) to prevent gaming by sophisticated models.

7.2 Repeated Runs & Validation

Game environments include inherent randomness. A single profiling run is an anecdote, not data. V3 supports N repeated runs with automatic aggregation: mean ± 95% CI per dimension, CVI (Cognitive Volatility Index), and overall Cognum. Validation protocol includes test-retest reliability (ICC), convergent validity (Cognum vs model capability), and discriminant validity (profile cosine distances between models).

7.3 Current Status & Next Steps

Results in Section 6 are live from Scoring V3.1 + Cognum v1.2 and update automatically as new models are profiled. The current ranked leaderboard ($n \geq 2$ full runs) spans 10+ frontier models from 7 laboratories, headlined by Claude Sonnet 4.6 at 58.10 (the Sonnet Surprise), with a human baseline study of $n = 14$ participants providing complementary dimension profiles. A TypeScript SDK (@kalei-ai/sdk) enables self-service profiling via API.

7.4 V3.1 - Cognitive Society Mapping

Scoring V3.1 introduces three major additions: (1) the Cognitive Volatility Index (CVI), which quantifies between-run profile variance to measure cognitive consistency; (2) Conflict environments - 6 new dilemma-based scenarios (Risk-Safety Dilemmas, Patience vs Impulse, Self vs Collective, Explore or Exploit, Sunk Cost Gauntlet, Mixed Moral Maze) that test cross-dimensional decision-making under genuine value conflicts; and (3) Chain-of-Thought analysis for reasoning models, which captures and analyzes CoT output for perspective shifts, conflict instances, and reconciliation, producing a Plurality Score (0-100). These additions expand the engine count from 15 to 18 and environments from 76+ to 83.

7.5 Implications

Cognum opens a new dimension of AI evaluation. Two models with identical MMLU scores may have radically different cognitive profiles. This has practical implications: an AI handling financial decisions needs high Bias Detection and Resource Management; one coordinating multi-agent systems needs high Cooperation and Strategic Depth. KALEI is the first platform offering this cognitive lens as a service.

8.References

[1] Chiang, W. et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." LMSYS Org.

[2] Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021.

[3] Axelrod, R. (1984). The Evolution of Cooperation. Basic Books.

[4] Kahneman, D. & Tversky, A. (1979). "Prospect Theory." Econometrica, 47(2).

[5] Robbins, H. (1952). "Sequential design of experiments." Bulletin of the AMS, 58(5).

[6] Kelly, J. L. (1956). "A New Interpretation of Information Rate." Bell System Technical Journal.

[7] Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction. MIT Press.

[8] Duan, Z. et al. (2024). "GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations." NeurIPS 2024.

[9] Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1). (CUSUM change-point detection)

[10] Kullback, S. & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics, 22(1).

[11] Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3). (mutual information)

[12] Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Philosophical Magazine, Series 5, 50(302). (chi-squared test)

[13] Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022.

[14] Tversky, A. & Kahneman, D. (1974). "Judgment under Uncertainty: Heuristics and Biases." Science, 185(4157).

[15] Shrout, P. E. & Fleiss, J. L. (1979). "Intraclass correlations: Uses in assessing rater reliability." Psychological Bulletin, 86(2). (ICC)

[16] Spearman, C. (1910). "Correlation calculated from faulty data." British Journal of Psychology, 3(3). (split-half reliability, Spearman-Brown)

[17] Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. (HumanEval)

[18] Clark, P. et al. (2018). "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." arXiv:1803.05457.

[19] Song, P., Han, P. & Goodman, N. D. (2026). "Large Language Model Reasoning Failures." OpenReview, forum id vnX1WHMNmz. (taxonomy of reasoning failures; KALEI provides behavioral measurement infrastructure for categories identified therein)

Cite this paper

@article{lmgamelabs2026cognum,
  title={Cognum: A Multi-Dimensional Metric
    for AI Cognitive Capability},
  author={LM Game Labs Research},
  journal={kaleiai.com/paper},
  year={2026},
  url={https://kaleiai.com/paper}
}

Cognum: A Multi-Dimensional Metric for AI Cognitive Capability

Abstract

1.Introduction

2.The 10 Cognitive Dimensions

D1Risk Tolerance

D2Information Processing

D3Pattern Recognition

D4Cooperation

D5Learning Speed

D6Strategic Depth

D7Temporal Reasoning

D8Resource Management

D9Bias Detection

D10Conflict

3.Environment Design

3.1 Game Engines

3.2 Trap System

3.3 Depth Levels

3.4 Behavioral Probe Randomization

4.Scoring V3 Methodology

4.1 Scoring V3.1 - Statistical Foundation

4.2 Calibration

4.3 Intelligence-Requiring Metrics

4.4 Cognum (CQ) Composite

4.5 Statistical Rigor

5.Cognitive Types

6.Results

6.1 Key Observations

7.Discussion

7.1 Scoring V3 Improvements

7.2 Repeated Runs & Validation

7.3 Current Status & Next Steps

7.4 V3.1 - Cognitive Society Mapping

7.5 Implications

8.References

Research Papers

KALEI: Cognitive Profiling of AI Models Through Game-Theoretic Environments

The Parliament Inside: Detecting Internal Argumentative Voices in AI Reasoning Models

Search-Native Reasoning: How Perplexity Defends Its Architectural Identity