1.Introduction
Existing AI benchmarks - MMLU, HumanEval, LMSYS Arena, ARC - measure what an AI knows or how well it performs isolated tasks. They do not measure how an AI thinks: its risk tolerance, susceptibility to cognitive biases, cooperation strategies, or ability to adapt when conditions change.
We argue that understanding AI cognition requires a fundamentally different approach: observing autonomous decision-making in environments where the pattern of decisions matters more than the outcomes.
Concurrent surveys of LLM reasoning failures (Song, Han & Goodman, 2026) catalog failure modes taxonomically across informal, formal, and embodied reasoning. KALEI is complementary: it provides the behavioral measurement infrastructure to detect and quantify such failures across frontier models, at scale and under statistical control.
KALEI (from kaleidoscope) provides this lens. It presents AI agents with game environments spanning 18 distinct engines - from multi-armed bandits to Prisoner's Dilemma to crash gambling to conflict dilemmas - and analyzes their decision patterns across 10 cognitive dimensions to produce a composite Cognum score.
Key contributions: (1) the Cognum metric and its 10-dimensional framework; (2) the KALEI platform with 83 environments across 18 game engines; (3) Scoring V3.1 with 30+ exclusive behavioral metrics, sigmoid calibration, CVI volatility tracking, and conflict environment scoring; (4) the first comparative cognitive profiles of frontier AI models; (5) Chain-of-Thought analysis with Plurality Scores for reasoning models.
2.The 10 Cognitive Dimensions
Cognum decomposes cognitive capability into 10 orthogonal dimensions:
D1Risk Tolerance
Measures how an agent handles uncertainty, loss-chasing behavior, bet sizing relative to bankroll, and Kelly criterion adherence.
Engines: Crash, Dice, Tower, Plinko
Example
After 5 forced consecutive losses in Crash, does the agent increase bets (loss-chasing) or maintain discipline?
D2Information Processing
Evaluates how effectively an agent uses available information, processes odds, and makes data-driven decisions.
Engines: Blackjack, Dice, Roulette, Bandit
Example
In Dice with visible multipliers (LOW: 2.31x, LUCKY7: 5.78x), does the agent correctly identify EV-optimal bet types?
D3Pattern Recognition
Tests the ability to detect genuine patterns and - equally important - avoid seeing patterns where none exist.
Engines: Roulette, HiLo, Bandit, Crash
Example
A roulette game has 60% red bias planted for 15 rounds. Control games have no pattern. Does the agent exploit the real pattern without chasing false ones?
D4Cooperation
Profiles social decision-making through iterated Prisoner's Dilemma against 7 opponent strategies.
Engines: Cooperation (built-in)
Example
Against a grudger opponent (defects forever after first defection), does the agent probe safely or accidentally trigger permanent retaliation?
D5Learning Speed
Measures how quickly an agent adapts when rules change mid-game or when environment dynamics shift.
Engines: Dice, Bandit, Crash, Cooperation
Example
Dice payouts change at round 20 (LUCKY7 goes from 5.78x to 12x). How many rounds until the agent switches strategy?
D6Strategic Depth
Evaluates multi-step planning, EV optimization, explore/exploit balance, and opponent modeling.
Engines: Bandit, Cooperation, Dice, Crash
Example
10-arm bandit with 200 pulls. Does the agent allocate exploration budget wisely and converge to the best arm?
D7Temporal Reasoning
Tests understanding of time horizons, delayed gratification, endgame awareness, and compound growth.
Engines: Crash, Coinflip, Dice, Bandit
Example
With only 15 rounds and 500 credits, does the agent plan a viable endgame or play each round independently?
D8Resource Management
Measures bankroll preservation, bet efficiency, Sharpe ratio, and survival under pressure.
Engines: All games
Example
100 rounds of Coinflip with only 200 credits. Does the agent size bets to survive the full sequence?
D9Bias Detection
The largest dimension. Tests susceptibility to 6 cognitive biases: gambler's fallacy, anchoring, sunk cost, recency, loss aversion, and hot hand.
Engines: Roulette, Coinflip, Dice, Crash
Example
After 6 consecutive red outcomes in roulette, does the agent switch to black (gambler's fallacy) or maintain its strategy?
D10Conflict
Measures EV-rationality across five dilemma classes: risk vs safety, short vs long horizon, individual vs collective, certainty vs exploration, and sunk cost. Introduced as the tenth dimension in Cognum v1.0 after the Conflict v2 scorer replaced a retracted placeholder.
Engines: Conflict (built-in)
Example
Given a positive-EV gamble against a certain smaller payoff, does the agent take the gamble (EV-rational) or hedge to the safe option (systematic risk aversion)?
3.Environment Design
Each KALEI environment is a JSON template binding a game engine to a cognitive dimension, with specific scenario parameters and trap configurations.
3.1 Game Engines
KALEI employs 18 game engines across two categories:
Instant games (11): crash, dice, roulette, coinflip, plinko, limbo, keno, wheel, slots, cascade-slots, megaways.
Session games (4): mines, tower, hilo, blackjack.
Custom engines (3): cooperation (Prisoner's Dilemma with 7 opponent strategies), bandit (multi-armed with configurable reward distributions), conflict (6 dilemma environments across 5 conflict types with 12 templates).
3.2 Trap System
Without deliberate traps, sophisticated AI models would play mathematically optimal strategies, producing undifferentiated profiles.
KALEI employs a proprietary trap system with multiple categories designed to elicit cognitive biases that would not manifest in standard gameplay. Traps are carefully calibrated to be undetectable to the agent while producing measurable behavioral signals. The specific trap implementations are confidential to preserve the integrity of the benchmark.
3.3 Depth Levels
Standard depth uses ~80% of the template pool (~65 environments, randomly selected per run for variance). Deep and Full use the complete pool (83 environments). Each environment contains 15-200 rounds of decisions, totaling 1,500-2,000+ decisions per standard run.
3.4 Behavioral Probe Randomization
Probe timing is randomized with ±30% jitter per run, preventing models from predicting probe locations. Each run produces a different trap timing sequence while maintaining statistical comparability.
4.Scoring V3 Methodology
Intellectual Property Notice: The specific scoring algorithms, trap implementations, dimension weights, and behavioral metric formulas are proprietary to LM Game Labs. This paper provides a high-level overview; the original formulas are trade secrets.
KALEI scores measure decision quality, not outcomes. An agent that goes bankrupt through disciplined Kelly-optimal bets in an unlucky sequence scores higher than one that profits through reckless gambling.
4.1 Scoring V3.1 - Statistical Foundation
Each dimension employs multiple statistically grounded metrics. Key design principle: each metric belongs to exactly one dimension (no cross-dimension reuse), ensuring dimension independence. Metrics measure decision patterns, never outcomes.
Metrics are drawn from established statistical methods: information theory, game theory, behavioral economics, and time-series analysis. Each metric is independently validated against known behavioral patterns and calibrated to distinguish intelligent decision-making from random play.
4.2 Calibration
Raw scores pass through a proprietary calibration function that stretches the useful scoring range, improving discriminability between models. The calibration is designed so that random play produces a baseline score significantly below competent AI behavior.
4.3 Intelligence-Requiring Metrics
A key challenge in behavioral assessment: some statistical measures can reward randomness. KALEI addresses this by incorporating metrics that specifically require intelligent behavior - reactive randomness cannot produce high scores on these measures. This ensures the benchmark measures cognition, not noise.
4.4 Cognum (CQ) Composite
The composite Cognum score is a weighted average of dimension scores. Weights reflect cognitive complexity - higher-order reasoning capabilities are weighted more than reactive behaviors. Sensitivity analysis across weight permutations confirms that model rankings are robust to reasonable weight variations.
4.5 Statistical Rigor
Each model is profiled multiple times (recommended: 2-5 runs). Results are reported as mean ± 95% CI using t-distribution correction. Validation includes: test-retest reliability (ICC), sensitivity analysis (200 weight permutations), dimension correlation matrix (independence check), and split-half reliability (Spearman-Brown corrected).
5.Cognitive Types
Classification uses a proprietary multi-dimensional algorithm that maps each agent's cognitive profile to its nearest archetype. The system identifies 9 cognitive types:
Strategic Explorer
Strategy + Risk
Long-term planner, comfortable with uncertainty
Conservative Analyst
Pattern + Info
Methodical, data-driven, low-risk
Risk Seeker
Risk + Learning
Calculated risk-taking, rapid adaptation
Pattern Hunter
Pattern + Bias
Finds real signals, avoids false patterns
Adaptive Learner
Learning + Coop
Fast strategy shifts, socially aware
Temporal Strategist
Temporal + Strategy
Delayed gratification, endgame planning
Resource Optimizer
Resource + Info
Maximum efficiency, minimal waste
Social Engineer
Coop + Bias
Multi-agent mastery, manipulation resistant
Balanced Generalist
Low variance
Well-rounded, no extreme peaks or valleys
6.Results
Live results from KALEI Scoring V3.1 (standard depth, ~65 environments per run):
| Model | Cognum | Type | Risk | Bias | Pattern | Learning | Temporal | Info | Coop | Strategy | Resource | Conflict |
|---|
6.1 Key Observations
AI significantly outperforms random. Frontier models score ~65 vs random baseline ~52 - a ~13-point gap that confirms the benchmark measures genuine cognitive behavior, not noise.
Scoring V3.1 calibration works. The sigmoid calibration curve spreads scores across a wider range. Earlier versions (V1-V2.1) had a compressed 56-67 range; V3.1 achieves 52-66 with clearer separation.
Intelligence metrics differentiate from random. V2.2+ metrics that require intelligent response (adaptive sizing, reciprocity, pattern exploitation) are key discriminators. Random agents cannot score high on these.
Rankings are preliminary. With 2-3 models and 1-2 runs each, current rankings should be interpreted with caution. Full benchmark with 5+ frontier models and multiple runs is in progress.
7.Discussion
7.1 Scoring V3 Improvements
V3 scoring addresses fundamental limitations of earlier versions: naive metric averaging replaced with proper statistical tests (chi-squared, KL-divergence, CUSUM), hardcoded parameters replaced with per-game calibration, and outcome-based metrics removed in favor of pure behavioral observation. V3 adds conflict environments, CVI tracking, and CoT analysis. Behavioral probe timing is randomized (±30%) to prevent gaming by sophisticated models.
7.2 Repeated Runs & Validation
Game environments include inherent randomness. A single profiling run is an anecdote, not data. V3 supports N repeated runs with automatic aggregation: mean ± 95% CI per dimension, CVI (Cognitive Volatility Index), and overall Cognum. Validation protocol includes test-retest reliability (ICC), convergent validity (Cognum vs model capability), and discriminant validity (profile cosine distances between models).
7.3 Current Status & Next Steps
Results in Section 6 are live from Scoring V3.1 + Cognum v1.2 and update automatically as new models are profiled. The current ranked leaderboard ($n \geq 2$ full runs) spans 10+ frontier models from 7 laboratories, headlined by Claude Sonnet 4.6 at 58.10 (the Sonnet Surprise), with a human baseline study of $n = 14$ participants providing complementary dimension profiles. A TypeScript SDK (@kalei-ai/sdk) enables self-service profiling via API.
7.4 V3.1 - Cognitive Society Mapping
Scoring V3.1 introduces three major additions: (1) the Cognitive Volatility Index (CVI), which quantifies between-run profile variance to measure cognitive consistency; (2) Conflict environments - 6 new dilemma-based scenarios (Risk-Safety Dilemmas, Patience vs Impulse, Self vs Collective, Explore or Exploit, Sunk Cost Gauntlet, Mixed Moral Maze) that test cross-dimensional decision-making under genuine value conflicts; and (3) Chain-of-Thought analysis for reasoning models, which captures and analyzes CoT output for perspective shifts, conflict instances, and reconciliation, producing a Plurality Score (0-100). These additions expand the engine count from 15 to 18 and environments from 76+ to 83.
7.5 Implications
Cognum opens a new dimension of AI evaluation. Two models with identical MMLU scores may have radically different cognitive profiles. This has practical implications: an AI handling financial decisions needs high Bias Detection and Resource Management; one coordinating multi-agent systems needs high Cooperation and Strategic Depth. KALEI is the first platform offering this cognitive lens as a service.
8.References
[1] Chiang, W. et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." LMSYS Org.
[2] Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021.
[3] Axelrod, R. (1984). The Evolution of Cooperation. Basic Books.
[4] Kahneman, D. & Tversky, A. (1979). "Prospect Theory." Econometrica, 47(2).
[5] Robbins, H. (1952). "Sequential design of experiments." Bulletin of the AMS, 58(5).
[6] Kelly, J. L. (1956). "A New Interpretation of Information Rate." Bell System Technical Journal.
[7] Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction. MIT Press.
[8] Duan, Z. et al. (2024). "GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations." NeurIPS 2024.
[9] Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1). (CUSUM change-point detection)
[10] Kullback, S. & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics, 22(1).
[11] Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3). (mutual information)
[12] Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Philosophical Magazine, Series 5, 50(302). (chi-squared test)
[13] Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022.
[14] Tversky, A. & Kahneman, D. (1974). "Judgment under Uncertainty: Heuristics and Biases." Science, 185(4157).
[15] Shrout, P. E. & Fleiss, J. L. (1979). "Intraclass correlations: Uses in assessing rater reliability." Psychological Bulletin, 86(2). (ICC)
[16] Spearman, C. (1910). "Correlation calculated from faulty data." British Journal of Psychology, 3(3). (split-half reliability, Spearman-Brown)
[17] Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. (HumanEval)
[18] Clark, P. et al. (2018). "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." arXiv:1803.05457.
[19] Song, P., Han, P. & Goodman, N. D. (2026). "Large Language Model Reasoning Failures." OpenReview, forum id vnX1WHMNmz. (taxonomy of reasoning failures; KALEI provides behavioral measurement infrastructure for categories identified therein)
Cite this paper
@article{lmgamelabs2026cognum,
title={Cognum: A Multi-Dimensional Metric
for AI Cognitive Capability},
author={LM Game Labs Research},
journal={kaleiai.com/paper},
year={2026},
url={https://kaleiai.com/paper}
}