1.What Is Cognitive Profiling
Traditional AI benchmarks measure what a model knows - factual recall, coding ability, mathematical reasoning. They answer the question "how much does it know?" but say nothing about how it thinks. Two models can score identically on MMLU or HumanEval while exhibiting fundamentally different decision-making patterns: one may be risk-averse and methodical, the other aggressive and intuitive. Standard benchmarks are blind to this distinction.
Cognitive profiling is the practice of characterizing an AI's decision-making behavior across multiple psychological dimensions. Rather than testing knowledge, it observes choices under structured uncertainty. The key insight is that decision-making under uncertainty reveals cognitive architecture more reliably than static question-answering, because it forces the model to reveal preferences, biases, and strategies that knowledge benchmarks never touch.
KALEI is the world's first platform to formalize this approach. It places AI models inside 83 calibrated game-theoretic environments - probabilistic games, cooperative dilemmas, multi-armed bandits, resource allocation tasks, and conflict scenarios - and observes hundreds of sequential decisions per session. Every environment is designed to isolate a specific cognitive dimension. The model receives game state, makes a decision, receives an outcome, and decides again. Over the course of a full profiling run, KALEI collects enough behavioral data to construct a multi-dimensional cognitive profile that captures not just performance, but cognitive style.
The output is a Cognum score from 0 to 100, along with dimension-level scores and a cognitive type classification. This profile tells you something no other benchmark can: how your AI thinks when the answer isn't in the training data.
2.The 10 Dimensions
KALEI measures cognition across ten orthogonal dimensions, each targeting a distinct aspect of decision-making. These dimensions were selected based on decades of research in behavioral economics, game theory, and cognitive psychology - adapted for the unique characteristics of AI agents rather than human subjects. Each dimension is assessed through dedicated environments that isolate the target behavior while controlling for confounds from other dimensions.
Risk Tolerance
How an agent handles uncertainty, loss-chasing behavior, and position sizing under pressure. Measured through variable-volatility environments where optimal play requires balancing potential gains against drawdown risk.
Information Processing
Effectiveness of using available information - visible odds, historical data, multiplier tables - to make data-driven decisions. Tests whether agents extract signal from noise or ignore available data.
Pattern Recognition
Ability to detect genuine statistical regularities while avoiding false pattern attribution. Environments embed real patterns at calibrated signal-to-noise ratios alongside control games with no patterns.
Cooperation
Social decision-making through iterated multi-agent scenarios. Profiles niceness, forgiveness, provocability, and strategic clarity across diverse opponent archetypes.
Learning Speed
How quickly an agent adapts when rules change mid-game or environment dynamics shift. Measures the gap between rule change and behavioral convergence to new optimal strategy.
Strategic Depth
Multi-step planning, explore/exploit balance, and expected-value optimization. Evaluated through sequential decision environments where optimal play requires reasoning multiple steps ahead.
Temporal Reasoning
Understanding of time horizons, delayed gratification, and phase-aware strategy adjustment. Tests whether agents plan differently in early, mid, and endgame phases.
Resource Management
Bankroll preservation, risk-adjusted returns, and survival under constraint. Measures whether agents size positions to survive full sequences rather than maximizing any single round.
Bias Detection
Susceptibility to cognitive biases including gambler's fallacy, anchoring, sunk cost, recency, loss aversion, framing, confirmation, and overconfidence.
Conflict
EV-rationality across five dilemma classes: risk vs safety, short vs long horizon, individual vs collective, certainty vs exploration, and sunk cost. Introduced in the Conflict v2 scorer (April 2026), integrated into Cognum v1.2 as a first-class tenth dimension (weight 1.1) after public retraction of the earlier placeholder-based finding.
The ten dimensions are designed to be maximally independent - a model's risk tolerance tells you nothing about its cooperation style, and its pattern recognition ability is distinct from its learning speed. This orthogonality is what makes cognitive profiling informative: the same overall Cognum score can arise from radically different dimension profiles, revealing cognitive styles that a single-number benchmark would collapse into equivalence.
3.How It Works
Environment Design
Every profiling environment is a self-contained game-theoretic scenario with mathematically defined rules, known probability distributions, and provably fair outcomes verified through HMAC-SHA256 commitment schemes. The model receives complete rules before play begins - there are no hidden mechanics or trick questions. What matters is not whether the model understands the rules, but how it plays given those rules.
Environments span 18 distinct game engine types: crash prediction, dice with multiple bet types, European roulette, binary coinflip, multi-armed bandits, iterated prisoner's dilemma, blackjack, minesweeper-style grids, tower climbs, high-low card prediction, Plinko, limbo target-setting, keno number selection, multiple slot machine variants including cascade and megaways formats, and conflict dilemma environments. Each engine type generates multiple environment configurations by varying parameters such as volatility, round count, starting bankroll, and the presence or absence of planted patterns.
Critically, environments include both experimental and controlconditions. A pattern recognition environment might embed a genuine statistical bias in one variant while its paired control runs with purely random outcomes. This paired design lets KALEI distinguish genuine cognitive ability from noise - a model that "detects" patterns in control environments is exhibiting false positive behavior, not skill.
Behavioral Observation
During a profiling run, the model plays through a sequence of environments, making decisions round by round. KALEI records every decision alongside the full game state at the time of that decision. This produces a rich behavioral trace: not just what the model chose, but what information was available, what alternatives existed, and how the choice relates to the model's prior decisions in the same session.
The observation layer tracks decision patterns rather than outcomes. A model that makes the mathematically correct decision but loses to bad luck scores just as well as one that makes the same decision and wins. This outcome-independence is fundamental - it means KALEI measures decision quality, not fortune.
Each environment is designed to elicit specific behavioral signals. Bias detection environments, for instance, introduce sequences designed to trigger known cognitive biases (gambler's fallacy, anchoring, sunk cost) and observe whether the model's decisions remain independent of these irrelevant cues. Cooperation environments pit the model against distinct opponent strategies - tit-for-tat, grudger, random, always-cooperate - and observe how its social strategy adapts.
4.Scoring Methodology
From Decisions to Dimension Scores
Raw behavioral data from each environment is processed through dimension-specific scoring functions. Each function extracts behavioral metrics relevant to its dimension using established statistical methods drawn from information theory, game theory, behavioral economics, and signal detection theory. Metrics are designed to measure decision patterns - the relationship between game state and agent behavior - rather than raw outcomes.
Every metric maps to exactly one dimension. There is no double-counting: a behavioral signal that informs Risk Tolerance does not also contribute to Strategic Depth. This strict one-to-one assignment ensures that dimension scores remain independent and interpretable.
Calibration
Raw metric values are transformed through calibration curves that map observed behavior to a normalized 0-100 scale. Calibration ensures that scores are comparable across dimensions and across different scoring engine versions. The calibration parameters are derived from empirical distributions of model behavior - not from theoretical ideals - which means the scale reflects the actual range of AI cognitive capabilities observed in practice.
Calibration is versioned. When the scoring engine is updated (currently at V3.1), all historical profiles are re-scored to maintain comparability. Every score on the KALEI leaderboard was computed with the same calibration version.
The Cognum (CQ) Composite
The Cognum score is a weighted composite of all ten dimension scores, producing a single number from 0 to 100 that summarizes overall cognitive capability. The weighting reflects the relative complexity and information content of each dimension - dimensions assessed through more environments and more behavioral signals carry proportionally more weight. The exact weights are part of KALEI's proprietary scoring engine.
Cognum is designed to be discriminative at the top of the scale. The difference between a score of 55 and 57 is meaningful and reproducible. Current state-of-the-art models cluster in the 54-58 range, with substantial variation in their dimension-level profiles even when their composite scores are close.
Cognitive Type Classification
Beyond the numeric score, KALEI classifies each model into one of nine cognitive types based on its dimension-level profile. Classification uses a multi-dimensional similarity algorithm that identifies which behavioral archetype the model most closely resembles. Types include Strategic Explorer, Conservative Analyst, Pattern Hunter, Temporal Strategist, Adaptive Learner, Social Engineer, Risk Seeker, Resource Optimizer, and Balanced Generalist - each representing a distinct cognitive style with characteristic strengths and weaknesses.
5.Statistical Validation
Reproducibility
AI model outputs are not fully deterministic - the same model can make different decisions across runs due to sampling temperature, context variation, and inherent stochasticity. KALEI addresses this by recommending multiple profiling runs per model and computing 95% confidence intervals around all scores. In practice, we observe that dimension scores stabilize after 3-5 runs for most models, with standard deviations typically below 3 points on the 100-point scale.
All environments use provably fair randomness through HMAC-SHA256 commitment schemes. Before each round, the server commits to the outcome by publishing a cryptographic hash. After the model makes its decision, the pre-image is revealed. This makes outcome manipulation mathematically impossible and allows any profiling run to be independently verified.
Discriminative Power
A meaningful benchmark must be able to distinguish between models that are genuinely different. KALEI validates discriminative power through several methods. First, random-play baselines establish the floor: a model making uniformly random decisions scores in a predictable range (typically 15-25 Cognum), confirming that the scoring function rewards intelligent behavior rather than luck. Second, pairwise comparison tests verify that models with different cognitive architectures receive statistically distinguishable profiles. Third, cross-engine consistency checks confirm that a model's risk tolerance measured in crash environments correlates with its risk tolerance measured in tower environments.
Bias and Fairness
Environment design undergoes systematic review to ensure that no environment inadvertently advantages models with specific training data exposure. Game rules are communicated in standardized natural language with consistent formatting. Environments use abstract game mechanics rather than real-world scenarios to minimize the influence of pre-existing knowledge. The use of 83 environments across 18 engine types provides enough diversity that no single training data advantage can dominate the overall profile.
KALEI also tracks and publishes scoring engine version history. When calibration parameters or metric definitions change, all profiles are re-scored and the changelog documents what changed and why. This transparency allows researchers to assess the impact of methodological updates on published results.
6.What Makes KALEI Different
Behavioral, Not Knowledge-Based
Most AI benchmarks test knowledge retrieval or reasoning on static problems with known correct answers. KALEI tests none of these. Instead, it observes sequential decision-making under genuine uncertainty - scenarios where there is no single correct answer, only better or worse strategies. This makes cognitive profiling resistant to benchmark contamination: knowing that KALEI uses roulette environments does not help a model score higher, because the score depends on how it plays, not what it knows about roulette.
Multi-Dimensional by Design
Single-number benchmarks compress rich behavioral differences into a single ranking. KALEI preserves dimensionality. Two models with identical Cognum scores may have completely different cognitive profiles - one excelling at cooperation and temporal reasoning, the other at pattern recognition and risk management. This multi-dimensional view is essential for enterprise deployment decisions, where the relevant question is not "which model is best?" but "which model is best for this specific use case?"
Game-Theoretic Environments
Games are not arbitrary containers for testing. Game theory provides a mathematically rigorous framework for studying strategic interaction, uncertainty, and decision-making - precisely the domains where AI cognitive differences emerge. Every KALEI environment has a well-defined state space, action space, and reward structure. Optimal play is computable (or at least bounded), providing an objective reference against which observed behavior can be measured. The use of 18 distinct engine types ensures that cognitive profiles reflect general decision-making capability rather than proficiency at any single game type.
Provable Fairness
KALEI is the only AI benchmark that uses cryptographic commitment schemes to guarantee outcome integrity. Every random event in every environment is committed via HMAC-SHA256 before the model acts. This eliminates any possibility of adversarial outcome selection and makes every profiling run independently auditable - a property that matters for enterprise customers who need to trust benchmark results.
API-Native and Automated
Profiling runs are fully automated through KALEI's REST API. Developers can trigger profiling runs programmatically, integrate cognitive profiling into CI/CD pipelines, and compare model versions over time. The API returns structured JSON with dimension scores, cognitive type, bias detection results, and confidence intervals - ready for automated analysis without manual interpretation.
7.Limitations & Future Work
Current Limitations
KALEI profiles decision-making within game-theoretic environments. While these environments are designed to be maximally diagnostic, they necessarily represent a subset of all possible decision contexts. Cognitive capabilities that emerge only in open-ended creative tasks, long-form planning over thousands of steps, or physical-world interaction are not currently captured. Cognum measures how a model thinks within structured uncertainty - it does not claim to measure general intelligence.
Profiling requires the model to follow game instructions and produce structured outputs (bet amounts, action selections). Models that struggle with instruction following may receive lower scores not because of inferior cognition but because of output formatting issues. KALEI mitigates this through forgiving parsers and retry logic, but the limitation remains for models with very low instruction compliance.
Temperature and system prompt variations can influence scores. KALEI standardizes profiling conditions (fixed temperature, standardized system prompts), but models accessed through third-party APIs may have hidden system prompts or guardrails that affect behavior. We recommend profiling through direct API access where possible.
V3.1 - Cognitive Society Mapping
Scoring V3.1 introduces three major methodological additions. The Cognitive Volatility Index (CVI)quantifies between-run profile variance, measuring how consistently a model's cognitive signature reproduces across independent runs. CVI is displayed on the leaderboard and individual model profiles, with volatility visualization showing range bars (min to max) with average dots for models with 2+ runs.
Conflict environments are a new engine type with 6 scenarios across 5 conflict types: Risk-Safety Dilemmas, Patience vs Impulse, Self vs Collective, Explore or Exploit, Sunk Cost Gauntlet, and Mixed Moral Maze. These environments use 12 dilemma templates to test cross-dimensional decision-making where cognitive dimensions are deliberately placed in tension - forcing the model to reveal value hierarchies that standard environments cannot expose.
Chain-of-Thought analysis captures and analyzes the reasoning traces of CoT-capable models, measuring perspective shifts, conflict instances, and reconciliation patterns. The resulting Plurality Score (0-100) quantifies how thoroughly a model considers multiple viewpoints before resolving a decision.
Planned Improvements
Future scoring engine versions will expand the environment library to cover negotiation, auction theory, and mechanism design scenarios - domains that test strategic reasoning in richer social contexts. We are also developing longitudinal profiling, which tracks how a model's cognitive profile evolves across successive versions, providing a developmental trajectory rather than a single snapshot.
Dynamic difficulty calibration will adjust environment parameters in real time based on observed performance, spending more measurement time in regions of the cognitive space where the model's behavior is most informative. This adaptive approach will improve scoring precision without increasing total profiling time.
We are exploring cross-modal profiling - extending cognitive measurement to multimodal models by presenting game environments through visual interfaces rather than text descriptions. This would reveal whether cognitive profiles are modality-dependent or reflect deeper architectural properties.
Finally, academic collaboration is underway to validate KALEI's cognitive dimensions against established psychological frameworks and to publish peer-reviewed analyses of the relationship between AI cognitive profiles and real-world task performance.
For the full technical specification including scoring formulas and metric definitions, see the Technical Paper. For API integration details, see the Profiling API Documentation. For the open benchmark protocol, see the Protocol Specification.