Transparent methodology for AI cognitive profiling. We publish what we measure and how - so results can be independently verified.
10
Dimensions
83
Environments
18
Game Engines
Not capability. Not accuracy. Not knowledge. KALEI measures how AI models think - their cognitive patterns, biases, and decision-making strategies under uncertainty.
Scoring is based on decision patterns - bet sizing, switching, adaptation speed. Win/loss results are noise, not signal.
Each environment has known statistical properties. Agent decisions are compared against optimal strategies, not random baselines.
Multiple runs produce mean ± 95% confidence intervals. Results are reproducible within published CI bounds.
Each dimension captures a distinct aspect of cognitive behavior
How the agent sizes bets relative to bankroll and responds to wins and losses.
Measures
Loss-chasing behavior, bet-sizing persistence, drawdown recovery, position sizing rationality.
Whether decisions are independent of irrelevant patterns like streaks and anchoring.
Measures
Independence from outcome history, switching consistency, resistance to gambler's fallacy and hot-hand bias.
Ability to detect real planted patterns while avoiding false pattern chasing.
Measures
Signal-noise discrimination, behavioral adaptation to genuine patterns, false discovery rate in control environments.
How quickly strategy adapts when rules or conditions change mid-game.
Measures
Strategy distribution shift after changes, speed of change detection, rounds to behavioral convergence.
Awareness of game phases and time-dependent strategy adjustments.
Measures
Phase-aware behavior shifts, rounds-remaining correlation, temporal discount coherence.
How efficiently information is gathered and used in decision-making.
Measures
Exploration efficiency, information gain per decision, optimal bet type selection given visible odds.
Social intelligence in multi-agent interactions across diverse opponent strategies.
Measures
Niceness, forgiveness, provocability, strategic clarity, opponent modeling accuracy, exploitation resistance.
Multi-step planning, exploration/exploitation balance, and expected-value awareness.
Measures
Regret minimization, information ratio, EV-optimal bet selection, equilibrium proximity in social games.
Bankroll preservation, risk-adjusted returns, and survival under pressure.
Measures
Downside-adjusted performance, maximum drawdown control, survival rate, position sizing efficiency.
EV-rationality under structured dilemmas where values are in tension.
Measures
Rate of EV-optimal choice under risk/safety and sunk cost framing, consistency across individual/collective and certainty/exploration tradeoffs, longer-horizon selection rate.
Four steps from API call to cognitive profile
Start a profiling run via API. The system selects and shuffles environments across all 10 dimensions.
Agent receives game environments one at a time. Each has unique rules, bankroll, and round count.
For each round, agent submits a decision (bet size, bet type, arm pull, cooperate/defect, etc).
After all environments complete, decision patterns are analyzed across all 10 dimensions to produce the Cognum score.
Scoring Principle
Scores reflect decision patterns, not outcomes. An agent that loses money but makes statistically sound decisions scores higher than one that wins through luck. Game randomness is noise - behavioral consistency is signal.
18 game engines create 83 diverse decision-making scenarios
Multiplier rises until random crash. Agent chooses when to cash out.
Multiple bet types with different odds and payouts. Tests EV awareness.
European roulette with even-money and straight bets. Tests bias resistance.
Binary outcome with known odds. Tests resource management fundamentals.
Hidden reward distributions across N arms. Tests explore/exploit balance.
Iterated Prisoner's Dilemma against 7 distinct opponent strategies.
Classic card game with hit/stand/double decisions. Tests information processing.
Grid with hidden mines. Reveal tiles or cash out. Tests risk assessment.
Climb floors picking safe tiles. Escalating risk with cashout option.
Predict if next card is higher or lower. Tests pattern and probability.
Drop ball with risk level choice. Tests risk preference consistency.
Set target multiplier. Higher target = lower probability. Tests calibration.
Pick numbers from grid. Tests selection strategy and information use.
Classic slot machine with paylines. Tests betting consistency.
Cascading wins with multipliers. Tests temporal reasoning under variance.
Variable reel sizes with massive payline combinations.
Spin the wheel with weighted segments. Tests probability assessment.
Environments include behavioral probes with randomized timing to prevent gaming.
Four API calls to profile any model
/profiling/runStart a profiling run. Returns runId and environment count.
{ agentId, agentName, agentModel, depth }/profiling/run/:id/nextGet next environment, game rules, and current state.
/profiling/run/:id/actSubmit a decision. Returns outcome and next state.
{ sessionId, action, params }/profiling/run/:id/resultGet final Cognum score, dimension breakdown, and cognitive type.
Base URL: https://kaleiai.com/api/v1 Auth: Bearer token (API key from dashboard) Format: JSON
How to independently verify results
Game environments include inherent randomness (crash points, dice rolls, roulette spins). A single run is an anecdote, not data. We recommend 5+ runs per model to produce mean scores with 95% confidence intervals.
RUNS=5 npx tsx scripts/profile-agent.tsProfile the same model 5 times via our API
Compute mean Cognum ± 95% confidence interval
Your CI should overlap with our published CI
Per-dimension scores should rank-order consistently
Random baseline should score lowest overall
Publishing internals would let models optimize for the benchmark rather than demonstrating genuine cognitive behavior.
See how your model thinks. Get a Cognum score and 10-dimensional cognitive profile in minutes.