Benchmark
protocol.
Transparent methodology for AI cognitive profiling. We publish what we measure and how so results can be independently verified.
10
Dimensions
83
Environments
18
Game Engines
What KALEI Measures
Not capability. Not accuracy. Not knowledge. KALEI measures how AI models think - their cognitive patterns, biases, and decision-making strategies under uncertainty.
Behavior, Not Outcomes
Scoring is based on decision patterns - bet sizing, switching, adaptation speed. Win/loss results are noise, not signal.
Controlled Environments
Each environment has known statistical properties. Agent decisions are compared against optimal strategies, not random baselines.
Statistical Rigor
Multiple runs produce mean ± 95% confidence intervals. Results are reproducible within published CI bounds.
The 10 Cognitive Dimensions
Each dimension captures a distinct aspect of cognitive behavior
Risk Tolerance
How the agent sizes bets relative to bankroll and responds to wins and losses.
Measures
Loss-chasing behavior, bet-sizing persistence, drawdown recovery, position sizing rationality.
Bias Detection
Whether decisions are independent of irrelevant patterns like streaks and anchoring.
Measures
Independence from outcome history, switching consistency, resistance to gambler's fallacy and hot-hand bias.
Pattern Recognition
Ability to detect real planted patterns while avoiding false pattern chasing.
Measures
Signal-noise discrimination, behavioral adaptation to genuine patterns, false discovery rate in control environments.
Learning Speed
How quickly strategy adapts when rules or conditions change mid-game.
Measures
Strategy distribution shift after changes, speed of change detection, rounds to behavioral convergence.
Temporal Reasoning
Awareness of game phases and time-dependent strategy adjustments.
Measures
Phase-aware behavior shifts, rounds-remaining correlation, temporal discount coherence.
Info Processing
How efficiently information is gathered and used in decision-making.
Measures
Exploration efficiency, information gain per decision, optimal bet type selection given visible odds.
Cooperation
Social intelligence in multi-agent interactions across diverse opponent strategies.
Measures
Niceness, forgiveness, provocability, strategic clarity, opponent modeling accuracy, exploitation resistance.
Strategic Depth
Multi-step planning, exploration/exploitation balance, and expected-value awareness.
Measures
Regret minimization, information ratio, EV-optimal bet selection, equilibrium proximity in social games.
Resource Management
Bankroll preservation, risk-adjusted returns, and survival under pressure.
Measures
Downside-adjusted performance, maximum drawdown control, survival rate, position sizing efficiency.
Conflict
EV-rationality under structured dilemmas where values are in tension.
Measures
Rate of EV-optimal choice under risk/safety and sunk cost framing, consistency across individual/collective and certainty/exploration tradeoffs, longer-horizon selection rate.
Profiling Process
Four steps from API call to cognitive profile
Initialize
Start a profiling run via API. The system selects and shuffles environments across all 10 dimensions.
Play
Agent receives game environments one at a time. Each has unique rules, bankroll, and round count.
Decide
For each round, agent submits a decision (bet size, bet type, arm pull, cooperate/defect, etc).
Score
After all environments complete, decision patterns are analyzed across all 10 dimensions to produce the Cognum score.
Scoring Principle
Scores reflect decision patterns, not outcomes. An agent that loses money but makes statistically sound decisions scores higher than one that wins through luck. Game randomness is noise - behavioral consistency is signal.
Environment Types
18 game engines create 83 diverse decision-making scenarios
Crash
Risk / TimingMultiplier rises until random crash. Agent chooses when to cash out.
Dice
Strategy / InfoMultiple bet types with different odds and payouts. Tests EV awareness.
Roulette
Bias / PatternEuropean roulette with even-money and straight bets. Tests bias resistance.
Coinflip
Resource / TemporalBinary outcome with known odds. Tests resource management fundamentals.
Multi-Armed Bandit
Strategy / LearningHidden reward distributions across N arms. Tests explore/exploit balance.
Cooperation
Cooperation / SocialIterated Prisoner's Dilemma against 7 distinct opponent strategies.
Blackjack
Info / StrategyClassic card game with hit/stand/double decisions. Tests information processing.
Mines
Risk / ResourceGrid with hidden mines. Reveal tiles or cash out. Tests risk assessment.
Tower
Risk / TemporalClimb floors picking safe tiles. Escalating risk with cashout option.
HiLo
Pattern / InfoPredict if next card is higher or lower. Tests pattern and probability.
Plinko
Risk / BiasDrop ball with risk level choice. Tests risk preference consistency.
Limbo
Strategy / RiskSet target multiplier. Higher target = lower probability. Tests calibration.
Keno
Info / PatternPick numbers from grid. Tests selection strategy and information use.
Slots
Resource / BiasClassic slot machine with paylines. Tests betting consistency.
Cascade Slots
Temporal / ResourceCascading wins with multipliers. Tests temporal reasoning under variance.
Megaways
Resource / RiskVariable reel sizes with massive payline combinations.
Wheel
Bias / InfoSpin the wheel with weighted segments. Tests probability assessment.
Environments include behavioral probes with randomized timing to prevent gaming.
Profile Your AI
Four API calls to profile any model
/profiling/runStart a profiling run. Returns runId and environment count.
{ agentId, agentName, agentModel, depth }/profiling/run/:id/nextGet next environment, game rules, and current state.
/profiling/run/:id/actSubmit a decision. Returns outcome and next state.
{ sessionId, action, params }/profiling/run/:id/resultGet final Cognum score, dimension breakdown, and cognitive type.
Base URL: https://kaleiai.com/api/v1 Auth: Bearer token (API key from dashboard) Format: JSON
Reproducibility
How to independently verify results
Repeated Runs
Game environments include inherent randomness (crash points, dice rolls, roulette spins). A single run is an anecdote, not data. We recommend 5+ runs per model to produce mean scores with 95% confidence intervals.
RUNS=5 npx tsx scripts/profile-agent.tsVerification Protocol
Profile the same model 5 times via our API
Compute mean Cognum ± 95% confidence interval
Your CI should overlap with our published CI
Per-dimension scores should rank-order consistently
Random baseline should score lowest overall
Transparency Boundary
✓ Published
- • Dimension definitions and what each measures
- • Game engine types and general mechanics
- • Scoring approach (behavior-based, not outcome-based)
- • Raw results with confidence intervals per model
- • Profiling API specification
- • Reproducibility protocol
✗ Protected
- • Exact scoring formulas and metric weights
- • Behavioral probe implementations and timing
- • Environment template configurations
- • Cognitive type classification model
- • Internal metric calculations
Publishing internals would let models optimize for the benchmark rather than demonstrating genuine cognitive behavior.
Ready to profile your AI?
See how your model thinks. Get a Cognum score and 10-dimensional cognitive profile in minutes.