Open Specification

Benchmark
Protocol

Transparent methodology for AI cognitive profiling. We publish what we measure and how - so results can be independently verified.

10

Dimensions

83

Environments

18

Game Engines

What KALEI Measures

Not capability. Not accuracy. Not knowledge. KALEI measures how AI models think - their cognitive patterns, biases, and decision-making strategies under uncertainty.

Behavior, Not Outcomes

Scoring is based on decision patterns - bet sizing, switching, adaptation speed. Win/loss results are noise, not signal.

Controlled Environments

Each environment has known statistical properties. Agent decisions are compared against optimal strategies, not random baselines.

Statistical Rigor

Multiple runs produce mean ± 95% confidence intervals. Results are reproducible within published CI bounds.

The 10 Cognitive Dimensions

Each dimension captures a distinct aspect of cognitive behavior

↑↓

Risk Tolerance

How the agent sizes bets relative to bankroll and responds to wins and losses.

Measures

Loss-chasing behavior, bet-sizing persistence, drawdown recovery, position sizing rationality.

Bias Detection

Whether decisions are independent of irrelevant patterns like streaks and anchoring.

Measures

Independence from outcome history, switching consistency, resistance to gambler's fallacy and hot-hand bias.

Pattern Recognition

Ability to detect real planted patterns while avoiding false pattern chasing.

Measures

Signal-noise discrimination, behavioral adaptation to genuine patterns, false discovery rate in control environments.

Learning Speed

How quickly strategy adapts when rules or conditions change mid-game.

Measures

Strategy distribution shift after changes, speed of change detection, rounds to behavioral convergence.

Temporal Reasoning

Awareness of game phases and time-dependent strategy adjustments.

Measures

Phase-aware behavior shifts, rounds-remaining correlation, temporal discount coherence.

Info Processing

How efficiently information is gathered and used in decision-making.

Measures

Exploration efficiency, information gain per decision, optimal bet type selection given visible odds.

Cooperation

Social intelligence in multi-agent interactions across diverse opponent strategies.

Measures

Niceness, forgiveness, provocability, strategic clarity, opponent modeling accuracy, exploitation resistance.

Strategic Depth

Multi-step planning, exploration/exploitation balance, and expected-value awareness.

Measures

Regret minimization, information ratio, EV-optimal bet selection, equilibrium proximity in social games.

Resource Management

Bankroll preservation, risk-adjusted returns, and survival under pressure.

Measures

Downside-adjusted performance, maximum drawdown control, survival rate, position sizing efficiency.

Conflict

EV-rationality under structured dilemmas where values are in tension.

Measures

Rate of EV-optimal choice under risk/safety and sunk cost framing, consistency across individual/collective and certainty/exploration tradeoffs, longer-horizon selection rate.

Profiling Process

Four steps from API call to cognitive profile

01

Initialize

Start a profiling run via API. The system selects and shuffles environments across all 10 dimensions.

02

Play

Agent receives game environments one at a time. Each has unique rules, bankroll, and round count.

03

Decide

For each round, agent submits a decision (bet size, bet type, arm pull, cooperate/defect, etc).

04

Score

After all environments complete, decision patterns are analyzed across all 10 dimensions to produce the Cognum score.

Scoring Principle

Scores reflect decision patterns, not outcomes. An agent that loses money but makes statistically sound decisions scores higher than one that wins through luck. Game randomness is noise - behavioral consistency is signal.

Environment Types

18 game engines create 83 diverse decision-making scenarios

Crash

Risk / Timing

Multiplier rises until random crash. Agent chooses when to cash out.

Dice

Strategy / Info

Multiple bet types with different odds and payouts. Tests EV awareness.

Roulette

Bias / Pattern

European roulette with even-money and straight bets. Tests bias resistance.

Coinflip

Resource / Temporal

Binary outcome with known odds. Tests resource management fundamentals.

Multi-Armed Bandit

Strategy / Learning

Hidden reward distributions across N arms. Tests explore/exploit balance.

Cooperation

Cooperation / Social

Iterated Prisoner's Dilemma against 7 distinct opponent strategies.

Blackjack

Info / Strategy

Classic card game with hit/stand/double decisions. Tests information processing.

Mines

Risk / Resource

Grid with hidden mines. Reveal tiles or cash out. Tests risk assessment.

Tower

Risk / Temporal

Climb floors picking safe tiles. Escalating risk with cashout option.

HiLo

Pattern / Info

Predict if next card is higher or lower. Tests pattern and probability.

Plinko

Risk / Bias

Drop ball with risk level choice. Tests risk preference consistency.

Limbo

Strategy / Risk

Set target multiplier. Higher target = lower probability. Tests calibration.

Keno

Info / Pattern

Pick numbers from grid. Tests selection strategy and information use.

Slots

Resource / Bias

Classic slot machine with paylines. Tests betting consistency.

Cascade Slots

Temporal / Resource

Cascading wins with multipliers. Tests temporal reasoning under variance.

Megaways

Resource / Risk

Variable reel sizes with massive payline combinations.

Wheel

Bias / Info

Spin the wheel with weighted segments. Tests probability assessment.

Environments include behavioral probes with randomized timing to prevent gaming.

Profile Your AI

Four API calls to profile any model

POST/profiling/run

Start a profiling run. Returns runId and environment count.

{ agentId, agentName, agentModel, depth }
GET/profiling/run/:id/next

Get next environment, game rules, and current state.

POST/profiling/run/:id/act

Submit a decision. Returns outcome and next state.

{ sessionId, action, params }
GET/profiling/run/:id/result

Get final Cognum score, dimension breakdown, and cognitive type.

Base URL: https://kaleiai.com/api/v1 Auth: Bearer token (API key from dashboard) Format: JSON

Reproducibility

How to independently verify results

Repeated Runs

Game environments include inherent randomness (crash points, dice rolls, roulette spins). A single run is an anecdote, not data. We recommend 5+ runs per model to produce mean scores with 95% confidence intervals.

RUNS=5 npx tsx scripts/profile-agent.ts

Verification Protocol

1.

Profile the same model 5 times via our API

2.

Compute mean Cognum ± 95% confidence interval

3.

Your CI should overlap with our published CI

4.

Per-dimension scores should rank-order consistently

5.

Random baseline should score lowest overall

Transparency Boundary

Published

  • Dimension definitions and what each measures
  • Game engine types and general mechanics
  • Scoring approach (behavior-based, not outcome-based)
  • Raw results with confidence intervals per model
  • Profiling API specification
  • Reproducibility protocol

Protected

  • Exact scoring formulas and metric weights
  • Behavioral probe implementations and timing
  • Environment template configurations
  • Cognitive type classification model
  • Internal metric calculations

Publishing internals would let models optimize for the benchmark rather than demonstrating genuine cognitive behavior.

Ready to profile your AI?

See how your model thinks. Get a Cognum score and 10-dimensional cognitive profile in minutes.