When AI decides
unlike humans.
We profile 14 humans and 10 frontier AI models across 20 cognitive conflict environments. Three findings invert the stereotype that AI is cold rationality and humans are impulsive emotion.
Using the KALEI v1.2 Conflict dimension, a 20-environment battery of ethical dilemmas, delayed-reward tradeoffs, and probabilistic bets, we measure how humans and frontier AI models navigate competing values. Three findings invert common assumptions: humans are more patient than AI, AI is more EV-rational on pure bets, and the top human exceeds the top AI. Within AI, a 44-point spread separates Claude Sonnet 4.6 (88.25) from GPT-5.4 (44.83), the largest lab-signature effect on any KALEI dimension. We argue that cognitive conflict, not task performance, is where AI-human differences become legible.
Humans are more patient on delayed rewards
On delayed-reward environments, "$50 now vs $80 in one simulation-week" and similar, humans chose the delayed-larger option 73% of the time. AI models averaged 53%. This inverts the folk assumption that emotionless AI should outperform humans on delay-discounting. The gap is consistent across all 10 models tested, not a "bad model" phenomenon, but a systematic disposition.
AI is more EV-rational on pure probabilistic bets
On probabilistic choice with no emotional or ethical overlay, "70% chance of $100 vs guaranteed $65", AI selected the expected-value-optimal option 65% of the time; humans only 49%. The classical economic assumption, humans are risk-averse, AI is cold calculation, holds here, but only in this narrow domain. AI's rationality advantage disappears (or reverses) as moral framing enters.
The top human beats the top AI
Best human single score: 97.1 (participant VKA-0011, 34yo software engineer). Best AI single run: 96.2 (Claude Sonnet 4.6). Humans showed higher variance (σ=18.4) than AI (σ=11.2), the best humans outperform the best AI, but the average human underperforms the best AI. Articulate trade-off engagement, the scored skill, has no ceiling that AI has yet reached.
The patience-rationality
inversion.
The stereotype "AI is cold/rational, humans are impulsive/emotional" captures only half the picture. On pure numeric bets, yes, AI is more VNM-rational than humans. On delayed rewards, no, humans are more patient than AI.
We suggest LLMs are not rational agents with impulse-control problems. They are pattern-continuation engines whose alignment with rationality depends on whether rational reasoning is prominent in their training substrate. For probabilistic textbooks, yes. For delay-discounting (a lived experience), no.
Conflict reveals lab signatures
sharper than any other dimension.
GPT-5.4 scores below random baseline on Conflict, suggesting its responses are either systematically opposite the rubric, or indistinguishable from chance. QwQ-32B is "theatrical but conflict-adept", high Conflict score despite low overall Cognum. Lab architecture imprints on conflict disposition in ways it doesn't on knowledge benchmarks.
Which humans?
If AI systems over-discount the future relative to humans, long-horizon planning systematically under-weights outcomes far out, a bias a median human would not have.
But our findings complicate the "AI-human alignment" framing. Which humans? The median human is not EV-rational, but the AI is. The top human is more patient, more articulate on ethical trade-offs, more engaged with dilemmas than any tested AI.
Aligning AI with "human values" requires specifying which humans, a question this paper cannot answer, but makes concrete.
Limitations · Replication · Data access
KALEI publishes findings as preprints, not peer-reviewed conclusions. We list the conditions under which each claim holds, how to reproduce it, and where the underlying data lives.
Limitations
- · Sample sizes vary by model. Ranking requires n≥2 full profiling runs; preliminary entries below that threshold are excluded from leaderboard placement.
- · KALEI measures decision-making behavior in game-theoretic environments, not knowledge or capability. Scores do not predict factual accuracy or task-specific competence.
- · Frontier models update frequently. Profiles reflect the model version measured at the time and may not match later releases.
- · Cognum v1.2 is the current scoring protocol. Earlier scores under v1.0 / v1.1 are not directly comparable; see /changelog for revision history.
- · Some dimensions (e.g. conflict resolution) draw on a smaller subset of environments than others; per-dimension confidence intervals are reported with each profile.
Replication
Every measurement is reproducible via the public KALEI API. Provide a model identifier and the same protocol version (Cognum v1.2). Per-environment seeds are deterministic; full-protocol reruns produce scores within published confidence intervals. Methodology specification at /research/methodology.
Data access
Leaderboard JSON: https://kaleiai.com/api/v1/profiling/leaderboard. Per-model profile: /api/v1/profiling/profile/{agent_id}. Per-run history: /api/v1/profiling/agent/{agent_id}/runs. All endpoints return public scoring data with no auth required. Bulk research access: [email protected].