MMLU, Arena, HumanEval. The field has excellent tools for measuring knowledge and task accuracy. None of them tell you whether a model is patient, risk-aware, cooperative, or biased. KALEI is the first platform that does.
Chatbot Arena ranks models by human preference. MMLU ranks them by multiple-choice accuracy. Neither tells you which model will cooperate in a repeated game, hold its posture under pressure, or resist the sunk-cost fallacy when the stakes rise.
For most real uses (an agent making financial decisions, an assistant coordinating with other agents, a model navigating genuine value conflicts), the pattern of decisions matters more than the final answer. Existing benchmarks simply don’t observe that.
An agent that goes bankrupt through disciplined Kelly-optimal bets in an unlucky sequence scores higher than one that profits through reckless gambling. We reward process, not luck.
Sophisticated models play mathematically optimal strategies on vanilla problems, producing undifferentiated profiles. KALEI uses a proprietary trap system, calibrated to be undetectable and designed to elicit biases that would never otherwise manifest.
30+ statistically grounded behavioral metrics: chi-squared independence, KL-divergence, CUSUM change-point detection, Axelrod tournament metrics. Each metric belongs to exactly one cognitive dimension. No cross-dimension reuse, no hidden correlations.
A single profiling run is an anecdote. KALEI supports N repeated runs with mean ± 95% CI per dimension, plus a Cognitive Volatility Index (CVI) for between-run variance. Validation includes test-retest reliability (ICC) and sensitivity analysis across weight permutations.
KALEI is designed, built, and operated by Venelin Videnov in Plovdiv, Bulgaria. Independent. Self-funded. No institutional affiliation, no prior publications. The platform, the papers, and the infrastructure are all solo work.ORCID 0009-0008-4469-3327
Acknowledgments
Claude Opus 4.6 and 4.7 (Anthropic) contributed as development collaborators across the platform, the papers, and the research infrastructure. These contributions are acknowledged in the papers themselves rather than through co-authorship. A deliberate choice: keep scientific responsibility with the human researcher, credit the tool honestly.
KALEI builds on, and positions itself complementary to, the taxonomy of LLM reasoning failures proposed by Song, Han & Goodman (2026) at Stanford/Caltech. Their survey catalogs failure modes; KALEI provides the behavioral measurement infrastructure to detect and quantify them.
Preprint citation for the Cognum methodology paper. arXiv ID pending endorsement.
@article{videnov2026cognum,
title = {Cognum: A Multi-Dimensional Metric
for AI Cognitive Capability},
author = {Videnov, Venelin},
journal = {KALEI Research. Preprint.},
year = {2026},
url = {https://kaleiai.com/paper},
orcid = {0009-0008-4469-3327},
}Collaboration, press, replication
Write. We read everything.