About

We measure how AI thinks,
not what it knows.

MMLU, Arena, HumanEval. The field has excellent tools for measuring knowledge and task accuracy. None of them tell you whether a model is patient, risk-aware, cooperative, or biased. KALEI is the first platform that does.

01 / The gap

Two models with identical benchmark scores can make radically different decisions.

Chatbot Arena ranks models by human preference. MMLU ranks them by multiple-choice accuracy. Neither tells you which model will cooperate in a repeated game, hold its posture under pressure, or resist the sunk-cost fallacy when the stakes rise.

For most real uses (an agent making financial decisions, an assistant coordinating with other agents, a model navigating genuine value conflicts), the pattern of decisions matters more than the final answer. Existing benchmarks simply don’t observe that.

02 / The approach

Behavioral measurement, under statistical control.

Decision patterns, not outcomes

An agent that goes bankrupt through disciplined Kelly-optimal bets in an unlucky sequence scores higher than one that profits through reckless gambling. We reward process, not luck.

II.

Traps that surface cognition

Sophisticated models play mathematically optimal strategies on vanilla problems, producing undifferentiated profiles. KALEI uses a proprietary trap system, calibrated to be undetectable and designed to elicit biases that would never otherwise manifest.

III.

One metric, one dimension

30+ statistically grounded behavioral metrics: chi-squared independence, KL-divergence, CUSUM change-point detection, Axelrod tournament metrics. Each metric belongs to exactly one cognitive dimension. No cross-dimension reuse, no hidden correlations.

IV.

Repeat. Quantify variance.

A single profiling run is an anecdote. KALEI supports N repeated runs with mean ± 95% CI per dimension, plus a Cognitive Volatility Index (CVI) for between-run variance. Validation includes test-retest reliability (ICC) and sensitivity analysis across weight permutations.

03 / Lineage

Built solo. Acknowledged openly.

KALEI is designed, built, and operated by Venelin Videnov in Plovdiv, Bulgaria. Independent. Self-funded. No institutional affiliation, no prior publications. The platform, the papers, and the infrastructure are all solo work.ORCID 0009-0008-4469-3327

Acknowledgments

Claude Opus 4.6 and 4.7 (Anthropic) contributed as development collaborators across the platform, the papers, and the research infrastructure. These contributions are acknowledged in the papers themselves rather than through co-authorship. A deliberate choice: keep scientific responsibility with the human researcher, credit the tool honestly.

KALEI builds on, and positions itself complementary to, the taxonomy of LLM reasoning failures proposed by Song, Han & Goodman (2026) at Stanford/Caltech. Their survey catalogs failure modes; KALEI provides the behavioral measurement infrastructure to detect and quantify them.

04 / The work, in the open

Everything we measure, in one place.

Cognum Paper →

The methodology, 10 dimensions, Scoring V3.1, validation protocol.

Parliament Paper →

96% of AI reasoning is performative. 6 voice archetypes identified.

Search-Native Paper →

Citation hallucination and architectural identity defense in Perplexity.

Live Leaderboard →

80+ profiles across 10 labs. Cognum v1.2. Updated automatically.

Ask the Parliament →

4 frontier AI models deliberate your question, reasoning visible.

Human Baseline →

n = 14 participants. Where humans differ from AI, measured.

05 / Citation

If our work shows up in yours.

Preprint citation for the Cognum methodology paper. arXiv ID pending endorsement.

@article{videnov2026cognum,
  title   = {Cognum: A Multi-Dimensional Metric
             for AI Cognitive Capability},
  author  = {Videnov, Venelin},
  journal = {KALEI Research. Preprint.},
  year    = {2026},
  url     = {https://kaleiai.com/paper},
  orcid   = {0009-0008-4469-3327},
}

Collaboration, press, replication

Write. We read everything.

[email protected]Contact form

We measure how AI thinks,not what it knows.