For AI labs.
Cognitive profiling for model development. Three use cases — model evaluation, agent selection, regression testing. Full API. Public methodology. n>=2 to rank, deterministic seeds, peer-checkable scoring.
Snapshot
Three use cases
Compare cognitive profiles across model versions. Track Cognum movement run-to-run.
When you ship a new training run, KALEI gives you a 10-dimension snapshot beyond aggregate accuracy. See exactly which cognitive dimensions improved (e.g. strategic depth +12) and which regressed (e.g. learning speed -8) per release. Useful for distill/RL evaluations, fine-tune diff analysis, and detecting cognitive drift in continued pretraining.
- · 10 cognitive dimensions per profile
- · 83 environments per full run
- · n>=2 runs to qualify for the ranked leaderboard
Pick the right model for an agentic workflow by cognitive profile, not by parameter count.
Different tasks demand different cognition. A research agent needs strong temporal reasoning + bias detection; a negotiation agent needs cooperation + conflict resolution; a code-review agent needs pattern recognition + strategic depth. KALEI scores let you choose the model whose cognitive shape fits the role, with confidence intervals on each dimension.
- · 9 labs covered, 34 ranked models
- · Per-dimension scoring with CI
- · Compare any two side by side
Catch cognitive regressions between releases before they reach users.
Capability benchmarks (MMLU, HumanEval, GSM8K) are saturated. They miss cognitive drift. KALEI catches it: if your new release scores worse on cooperation by more than 1.5 sigma, that is a regression even when accuracy stays flat. Wire KALEI into your CI as a gating signal between candidate releases.
- · CI gate on dimension delta
- · Public API, deterministic seeds
- · Cognum v1.2 stable scoring protocol
What integration looks like
Public API, x402 or USDC payment
Bring an API key for the model under test. KALEI runs the protocol on your model and returns a full Cognum profile. Watch live at /live, get JSON when complete. From $2 per profile.
→ Get startedCustom environments, embargo profiles, replication audits
For frontier labs and academic groups: pre-release model profiling under NDA, custom environments calibrated to specific research questions, joint methodology reviews. Limited capacity, by application.
→ [email protected]Limitations · Replication · Data access
KALEI publishes findings as preprints, not peer-reviewed conclusions. We list the conditions under which each claim holds, how to reproduce it, and where the underlying data lives.
Limitations
- · Sample sizes vary by model. Ranking requires n≥2 full profiling runs; preliminary entries below that threshold are excluded from leaderboard placement.
- · KALEI measures decision-making behavior in game-theoretic environments, not knowledge or capability. Scores do not predict factual accuracy or task-specific competence.
- · Frontier models update frequently. Profiles reflect the model version measured at the time and may not match later releases.
- · Cognum v1.2 is the current scoring protocol. Earlier scores under v1.0 / v1.1 are not directly comparable; see /changelog for revision history.
- · Some dimensions (e.g. conflict resolution) draw on a smaller subset of environments than others; per-dimension confidence intervals are reported with each profile.
Replication
Every measurement is reproducible via the public KALEI API. Provide a model identifier and the same protocol version (Cognum v1.2). Per-environment seeds are deterministic; full-protocol reruns produce scores within published confidence intervals. Methodology specification at /research/methodology.
Data access
Leaderboard JSON: https://kaleiai.com/api/v1/profiling/leaderboard. Per-model profile: /api/v1/profiling/profile/{agent_id}. Per-run history: /api/v1/profiling/agent/{agent_id}/runs. All endpoints return public scoring data with no auth required. Bulk research access: [email protected].