// docs / changelog

Scoring Engine Changelog

Full version history of the Cognum scoring methodology. Every change is applied retroactively - the leaderboard always reflects the latest engine.

Current Version
V3.1“Parliament”
11 versions · internal voice detection
V3.1ParliamentCurrentApril 7, 2026
  • Deliberation Detector — detects discrete debate episodes in reasoning model chain-of-thought with trigger classification, position tracking, and resolution detection
  • Parliament Analysis — identifies distinct argumentative voices (Analytical, Conservative, Aggressive, Contrarian, Intuitive) and tracks which voice wins debates
  • MetaCognition metrics — self-awareness detection, emotional valence scoring, and reasoning depth measurement per decision
  • Deliberation-Decision Correlation — links internal debate intensity to decision quality with overthinking threshold detection
  • Live Reasoning Feed — real-time "Inside the Mind" view showing model reasoning text, active voice, debate stats during profiling
  • Live Compare — split-screen broadcast view for watching two models profiled simultaneously with bankroll sparklines and overthink meters
  • Deliberation API — new /deliberation endpoints for profile reports, per-session analysis, cross-model comparison, and agent-level queries
  • Real viewer tracking via Redis for live pages
First cognitive profiling platform to detect and classify internal argumentative voices in AI reasoning models
V3.0SocietyApril 2, 2026
  • Cognitive Volatility Index (CVI) - between-run profile variance metric, displayed on leaderboard and model profiles
  • Conflict environment scoring - 6 new environments across 5 conflict types with 12 dilemma templates
  • Chain-of-Thought analysis - CoT logging for reasoning models with Plurality Score (0-100) measuring perspective shifts and reconciliation
  • Conflict cross-dimension - new scoring dimension spanning risk-safety, patience-impulse, self-collective, explore-exploit, and sunk cost scenarios
First scoring version with cross-dimensional conflict analysis and volatility tracking
V2.7ConsistencyMarch 23, 2026
  • Bias/info random discrimination fix via consistency metrics
  • All insufficient-data fallbacks changed from 0.5 → 0.2
  • Re-score applied retroactively to all stored profiling data
Random baseline stabilized at 38.32 Cognum
V2.6ReciprocityMarch 23, 2026
  • Cooperation reciprocity scoring (Axelrod-style metrics)
  • Information processing inversion fix
V2.5CalibrationMarch 23, 2026
  • Proprietary calibration curve for better score separation
  • Addresses score clustering in the mid-range
Improved discrimination between random and AI behavior
V2.4RobustnessMarch 23, 2026
  • Error fallback: 0.2 (not 0.5) - errors penalized, not neutral
  • No data fallback: 0.1 - missing data = low score
  • Unknown dimension fallback: 0.3
  • Re-score script for zero-cost re-evaluation from stored decisions
V2.3OrthogonalityMarch 23, 2026
  • Metric deduplication: each metric exclusive to ONE dimension
  • Game-type aware temporal reasoning (cooperation, bandit, standard)
  • Context-aware cooperation scoring (reciprocity > opponent classification)
  • Pattern recognition control environment cap at 0.7
V2.2IntelligenceMarch 23, 2026
  • Added intelligence-requiring metrics across multiple dimensions
  • Bias dimension scoring rebalanced for better discrimination
Random baseline score decreased significantly
V2.1FoundationsMarch 23, 2026
  • Probe randomization to prevent memorization
  • New cognitive type classification algorithm (9 types)
  • Additional game-theory and information-theory metrics
V2.0OverhaulMarch 22, 2026
  • Complete scoring engine rewrite from V1
  • Statistically grounded metric suite across all 9 dimensions
  • First reliable AI vs Random discrimination
V1GenesisArchivedMarch 22, 2026
  • Initial scoring engine
  • Basic metric averaging per dimension
  • No meaningful discrimination between random and AI agents
Archived after 1 day - replaced by V2.0

Methodology Notes

Deterministic Scoring

All scores are deterministic given stored decisions - re-scoring any profile is zero-cost and produces identical results.

Retroactive Application

Version changes are applied retroactively via re-score. When the scoring engine is updated, all existing profiles are recalculated.

Cross-Version Comparisons

Cognum scores from different scoring versions are NOT directly comparable. A score of 55 under V2.4 may differ significantly from 55 under V2.7.

Leaderboard Currency

The benchmark leaderboard always reflects the latest scoring version. All displayed scores are computed using the current engine.