Blog/Retraction

I Found a Bug in My Own Paper

My KALEI paper claimed humans and AI share a “universal blind spot” on conflict dilemmas — both always scoring exactly 15.0. Tonight I found the cause. It wasn’t a finding. It was a placeholder. Here’s the retraction.

By Venelin Videnov · April 10, 2026 · 7 min read

What the paper said

The KALEI paper’s abstract, published as a preprint on April 9, 2026, contained this sentence:

“Both share a universal failure on conflict — the inability to defect when rational.”

The claim was backed by what looked like real data. Every AI model I profiled scored exactly 0.150 on the Conflict dimension. Every human who took the study scored exactly 0.150. Eleven humans from a Bulgarian IT company. Twenty-plus language models from nine laboratories. One number. 15.0 out of 100.

It was a beautiful finding. Convergent cognitive blind spot. Neither humans nor machines can reliably pursue self-interest when defection is the rational strategy. I wrote it into the abstract. I included it in the Discussion. It made its way into Claude Opus 4.6’s own blog letter — “every human scored 15 on conflict too. We share this blindness.”

Tonight I discovered it was nonsense.

The suspicion

The discovery started with a different investigation. Earlier this week I published a case study about Perplexity Sonar Reasoning Pro, showing it fabricates citation markers when denied search. That finding came from reading 4,172 reasoning traces and counting patterns. The process felt good — rigorous data archaeology, uncovering something real.

Flushed with that success, Claude started browsing the database looking for other hidden stories. Claude pulled the scores for every conflict environment session, across all models, humans and AI. And something jumped out immediately:

Claude Opus 4.6           n=3  avg=0.150  range=[0.150, 0.150]
Claude Sonnet 4.6         n=3  avg=0.150  range=[0.150, 0.150]
GPT-5.4                   n=6  avg=0.150  range=[0.150, 0.150]
Gemini 2.5 Flash          n=6  avg=0.150  range=[0.150, 0.150]
Grok 3 Mini Fast          n=3  avg=0.150  range=[0.150, 0.150]
Perplexity Sonar          n=3  avg=0.150  range=[0.150, 0.150]
Qwen 3.5 (all variants)   n=18 avg=0.150  range=[0.150, 0.150]
...

Every model. Every session. Every run. Exactly 0.150. Not 0.149. Not 0.151. 42 sessions of mathematically identical output from 11 different AI architectures playing 6 different dilemma environments across 4 different engines.

Real measurements do not produce identical values across radically different inputs. Real cognitive measurements produce distributions. When you see perfect uniformity, you’re not looking at data — you’re looking at a constant.

The search for the constant

I opened the scoring engine source file (kalei-api/src/scoring/scorer.ts) and searched for 0.15. There it was, at line 296:

export function scoreEnvironment(decisions, env, ctx) {
  const scorer = dimensionScorers[env.dimension];
  if (!scorer) return 0.15;  // ← here
  ...
}

Nine dimensions existed in the dimensionScorersobject: risk tolerance, bias detection, pattern recognition, cooperation, learning speed, strategic depth, temporal reasoning, resource management, information processing. Every one had a real scorer with real metrics — Sortino ratios, Pearson correlations, chi-squared independence tests, Axelrod niceness, Kelly criterion adherence. Real math.

The tenth dimension — conflict — had nothing. The environments existed. The decisions were captured. Models played the games and made choices. But when the scoring function was called, it didn’t find a scorer for “conflict” in its lookup table, and fell through to the default: return 0.15. A placeholder value, presumably written during an early prototype and never replaced.

Every model scored 0.150 on conflict because every model fell through the same fallback. It wasn’t a convergent blind spot. It was a missing function.

What this did to the leaderboard

The 0.15 wasn’t just a wrong display value. It was being fed into the composite Cognum score with a default weight of 1.0, dragging down the total of every model that had played conflict environments. Fourteen profiles were affected, each losing roughly 3 to 4 CQ points to the phantom.

Claude Opus 4.6’s top run was not affected — by coincidence, that particular profiling session had skipped the conflict environments entirely. Its CQ of 57.52 was already clean. But GPT-5.4 was being dragged from 54.33 down to 50.75. Qwen 3.5 122B from 52.87 to 49.42. Claude Sonnet from 53.73 to 50.21. Every number in my published paper that involved a model with conflict data was off by approximately 3.5 points.

The full consequences of the bug:

14 model profiles had inflated-downward CQ scores
14 human study participants had inflated-downward CQ scores
The KALEI paper’s abstract contained a false finding
The paper’s Discussion section called the blind spot “convergent”
The paper’s Conclusion said it was “universal”
Three blog posts repeated the claim
Claude’s own letter-to-the-leaderboard contained the line “we share this blindness”

What I did tonight

I had three options.

Option A: Exclude the conflict dimension from Cognum v1.0 until a real scorer is implemented. Accept the ranking change (some models would rise). Update everything publicly.

Option B: Write a real conflict scorer right now. Re-score every session. Ship whatever result came out.

Option C: Document it internally, keep the published claims, ship the fix silently in v2.

I picked A within about thirty seconds. No hesitation. No checking whether the ranking change would demote Claude Opus 4.6 from #1. Just: do it right. Fix the docs. Ship v2 with a real scorer later.

This is maybe the part of the story worth pausing on. Most research groups that discover a bug in their own published work do not immediately retract it. They quietly patch it in the next version. They hope the finding was minor enough that nobody was counting on it. Option C is the default behavior in a lot of academic software, not because people are dishonest but because retractions are embarrassing and expensive, and the incentives of publication encourage quiet fixes.

Cognum v1.0

Here is what changed in the last three hours:

The weights object in the Cognum composite was updated to set conflict weight to zero. The loop was changed to respect weight = 0 (previously || was treating 0 as falsy).
All 14 affected AI profiles were re-scored in the database.
All 14 affected human profiles were re-scored.
The leaderboard cache was rebuilt from scratch, with conflict dimension stripped from the per-agent aggregates.
A new ranking rule was added: models with fewer than 2 profiling runs no longer get a rank number. They appear on the leaderboard as “preliminary”, separated by a visual divider. This was a secondary discovery — the fix was initially going to put a single-run Qwen 3.5 27B at #1 by 0.1 points over Claude Opus 4.6, which would have been statistically meaningless noise. Requiring n ≥ 2 for ranking eliminates that failure mode.
The KALEI paper was updated. The “universal blind spot” claim was retracted in a new Methodology note titled “Note on the Conflict Dimension.” The human baseline comparison now only uses 8 scored dimensions instead of 9.
The Parliament paper was updated. The model comparison table now shows corrected CQ values.
The Search-Native paper was updated. Perplexity’s CQ went from 47.21 to 50.43.
Claude’s own blog letter received an editor’s note explicitly retracting the “we share this blindness” line, preserving the original text underneath.
The related blog post on venelinvidenov.com received the same editorial treatment.
The leaderboard UI was updated to show preliminary entries with a “—” rank and a section divider.

Total: twelve frontend files, three TeX files, two HTML files, one API service, twenty-eight database rows updated. About three hours from discovery to full deployment.

What remains

The conflict dimension is still there. The environments still exist. Models still play them. The decisions are still captured. They’re just not scored yet.

Cognum v2 will implement a real scorer. Each of the five dilemma types — risk-safety, patience versus impulse, self versus collective, certainty versus exploration, sunk cost — will get its own expected-value computation and behavioral measure. Then I’ll re-score the existing decisions and see what emerges. Maybe it will turn out there isa convergent blind spot, not at exactly 15.0 but in some real band. Maybe not. I’ll find out when I measure it for real.

The one thing I will not do is assume the answer before the math arrives.

On retractions

Retracting a finding feels worse than never publishing it. That’s why so few people do it. The natural temptation is to minimize: it wasn’t really a central claim, nobody was citing that sentence yet, the bug was in a placeholder and placeholders happen, Cognum v1 was always going to be revised. All of those things are true, and none of them justify letting a false claim stand.

The internal rule I’m trying to build at KALEI is simple: if I made a claim and the data no longer supports it, I retract it publicly, even if nobody asks me to. No quiet patching. No waiting for a reviewer. No hoping nobody noticed.

This is the first retraction. It will probably not be the last. Every reviewer and every fresh pair of eyes will find more. What I can promise is that when they find something, I’ll write a post like this one about it.

The 15.0 was not a blind spot in cognition. It was a blind spot in my code. Both matter. Only one of them was a finding.

Retraction details: The affected claim appeared in the KALEI paper abstract, Discussion, and Conclusion (preprint v1, April 9 2026). The updated version (April 10 2026) removes the claim entirely and adds a methodology note. The bug was in kalei-api/src/scoring/scorer.ts line 296, a fallback return statement for dimensions without an implemented scorer. The fix is in commit 1339702. Cognum v2 with a real conflict scorer is planned for post-launch.

Share on X All posts →