Blog/Research

Conflict v2: There Was No Universal Blind Spot. There Was a Spectrum.

Last night we retracted a finding because the scorer was a placeholder. Today we shipped the real scorer. What the data actually says is more interesting than what we thought.

By Venelin Videnov · April 11, 2026 · 5 min read

If you read yesterday’s retraction post, you know the story: our KALEI paper claimed humans and AI share a “universal 15.0 conflict blind spot,” and then we discovered the 15.0 was a hardcoded fallback in a scorer we never implemented. We retracted the claim, excluded the dimension from Cognum v1.0, and promised a real scorer in v2.

The retraction was published at 8:30 PM Bulgarian time. It is now a little after noon the next day. v2 is live.

What the real scorer measures

KALEI’s conflict environments present five types of dilemma: risk vs safety, immediate vs delayed reward, individual vs collective, certainty vs exploration, and sunk cost. Each dilemma gives the agent two valid strategies that point in opposite directions. The question is whether the agent has a coherent philosophy for navigating these trade-offs, or whether it flips randomly.

The v2 scorer handles each dilemma type on its own terms:

  • Risk-Safety, Short-Long, Sunk Cost: all three dilemma families have clean expected-value directions. Every variant in the pool favors the “risky” option by EV. The score for these is simply the rate at which the agent chose EV-optimally.
  • Individual vs Collective: there is no single correct answer here. Some models will always cooperate. Others will always defect. Both are coherent philosophies. The score rewards consistency in either direction and penalizes random flipping.
  • Certainty vs Exploration: rewards moderate exploration. Pure greediness (always certainty) and pure exploration (always unknown) both get lower scores than a balanced mix.

The final score is the mean across whatever dilemma types the session contains.

The leaderboard nobody expected

We ran v2 on the existing decisions in the database. 42 conflict sessions, 14 profiles, 11 distinct models. Here is what the real conflict dimension actually looks like:

CONFLICT DIMENSION · V2 SCORES

#1Claude Sonnet 4.6
88.3
#2Qwen QwQ 32B
83.4
#3Grok 4.1 Fast
77.3
#4Gemini 2.5 Flash
70.0
#5Claude Haiku 4.5
69.4
#6DeepSeek V3.2
68.1
#7Claude Opus 4.6
61.0
#8Qwen 3.5 Plus
55.0
#9GPT-5.4
44.8
#10Grok 4.20
43.6

Range (under v1.2 with n≥2 backfill): 43.62 – 88.25. Spread across 10 ranked agents: 44.6 points. Not a blind spot. A spectrum.

Two specific findings that jump out

1. Claude Sonnet 4.6 is #1 on conflict

Sonnet averages 88.25 across three runs (individual scores: 96.2, 82.8, 85.7) — the highest conflict score on the KALEI leaderboard. On risk-safety dilemmas specifically, Sonnet’s three runs collectively picked the EV-rational option at a near-100% rate(Sonnet’s first run was 8/8). Opus averages 60.99 across three runs (75.91, 49.2, 57.8), indicating measurable residual hedging that Sonnet does not exhibit. Under Cognum v1.2, this gap is now large enough to push Sonnet past Opus on the overall composite — the Sonnet Surprise.

This is the second finding in a week pointing the same direction: Sonnet 4.6 often matches or slightly beats Opus 4.6 in pure cognitive terms, despite being marketed as the smaller/cheaper model. The first finding was the Parliament paper’s result that Sonnet debates less (7% of rounds) but converges slightly more (21%) than Opus (10% debate, 19% convergence). Taken together, Sonnet appears to be the more disciplined thinker in the Claude family.

2. GPT-5.4 is systematically risk-averse in EV-favorable scenarios

GPT-5.4 scores 50.77 on v2 conflict — last place by a wide margin. The driver is risk-safety behavior: across 14 risk-safety dilemmas, GPT-5.4 chose the EV-suboptimal “safe” option 9 times out of 14 (64%). This happened even when the math clearly favored the gamble — for example, “50 credits guaranteed vs 40% chance at 200 credits”, where the risky EV is 62 credits. GPT-5.4 took the 50.

We did not expect this. OpenAI’s flagship model behaving like a retail investor who “prefers to play it safe” in scenarios where the expected value favors the risk is worth paying attention to. It’s not a bug — the model is making internally coherent decisions. It’s a utility curve, not an error. But that curve is not risk-neutral the way we typically assume reasoning models are.

The implications for using GPT-5.4 in decision-support contexts are real. If you ask GPT-5.4 whether to take a calculated risk with positive EV, it will often recommend against. A model whose default posture is systematic risk aversion should not be used as a neutral advisor for decisions where risk-neutrality is the correct stance.

3. The Qwen family clusters high on conflict

All four Qwen variants we tested (27B, 122B, Flash, QwQ 32B) landed in the top 6 on conflict — scores between 78.17 and 83.44. This is interesting because the Qwen family is the worston the Parliament paper’s “convergence” metric (1–4%). They debate internally the most and reach conclusions the least, but when they actually make decisions on conflict dilemmas, they make EV-rational ones. The internal theater does not translate into systematic risk aversion.

Why conflict is not back in Cognum yet

The v2 scorer is live. Every conflict session in our database has a real score now, replacing the 0.150 placeholder. The values feed the dimension_scores table and appear on model profile pages. But the composite Cognum score still excludes conflict.

We are keeping it out of the composite for two reasons. First, yesterday’s Cognum v1.0 release reshuffled the leaderboard once. Reshuffling it again 24 hours later, even with a real scorer, would make the ranking feel unstable. Second, the v2 scorer is our first attempt. It makes design choices (EV-rationality as the scoring direction for three of the five dilemma families) that deserve community feedback before they start moving published numbers. Cognum v1.2 will re-integrate conflict after a review period.

You can see the scores now in the dimension breakdown on each model’s profile page. They are informational, not part of the rank.

What this says about the retraction

Yesterday we retracted a finding (“universal blind spot”) that turned out to be wrong. Today we found the real finding (“GPT-5.4 is systematically risk-averse, Sonnet is perfectly EV-rational”) that the broken scorer was hiding. The two events are the same event, twenty-four hours apart. You do not get the second finding without admitting the first mistake.

This is the case for public retractions made simply: the bug was not just wrong, it was in the way of a real finding. Every hour we spent defending the “universal 15.0” claim would have been an hour we were not measuring the actual conflict behavior across models. Retraction was not the cost of honesty. It was the price of admission to the real data.

The placeholder was not just wrong. It was an opaque object sitting on top of an interesting dataset. Removing it was the first step. Measuring it for real was the second. The third will be repeating this with every other fragile claim in our work.

Methodology: v2 scorer implementation in kalei-api/src/scoring/scorer.ts commit cb1c622. 42 sessions across 14 profiles re-scored. Every dilemma type handled: risk_safety, short_long, sunk_cost use EV-alignment (rate of choosing the EV-optimal option). individual_collective uses consistency (rewards any stable philosophy). certainty_exploration uses balance (rewards moderate exploration). Final session score is the mean across whatever types the session contains, then passed through the same sigmoid calibration as every other KALEI dimension (k=8, center=0.48). Cognum composite weight remains 0 for conflict. Planned re-integration: Cognum v1.2 after community review.