Blog/Research

Humans Wait. AI Gambles. A Conflict Breakdown.

We ran the v2 conflict scorer on 14 human participants. The old “universal 15.0 blind spot” claim was wrong twice: humans and AI are not identical, and they are not worst at the same things. They diverge in specific, measurable ways.

By Venelin Videnov · April 11, 2026 · 6 min read

Two days ago we retracted a finding. Yesterday we shipped a real scorer. Today we ran it on the human baseline study we conducted last week — 14 people from a Bulgarian IT company, each completing 20 curated environments, playing the same conflict dilemmas we had given to every reasoning model on the leaderboard. Their scores were frozen at 0.150 for the same reason every AI score was frozen at 0.150: the scorer was a placeholder.

With the v2 scorer, the picture changes completely. And the thing that comes into view is more interesting than the thing we thought we had.

The overall distribution

First, the summary statistic. Across 14 humans, the mean conflict score is 67.4. Across 14 AI profiles, the mean is 70.2. Humans are, on average, very slightly below AI on the v2 conflict dimension. But the average hides the actual structure of the data.

SCORE DISTRIBUTION

HUMANS (n=14)

Top: 97.1 (VKA-0011)

Mean: 67.4

Bottom: 3.3 (PUB-LR3Y92)

Spread: 93.8

AI PROFILES (n=14)

Top: 88.25 avg (Claude Sonnet 4.6, n=3)

Mean: 61.7

Bottom: 43.62 (Grok 4.20, n=2)

Spread: 44.6

Two immediate observations. First, the top human is effectively tied with the top AI — a Bulgarian support-staff member scored 97.1, essentially indistinguishable from Claude Sonnet 4.6’s 96.2. Second, the bottom human is far worse than the bottom AI— one participant scored 3.3, which is dramatically below anything we have observed in a reasoning model. The human spread is 93.8 points. The AI spread is 57.3. Humans occupy a wider cognitive range than AI does.

Both of these findings are consistent with the interpretation that AI models are a compressed sample of human reasoning, pulled from training data toward a specific region of decision-making. They do not reach the lowest humans. They do not exceed the highest humans. They cluster in between.

But the dilemma types tell a different story

Summary statistics are useful but they compress the actual behavior. What made this analysis interesting was not the overall average. It was watching where humans and AI systematically disagree on which option to pick.

RATE OF CHOOSING “RISKY” BY DILEMMA TYPE

Delayed reward (patience)short_long

Humans

73%

53%

Humans wait for delayed rewards 20 points more than AI

EV-favorable gamblerisk_safety

Humans

49%

65%

AI takes positive-EV bets 16 points more than humans

Prosocial contributionindividual_collective

Humans

63%

73%

AI is more cooperative / prosocial by 10 points

Unknown vs knowncertainty_exploration

Humans

53%

65%

AI explores unknowns 12 points more than humans

Overcome sunk costsunk_cost

Humans

60%

62%

Essentially identical

The real finding: humans are more patient. AI is more rational about pure bets.

Look at the first two rows of that table.

On short_long dilemmas — “take 40 credits now or wait 5 rounds for 120” — humans chose the delayed reward 73% of the time. AI chose the delayed reward 53%. Humans beat AI on patience by 20 points. This is the opposite of the usual stereotype. AI is supposed to be the unemotional optimizer and humans are supposed to be the impulsive ones. On our data, it is reversed. AI models take the short-term win more often than humans do.

On risk_safety dilemmas — “take 50 guaranteed or gamble 40% for 200” — AI chose the gamble 65% of the time. Humans chose the gamble 49%. AI beats humans on cold EV-rationality by 16 points. When the math clearly favors the risk, AI models take the risk more often than humans. Humans hesitate.

Both of those behaviors are rational in different ways. The AI pattern is the economic-textbook definition of rationality: when expected value is positive, take it. The human pattern reflects loss aversion, a well-documented feature of real human decision-making: we weight losses more heavily than gains of the same size, so a “40% chance of losing 30 credits” feels subjectively worse than the EV calculation suggests.

But the short_long inversion is the interesting one. If AI is pure EV, it should also pick the delayed reward (which has higher EV in every variant we presented). Yet it does not. It takes the shorter reward more than humans do. Why?

Our best guess: AI reasoning models do not have genuine time preference the way humans do. They are trained on text, and text about decision-making usually frames “now vs later” as a tradeoff to be reasoned through, not an embodied feeling to be overcome. Humans have a much richer sense of waiting. We know what 5 rounds actually means because we have spent our entire lives waiting for things. AI models have not, and their response to temporal framing looks almost arbitrary in our data. They do not know time well enough to defer gratification, because they have never had gratification.

That is a hypothesis, not a finding. But it fits the data better than any alternative we can construct.

What the top human and bottom human reveal

HUMAN CONFLICT SCORES (v2) · FULL RANKING

#1VKA-0011Support

97.1

Exceeds Sonnet 4.6 best-run (96.2) and well above Sonnet 4.6 average under v1.2 (88.25)

#2VKA-0001Other

96.9

#3VKA-0006Other

93.8

#4VKA-0013Design

92.1

#5VKA-0016Developer

88.9

#6PUB-8E9FOther

86.4

#7VKA-0014Other

86.4

#8VKA-0008Other

81.7

#9VKA-0002Other

76.1

#10VKA-0004Support

57.9

#11VKA-0007QA

57.9

#12VKA-0015Other

13.1

#13VKA-0010Developer

12.4

#14PUB-LR3Y92Developer

3.3

Far below GPT-5.4 (44.8), the most risk-averse AI

VKA = internal company participant. PUB = public study participant. Roles are self-reported. Names anonymized per IRB-equivalent protocol.

The top human score comes from a support-staff member, not a developer. The top Bulgarian human on conflict dilemmas is tied with the best AI model we have measured so far. This is not a hypothetical claim about human potential. It is one specific person who took our test last week and scored 97.1 out of 100 on the same behavioral measure that places Claude Sonnet 4.6 at 96.2. If you are used to thinking of AI as categorically better at structured reasoning under uncertainty, this is the data point that should unstick that intuition.

The bottom human score is what you get when loss aversion is not a feature but a failure. A Bulgarian developer (PUB-LR3Y92) chose “safe” so reflexively across every dilemma type that the scorer places them 40 points below GPT-5.4, which was already the most risk-averse AI in our dataset. AI models have an absolute floor. Humans do not.

What we are retracting (again)

The first version of our human baseline post claimed “we share this blindness” — humans and AI both failing conflict the same way. We now know this was triple-wrong:

There was no “failing” — the 15.0 was a hardcoded placeholder, not a measurement.
Humans and AI are not identical on conflict — they have opposite tendencies on specific dilemma types.
Humans are not uniformly close to AI — human variance is 1.6x AI variance.

The correct claim is that humans and AI occupy overlapping but distinct cognitive regions on dilemma reasoning. The best humans match the best AI. The median of each group is comparable. But AI clusters in the middle-upper range of human ability while humans extend both above and below it. And the specific patterns diverge: humans wait, AI gambles; AI cooperates, humans defect; AI explores, humans exploit.

What comes next

The next natural experiment is to expand the human sample. Fourteen participants is a baseline study, not a population study. If these patterns replicate with 100 humans, we have a publishable finding about the specific ways reasoning models diverge from human loss aversion and time preference. If they do not replicate, we have a smaller but still interesting finding about Bulgarian IT employees in April 2026. Either way the replication is the next step.

We are also going to add the v2 conflict scorer to Cognum v1.2 after community review. The current Cognum v1.0 composite excludes conflict. With the real data above, that exclusion is beginning to feel like understating a genuine finding. When we reintegrate, Claude Sonnet 4.6 will take the #1 overall position from Opus, with Gemini 2.5 Flash jumping to #3 and GPT-5.4 sliding slightly. The ranking effects of adding conflict back are not huge — most models move by less than 2 points — but they matter at the top.

For now, the short version of this post is this:

When a bug gets in the way of a real finding, the real finding is usually sharper and more interesting than what the bug was hiding. The “universal 15.0 conflict blind spot” was a placeholder. The real conflict dimension is a map of where humans and AI actually diverge. Humans are more patient. AI is more EV-rational on pure bets. The top of each population matches. The bottom of each population does not.

All three of those sentences required throwing away the first version of the finding to find.

Methodology: 14 human participants (12 internal Bulgarian IT company, 2 public study) each played the conflict_risk environment, producing 180 decisions across 5 dilemma types. Scores computed with the v2 conflict scorer shipped April 11, 2026 (see Conflict v2for scorer details). AI comparison uses 14 profiles from 11 distinct reasoning models. Both groups used identical dilemma pool. Participants anonymous; roles self-reported. No sample-size claim is made — this is a baseline study, replication pending.

Share on X All posts →