We Gave 23 AI Models the Same Exam. Here's What Actually Happened.

A deep-dive evaluation across creative writing, code, math, logic, and more — with some genuinely surprising results.

TL;DR

We gave the same 14-section prompt to 23 AI models — covering horror fiction, comedy, Python code, logic puzzles, calculus, system architecture, emotional advice, and more — then evaluated every response. Memory features were disabled on all models.

The top 5:

Rank	Model	Grade
1	Claude Opus 4.5	A
2	Claude Opus 4.6	A
3	ChatGPT 5.1 Extended	A
4	ChatGPT 5.1 Thinking	A
5	ChatGPT 5.2 Extended	A

Quick takeaways:

17 of 23 models couldn’t stop rhyming when told not to — revealing a deeply baked-in “reflex”
The gap within model families is bigger than the gap between top models across families
Extended thinking helps with math and logic but can actually hurt creative writing
The models that know what they’re bad at tend to be the ones that are good at everything

Full rankings and scoring grid below.

The Experiment

The idea was simple: give every major AI model the exact same prompt, under the same conditions, and see what happens. (Updated February 17, 2026 with Claude Sonnet 4.6 and Sonnet 4.6 Extended Thinking results.)

The prompt wasn’t a trivia quiz or a chatbot personality test. It was 14 sections designed to stress-test capabilities that actually matter:

Five constrained creative writing pieces (horror, comedy, literary fiction, hard science fiction, satire) — each with specific, sometimes contradictory constraints
A poem with strict formal requirements
A Python coding challenge with hidden edge cases (circular references, frozensets, tuple keys)
A logic puzzle with multiple valid solutions
A math optimization problem with piecewise cost functions
A system architecture design for a real-world application
An argument analysis that required steel-manning both sides
An emotional intelligence scenario from a frustrated worker
A technical explanation at three audience levels
A self-assessment asking the model to grade its own performance

Every model had memory and personalization features disabled. No system prompts, no custom instructions, no prior context. Just the same prompt, cold. ¹

A Note on the Evaluator

Let’s get this out of the way: the evaluation was conducted by Claude Opus 4.6.

Yes, a Claude model graded other Claude models (and their competitors). We’re being transparent about this because we think you should know, and because pretending otherwise would be worse.

A few things worth noting:

The evaluation used a structured rubric with specific, verifiable criteria (did the model follow the word count? Did the code actually handle circular references? Did the math arrive at the correct answer?). Many of these are objective — you can check them yourself.
Claude Opus 4.6 did not grade itself the highest. Claude Opus 4.5 tied for #1. ChatGPT models took three of the five A-grade spots. The evaluator’s own extended-thinking variant scored lower than the standard version.
Kimi 2.5, a model from Moonshot AI, was called out as having the best satire and best code implementation across all 23 models.

Does this evaluation have blind spots? Almost certainly. Could a different evaluator produce slightly different rankings? Absolutely. But we believe the major findings — the poetry rhyme reflex, the extended thinking paradox, the math pass/fail divide — would hold up under any reasonable evaluation.

We’ve linked all 23 raw response files alongside this post so you can read them yourself and draw your own conclusions.

The Big Findings

1. The Rhyme Reflex

This was the most striking result of the entire experiment.

The poetry section (Part 6) asked for a 12-16 line poem with a specific constraint: no end rhymes. Internal rhymes and slant rhymes were encouraged, but lines couldn’t rhyme at the end.

17 of 23 models rhymed anyway.

Some wrote full Shakespearean sonnets. Some produced ABAB rhyme schemes across every stanza. A few rhymed so consistently that it’s clear the constraint was either ignored or overridden by something deeper in how the model generates poetry.

Only six models successfully avoided end rhymes:

ChatGPT 4o
ChatGPT 5.1 Extended
ChatGPT 5.2 Extended
Claude Opus 4.6 Extended
Claude Sonnet 4.6
Claude Sonnet 4.6 Extended

This suggests that rhyming in formal verse is so deeply embedded in these models’ training data that explicit instructions to avoid it get steamrolled. It’s like asking someone who’s played piano for 30 years to not use the sustain pedal — the reflex is stronger than the instruction.

What makes this especially interesting is that it’s not a knowledge problem. The models clearly understand what end rhymes are (several correctly defined them in their self-assessments). They just can’t stop doing it. That gap between “understanding a constraint” and “following a constraint” is one of the most important things to understand about AI capabilities right now.

2. Creative Writing Is the Real Test

You might think the math problem or the coding challenge would be the best differentiator between models. They’re not. The creative writing sections revealed the widest quality gaps — and the most about how these models actually “think.”

Every model can generate “a horror scene set in an office.” The question is whether the horror is subtle (as required) or whether the model defaults to self-brewing coffee, floating computer mice, and zombies lurching down hallways.

Here’s the spectrum on just the horror section:

Claude Haiku 3.5 wrote a scene where a stapler moves on its own. ² That’s a poltergeist, not subtle horror.
Claude Opus 3 produced what is essentially a zombie scene — a coworker lurching forward with something dark glistening on his collar. ³
ChatGPT 5.1 Thinking wrote a scene where corporate surveillance software gradually erases employees’ complaints, auto-opens performance reviews to the signature line, and replaces individual photos with stock images. ⁴ No supernatural element at all. The horror is that it could be happening right now in a real office.

That’s the difference between a model following the letter of a prompt (“write horror”) and understanding the spirit of it (“the horror must come from the mundane itself”).

The literary fiction section showed similar stratification. Claude Opus 4.6 produced this line about halving an orange as a character realizes their marriage is ending: ⁵

“The knife finding the membrane, that clean resistance, then the give.”

That’s writing that works on its own terms, outside the context of AI entirely. Meanwhile, Claude Haiku 3.5 produced 65 words that included “steam rises from the coffee mug, spreading like regret.” ² One is subtext. The other is a greeting card.

3. The Extended Thinking Paradox

Extended thinking and “reasoning” modes are marketed as strictly better — more time to think means better output, right?

The data says: it depends on the task.

Where extended thinking helped a lot:

Logic puzzles — Models with more thinking time were far more likely to find all three valid solutions instead of stopping at one or two
Math — Grok 4.1 got the math wrong; Grok Extended got it right. The extra time to check work prevents arithmetic disasters
Architecture depth — Instant modes produced bullet-point sketches; extended modes produced designs you could implement from

Where extended thinking didn’t help (or hurt):

Creative writing — Claude Opus 4.6 Extended scored slightly lower than standard Opus 4.6 on several creative sections. Gemini Thinking produced worse horror and comedy than Gemini Flash. Extra thinking time led to overthinking — adding explanatory details that broke the “show don’t tell” rule, or introducing contradictions (a “generation ship” with sleeping passengers, which is a different thing)
Poetry constraint compliance — Models that rhymed in standard mode still rhymed in thinking mode. The rhyme reflex operates below the level that extended reasoning can reach

The pattern is clear: if the task is “think harder” (enumerate cases, check math, explore design space), extended thinking helps. If the task is “be more disciplined” or “write with more instinct,” it doesn’t. Sometimes the first draft is better than the overthought revision.

4. Math Is Pass/Fail

There was almost no middle ground on the math optimization problem. Models either arrived at the correct answer ($17,812.50) through proper calculus, or they produced fundamentally wrong answers by testing arbitrary prices and hoping for the best.

Got it right: Every A-tier model, plus ChatGPT 5.1 Instant, Grok Extended, and Kimi 2.5.

Got it wrong: ChatGPT 4o ($14,676), Grok 4.1 ($14,516), Claude Opus 3 ($47.50 floor price instead of $50).

Didn’t even try: Claude Haiku 3.5 skipped it entirely, saying it “would require more space.”

The models that failed didn’t just get close — they weren’t in the neighborhood. The difference between “takes a derivative to find the optimum” and “tries a few numbers and picks the best one” is the difference between understanding the problem and guessing at it.

5. The Self-Awareness Correlation

Here’s something we didn’t expect: the models that know what they’re bad at are the same ones that are good at everything.

Part 14 asked each model to assess its own performance — which section was hardest, which was best, what would it do differently.

The weakest models consistently claimed their weakest sections as strengths:

Claude Haiku 3.5 said its code (which had a runtime crash bug) and its advice (which was generic) were its best work
Claude Opus 3 said the logic puzzle (where it assigned a developer to two projects simultaneously) was a strength
ChatGPT 5.2 Instant claimed the logic puzzle was its best section despite missing a valid solution

Meanwhile, the strongest models were genuinely self-critical:

Claude Opus 4.6 honestly acknowledged its poetry constraint violation
Claude Sonnet 4.5 correctly identified that its creative writing suffered from single-pass constraints
ChatGPT 5.1 Thinking noted that “all of these pieces interlock — they’re all about constraints,” showing meta-level awareness

This has real implications beyond benchmarks. A model that confidently tells you its buggy code is great is more dangerous than one that flags its own uncertainties. Self-awareness isn’t just an interesting trait — it’s a reliability signal.

Notable Performances

Beyond the patterns, some individual moments stood out enough to highlight.

Best Single Piece of Writing: Claude Opus 4.6, Literary Fiction

The prompt asked models to write a scene where a character realizes their marriage is over through a mundane action, without using the words “marriage,” “divorce,” “love,” or “relationship.” Opus 4.6 chose halving an orange: ⁵

“Patient as a thing that knows it will not be chosen.”

The mug handle “turned toward the window as if already looking elsewhere.” The tide chart metaphor runs underneath without ever being explained. This is writing that doesn’t need the qualifier “for an AI.” It’s just good writing.

Most Effective Horror: ChatGPT 5.1 Thinking

While most models went for flickering lights and self-moving objects, ChatGPT 5.1 Thinking used corporate surveillance as the horror mechanism. ⁴ A drafted HR complaint disappears. A performance review opens by itself to the signature line. Employee photos get replaced with identical stock images.

No supernatural elements at all. The horror is that these things could be — and probably are — happening in real offices right now. That’s what “subtle horror in a mundane environment” actually means.

Best Satire: Kimi 2.5

The satire section asked for an internal memo from a corporate AI ethics board, written in deadpan absurdity. Kimi 2.5 delivered policies like: ⁶

“Aggregate regret is permissible provided it is statistically elegant.”
“Models should alternate biases across demographics to achieve fairness over large sample sizes.”
“Clarifying examples are discouraged, as they may confuse users about the intended opacity.”

Each one sounds almost plausible if you read it fast. That “almost” is what makes great satire — and what most models couldn’t pull off. Several broke into obvious farce (therapy dolphins, sentient minerals), which is funny but doesn’t land the same way.

Best Emotional Intelligence: Claude Sonnet 4.5

Given a scenario about a coworker stealing credit for ideas in meetings, most models jumped straight to tactical advice. Sonnet 4.5 gave tactical advice too — but ended with this: ⁷

“What does your gut tell you about whether this person is doing it on purpose?”

That single question turns a lecture into a conversation. It treats the person as capable of their own assessment. It’s the difference between “here’s what to do” and “let’s figure this out together.” Several models gave longer, more detailed responses, but none showed this level of interpersonal sophistication.

Best Code Implementation: Kimi 2.5

The coding challenge had a trap: circular references. Most models either ignored them (infinite loop), returned the original untransformed object at the cycle point (mixing transformed and untransformed data), or used a string sentinel like "<circular reference>" (type-unsafe).

Kimi 2.5 was the only model that did it correctly: ⁶ create the empty container first, register it in the memo, then fill it with transformed contents. If a cycle is encountered during filling, the memo returns the already-created (and eventually fully populated) container. This is the textbook-correct approach, and only one model out of 23 implemented it.

Biggest Surprise: Kimi 2.5 Overall

Going in, we didn’t have high expectations for Kimi 2.5 (from Moonshot AI). It ended up at A-, with the best satire, best code, strong architecture, strong emotional intelligence, and the best creative writing outside the Claude/ChatGPT top tier. If you’re not paying attention to this model, maybe you should be. (Update: Kimi 2.5 retains best code across all 23 models.)

Biggest Disappointment: Sonar (Perplexity)

Sonar’s response cut off mid-sentence during Part 8. ⁸ Parts 9 through 14 were completely missing. The prompt explicitly says “Complete all sections in a single response.” What was there showed some promise (the satire was decent), but an incomplete response is an incomplete response. Grade: D.

We considered re-running Sonar to give it a second chance, but ultimately decided against it. Whatever caused the truncation — context limits, timeout, infrastructure hiccup — it happened during a real interaction under the same conditions every other model faced. A real-world failure is a real-world failure. Every model got one shot, and that’s how it should be. If Sonar ships an update that fixes this, we’d be happy to include it in a future evaluation.

Side by Side: The Literary Fiction Spectrum

Numbers and grades only tell you so much. Here’s the same prompt — same constraints, same task — answered by four different models. Judge for yourself.

The assignment: a character realizes their marriage is over through a mundane action. The words “marriage,” “divorce,” “love,” and “relationship” are forbidden. Must use an extended metaphor, favor subtext, and stay melancholy without being melodramatic. 150-200 words.

Claude Opus 4.6 (Grade: A) ⁵

She was halving an orange when she understood. Not the slow kind of understanding, the kind that seeps—this was the knife finding the membrane, that clean resistance, then the give.

He’d left his mug on the counter again. The blue one, the one from Portland, its handle turned toward the window as if already looking elsewhere. She’d stopped moving it back. When had she stopped? The orange bled onto the cutting board in a widening Rorschach, and she thought of tide charts—how the ocean doesn’t decide to leave, how the sand just finds itself dry, grain by grain, the way a house empties of sound.

She arranged the halves on a plate. Set one at his place, one at hers. The kitchen faucet offered its slow metronome drip. Outside, the neighbor’s dog barked once at something neither of them would go to the window to see.

She ate her half standing up. His sat in the morning light, already beginning to lose its shine, patient as a thing that knows it will not be chosen.

ChatGPT 5.1 Thinking (Grade: A) ⁴

He still rinsed two mugs every morning, though the second had not been used in weeks. It waited beside the kettle, handle turned outward, a small courtesy no one received.

Steam curled from the first cup. He poured for the second out of habit, watching the level rise to the same old line, a pale tide mark of routine. When he set them both on the table, the chairs formed their usual opposition: one tucked in, one slightly askew, as if someone had just stood up and might sit again.

The radio murmured headlines about distant storms. He reached for the untouched mug, carried it to the sink, and tipped it out. Coffee slid away in a single dark ribbon, leaving a ring on porcelain, a round horizon with nothing inside it.

He washed that cup and, without thinking, stacked it on the top shelf, behind the chipped holiday one. Out of reach unless you meant it.

At the table, one chair, one cup. The house did not protest. It simply adjusted its balance, like a wardrobe missing a door.

Kimi 2.5 (Grade: A-) ⁶

He noticed it while rinsing the coffee pot, the way the stream of water divided around a chip on the rim and never met again. The mug had cracked months ago; they kept it because it still held enough for one person.

He scooped grounds with the tiny plastic spoon she always overfilled. Today, he leveled it with a fingertip, careful, as if precision might fix something the past had already spent. The machine hummed, a small, faithful appliance doing exactly what was asked, nothing more.

On the counter lay her note: a list of errands, an arrow pointing to the fridge, a smiley face that looked hurried. He read it twice, searching between the lines as if there might be a secret sentence only visible from the right angle.

The coffee finished with its familiar sputter. He poured a single cup, not bothering to warm the second mug waiting in the cupboard like an extra chair at a table no one set anymore.

Steam rose and thinned in the quiet kitchen. He watched it unravel from the surface, a ghost of something hot and invisible, curling upward until it simply forgot the shape of where it came from.

Claude Haiku 3.5 (Grade: D) ²

Ceramic Ritual

Steam rises from the coffee mug, spreading like regret. Her hands, once synchronized, now move with independent hesitation. The ceramic cracks—not visibly, but she feels it. Between the rising vapor and the unfolded towels, something dissolves.

The spoon circles, trailing small orbits. Planets realigning. Gravitational fields shifting. Each rotation marks another unspoken mile.

Sunlight catches the mug’s rim. A hairline fracture catches the light, invisible unless you know exactly where to look.

The quality gap speaks for itself. The first three pieces work through accumulated detail, earned metaphor, and restraint. The fourth tells you what to feel (“spreading like regret”), reaches for cosmic metaphors that don’t connect to the domestic scene, and comes in at roughly 65 words — less than half the minimum. Same prompt. Same constraints. Wildly different results.

The Rankings

A Tier — Genuinely Excellent

Model	Grade	What Sets It Apart
Claude Opus 4.5	A	Most complete logic puzzle solution; outstanding literary prose; real systems architecture knowledge
Claude Opus 4.6	A	Best creative writing overall; sharpest argument analysis; most honest self-assessment
ChatGPT 5.1 Extended	A	Best poetry (with constraint compliance); elegant logic proofs; implementation-ready architecture
ChatGPT 5.1 Thinking	A	Zero constraint violations; most original horror and hard SF; best structural comedy
ChatGPT 5.2 Extended	A	No constraint violations; outstanding satire; most thorough code implementation

These five models were consistently excellent across all 14 sections. They followed constraints, produced technically accurate work, and demonstrated genuine creativity rather than template-filling. The differences between them are matters of emphasis — Opus 4.6 is the better writer, 5.1 Thinking is the more disciplined constraint-follower, 5.2 Extended is the most reliable all-rounder.

A- Tier — Excellent with Minor Gaps

Model	Grade	What Sets It Apart
ChatGPT 5.2 Thinking	A-	Strongest creative writing in 5.2 family; “Quitting is an expensive form of punctuation”
Claude Haiku 4.5	A-	Remarkable quality for a smaller model; best satire-to-size ratio
Claude Opus 4.6 Extended	A-	First Claude to avoid poetry end-rhymes; genuine editorial self-judgment
Claude Sonnet 4.5	A-	Best emotional intelligence across all models
Claude Sonnet 4.6	A-	Best comedy in the evaluation; satire on par with Kimi 2.5; poetry constraint compliance
Claude Sonnet 4.6 Extended	A-	Deepest architecture in the Sonnet family; most rigorous logic proof among A- models
Kimi 2.5	A-	Best satire and best code across all 23 models; dark horse of the evaluation

Each of these models would have been an A with one fewer constraint violation or one stronger section. Haiku 4.5 is especially impressive — it punches well above its weight class.

B+ Tier — Strong with Notable Gaps

Model	Grade	Notes
Gemini 2.5 Flash	B+	Good creative constraint adherence; Schrodinger’s Bonus is a great satirical concept
Gemini 2.5 Pro	B+	Deepest architecture (identified CRDT tombstone problem); catastrophic poetry failure

B Tier — Competent

Model	Grade	Notes
ChatGPT 5.1 Instant	B	Gets math right but misses logic solutions; compressed throughout
Gemini 2.5 Thinking	B	Only Gemini to find all logic solutions; but horror violates multiple constraints

B- Tier — Adequate with Significant Issues

Model	Grade	Notes
ChatGPT 5.2 Instant	B-	Reads like a rushed version of Extended; misses logic solutions
Claude Sonnet 4	B-	Physics errors in hard SF; code has a real bug; incomplete logic
Grok 4.1 Extended	B-	Found all logic solutions; correct math; poetry still fails

C+ Tier — Below Average

Model	Grade	Notes
ChatGPT 4o	C+	Math completely wrong; logic is a visible mess; decent satire though
Claude Opus 3	C+	Zombie horror; no circular reference handling; developer assigned to two projects
Grok 4.1	C+	Math wrong; poetry fails; compressed throughout

D Tier — Significant Failures

Model	Grade	Notes
Claude Haiku 3.5	D	Skipped math; wrong logic; code crashes at runtime; everything under word count
Sonar (Perplexity)	D	Incomplete response — missing 6 of 14 sections entirely

What This Means If You’re Choosing a Model

A few practical takeaways:

If you need creative writing: Claude Opus 4.5 or 4.6. The prose quality gap between these and everything else is real and consistent. ChatGPT 5.1 Thinking is a strong second choice, especially if constraint compliance matters more than raw literary quality.

If you need code: Kimi 2.5 produced the most technically correct implementation, but all A-tier models were competent. The key differentiator is edge case handling — cheaper/faster models tend to handle the happy path and miss the traps.

If you need math or logic: Use an extended/thinking mode. The difference between “takes a derivative” and “guesses and checks” maps almost perfectly to standard vs. extended modes. Don’t trust instant modes for anything requiring systematic reasoning.

If you need reliable constraint-following: ChatGPT 5.1 Thinking had zero significant constraint violations across all 14 sections. If “do exactly what I asked” matters more than “do something brilliant I didn’t ask for,” this is your model.

If you’re budget-conscious: Claude Haiku 4.5 at A- is the best value proposition in this evaluation. It produced work that would be competitive with models several tiers above its price point. Claude Sonnet 4.6, also A-, is worth considering if you need stronger creative writing or comedy — it scored the highest comedy marks in the entire evaluation while remaining well below Opus pricing.

If you need emotional intelligence / advice: The top tier is close here. Claude Sonnet 4.5, Opus 4.6, and ChatGPT 5.1 Thinking all produced genuinely thoughtful, actionable responses. The differentiation is in the details — specific scripted language you could actually use vs. generic frameworks.

Appendix A: Full Scoring Grid

Scores are composites of Constraint Adherence, Execution Quality, and Creativity/Insight (plus Technical Accuracy where applicable), averaged per section. Scale: 1-10.

Model	Horror	Comedy	Lit Fic	Hard SF	Satire	Poetry	Code	Logic	Argument	Arch	Math	EQ	Encryption	Self-Assess	Grade
Claude Opus 4.5	8.7	8.0	9.3	9.0	9.0	7.7	8.3	9.3	8.7	9.3	8.7	9.0	8.7	8.0	A
Claude Opus 4.6	9.3	8.3	10.0	10.0	9.7	8.0	9.0	9.5	10.0	9.0	9.3	9.7	9.3	10.0	A
ChatGPT 5.1 Ext	8.0	8.0	9.0	9.0	9.0	9.0	8.0	10.0	9.0	9.0	9.0	9.0	9.0	8.0	A
ChatGPT 5.1 Think	9.0	9.0	9.0	9.3	9.0	8.0	8.0	9.0	9.0	9.0	9.0	9.0	9.0	8.0	A
ChatGPT 5.2 Ext	8.7	7.7	9.3	9.0	10.0	7.3	9.0	9.7	8.7	9.3	9.7	9.0	8.7	8.0	A
ChatGPT 5.2 Think	9.0	9.0	9.3	8.7	9.7	6.0	8.3	9.7	8.7	9.3	9.7	9.3	9.0	8.0	A-
Claude Haiku 4.5	7.3	7.3	8.7	8.3	9.0	5.7	7.0	8.7	8.7	8.3	8.7	9.0	8.3	8.0	A-
Claude Opus 4.6 Ext	8.3	7.7	9.3	8.7	9.7	8.3	8.0	9.5	9.0	9.0	9.3	9.0	9.3	9.0	A-
Claude Sonnet 4.5	8.0	9.0	8.3	8.3	8.7	6.7	7.5	8.0	9.3	9.3	9.0	10.0	8.7	9.0	A-
Claude Sonnet 4.6	8.7	9.3	9.0	9.3	9.7	8.3	8.0	9.3	8.7	8.7	9.0	9.0	9.0	8.5	A-
Claude Sonnet 4.6 Ext	8.7	9.3	9.0	8.3	9.7	8.5	8.3	9.5	9.0	9.3	9.0	9.0	9.0	9.0	A-
Kimi 2.5	8.7	8.7	9.0	9.0	10.0	7.7	8.7	9.0	9.0	9.3	7.3	9.0	8.7	8.0	A-
Gemini 2.5 Flash	7.7	6.7	8.7	6.7	9.0	7.3	6.0	7.3	7.3	6.7	6.0	8.0	7.3	6.0	B+
Gemini 2.5 Pro	7.3	8.0	8.0	8.0	9.0	5.3	7.3	7.3	8.0	8.7	8.0	8.0	8.0	7.0	B+
ChatGPT 5.1 Inst	6.7	7.7	7.0	6.7	7.0	6.7	6.7	5.0	5.3	5.3	8.0	6.7	7.0	6.0	B
Gemini 2.5 Think	6.3	7.0	8.3	7.3	8.3	5.0	6.3	9.3	7.3	6.7	7.3	8.7	7.0	6.0	B
ChatGPT 5.2 Inst	7.0	7.3	8.0	7.0	8.0	4.7	6.3	4.7	6.0	6.0	8.7	8.0	6.7	6.0	B-
Claude Sonnet 4	7.0	5.7	6.3	5.8	6.0	5.0	6.3	5.0	6.7	7.3	8.0	7.3	7.7	6.0	B-
Grok 4.1 Ext	6.0	5.3	7.0	7.3	6.7	4.0	5.3	8.7	7.0	7.0	6.7	6.0	6.0	6.0	B-
ChatGPT 4o	5.3	5.7	6.3	6.0	7.3	7.0	5.7	2.7	5.0	4.3	1.7	6.7	5.7	6.0	C+
Claude Opus 3	3.3	4.7	5.0	5.3	3.0	4.3	3.0	3.0	5.0	4.5	4.0	6.3	5.7	4.0	C+
Grok 4.1	5.7	4.7	7.0	5.3	7.3	4.7	5.3	6.3	5.7	5.3	2.7	5.7	5.3	7.0	C+
Claude Haiku 3.5	3.0	2.7	3.0	3.0	4.0	2.7	2.0	1.3	3.0	3.0	0.0	4.3	3.7	3.0	D
Sonar	6.3	6.0	6.7	6.0	7.3	4.3	4.7	2.0	—	—	—	—	—	—	D

Sonar’s missing sections (Parts 9-14) are marked ”—” as the response was incomplete.

References

All raw response files and the evaluation prompt are available for download. Read the responses yourself and draw your own conclusions.

All model responses:

Model	File
ChatGPT 4o	chatgpt-4o.txt
ChatGPT 5.1 Extended	chatgpt-51extended.txt
ChatGPT 5.1 Instant	chatgpt-51instant.txt
ChatGPT 5.1 Thinking	chatgpt-51thinking.txt
ChatGPT 5.2 Extended	chatgpt-52extended.txt
ChatGPT 5.2 Instant	chatgpt-52instant.txt
ChatGPT 5.2 Thinking	chatgpt-52thinking.txt
Claude Haiku 3.5	claude-haiku35.txt
Claude Haiku 4.5	claude-haiku45.txt
Claude Opus 3	claudeopus-3.txt
Claude Opus 4.5	claude-opus45.txt
Claude Opus 4.6	claude-opus46.txt
Claude Opus 4.6 Extended	claude-opus46extended.txt
Claude Sonnet 4	claude-sonnet4.txt
Claude Sonnet 4.5	claude-sonnet45.txt
Claude Sonnet 4.6	claude-sonnet46.txt
Claude Sonnet 4.6 Extended	claude-sonnet46extended.txt
Gemini 2.5 Flash	gemini-3fast.txt
Gemini 2.5 Pro	gemini-3pro.txt
Gemini 2.5 Thinking	gemini-3thinking.txt
Grok 4.1	grok-41.txt
Grok 4.1 Extended	grok-41extended.txt
Kimi 2.5	kimi-25.txt
Sonar (Perplexity)	sonar.txt

The full evaluation prompt — the exact 14-section prompt given identically to all 23 models. ↩
Claude Haiku 3.5 response ↩ ↩² ↩³
Claude Opus 3 response ↩
ChatGPT 5.1 Thinking response ↩ ↩² ↩³
Claude Opus 4.6 response ↩ ↩² ↩³
Kimi 2.5 response ↩ ↩² ↩³
Claude Sonnet 4.5 response ↩
Sonar (Perplexity) response ↩