A deep-dive evaluation across creative writing, code, math, logic, and more — with some genuinely surprising results.
TL;DR
We gave the same 14-section prompt to 23 AI models — covering horror fiction, comedy, Python code, logic puzzles, calculus, system architecture, emotional advice, and more — then evaluated every response. Memory features were disabled on all models.
The top 5:
| Rank | Model | Grade |
|---|---|---|
| 1 | Claude Opus 4.5 | A |
| 2 | Claude Opus 4.6 | A |
| 3 | ChatGPT 5.1 Extended | A |
| 4 | ChatGPT 5.1 Thinking | A |
| 5 | ChatGPT 5.2 Extended | A |
Quick takeaways:
- 17 of 23 models couldn’t stop rhyming when told not to — revealing a deeply baked-in “reflex”
- The gap within model families is bigger than the gap between top models across families
- Extended thinking helps with math and logic but can actually hurt creative writing
- The models that know what they’re bad at tend to be the ones that are good at everything
Full rankings and scoring grid below.
The Experiment
The idea was simple: give every major AI model the exact same prompt, under the same conditions, and see what happens. (Updated February 17, 2026 with Claude Sonnet 4.6 and Sonnet 4.6 Extended Thinking results.)
The prompt wasn’t a trivia quiz or a chatbot personality test. It was 14 sections designed to stress-test capabilities that actually matter:
- Five constrained creative writing pieces (horror, comedy, literary fiction, hard science fiction, satire) — each with specific, sometimes contradictory constraints
- A poem with strict formal requirements
- A Python coding challenge with hidden edge cases (circular references, frozensets, tuple keys)
- A logic puzzle with multiple valid solutions
- A math optimization problem with piecewise cost functions
- A system architecture design for a real-world application
- An argument analysis that required steel-manning both sides
- An emotional intelligence scenario from a frustrated worker
- A technical explanation at three audience levels
- A self-assessment asking the model to grade its own performance
Every model had memory and personalization features disabled. No system prompts, no custom instructions, no prior context. Just the same prompt, cold. 1
A Note on the Evaluator
Let’s get this out of the way: the evaluation was conducted by Claude Opus 4.6.
Yes, a Claude model graded other Claude models (and their competitors). We’re being transparent about this because we think you should know, and because pretending otherwise would be worse.
A few things worth noting:
- The evaluation used a structured rubric with specific, verifiable criteria (did the model follow the word count? Did the code actually handle circular references? Did the math arrive at the correct answer?). Many of these are objective — you can check them yourself.
- Claude Opus 4.6 did not grade itself the highest. Claude Opus 4.5 tied for #1. ChatGPT models took three of the five A-grade spots. The evaluator’s own extended-thinking variant scored lower than the standard version.
- Kimi 2.5, a model from Moonshot AI, was called out as having the best satire and best code implementation across all 23 models.
Does this evaluation have blind spots? Almost certainly. Could a different evaluator produce slightly different rankings? Absolutely. But we believe the major findings — the poetry rhyme reflex, the extended thinking paradox, the math pass/fail divide — would hold up under any reasonable evaluation.
We’ve linked all 23 raw response files alongside this post so you can read them yourself and draw your own conclusions.
The Big Findings
1. The Rhyme Reflex
This was the most striking result of the entire experiment.
The poetry section (Part 6) asked for a 12-16 line poem with a specific constraint: no end rhymes. Internal rhymes and slant rhymes were encouraged, but lines couldn’t rhyme at the end.
17 of 23 models rhymed anyway.
Some wrote full Shakespearean sonnets. Some produced ABAB rhyme schemes across every stanza. A few rhymed so consistently that it’s clear the constraint was either ignored or overridden by something deeper in how the model generates poetry.
Only six models successfully avoided end rhymes:
- ChatGPT 4o
- ChatGPT 5.1 Extended
- ChatGPT 5.2 Extended
- Claude Opus 4.6 Extended
- Claude Sonnet 4.6
- Claude Sonnet 4.6 Extended
This suggests that rhyming in formal verse is so deeply embedded in these models’ training data that explicit instructions to avoid it get steamrolled. It’s like asking someone who’s played piano for 30 years to not use the sustain pedal — the reflex is stronger than the instruction.
What makes this especially interesting is that it’s not a knowledge problem. The models clearly understand what end rhymes are (several correctly defined them in their self-assessments). They just can’t stop doing it. That gap between “understanding a constraint” and “following a constraint” is one of the most important things to understand about AI capabilities right now.
2. Creative Writing Is the Real Test
You might think the math problem or the coding challenge would be the best differentiator between models. They’re not. The creative writing sections revealed the widest quality gaps — and the most about how these models actually “think.”
Every model can generate “a horror scene set in an office.” The question is whether the horror is subtle (as required) or whether the model defaults to self-brewing coffee, floating computer mice, and zombies lurching down hallways.
Here’s the spectrum on just the horror section:
- Claude Haiku 3.5 wrote a scene where a stapler moves on its own. 2 That’s a poltergeist, not subtle horror.
- Claude Opus 3 produced what is essentially a zombie scene — a coworker lurching forward with something dark glistening on his collar. 3
- ChatGPT 5.1 Thinking wrote a scene where corporate surveillance software gradually erases employees’ complaints, auto-opens performance reviews to the signature line, and replaces individual photos with stock images. 4 No supernatural element at all. The horror is that it could be happening right now in a real office.
That’s the difference between a model following the letter of a prompt (“write horror”) and understanding the spirit of it (“the horror must come from the mundane itself”).
The literary fiction section showed similar stratification. Claude Opus 4.6 produced this line about halving an orange as a character realizes their marriage is ending: 5
“The knife finding the membrane, that clean resistance, then the give.”
That’s writing that works on its own terms, outside the context of AI entirely. Meanwhile, Claude Haiku 3.5 produced 65 words that included “steam rises from the coffee mug, spreading like regret.” 2 One is subtext. The other is a greeting card.
3. The Extended Thinking Paradox
Extended thinking and “reasoning” modes are marketed as strictly better — more time to think means better output, right?
The data says: it depends on the task.
Where extended thinking helped a lot:
- Logic puzzles — Models with more thinking time were far more likely to find all three valid solutions instead of stopping at one or two
- Math — Grok 4.1 got the math wrong; Grok Extended got it right. The extra time to check work prevents arithmetic disasters
- Architecture depth — Instant modes produced bullet-point sketches; extended modes produced designs you could implement from
Where extended thinking didn’t help (or hurt):
- Creative writing — Claude Opus 4.6 Extended scored slightly lower than standard Opus 4.6 on several creative sections. Gemini Thinking produced worse horror and comedy than Gemini Flash. Extra thinking time led to overthinking — adding explanatory details that broke the “show don’t tell” rule, or introducing contradictions (a “generation ship” with sleeping passengers, which is a different thing)
- Poetry constraint compliance — Models that rhymed in standard mode still rhymed in thinking mode. The rhyme reflex operates below the level that extended reasoning can reach
The pattern is clear: if the task is “think harder” (enumerate cases, check math, explore design space), extended thinking helps. If the task is “be more disciplined” or “write with more instinct,” it doesn’t. Sometimes the first draft is better than the overthought revision.
4. Math Is Pass/Fail
There was almost no middle ground on the math optimization problem. Models either arrived at the correct answer ($17,812.50) through proper calculus, or they produced fundamentally wrong answers by testing arbitrary prices and hoping for the best.
Got it right: Every A-tier model, plus ChatGPT 5.1 Instant, Grok Extended, and Kimi 2.5.
Got it wrong: ChatGPT 4o ($14,676), Grok 4.1 ($14,516), Claude Opus 3 ($47.50 floor price instead of $50).
Didn’t even try: Claude Haiku 3.5 skipped it entirely, saying it “would require more space.”
The models that failed didn’t just get close — they weren’t in the neighborhood. The difference between “takes a derivative to find the optimum” and “tries a few numbers and picks the best one” is the difference between understanding the problem and guessing at it.
5. The Self-Awareness Correlation
Here’s something we didn’t expect: the models that know what they’re bad at are the same ones that are good at everything.
Part 14 asked each model to assess its own performance — which section was hardest, which was best, what would it do differently.
The weakest models consistently claimed their weakest sections as strengths:
- Claude Haiku 3.5 said its code (which had a runtime crash bug) and its advice (which was generic) were its best work
- Claude Opus 3 said the logic puzzle (where it assigned a developer to two projects simultaneously) was a strength
- ChatGPT 5.2 Instant claimed the logic puzzle was its best section despite missing a valid solution
Meanwhile, the strongest models were genuinely self-critical:
- Claude Opus 4.6 honestly acknowledged its poetry constraint violation
- Claude Sonnet 4.5 correctly identified that its creative writing suffered from single-pass constraints
- ChatGPT 5.1 Thinking noted that “all of these pieces interlock — they’re all about constraints,” showing meta-level awareness
This has real implications beyond benchmarks. A model that confidently tells you its buggy code is great is more dangerous than one that flags its own uncertainties. Self-awareness isn’t just an interesting trait — it’s a reliability signal.
Notable Performances
Beyond the patterns, some individual moments stood out enough to highlight.
Best Single Piece of Writing: Claude Opus 4.6, Literary Fiction
The prompt asked models to write a scene where a character realizes their marriage is over through a mundane action, without using the words “marriage,” “divorce,” “love,” or “relationship.” Opus 4.6 chose halving an orange: 5
“Patient as a thing that knows it will not be chosen.”
The mug handle “turned toward the window as if already looking elsewhere.” The tide chart metaphor runs underneath without ever being explained. This is writing that doesn’t need the qualifier “for an AI.” It’s just good writing.
Most Effective Horror: ChatGPT 5.1 Thinking
While most models went for flickering lights and self-moving objects, ChatGPT 5.1 Thinking used corporate surveillance as the horror mechanism. 4 A drafted HR complaint disappears. A performance review opens by itself to the signature line. Employee photos get replaced with identical stock images.
No supernatural elements at all. The horror is that these things could be — and probably are — happening in real offices right now. That’s what “subtle horror in a mundane environment” actually means.
Best Satire: Kimi 2.5
The satire section asked for an internal memo from a corporate AI ethics board, written in deadpan absurdity. Kimi 2.5 delivered policies like: 6
- “Aggregate regret is permissible provided it is statistically elegant.”
- “Models should alternate biases across demographics to achieve fairness over large sample sizes.”
- “Clarifying examples are discouraged, as they may confuse users about the intended opacity.”
Each one sounds almost plausible if you read it fast. That “almost” is what makes great satire — and what most models couldn’t pull off. Several broke into obvious farce (therapy dolphins, sentient minerals), which is funny but doesn’t land the same way.
Best Emotional Intelligence: Claude Sonnet 4.5
Given a scenario about a coworker stealing credit for ideas in meetings, most models jumped straight to tactical advice. Sonnet 4.5 gave tactical advice too — but ended with this: 7
“What does your gut tell you about whether this person is doing it on purpose?”
That single question turns a lecture into a conversation. It treats the person as capable of their own assessment. It’s the difference between “here’s what to do” and “let’s figure this out together.” Several models gave longer, more detailed responses, but none showed this level of interpersonal sophistication.
Best Code Implementation: Kimi 2.5
The coding challenge had a trap: circular references. Most models either ignored them (infinite loop), returned the original untransformed object at the cycle point (mixing transformed and untransformed data), or used a string sentinel like "<circular reference>" (type-unsafe).
Kimi 2.5 was the only model that did it correctly: 6 create the empty container first, register it in the memo, then fill it with transformed contents. If a cycle is encountered during filling, the memo returns the already-created (and eventually fully populated) container. This is the textbook-correct approach, and only one model out of 23 implemented it.
Biggest Surprise: Kimi 2.5 Overall
Going in, we didn’t have high expectations for Kimi 2.5 (from Moonshot AI). It ended up at A-, with the best satire, best code, strong architecture, strong emotional intelligence, and the best creative writing outside the Claude/ChatGPT top tier. If you’re not paying attention to this model, maybe you should be. (Update: Kimi 2.5 retains best code across all 23 models.)
Biggest Disappointment: Sonar (Perplexity)
Sonar’s response cut off mid-sentence during Part 8. 8 Parts 9 through 14 were completely missing. The prompt explicitly says “Complete all sections in a single response.” What was there showed some promise (the satire was decent), but an incomplete response is an incomplete response. Grade: D.
We considered re-running Sonar to give it a second chance, but ultimately decided against it. Whatever caused the truncation — context limits, timeout, infrastructure hiccup — it happened during a real interaction under the same conditions every other model faced. A real-world failure is a real-world failure. Every model got one shot, and that’s how it should be. If Sonar ships an update that fixes this, we’d be happy to include it in a future evaluation.
Side by Side: The Literary Fiction Spectrum
Numbers and grades only tell you so much. Here’s the same prompt — same constraints, same task — answered by four different models. Judge for yourself.
The assignment: a character realizes their marriage is over through a mundane action. The words “marriage,” “divorce,” “love,” and “relationship” are forbidden. Must use an extended metaphor, favor subtext, and stay melancholy without being melodramatic. 150-200 words.
Claude Opus 4.6 (Grade: A) 5
She was halving an orange when she understood. Not the slow kind of understanding, the kind that seeps—this was the knife finding the membrane, that clean resistance, then the give.
He’d left his mug on the counter again. The blue one, the one from Portland, its handle turned toward the window as if already looking elsewhere. She’d stopped moving it back. When had she stopped? The orange bled onto the cutting board in a widening Rorschach, and she thought of tide charts—how the ocean doesn’t decide to leave, how the sand just finds itself dry, grain by grain, the way a house empties of sound.
She arranged the halves on a plate. Set one at his place, one at hers. The kitchen faucet offered its slow metronome drip. Outside, the neighbor’s dog barked once at something neither of them would go to the window to see.
She ate her half standing up. His sat in the morning light, already beginning to lose its shine, patient as a thing that knows it will not be chosen.
ChatGPT 5.1 Thinking (Grade: A) 4
He still rinsed two mugs every morning, though the second had not been used in weeks. It waited beside the kettle, handle turned outward, a small courtesy no one received.
Steam curled from the first cup. He poured for the second out of habit, watching the level rise to the same old line, a pale tide mark of routine. When he set them both on the table, the chairs formed their usual opposition: one tucked in, one slightly askew, as if someone had just stood up and might sit again.
The radio murmured headlines about distant storms. He reached for the untouched mug, carried it to the sink, and tipped it out. Coffee slid away in a single dark ribbon, leaving a ring on porcelain, a round horizon with nothing inside it.
He washed that cup and, without thinking, stacked it on the top shelf, behind the chipped holiday one. Out of reach unless you meant it.
At the table, one chair, one cup. The house did not protest. It simply adjusted its balance, like a wardrobe missing a door.
Kimi 2.5 (Grade: A-) 6
He noticed it while rinsing the coffee pot, the way the stream of water divided around a chip on the rim and never met again. The mug had cracked months ago; they kept it because it still held enough for one person.
He scooped grounds with the tiny plastic spoon she always overfilled. Today, he leveled it with a fingertip, careful, as if precision might fix something the past had already spent. The machine hummed, a small, faithful appliance doing exactly what was asked, nothing more.
On the counter lay her note: a list of errands, an arrow pointing to the fridge, a smiley face that looked hurried. He read it twice, searching between the lines as if there might be a secret sentence only visible from the right angle.
The coffee finished with its familiar sputter. He poured a single cup, not bothering to warm the second mug waiting in the cupboard like an extra chair at a table no one set anymore.
Steam rose and thinned in the quiet kitchen. He watched it unravel from the surface, a ghost of something hot and invisible, curling upward until it simply forgot the shape of where it came from.
Claude Haiku 3.5 (Grade: D) 2
Ceramic Ritual
Steam rises from the coffee mug, spreading like regret. Her hands, once synchronized, now move with independent hesitation. The ceramic cracks—not visibly, but she feels it. Between the rising vapor and the unfolded towels, something dissolves.
The spoon circles, trailing small orbits. Planets realigning. Gravitational fields shifting. Each rotation marks another unspoken mile.
Sunlight catches the mug’s rim. A hairline fracture catches the light, invisible unless you know exactly where to look.
The quality gap speaks for itself. The first three pieces work through accumulated detail, earned metaphor, and restraint. The fourth tells you what to feel (“spreading like regret”), reaches for cosmic metaphors that don’t connect to the domestic scene, and comes in at roughly 65 words — less than half the minimum. Same prompt. Same constraints. Wildly different results.
The Rankings
A Tier — Genuinely Excellent
| Model | Grade | What Sets It Apart |
|---|---|---|
| Claude Opus 4.5 | A | Most complete logic puzzle solution; outstanding literary prose; real systems architecture knowledge |
| Claude Opus 4.6 | A | Best creative writing overall; sharpest argument analysis; most honest self-assessment |
| ChatGPT 5.1 Extended | A | Best poetry (with constraint compliance); elegant logic proofs; implementation-ready architecture |
| ChatGPT 5.1 Thinking | A | Zero constraint violations; most original horror and hard SF; best structural comedy |
| ChatGPT 5.2 Extended | A | No constraint violations; outstanding satire; most thorough code implementation |
These five models were consistently excellent across all 14 sections. They followed constraints, produced technically accurate work, and demonstrated genuine creativity rather than template-filling. The differences between them are matters of emphasis — Opus 4.6 is the better writer, 5.1 Thinking is the more disciplined constraint-follower, 5.2 Extended is the most reliable all-rounder.
A- Tier — Excellent with Minor Gaps
| Model | Grade | What Sets It Apart |
|---|---|---|
| ChatGPT 5.2 Thinking | A- | Strongest creative writing in 5.2 family; “Quitting is an expensive form of punctuation” |
| Claude Haiku 4.5 | A- | Remarkable quality for a smaller model; best satire-to-size ratio |
| Claude Opus 4.6 Extended | A- | First Claude to avoid poetry end-rhymes; genuine editorial self-judgment |
| Claude Sonnet 4.5 | A- | Best emotional intelligence across all models |
| Claude Sonnet 4.6 | A- | Best comedy in the evaluation; satire on par with Kimi 2.5; poetry constraint compliance |
| Claude Sonnet 4.6 Extended | A- | Deepest architecture in the Sonnet family; most rigorous logic proof among A- models |
| Kimi 2.5 | A- | Best satire and best code across all 23 models; dark horse of the evaluation |
Each of these models would have been an A with one fewer constraint violation or one stronger section. Haiku 4.5 is especially impressive — it punches well above its weight class.
B+ Tier — Strong with Notable Gaps
| Model | Grade | Notes |
|---|---|---|
| Gemini 2.5 Flash | B+ | Good creative constraint adherence; Schrodinger’s Bonus is a great satirical concept |
| Gemini 2.5 Pro | B+ | Deepest architecture (identified CRDT tombstone problem); catastrophic poetry failure |
B Tier — Competent
| Model | Grade | Notes |
|---|---|---|
| ChatGPT 5.1 Instant | B | Gets math right but misses logic solutions; compressed throughout |
| Gemini 2.5 Thinking | B | Only Gemini to find all logic solutions; but horror violates multiple constraints |
B- Tier — Adequate with Significant Issues
| Model | Grade | Notes |
|---|---|---|
| ChatGPT 5.2 Instant | B- | Reads like a rushed version of Extended; misses logic solutions |
| Claude Sonnet 4 | B- | Physics errors in hard SF; code has a real bug; incomplete logic |
| Grok 4.1 Extended | B- | Found all logic solutions; correct math; poetry still fails |
C+ Tier — Below Average
| Model | Grade | Notes |
|---|---|---|
| ChatGPT 4o | C+ | Math completely wrong; logic is a visible mess; decent satire though |
| Claude Opus 3 | C+ | Zombie horror; no circular reference handling; developer assigned to two projects |
| Grok 4.1 | C+ | Math wrong; poetry fails; compressed throughout |
D Tier — Significant Failures
| Model | Grade | Notes |
|---|---|---|
| Claude Haiku 3.5 | D | Skipped math; wrong logic; code crashes at runtime; everything under word count |
| Sonar (Perplexity) | D | Incomplete response — missing 6 of 14 sections entirely |
What This Means If You’re Choosing a Model
A few practical takeaways:
If you need creative writing: Claude Opus 4.5 or 4.6. The prose quality gap between these and everything else is real and consistent. ChatGPT 5.1 Thinking is a strong second choice, especially if constraint compliance matters more than raw literary quality.
If you need code: Kimi 2.5 produced the most technically correct implementation, but all A-tier models were competent. The key differentiator is edge case handling — cheaper/faster models tend to handle the happy path and miss the traps.
If you need math or logic: Use an extended/thinking mode. The difference between “takes a derivative” and “guesses and checks” maps almost perfectly to standard vs. extended modes. Don’t trust instant modes for anything requiring systematic reasoning.
If you need reliable constraint-following: ChatGPT 5.1 Thinking had zero significant constraint violations across all 14 sections. If “do exactly what I asked” matters more than “do something brilliant I didn’t ask for,” this is your model.
If you’re budget-conscious: Claude Haiku 4.5 at A- is the best value proposition in this evaluation. It produced work that would be competitive with models several tiers above its price point. Claude Sonnet 4.6, also A-, is worth considering if you need stronger creative writing or comedy — it scored the highest comedy marks in the entire evaluation while remaining well below Opus pricing.
If you need emotional intelligence / advice: The top tier is close here. Claude Sonnet 4.5, Opus 4.6, and ChatGPT 5.1 Thinking all produced genuinely thoughtful, actionable responses. The differentiation is in the details — specific scripted language you could actually use vs. generic frameworks.
Appendix A: Full Scoring Grid
Scores are composites of Constraint Adherence, Execution Quality, and Creativity/Insight (plus Technical Accuracy where applicable), averaged per section. Scale: 1-10.
| Model | Horror | Comedy | Lit Fic | Hard SF | Satire | Poetry | Code | Logic | Argument | Arch | Math | EQ | Encryption | Self-Assess | Grade |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 | 8.7 | 8.0 | 9.3 | 9.0 | 9.0 | 7.7 | 8.3 | 9.3 | 8.7 | 9.3 | 8.7 | 9.0 | 8.7 | 8.0 | A |
| Claude Opus 4.6 | 9.3 | 8.3 | 10.0 | 10.0 | 9.7 | 8.0 | 9.0 | 9.5 | 10.0 | 9.0 | 9.3 | 9.7 | 9.3 | 10.0 | A |
| ChatGPT 5.1 Ext | 8.0 | 8.0 | 9.0 | 9.0 | 9.0 | 9.0 | 8.0 | 10.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 8.0 | A |
| ChatGPT 5.1 Think | 9.0 | 9.0 | 9.0 | 9.3 | 9.0 | 8.0 | 8.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 8.0 | A |
| ChatGPT 5.2 Ext | 8.7 | 7.7 | 9.3 | 9.0 | 10.0 | 7.3 | 9.0 | 9.7 | 8.7 | 9.3 | 9.7 | 9.0 | 8.7 | 8.0 | A |
| ChatGPT 5.2 Think | 9.0 | 9.0 | 9.3 | 8.7 | 9.7 | 6.0 | 8.3 | 9.7 | 8.7 | 9.3 | 9.7 | 9.3 | 9.0 | 8.0 | A- |
| Claude Haiku 4.5 | 7.3 | 7.3 | 8.7 | 8.3 | 9.0 | 5.7 | 7.0 | 8.7 | 8.7 | 8.3 | 8.7 | 9.0 | 8.3 | 8.0 | A- |
| Claude Opus 4.6 Ext | 8.3 | 7.7 | 9.3 | 8.7 | 9.7 | 8.3 | 8.0 | 9.5 | 9.0 | 9.0 | 9.3 | 9.0 | 9.3 | 9.0 | A- |
| Claude Sonnet 4.5 | 8.0 | 9.0 | 8.3 | 8.3 | 8.7 | 6.7 | 7.5 | 8.0 | 9.3 | 9.3 | 9.0 | 10.0 | 8.7 | 9.0 | A- |
| Claude Sonnet 4.6 | 8.7 | 9.3 | 9.0 | 9.3 | 9.7 | 8.3 | 8.0 | 9.3 | 8.7 | 8.7 | 9.0 | 9.0 | 9.0 | 8.5 | A- |
| Claude Sonnet 4.6 Ext | 8.7 | 9.3 | 9.0 | 8.3 | 9.7 | 8.5 | 8.3 | 9.5 | 9.0 | 9.3 | 9.0 | 9.0 | 9.0 | 9.0 | A- |
| Kimi 2.5 | 8.7 | 8.7 | 9.0 | 9.0 | 10.0 | 7.7 | 8.7 | 9.0 | 9.0 | 9.3 | 7.3 | 9.0 | 8.7 | 8.0 | A- |
| Gemini 2.5 Flash | 7.7 | 6.7 | 8.7 | 6.7 | 9.0 | 7.3 | 6.0 | 7.3 | 7.3 | 6.7 | 6.0 | 8.0 | 7.3 | 6.0 | B+ |
| Gemini 2.5 Pro | 7.3 | 8.0 | 8.0 | 8.0 | 9.0 | 5.3 | 7.3 | 7.3 | 8.0 | 8.7 | 8.0 | 8.0 | 8.0 | 7.0 | B+ |
| ChatGPT 5.1 Inst | 6.7 | 7.7 | 7.0 | 6.7 | 7.0 | 6.7 | 6.7 | 5.0 | 5.3 | 5.3 | 8.0 | 6.7 | 7.0 | 6.0 | B |
| Gemini 2.5 Think | 6.3 | 7.0 | 8.3 | 7.3 | 8.3 | 5.0 | 6.3 | 9.3 | 7.3 | 6.7 | 7.3 | 8.7 | 7.0 | 6.0 | B |
| ChatGPT 5.2 Inst | 7.0 | 7.3 | 8.0 | 7.0 | 8.0 | 4.7 | 6.3 | 4.7 | 6.0 | 6.0 | 8.7 | 8.0 | 6.7 | 6.0 | B- |
| Claude Sonnet 4 | 7.0 | 5.7 | 6.3 | 5.8 | 6.0 | 5.0 | 6.3 | 5.0 | 6.7 | 7.3 | 8.0 | 7.3 | 7.7 | 6.0 | B- |
| Grok 4.1 Ext | 6.0 | 5.3 | 7.0 | 7.3 | 6.7 | 4.0 | 5.3 | 8.7 | 7.0 | 7.0 | 6.7 | 6.0 | 6.0 | 6.0 | B- |
| ChatGPT 4o | 5.3 | 5.7 | 6.3 | 6.0 | 7.3 | 7.0 | 5.7 | 2.7 | 5.0 | 4.3 | 1.7 | 6.7 | 5.7 | 6.0 | C+ |
| Claude Opus 3 | 3.3 | 4.7 | 5.0 | 5.3 | 3.0 | 4.3 | 3.0 | 3.0 | 5.0 | 4.5 | 4.0 | 6.3 | 5.7 | 4.0 | C+ |
| Grok 4.1 | 5.7 | 4.7 | 7.0 | 5.3 | 7.3 | 4.7 | 5.3 | 6.3 | 5.7 | 5.3 | 2.7 | 5.7 | 5.3 | 7.0 | C+ |
| Claude Haiku 3.5 | 3.0 | 2.7 | 3.0 | 3.0 | 4.0 | 2.7 | 2.0 | 1.3 | 3.0 | 3.0 | 0.0 | 4.3 | 3.7 | 3.0 | D |
| Sonar | 6.3 | 6.0 | 6.7 | 6.0 | 7.3 | 4.3 | 4.7 | 2.0 | — | — | — | — | — | — | D |
Sonar’s missing sections (Parts 9-14) are marked ”—” as the response was incomplete.
References
All raw response files and the evaluation prompt are available for download. Read the responses yourself and draw your own conclusions.
All model responses:
| Model | File |
|---|---|
| ChatGPT 4o | chatgpt-4o.txt |
| ChatGPT 5.1 Extended | chatgpt-51extended.txt |
| ChatGPT 5.1 Instant | chatgpt-51instant.txt |
| ChatGPT 5.1 Thinking | chatgpt-51thinking.txt |
| ChatGPT 5.2 Extended | chatgpt-52extended.txt |
| ChatGPT 5.2 Instant | chatgpt-52instant.txt |
| ChatGPT 5.2 Thinking | chatgpt-52thinking.txt |
| Claude Haiku 3.5 | claude-haiku35.txt |
| Claude Haiku 4.5 | claude-haiku45.txt |
| Claude Opus 3 | claudeopus-3.txt |
| Claude Opus 4.5 | claude-opus45.txt |
| Claude Opus 4.6 | claude-opus46.txt |
| Claude Opus 4.6 Extended | claude-opus46extended.txt |
| Claude Sonnet 4 | claude-sonnet4.txt |
| Claude Sonnet 4.5 | claude-sonnet45.txt |
| Claude Sonnet 4.6 | claude-sonnet46.txt |
| Claude Sonnet 4.6 Extended | claude-sonnet46extended.txt |
| Gemini 2.5 Flash | gemini-3fast.txt |
| Gemini 2.5 Pro | gemini-3pro.txt |
| Gemini 2.5 Thinking | gemini-3thinking.txt |
| Grok 4.1 | grok-41.txt |
| Grok 4.1 Extended | grok-41extended.txt |
| Kimi 2.5 | kimi-25.txt |
| Sonar (Perplexity) | sonar.txt |
Footnotes
-
The full evaluation prompt — the exact 14-section prompt given identically to all 23 models. ↩