We Blind-Tested ChatGPT vs Claude vs Gemini — Here's the Winner (2026)
We removed all labels, randomized the order, and asked 134 people to vote blind. Claude won 4 out of 8 rounds. ChatGPT won just 1. Full results + capabilities breakdown inside.
Last Sunday, Karo (Product with Attitude) and I put ChatGPT, Claude, and Gemini through a blind test across 8 different prompts. We stripped the labels, randomized the order, and asked you to vote on which output was best. No model names. No hints.
134 of you showed up for Round 1. By Round 8, 111 were still going.
And now we've finally gotten to the moment we've all been waiting for. The voting is closed, the labels are off, and it's time to find out which AI model won.
Here’s what we’re covering:
Which AI model won the blind test?
Out of 8 rounds, with over 100 voters in each one:
🥇 Claude: 4 wins (Rounds 1, 2, 5, 6)
🥈 Gemini: 3 wins (Rounds 3, 7, 8)
🥉 ChatGPT: 1 win (Round 4)
Want to check your own answers? Go back to the blind test, look at what you picked, then compare it with this answer key to see which model was behind each output. How many did you get right? Let us know!
If these results make you curious to try another model without losing your chats and memory, I wrote a step-by-step guide on how to transfer your data between AI tools.
Now, back to our blind test results. Claude took half the rounds. Gemini, which most people still treat as an afterthought, came in second. And ChatGPT, the only AI most people have heard of, won exactly once.
I honestly expected a much tighter race.
Just look at those margins. When Claude won a round, it won by 35 to 54 points. Gemini's wins were much closer, with margins between 3 and 11 points. And ChatGPT won one round, but it wasn't a fluke either: 53% on The Strategist, 25 points ahead of Gemini.
But that's just the scoreboard. And a lot of you thought each letter always belonged to the same model across all rounds. It didn't. But once you believe A is always Claude, you start voting for A differently. That's a bias we didn't account for (and honestly, we also didn't do the best job randomizing the letters either).
Next, Karo (Product with Attitude) digs into the patterns and what this experiment can and can’t actually tell us. Then I break down all three models side by side, so you know what each one actually does best.
What we learned about your model preferences
Claude won the writing rounds - and it wasn’t even close.
71% of you picked Claude as the best simplifier
62% chose it as the best poet.
58% as the wisest.
ChatGPT‘s one win was the most analytical round. The strategist prompt (”your competitor just launched the same product cheaper, what’s your first move?”) was the only round where ChatGPT took first place, at 53%. This suggests ChatGPT’s strengths may be in structured, business-oriented reasoning rather than in creative or linguistic tasks. At least under these short-output conditions.
Gemini was the quiet all-rounder. Gemini never dominated a round the way Claude did, but it also never bombed one. It showed up consistently in second or first place across every category. So if Claude is the writer and ChatGPT is the strategist, Gemini is the generalist who’s never the worst choice.
What this data can’t tell you:
These results reflect how each model performs cold (no personalization, no memory, no history. In reality, the model you’ve been using for months knows your tone and your preferences. That relationship changes the output in ways this experiment couldn’t capture.
We also only tested short responses. For longer, more complex work (where each model’s reasoning depth, tool use, and context handling start to diverge) the rankings might look very different.
This is a snapshot, not a verdict. But it’s a snapshot built on blind votes, not brand loyalty. And that makes it worth paying attention to.
What we learned about experiment design on Substack
We set out to test three AI models, and ended up learning as much about experiment design on Substack.
What worked well:
The blind setup held up. We ran identical prompts in a fresh instance of Perplexity’s Model Council; no logged-in accounts, no memory, no saved preferences. None of the models knew us, and none of the readers knew which model they were voting for.
That part was airtight.
What we’d do differently:
We underestimated voter fatigue. We assumed readers would power through all 8 rounds with equal attention. They didn’t. 23 fewer people voted in Round 8 than in Round 1, a 17% drop-off.
Substack has no native way to display three long-form text blocks side by side. So we turned them into screenshot images. On desktop, fine. On mobile, nearly unreadable without tapping to zoom.
On Substack, there’s no way to randomize round order per reader, which made us introduce position bias: Everyone saw Round 1 first and Round 8 last. Everyone saw the answers in the same order.
What we learned:
The irony isn’t lost on us: we built an experiment to figure out which AI thinks best, and the hardest thinking was the experiment itself. Controlling for model bias was the easy part. Controlling for platform constraints, reader behavior, and our own assumptions about how people would engage - that’s where it got harder.
Which brings me to the three conclusions I always share with readers here on Substack:
The tool is rarely the only answer. The thinking you do before you open it is. ChatGPT, Claude, Gemini, they’re all impressive. But the gap between a good result and a bad one comes down to how clearly you define what you need. That was true for our prompts. It was even more true for our experiment design.
Don’t trust tool recommendations (including ours!). Test them on your own. The whole point of this experiment was to replace opinions with data. And even though what we got was imperfect, Substack-constrained data, this experiment was significantly more rigorous than the typical “I asked 3 models the same thing” hot take on X. Your workflow isn’t my workflow. Your prompts aren’t my prompts. The only way to find your best model is to run your own blind test with the tasks you do every day.
Ship it before it’s perfect, then make it better. This experiment had real flaws. But if we’d waited until the methodology and Substack features were bulletproof, we’d know far less than we know today from 111 votes. Version two of this test will be sharper, because version one existed.
If you want to go deeper on what each tool actually offers beyond a single prompt, I wrote complete guides to both: everything inside Gemini’s chat and how to make the most of Claude.
Beyond the blind test
The blind test showed which model won on short, isolated prompts. But that's not how most of us use AI day to day. Real work looks different: longer conversations, specific tools, files, integrations, follow-ups over time.
And right now, each model is building out a very different set of capabilities. Picking one depends on a lot of things.
So let's look at what's different.
ChatGPT vs Claude vs Gemini: capabilities comparison
If you want to dig into the benchmarks, you can check out Chatbot Arena and Artificial Analysis.
But what I’m going to do here is a side-by-side look at what each model can do, what tools and integrations it comes with, and where it fits into your workflow. The stuff you notice as a user.
AI capabilities & use cases
Here are some capabilities and use cases we use AI for every day. Where I’ve marked one as “best”, that’s my personal pick based on how I use them.
Customization & workflow tools
Beyond what the AI can do, there's the question of what each platform gives you to work with.
For me, Claude wins this one. Skills are more capable than Custom GPTs or Gems. They're not just reusable prompts. You can apply them to a much wider range of tasks and they proactively guide how Claude does a task. Both Claude and Gemini also let you build interactive things, small prototypes and apps you can share. ChatGPT's Canvas is more focused on writing and inline editing, which makes it more limited in what you can build.
Integrations & agentic capabilities
This is where the gap between models is widest right now: integrations, agents, and what each one can do from inside the chat window.
Claude wins this one for me again. Cowork is the most capable agentic system right now for non-developers, plugins do the work for you end to end at your own quality standard, and MCPs let you connect Claude to pretty much anything. Gemini is second because its Google Workspace integration is built in, and I live inside the Google ecosystem.
How to choose the best AI model for you
Maybe you voted in the blind test and it confirmed your pick. Maybe it surprised you. Maybe you didn’t participate at all and you’re just here to figure out which model to use.
Either way, these questions might help you decide:
What do you use AI for most? Writing, research, coding, images, automations, all of it? Start with the tasks you do every day.
How do you use it? Quick one-off questions or longer workflows connected to your tools and files?
Which apps and tools do you already live in? Google Workspace, VS Code, Notion, something else?
What kind of outputs do you need? Text, images, video, voice, data analysis?
How much are you willing to pay? ChatGPT is the cheapest. Gemini is a good deal because the subscription covers AI across the whole Google ecosystem plus NotebookLM Pro. Claude is the most expensive and burns through credits fast, but also the most capable.
Want to switch AI models but worried about losing your setup? Even if your current model has your memory, preferences, and conversation history, you can export all of that and drop it into a new one. There are ways to make the switch without losing much. You're not locked in. I documented the full step-by-step migration guide here.
The bottom line
A year ago, you could swap one model for another and barely notice the difference. That’s not where we are anymore.
All three general-purpose models can help you in your work, but they’ve become good at slightly different things. So the choice matters more now. And it's not even one choice. Depending on what you're doing, the best model might be a different one each time.
Which is what makes this interesting. Did the model you were expecting win? Or did the results surprise you? Let us know in the comments.
And if someone you know has been stuck on which model to pick, send them this. It might save them a few weeks of overthinking.
This post is free. If you found it useful and want access to more of what I’m building - prompts, automations, step-by-step guides - paid subscribers get all of it.






Thanks for this brilliant post, team, and for crunching the data. Also reinforces my own thoughts around both Claude and the fact that maybe one tool can't do everything that you need if you're interested in multiple applications. As you point out in the article, perhaps the most interesting thing though is just how well Claude performs in those categories where it does win. For me, this is further justification for my Claude Max Pro account, or at least that's what I'm telling myself... 😂
Great article—thank you for putting in the work!
Yes, a transition-effectively article would be useful. At this point, ChatGPT has enough context on me—refreshed through interactions—that even lousy prompts yield decent results. That increases the barrier to move, even if another approach might be superior over time (looking at Claude).