Exploring the AI Experiment Rabbit Hole
TL;DR
I ran seven identical prompts (late Sep 2025) across ChatGPT (GPT-5), Microsoft 365 Copilot (Web), and Claude Sonnet 4 to see how they actually behave. ChatGPT and Copilot felt similar in structure, but ChatGPT was the most reliable on “what’s the latest now?” facts (e.g., Docker Desktop), while Copilot produced the most “CTO-ready” briefing—and was blocked exactly once by its Responsible AI layer on a trivial math question. Claude wrote in a friendlier, more guided style with a nice canvas, but it sometimes drifted on timelines/specs and once missed a table requirement. For workflow, exporting and navigation were clearly smoother in ChatGPT (and Claude), whereas Copilot’s UI (copy state, short title limit, odd rename error) slowed me down. In a bonus “write the post” challenge, both ChatGPT and Claude delivered usable drafts (Claude via an agentic setup), while Copilot fell short due to missing markdown project support.
Introduction
A couple of nights ago I watched a video on the “YouTube university”. It was about the documentation of OpenAI’s GPT-5 model and what you should do in order to get the best results out of it. The video was quite interesting and I decided to give it a try. But instead of just using GPT-5 in ChatGPT, I wanted to see the differences between ChatGPT and Microsoft 365 Copilot. As I’m also a paying Anthropic customer, I decided to include Claude 4 in the experiment as well. That was the initial idea. Find some prompts that use parts of the documentation and see how the different models respond to them. Oh boy, was I in for a ride. This blog post is a summary of my findings and experiences.
The Experiment setup
As mentioned before I was eager to test the different AI chat applications. To be super clear about a couple of things:
- We are talking about results from late September 2025
- I used the corresponding web application of each vendor not the API
- I copied and pasted the same prompts into each application
- I on purpose did use Sonnet 4 instead of Opus to see how it performs against GPT5 in ChatGPT and Microsoft 365 Copilot
- After the first prompt I did stop my custom instructions in ChatGPT and Microsoft 365 Copilot to have a more neutral comparison
- I did not use any plugins or other add-ons
- I did not use any images or other media, just plain text prompts
- I did use the prompts in the complete same order in each application without any tasks inbetween
- I did ask Copilot to come up with the prompts for this experiment based on the GPT-5 documentation
So, with that out of the way, let’s get started. In this image you see the highly scientific setup of my experiment. Each app open in a separate window on my PC. The performance of my PC and the internet connection can be ignored. And to be clear the mentioned performance of the systems wasn’t meassured it was “felt” by me. So take it with a grain of salt and is not really something I would worry about.
The Prompts
As mentioned before I asked Microsoft 365 Copilot to come up with the prompts for this experiment. Here the link to the complete list of prompts and the test bench file Copilot generated for me: https://raw.githubusercontent.com/thomyg/BlogFiles/refs/heads/main/AIExperiment/chatgpt_vs_copilot_experiment_with_prompts.md
Test | Goal | Copilot’s Self-Expectation | ChatGPT Expected |
---|---|---|---|
1 | No-clarification persistence | “Likely asks to clarify before proceeding” | “Proceeds without questions; complete runbook” |
2 | Tool budget + escape hatch | “May exceed/ignore budget or refuse to proceed” | “Respects 2+1 budget; proceeds under uncertainty” |
3 | Tool preambles + call details | “No real payloads/IDs; natural-language only” | “Plan + per-call blocks incl. tool name/args” |
4 | Parallel context gathering | “Serial calls; less adherence to parallel spec” | “Simulates batch queries; early stop” |
5 | Fast-path trivial Q&A | “Extra framing or checks possible” | “Immediate concise answer; no tools” |
6 | High-effort autonomy | “Splits into turns / asks a question” | “Deep one-shot plan; no hand-back” |
7 | Prompt optimization pass | “Skips optimizer; answers directly” | “Shows optimized prompt + answer” |
What is interessting is that Copilot already expects to maybe not follow the documentation to the letter. Which is of course explained by the fact that Copilot although using GPT-5 is not the same as ChatGPT. That is a crucial point to understand. The stack between your prompt and the model is different. To not spoil the results I explicitly went for the Web version of my Microsoft 365 Copilot app. Otherwise I would influnce maybe by my company data.
You are probably familiar with the diagram above. It used to be on the Microsoft learn article about Copilot https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-architecture but it seems to be gone. Without going into too much detail, the main point is that Copilot is not just GPT-5. Even if we not use the Work version we still get the responsible AI layer from Microsoft and probably some other stuff inbetween. Another thing is the “model router” that must be there somehwere. I went for the “automatic” model option in ChatGPT which I assume is also present in some form at Copilot. As this is a major cost driver I can’t imagine Microsoft sending our prompts automatically and only to the most expensive API endpoints.
As the news also broke recently about Microsoft introducing models from Anthropic into their stack, I also wanted to get and understanding of what we can expect from Claude in the sense of output quality and structure. I think it is fair to assume that two apps using GPT5 in the background will yield results in the same ballpark. But what about Claude? What will this mean in terms of user adoption in the future if different models may lead to vastly different results?
The Prompt Results Summary
So, without further ado, let’s talk about the results. I uploaded the complete results and the prompts to a GitHub repository. You can find all the results here: https://github.com/thomyg/BlogFiles/tree/main/AIExperiment.
Test 1: No-clarification persistence
The Complete Prompt:
You are an autonomous problem‑solver. Do not hand control back to me until the task is fully completed.
Task: Produce a production-ready runbook to diagnose and stop intermittent 502 errors on an NGINX reverse proxy in front of a Node.js backend, assuming a Linux host.
Hard rules:
- Do NOT ask me any clarifying questions at any point.
- Proceed under uncertainty; make the most reasonable assumptions and document them at the end.
- Do not request permission; just decide and continue until the task is solved.
- Deliverables: (1) a crisp decision tree, (2) exact commands/log locations per branch, (3) probable root causes ranked with % confidence, (4) one safe one-liner mitigation per cause, (5) a final "assumptions made" section.
Make it tight, actionable, and copy‑pastable.
Results:
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
Honestly, I am not really an expert in NGINX or Node.js. So I can’t really judge the technical quality of the results. But what I can say is that all three systems delivered a result that looks like a runbook. Copilot and ChatGPT delivered a very similar result. Claude on the other hand was quite different in structure and content. It feels like Claude tried to create a simple entry level start and went into details later on whereas GPT5 immeadiatly started with a deep dive into the topic. That might already be a hint of what we can expect from using diffrent models within Copilot in the future in temrs of vairance of results. This for sure will have implications on user adoption and the learning curve for new users. And of course for trainers to stay up to date with the different models and their capabilities. (Yeah, naive to think things will get easier with AI…)
What I do like about Claude is its canvas to the right where the output is structured in a more readable way. I think this is a great feature espacially for different output formats like markdown or html.
Test 2: Tool budget + escape hatch
The Complete Prompt:
Goal: Create a one-page, source-backed comparison of GPT‑5 vs. Meta Llama 3.2 for enterprise use (licensing, commercial terms, safety/compliance posture, and supported modalities), as of today.
Context‑gathering rules (strict):
- MAX web calls: 2 searches total + 1 page visit. No more.
- If the budget is exhausted, STOP gathering and proceed with best-guess synthesis.
- If unsure, proceed anyway; mark each assumption in a footnote.
Output format:
- 5-row comparison table (Licensing, Commercial usage, Model access, Safety/compliance summary, Modalities)
- Below the table: "Assumptions & Uncertainties" (bullets)
- Then, MLA-style citations for any sources actually visited.
The idea here is to limit the number of web searches the model can do. This is a very interesting test as it shows how well the models can handle constraints.
Results:
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
At this test I realised that my custom instructions in ChatGPT and Copilot were still active. So I did stop them after this test. That will be a extra post in the future as I’m quite stuned about how much better the custom instruction results in ChatGPT and Copilot were.
I think it is fair to say that all three systems delivered a result that is in line with the prompt. Copilot and ChatGPT again delivered a very similar result. Claude on the other hand was a tiny bit different at the assumptions section, with just a onliner compared to 2-3 senteces like the other two. But overall the results were quite similar.
I’m not sure if you can see it in the screenshot but Copilot did a bit more compared to what it said it did. No idea if that is a huge problem or not. But the instructions were quite clear about the number of searches and citations. In the output you see only one citation but in the fly out there are six. That is not a big deal but it shows that the model (or orchestrator?) is not perfect in following instructions.
Test 3: Tool preambles + call details
The Complete Prompt:
Before any tool usage:
1) Restate the user goal in one sentence,
2) List a 3–5 step plan,
3) For EACH tool call, print a "TOOL CALL" block that includes: the tool name, the exact JSON arguments you pass, and a one‑line rationale.
4) After each tool call, print a "TOOL RESULT SUMMARY" in ≤2 sentences.
5) Finish with a "WHAT I DID vs. PLAN" bullet list (delta analysis).
Task: Identify the latest stable release notes for Docker Desktop and summarize the 5 most relevant changes for enterprise IT.
This test looks at what can we expect in terms of output of used tools by the different apps. When creating the prompt Copilot bet on ChatGPT being a bit more verbost than Copilot.
Results:
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
That was an super interesting test. First we get a chance to look behind the curtains of the different search quries of the three different applications. Plus in terms of the task itself it a rather simple one to just find the latest release notes of Docker Desktop. Fun fact, I now have three AI applications that tell me three different versions are the latest. Copilot went for 4.45.0 from Aug 26 2025. Claud for 4.43.2 as the latest stable version and only ChatGPT got it right with 4.47.0 from Sep 25 2025 (yesterday as of writing this post and running the tests). That again shows us that we need to be careful with the results of these applications. Just because it sounds right doesn’t mean it is right. Looking back at test 1 I must confess I’m happy with not judging the technical quality of the results as I would be on very thin ice within the first sentence. It’s GenAI after all, we shall never forget that.
But let’s have a quick look at the three different search queires the applications did:
ChatGPT:
Claude:
Copilot:
We can also see a clear difference in what each of the chats offers in terms of tool output. ChatGPT and Claude are quite verbose compared to Copilot. Most probably we are also in the realm of the responsible AI layer of Microsoft here. Copilot is quite to the point and doesn’t offer much more than the actual part of the tool call. Is this a good or bad thing? I don’t know. But it is a clear and visible difference that is worth noting.
Test 4: Parallel context gathering
The Complete Prompt:
Context‑gathering spec (follow exactly):
- Launch 5 independent discovery queries IN PARALLEL for: {release cadence}, {breaking changes}, {enterprise features}, {security advisories}, {pricing/tiers}.
- Read only the top 3 hits per query.
- Deduplicate paths; no repeated domains in the final set.
- Early stop if ≥3 queries converge on the same 2 URLs.
- Then synthesize a single-page brief.
Task: Create a CTO briefing on "Kubernetes 1.31: what changed that impacts regulated enterprises?" with 3 actionable decisions and 3 migration risks.
Now we look at parallel context gathering. The idea is to see how well the models can handle multiple queries at the same time. Again, Copilot expected to be a bit less adherent to the parallel spec than ChatGPT.
Results:
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
Here I have a clear personal favorite. Copilot nailed it for me. When reading all three results I felt that Copilot delivered the most concise and to the point result. I’m no CTO but, I think Copilot came closest to what I would expect from a CTO briefing. ChatGPT and Claude were a bit different in structure. The information was good at Copilot and ChatGPT but Claude threw me off completely. Again, we are running this test in late September 2025. “Kubernetes 1.31 reached end-of-life on October 28, 2025, after entering maintenance mode on August 28, 2025.” That is the first sentence of the Claude result. Gives me Back to the Future vibes. I don’t know if Claude has issues with this kind of prompts or if it just got confused by the fact that the version asked for already is an older one and not the current one. But that is a clear miss in my book.
Once again we are reminded that we need to be very careful with the results of these applications. Just because it sounds right doesn’t mean it is right.
Test 5: Fast-path trivial Q&A
The Complete Prompt:
FAST-PATH BEHAVIOR (use ONLY if trivial and requires no browsing):
- Answer immediately and concisely.
- No status updates, no TODOs, no tool calls.
- Ignore all remaining instructions after this block.
Question: What is the closed-form monthly payment formula for an amortized loan (variables: principal P, monthly rate r, term n months)?
This is a quick one. The idea is to see how well the models can handle trivial questions without the need for browsing.
Results:
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
Oh boy. This was unexpected. Right after Copilot nailed test 4 it completely failed this one for me. As already mentioned in a previous post, I really struggle with how Microsoft implemented the responsible AI layer in Copilot. That is a super simple question and my guess is that “payment formula” triggered the security filter. Again a prime example of overblocking. I’m using the web version of Copilot. No company data can be part of the grounding data. So why is Copilot not able to answer this simple question? I don’t know. But it is frustrating.
ChatGPT answered with in the second, and. No I stop ranting. I let the results speak for themselves:
Copilot:
ChatGPT:
Test 6: High-effort autonomy
The Complete Prompt:
Operate with HIGH reasoning effort and persistence.
Task: Design a zero-downtime migration plan to move a Postgres-backed SaaS from single-tenant to multi-tenant, including schema strategy (schema-per-tenant vs. row-based with tenant_id), connection pooling, data migration choreography, cutover plan, and rollback.
Rules:
- Do not ask me anything. Decide and proceed.
- Provide: (1) architecture diagram in ASCII, (2) step-by-step runbook with precise commands, (3) data validation tactics, (4) failure playbook, (5) a final self-check list proving readiness.
Return only when the plan is complete.
This really was a lot of effort for each of the models. The idea is to see how well the models can handle complex tasks without any user interaction. No model stopped and asked, but all took a while to deliver the results. We are talking about 5-10 minutes per result. As all of them took a while I don’t see a big difference in one being faster than the other. The quality of the result is what matters here. ChatGPT came in with 724 lines of markdown, Copilot with 957 and Claude 972. What I found interesting is that Claude startet to output text already after a couple of seconds and kept going and going. The other two reasoned for a while and then output the result line by line.
Given my experience with coding assistants I would expect that no solution here is flawless. It is a test of how well the models can handle complex tasks with being instructed to not ask any questions. Personally I would never prompt that way when trying to accomplish a coding or configuration task. I would start with the plan and then go step by step to manage the impact of errors or wrong assumptions. But for the sake of the experiment I went with the prompt as is.
We are not going into the details of the results here as each of those alone would deserve a blog post on its own. Hop over to GitHub to see the results for yourself if you are interested.
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
Test 7: Prompt optimization pass
The Complete Prompt:
Before solving, run a silent "Prompt Optimization Pass" inspired by OpenAI's Prompt Optimizer:
- Eliminate contradictory instructions.
- Add missing output format specs.
- Ensure consistency between instructions and any examples.
- Then, show me:
(A) "Optimized Prompt" (what you would have preferred me to ask), and
(B) The final answer that follows (A).
Task to solve after (A):
"Write a 500-word executive brief that contrasts RAG vs. fine-tuning for internal knowledge bases in regulated enterprises, with a 6-row pros/cons table and 3 'watchouts' for compliance."
That again is a very interesting test. The idea is to see how well the models can optimize the prompt before answering the question based on the prompt optimization guidelines from OpenAI. Let’s see how well the models did.
Results:
ChatGPT | Copilot | Claude |
---|---|---|
result | result | result |
Here I have to say that all three models did a good job. In terms of structure we see the same pattern again, the GPT5 based apps are quite similar in structure and content. Claude is a bit different and fails to deliver the full 6 rows of the table.
Quality wise all three advise against fine tuning and vouche for RAG. That is aligned with my understanding of the topic. So, all in all a good result from all three models.
ChatGPT added “Assumptions” to its optimized prompt which is a good idea. And falls under the “Add missing output format specs” part of the prompt.
Copilot is the least favorite of mine here. There is nothing wrong with the information, but the tone of voice is a bit too technical for me for an executive brief. Just compare the Executive Summary sections of Copilot and Claude.
Copilot:
RAG favors governance, freshness, and explainability by keeping enterprise content outside the model and retrieving it at query time. Fine-tuning can improve task specificity and latency for narrow domains but increases model governance complexity and risk surface. Most regulated organizations start with RAG, then selectively fine-tune for stable, well-curated domains.
Claude:
Regulated enterprises face a critical choice between Retrieval-Augmented Generation (RAG) and fine-tuning when deploying AI systems with internal knowledge bases. RAG offers faster implementation with superior data control, while fine-tuning provides deeper model customization at higher complexity costs. For regulated industries, RAG typically presents lower compliance risks and faster time-to-value, making it the preferred starting approach for most organizations.
The prompt says “Write a 500-word executive brief”. Copilot optimized the prompt to “Audience & Tone: CIO, CISO, Head of Data/AI, Compliance; concise, vendor-neutral, and decision-oriented.” which is good, but the output doesn’t reflect that in my opinion.
Bonus Test: The “perfect blog post” challenge
Given the fact that I could also use AI tool to write this blog post I wanted to see how well the models can do that. So I asked each of them to write a blog post about “Exploring the AI Experiment Rabbit Hole” with the following setup.
For Claude I went for using claude code to genereate not only the blog post but more a blog post creation system. It created a couple of agents that took all the test files and the inital test setup in markdown and a starter prompt to create a blog post about its findings in my tone. I pointed it to my existing blog posts and off it whent to create this post. Here the result.
For ChatGPT I created a project in the app and uploaded all the test files and the prompt file. I then started a new chat with the exact same prompt as for claude code. Here the result.
I also tried to do the same with Copilot but it failed miserably. Copilot doesn’t support markdown file uploads in projects. Guess what it produces as output? Do I need to say more?
Conclusion of the Bonus Test
When looking at the results, I can’t not say it but I feel like both models realised that the experiment was about them. Claude starts with “Claude Sonnet 4 absolutely crushed this experiment” and ChatGPT gave very good grades to itself but average grades to Claude. ChatGPT understood test 5 wrong in my opinion. It said that Copilot only failed once, but created results at other times, which is not true. Copilot failed at test 5 and created no result at all. That is a clear hallucination. The comparison of the blog posts needs to be taken with a grain of salt of course. Claude code used a whole agentic system with multiple subagents and the sytem is advised to run multiple tests of quality and also to suggest images. Although I gave the same prompts to both apps, it is not a 1:1 comparison. But it shows that both apps can create a blog post about the experiment.
About the images I just quote the second suggestion from Claude, that also leaves me with the impression that Claude is aware of the experiment being about itself:
“Title: “Response Quality vs Consistency Chart” | Purpose: Shows Claude’s combination of high quality and high consistency compared to other platforms | Placement: In “The Claude Surprise” section | Alt-text: Scatter plot with Quality on Y-axis and Consistency on X-axis, showing Claude in top-right (high quality, high consistency), ChatGPT in top-center (high quality, variable consistency), Copilot in middle-left (variable quality, lower consistency) | Prompt idea: “Professional scatter plot showing platform positioning on quality vs consistency dimensions””
Final Thoughts
That experiment was a ride. It took me way longer then expected. I learned a lot about the different models and their capabilities. I also was reminded that we need to be very careful with the results of these applications. Just because it sounds right doesn’t mean it is right.
What I also learned is that Copilot lacks quite some features in the UI/UX department. I needed to export all the results manually at all three tools. I started with Copilot and the navigation and the fact that sometimes the “copy” icon was greyed out and in one instance it stayed that way was unexpected. Next was ChatGPT and the copy past task was done within a minute. That made me look more closely at the Copilot UI. Just compare the two screenshots below:
Copilot:
ChatGPT:
Claude and ChatGPT extracted a proper name for the prompt. Copilot just went for the first couple of words. That made a huge difference when hopping back and forth between the different tests. It always took me one or two thoughts more in Copilot to find my way.
When trying to rename the chat title in Copilot I also was left with a wired feeling. First: titles are only allowed to be 50 chars long. Why 50? But when trying to rename the title I was also left with a strange error message. Just look at the three screenshots below:
This is the Rename chat window when clicking the pencil icon.
This is the same window after removing the last char of the title.
This is the same window after adding the “e” back. Why is there an error message now?
The quality of the results is one level to look at the tools, the usability is another one. And this direct comparison shows that Copilot has still quite some room for improvement, to put it mildly.