GPT-5 is Finally Here. How Much Does it Matter?

Of course the big news this week is that OpenAI released the much-anticipated GPT-5. Did it live up to expectations? Depends on which camp you were in. For many it was a major disappointment; I see it as a good but incremental upgrade (more on that later). Here’s what matters:

  • Better scores (of course). It scores #1 on most benchmarks (but not all, Grok Heavy scored higher on a few). But the gains are evolutionary, not revolutionary.
  • Less sycophancy. They don’t want a repeat of what happened a few months ago.
  • Easier to use. It chooses the model for you. You no longer have to decide which model is best (Should I use GPT‑4o? o3? o4-mini? GPT‑4.1? GPT‑4.5? Augh!).
  • Better rejection. When asked to do something impossible (for example, give information about an image that wasn’t provided), o3 made up confident answers 87% of the time while GPT-5 cut that down to 9%.
  • Selectable personalities (Cynic, Robot, Listener, and Nerd) so you can choose the way you’d like it to talk to you.

But in my opinion what really matters, the most important development is fewer hallucinations. Their tests show that GPT-5 hallucinates 2x to 4x less than previous models. This is important because it’s the main driver of reliability – we can’t trust an LLM if it frequently makes mistakes.

Here’s an internal exchange that happened recently at my company:

Employee 1: FYI, [system name] is unavailable at the moment, I used a command that removed the configuration. I’m putting things back.

Employee 2: what happened?

Employee 1: I trusted the instructions from chatgpt

So yeah, hallucinations are a problem. Look at the middle graph from the GPT-5 System Card below: o3 and 4o respond with at least one major error over 20% of the time (yikes!); GPT-5 is about half that, and GPT-5 thinking is about one-quarter of that.

Yes it needs to be lower still, but this is good progress on a very important metric, one that the LLM companies haven’t talked much about. 

Note that GPT-5 is not a single model, but is a collection of models. There are six models behind the single GPT-5 user interface, and these models replace all of the previous models available from OpenAI as follows:

Previous model GPT‑5 model
GPT‑4ogpt-5-main
GPT‑4o-minigpt-5-main-mini
OpenAI o3gpt-5-thinking
OpenAI o4-minigpt-5-thinking-mini
GPT‑4.1-nanogpt-5-thinking-nano
OpenAI o3 Progpt-5-thinking-pro

Now that the much-anticipated GPT-5 is finally here, it’s interesting to take a look at just how fast this market is moving. GPT-3.5 came out in November 2022. OpenAI has discontinued 17 ChatGPT models in less than three years!

State of the Market

The market for LLMs is moving super fast and data is sparse so it’s hard to know what’s happening. Here is a great snapshot of the state of the market from Menlo Ventures. Most notable to me was that Anthropic has taken the lead from OpenAI in in enterprise API use.

What’s an API you ask? API stands for “application programming interface” and it means that instead of a human accessing an application, a computer does. So these are situations where a company integrates an LLM into a computer-based process (like RAG, an agent, etc.). Here is a trend graph of LLM API market share over the past 2 ½ years:

You can also see from the pie chart that Anthropic’s Claude dominates coding (we’ll see if that holds now that GPT-5 is out!).

Other key observations (if you’re even a little interested, look at the entire report, it’s short):

  • Fast market growth. No surprise, it’s expanding quickly: “Enterprise spend in the first six months of 2025 has already more than doubled all of 2024.”
  • Performance wins: Companies are favoring closed models rather than open, despite the higher price. Menlo says that’s primarily due to quality, but I think it’s also because the cost of best-in-class closed models has decreased over time.
  • Performance > Cost: companies are willing to upgrade to the latest frontier models to get better performance, rather than a lesser model at reduced cost.
  • Training custom models is disappearing: companies – both enterprises as well as startups – are spending significantly less on training or fine-tuning their own models. This indicates that the base models are getting good enough that the effort/cost of training a custom model is rarely worth it.

But GPT-5 Wasn’t the Only Model Released

Shortly before releasing GPT-5, OpenAI released two open-weights reasoning models. Despite the company’s name and original charter (to build AI that is open and accessible to all), they have not released an open-weights model in six years. That’s right, six years…ChatGPT has only been around less than 3! Why? To help ensure that:

“the world is building on an open AI stack created in the United States, based on democratic values”
– Sam Altman

Also shortly before the GPT-5 release, Anthropic released Opus 4.1 which replaces Opus 4. This is an incremental upgrade to their largest (and most expensive) reasoning model.

Google is making Gemini 2.5 Deep Think, a model that they first announced in May, available to subscribers. This is the one that took a gold medal at this year’s International Math Olympiad (OpenAI also claimed gold but they didn’t participate in the actual competition). But if you want it, you have to pay; for now, it’s only available via their Ultra subscription which is $250/month(!).

Or, you can be a student. Google is offering this model (and all their others) for free to college students (18 and older) for 12 months via their Google AI Pro plan for students. Great idea – target the youth, and they’re likely to want to stay in the Gemini ecosystem. Especially with everything it offers:

  • Gemini 2.5 Pro – their best model
  • Deep Research – conduct research and create reports
  • NotebookLM – summaries, study guides, and podcasts on resources
  • Veo3 – video generator (8-second clips)
  • Jules – coding assistant
  • 2 TB storage

My take on why does it matter, particularly for generative AI in the workplace


GPT-5 Was Supposed to Be All That. Is It?

To many, the release of GPT-5 was much anticipated, over-hyped, and overdue. For them, the release was a major disappointment. It was not a major leap forward but incremental progress. But I think that’s just what everyone should have expected:

  • The underlying technology hasn’t changed, it’s still pattern matching
  • Gains from scaling up model size and pre-training have been diminishing
  • Sam Altman told us GPT-5 would not be some super model but a collection of models

Anyone expecting a big jump was disappointed. Anybody expecting AGI or superintelligence was very disappointed. But that’s not the story here. GPT-5 is more user-friendly and slightly more trustworthy, and I think that represents a turning point in the market.

Towards a Real Consumer Product

The turning point is a move towards creating a real product for people to use, one that is less experimental and more viable:

  • You no longer have to choose the right model
  • It hallucinates less
  • You can choose a personality, according to how you would like to interact

This is an attempt at the first LLM for everyone.

Better scores on benchmarks mean nothing to the average user. The benchmarks at this point are measuring very difficult tasks or very in-depth knowledge, of interest to experts and specialists but not something that most people care about.

Hallucinations Matter

But hallucinations are a huge obstacle to mass adoption. Even if LLMs can do amazing things at the high end (i.e., score well on benchmarks), how can we trust them when they can’t do simple tasks, when they regularly make errors?

GPT-5 doesn’t fix that. But it is progress in the right direction. And even the fact that OpenAI is talking about it is important, as other model releases haven’t said much on this topic. So I’m glad to see there is progress here.

But, GPT-5’s best is still wrong 1% of the time. That’s a pretty high error rate for workplace use. Especially as we move to AI agents that execute multiple steps. A 1% error rate might not seem like a lot, but that would mean an AI agent with 50 steps would have an abysmal 40% failure rate. An AI Agent that’s wrong 40% of the time is practically useless.

So, it’s great to see progress on the Achilles’ heel of LLMs. But if the best we can do right now (per GPT-5’s System Card) is that 1 out of every 20 responses has a “major factual error,” we still have a ways to go, either with models or with other techniques, to realize the promise of AI agents that are reliable enough to lessen our workloads.

Copyright (c) 2025 | All Rights Reserved.


Discover more from

Subscribe to get the latest posts sent to your email.

Leave a Reply