Wow that was an unanticipated hiatus! I haven’t posted in 5 weeks! It started when I took a vacation and I came back overloaded…and suddenly 5 weeks had passed. My apologies for the silence; thanks for your patience and thanks to those of you who reached out saying you missed my posts! I’m going to catch up with a handful of subject-oriented posts, and I fully intend to get back to weekly cadence. Or at least weekly-ish. Wish me luck.
(logos in the image are trademarks of their respective companies)
The Five Models
Efficiency
Anthropic followed on the heels of Claude Sonnet 4.5 and in mid-October released Claude Haiku 4.5. This is their smallest model and provides comparable performance to Claude Sonnet 4 (which was state of the art only five months ago!) at one-third the cost and twice the speed. Artificial Analysis rated it the 11th best model at the time, but now it’s 14thbecause…
Open Weights
In late October, MiniMax released their M2 model. It was ranked 5th overall by Artificial Analysis, moving past Qwen3 to claim the title of the best open-source LLM…
Open Weights Thinking
For two weeks, until Moonshot AI, another Chinese model company, released Kimi K2 Thinking. It even beats GPT-5 and Claude Sonnet 4.5 on several of the major benchmarks (such as Humanity’s Last Exam) and was rated the second best model overall by Artificial Analysis. That’s impressive by any measure, but particularly for an open weights model.
The model was (reportedly) trained for only $4.6 million, which is almost nothing in this space. Estimates for the commercial models vary but it’s possible that OpenAI’s GPT-5 cost on the order of $1 billion to train. If so, K2 achieves almost the same performance at 1/20th the cost!
As we’ve seen before, there is no moat. While these models continue to improve, there isn’t a sustainable competitive advantage based on the model alone. Which makes it hard to imagine OpenAI is worth $1 trillion. Even when…
Friendlier
One week later, OpenAI released GPT-5.1. GPT-5 already held the #1 spot from Artificial Analysis with a score of 68, and GPT-5.1 got a 70, giving OpenAI the two best models. While this model does score better than GPT-5, the main selling point according to OpenAI is that it’s “warmer and more conversational.”
Ability
Then…the big one. Google announced Gemini-3 and with it, claimed that they are “ushering in a new era of intelligence.” That claim seems a bit overstated – there’s no new intelligence here and we know that these models aren’t going to get us to AGI – but it is solidly the best model available. Artificial Analysis gives it a score of 73, three points ahead of GPT-5.1.
But it is definitely a very impressive model. If you haven’t looked into capabilities recently, it’s worth reading the more detailed announcement on their blog.

Open vs. Closed
According to Epoch AI’s Epoch Capabilities Index, open weights models lag closed source models in capabilities by about 3 months. The chart doesn’t have these latest models, but with an open-weights model in 4th place overall, the trend seems to be continuing.


My take on why does it matter, particularly for generative AI in the workplace
Looking at these models, we can conclude (or confirm) a number of things. Here’s my take:
- Performance improvements aren’t as big as they used to be, so we are seeing diminishing returns compared to a year ago.
- But only slightly. There is clearly a lot more that can be done to make these models better. Three of the top four models were released in the past three weeks!
- So far, the models provide no moat. Open-weight models give comparable performance with lower investments only a few months later. Companies producing closed models are going to have to differentiate by something other than their latest and greatest model.
- These models alone will not get us to AGI (or superintelligence or whatever you want to call it)
- But they don’t need to in order to be valuable. All of the best models are very capable – in fact, they’re already good enough for many applications.
- But they still suffer from hallucinations, so even though they’re capable of amazing things, they sometimes stumble on simple things, and it’s often very difficult to tell that they’re wrong.
Here’s a personal example. Recently I was chatting with a model about 401k plans and I asked about the contribution limits for 2026. It answered confidently – but with the 2025 contribution limit. The model alone isn’t to blame here; it can only get the correct amount by using search (RAG). But that’s exactly how we’re using these models most of the time, and if it can’t distinguish between 2025 and 2026, how are we going to use it for high-stakes business decisions?
We aren’t. Even as these models improve, naïve RAG is not sufficient. We need advanced RAG, plenty of scaffolding, guardrails, and self-verification processes in order to reach accuracy levels that will be good enough for high-stakes applications. In other words, AI agents…and that’s what everyone is working towards, and will be the topic of my next post.



Leave a Reply
You must be logged in to post a comment.