A Model Explosion: 4 in 1 Week!

5 models in 5 weeks. Then Anthropic released a new model, Claude 4.5 Opus, and I thought my post would be about it. But a rush of new models in the past few days can’t be ignored. What does it mean? Let’s dive in.

Model #6: Anthropic

On November 24, Anthropic announced Claude 4.5 Opus, to complete their 4.5 family (with Sonnet and Haiku). This is the largest of the family, and is impressive. It:

  • achieved #2 on Artificial Analysis’ composite intelligence benchmark
  • is the best model for coding (according to Anthropic, but Gemini 3 Pro is very strong so this is debatable)
  • is the best model for agentic tasks (according to benchmarks)
  • exhibits less “concerning behavior” – half as much as other models (see graph) and is less susceptible to prompt injection attacks than any other model

Model #7: DeepSeek

On December 1, DeepSeek released DeepSeek v3.2. This Chinese company shocked the world earlier this year with a model that was highly performant and trained for a lot less money than the commercial models in the U.S. They did it again, with numerous techniques to improve performance without requiring massive compute. Deepseek v3.2 even beats Gemini 3 and GPT-5.1 on certain benchmarks, and is ranked 6th overall per Artificial Analysis.

Model #8+: Mistral

One day later, Mistral announced the Mistral 3 family of models, a significant upgrade to their previous models. Mistral is a French open-weights LLM company and the only notable non-U.S., non-China company in the LLM race. Also notable is that these are not reasoning models, so while they perform very well for non-reasoning models (second to OpenAI’s GPT-5.1) they’re not very high on the overall leaderboard (the best sits at #22).

Model #9+: Amazon

That same day, Amazon (didn’t know they were in the model business? Yes, they are!) announced Nova 2.0, a family of five new models. Amazon has had LLMs but they were primarily for Amazon-centric customers (who used their Bedrock platform). But with Nova 2.0, they now have models on par with the leaders; Nova 2.0 Pro and Nova 2.0 Lite models both score in the top 15 according to Artificial Analysis:

(By the way, two other minor model upgrades are reflected in this graph: GPT-5.1 Codex replaced GPT-5 Codex, and Grok 4.1 Fast replaced Grok 4 Fast.)


My take on why does it matter, particularly for generative AI in the workplace


So where does this leave us? With the usual caveats that benchmarks are far from perfect, they do tell us something. Based on Artificial Analysis’ measure:

  • Google still has the lead (for now)
  • All the major commercial players have top models: Google at #1, Anthropic at #2, OpenAI at #3. Only xAI is farther behind at #7 (expect a new version of Grok soon).
  • Open-weights models are only slightly behind the best commercial models (4th and 6th overall).
  • Amazon is a new entry to the leaderboard (10th and 13th).
  • The U.S. still leads in commercial models; China still lead in open weights models.
  • In terms of raw performance, non-reasoning models are essentially history. The best non-reasoning model is ranked 21st, and scores almost 10 points lower than the model in 20th place.

What can we conclude?

  • There is no moat. Although it does take significant resources to train these models, any performance lead evaporates in months if not weeks. And DeepSeek continues to achieve remarkable performance with less compute (although this time it’s not clear how much less).
  • On balance, open weight models are almost as good as commercial models.
  • For the near future, we can expect models to continue to get better, faster, and cheaper. Even though the returns from scaling pre-training are diminishing, reasoning (scaling inference) and better post-training techniques are producing more capable models.
  • NVIDIA is not unassailable. While most models are trained with NVIDIA chips, Google and Amazon trained their models with their own custom chips (TPUs and Trainium chips, respectively).

We’re entering the holiday stretch so we may not see many new models before the end of the year (except from x.AI). Unless they’ve saved something to close out the year strong.

A note on Artificial Analysis’ Benchmark

Different benchmarks measure different things, and they’re far from perfect; but they’re all we really have to evaluate and compare models. Artificial Analysis uses a composite measure that combines results from 10 different benchmarks of “intelligence.” So it’s as good a metric as any. That said:

  • The numeric score has no real meaning
  • It doesn’t account for cost, time to run, or number of tokens
  • It doesn’t measure refusal rates, bias, or safety (or toxicity or sycophancy…)
  • Models have various strengths and weaknesses for different uses

So, don’t use a benchmark as your sole criteria. When choosing a model for an important application, consider how it performs for that particular need.

Copyright (c) 2025 | All Rights Reserved.


Discover more from

Subscribe to get the latest posts sent to your email.

Leave a Reply