Moby Dick vs. AI: Searching for the (correct) Whale

It’s just another quiet week in AI! New models this past week alone? Five. Sheesh.

Google introduced Gemini 2.5, their next round of reasoning models, (that “think” before answering to get better results, instead of using more pre-training and a larger model). They released one of the models now (Pro) but only as “experimental” and other (smaller and faster) models will ofllow.
DeepSeek updated their (non-reasoning) model (to V3).
Qwen released 2.5-Omni, a multimodal version of their 2.5 model.
A startup, Reve, launched a new image model.
OpenAI’s GPT-4o has better image generation; this replaces the previous model in ChatGPT, DALL-E (although DALL-E is still accessible).

Can you think of anything that becomes obsolete so fast? Even TVs last a few months! A few other tidbits:

Some are saying we’re in a datacenter bubble and we won’t need everything that’s being built. Maybe. A new data center is completed once every three days for Microsoft alone!
Tangentially, but super interesting. Figure AI used reinforcement learning and computer simulation to create a model for humanoid walking…that generalizes. That’s big; if we can train models in simulation and then transfer them to the real world and they generalize…the pace of progress is going to speed up even more.

But the most exciting item this week is OpenAI’s new image gen capabilities for GPT-4o, and the market reception seems to indicate that it’s now the best image generator out there. (Guess I’ll use it for creating my blog post images from now on) We don’t know the details of how it was done but it appears to use techniques besides pre-training (scaling to a bigger model). That is an indicator that there is still a lot of room for these models – including multimodal ones – to get better without scaling.

And it is good, but not as good as it may seem. Many of the examples OpenAI showcased are the “best of 8” attempts – in other words, the best 12.5%. How much worse were the bottom 4 results? How bad was the worst result? Does your use case allow you to try a bunch of options and pick the right one, or does it need to be consistently good? And of course there are still hallucinations. One of the examples they showcased on their website was a visual guide to whales (this was the best of 3 attempts). Looks great, right?

Look closer. In fact, let’s compare GPT-4o’s renderings with a more reliable source, let’s look at these whales next to representative images from Animal Spot. When we do that, we see that ChatGPT (for which GPT-4o is now the default) did a good job on three: the humpback, orca, and beluga (if you don’t mind a chubby beluga). The concept of the narwhal is ok – but a real narwhal’s “tusk” is shaped like a drill bit and the one from ChatGPT is clearly a unicorn horn (no surprise, there are more unicorns on the internet…and therefore in its training data…than there are narwhals). The blue whale isn’t bad, but looks like it may have had a humpback mother and blue father. The sperm, gray, and bowhead whales? Not even close!

So what at first appears to be an impressive knowledge of the anatomy of whales, turns out to be a very poor reflection of reality. What would happen at work if you were wrong less than 50% of the time?

But maybe we should cut ChatGPT a break. After all, it’s just doing what it’s supposed to do – match patterns. I did an internet image search for blue whale and to my surprise, nine of the first 23 pictures were labeled as blue whales, but were of humpback whales!

If this is the training data used, then GPT-4o is actually doing a good job! Apparently we all love humpback whales, and we’re not very good at identifying or classifying whales, so the default whale is the humpback. No wonder all of the whales that GPT-4o got wrong look rather humpbacky! Poor ChatGPT didn’t stand a chance!

My take on why does it matter, particularly for generative AI in the workplace

Image generators and LLMs use different technologies right now (LLMs are based on transformers and image generators use diffusion models) but they exhibit the same characteristics because they do the same kind of prediction.

So, you’ve heard me say it before, but the problems with image generation here are yet more evidence that all generative AI struggles when there is a right answer! For the enterprise, this is yet another lesson that the input data is critical, because garbage in, garbage out still applies. You have to feed the model the right information:

If the facts come from the training data, the training data must be clean and accurate (which it wasn’t for these whales!) and it must be consistent, since the LLM will favor more frequent (consistent) examples.
If the facts come from grounding the model by giving it the facts in real time (Retrieval-Augmented Generation, or RAG), the information you feed into the LLM must be clean and accurate. That means reliable generative AI depends on highly accurate RAG!

As models get better, they will hallucinate less, but they won’t ever go away completely (unless we find a different technology). Will the hallucinations become so infrequent that it won’t matter? I don’t think so. As you see from this example, the evidence points to the need for highly reliable inputs in order to have confidence in the output.

Moby Dick vs. AI: Searching for the (correct) Whale

Like this:

Discover more from