Generative AI: Surprisingly Capable. Fundamentally Flawed.

Ethan Mollick uses the term “jagged frontier” to describe how generative AI is really good at some things and really bad at others, and sometimes it’s hard to know where it will excel and where it will fail. In part because when it gets something wrong, it often does so with so much confidence and persuasion that it seems right. This is made worse by the fact that it’s often difficult (or at least time-consuming) to fact check it. And worse because it’s not deterministic, and that the same question will yield a different answer every time. So you don’t know if it got it right because it will usually get it right, or if it was just lucky.

Along the same lines, The Washington Post picked five chatbots and tested them for reading comprehension across four categories: literature, law, health science, and politics. While we can’t call this analysis scientific, the results were quite interesting:

Claude (Anthropic) and ChatGPT (OpenAI) were by far the best. They significantly outperformed the other three models. (Which is surprising, since Copilot uses OpenAI models; it appears that Copilot is using an older/smaller/different model than ChatGPT directly.)
All of the models struggled. 70% isn’t a great score…but then again we don’t know what the “average human” would have scored, so maybe it’s not so bad.
But in the midst of those errors, hallucinations, and distractions, there were moments of brilliance…

“[On] our most complex request (suggesting changes to our test rental agreement), Claude’s answer was complete, picked up on nuance and laid things out exactly like [the tester] would.”
– Geoffrey A. Fowler

They did best with the science category, and worst with literature

The quality of answers to more analytical questions by both ChatGPT and Claude left [him] gobsmacked. Prompted to describe how the book’s epilogue “made you feel,” both bots appeared to have “all the feels.” … Repeatedly…[he] was flabbergasted.
– Geoffrey A. Fowler

A large variation within models

“I was very surprised at how different the responses were for the different prompts.”
– Eric Topol, evaluator

And somewhat surprisingly, a large variation across models

Here was the conclusion from the tester for the literature category…and remember this quote is coming from the category where the models performed the WORST:

“Okay, I’m done. [The] whole human race is. Stick a fork in us.”
– Chris Bohjalian, evaluator and bestselling author of 25 books

As impressive as these results are, these models are quite limited in other areas…the jagged frontier.

The results are real. The method is not.

Over the weekend researchers from Apple released a paper that explains how LLMs are not intelligent, in the sense that we normally think of intelligence (confirming what most of us already knew). They draw a distinction between LLMs and LLMs that “reason,” so-called LRMs or Large Reasoning Models. LRMs use test-time compute when asked a question – they take time to “think” and “reason” through possible answers in order to arrive at a better response. The findings of the paper demonstrate that these models are not (by themselves) the path to AGI. It’s still surprising to me that many of the LLM companies are saying that AGI is within reach (even though they have a vested interest in believing it!).

They demonstrate that:

LLMs and LRMs struggle at complex tasks
Adding “reasoning” to create an LRM can improve answers to some questions
But it can also make answers worse, because the LRM pursues wrong answers (sometimes even after already finding the right answer – and sometimes even after being given the right answer!)
LRMs completely fail above a certain level of complexity

So it appears that (at least with current technology), contrary to many of the claims of the AI companies, “reasoning” may not have as much potential to improve the abilities of these models.

“We show that state-of-the-art LRMs still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments.”
– The Illusion of Thinking

This doesn’t mean that these models aren’t useful. They don’t have to “think” or be “intelligent” to be useful any more than your car has to be intelligent to help you get from point A to point B. But it’s a sobering reminder that we’re still living in a high-hype environment and we have no reason to believe that AGI will be here any time soon.

More on AI Impacting Jobs

This week, the NY Times adds to the idea that AI is increasing unemployment, particularly of new graduates. AI has gotten good enough (when deployed correctly) to do some the work typically done by college graduates and other entry level roles, and the concern is that companies are responding by holding off on hiring and using AI instead. It’s mostly anecdotal but does cite this trendline from the Federal Reserve Bank – note how the light blue line (recent college graduates) looks to be trending upwards faster than broader groups:

It’s not yet clear that this is due to AI and it could cycle back down in a few months; however if it continues upward, we may see (with hindsight) that we are at the beginning of the first categorical shift from human workers to AI.

LLMs for Business…

OpenAI announced (through a livestream, here’s a summary from TechCrunch) some new features to bring its LLMs to businesses and compete with the likes of Microsoft. As I’ve said before, the foundation of the value of LLMs in business is…knowledge about your business. This comes from using RAG on internal content, which means you need access to that content so that it can be searched – and OpenAI now has connectors to some of the common repositories: Sharepoint and OneDrive (hello, Microsoft) as well as Google Drive, Box, and Dropbox.

…and for Government

Anthropic announced that they have created special models for government use – models that have looser guardrails to better handle the specific challenges of working with classified information and in the intelligence space (basically, the model is more willing to read and discuss stuff that you wouldn’t want it discussing with anyone in the world).

The Internet, Post-Search

While not directly related to using generative AI in the enterprise, I feel remiss not mentioning the burgeoning industry of GEO. What is that you ask? SEO – Search Engine Optimization – was what everyone did to make sure that search engines would bring you to their website. It is (ahem, it was) a massive business. But now that generative AI is providing search summaries (Google, Bing) and answer engines (Perplexity, You.com) and LLMs informed by search (ChatGPT, Claude)…it’s no longer about links to websites. It’s about getting your content noticed by the LLM so that it’s included in the answer, and hopefully, with a citation or link to your site.

How do you do that? We don’t really know yet. But some are calling it GEO, for Generative Engine Optimization, and some are saying that’s only a part of it and you actually need to be doing AIO – AI Optimization. Whatever it’s called, it’s a fundamental shift in how web traffic happens and how it gets paid for (i.e., advertising). Google made a whopping $264,000,000,000 last year on ads. This is why OpenAI and others have openly talked about how they’re going to be incorporating ads into LLM responses.

Here’s the critical distinction: SEO aims for clicks. GEO ensures that AI uses your content in its responses.
– Sirte Pihlaga, AI Today

My take on why does it matter, particularly for generative AI in the workplace

Much of this week’s news adds to themes that we’ve already seen lately. The important takeaways:

LLM companies are burning cash – it takes a lot to train these models – and they need bigger revenue streams. Consumer use alone isn’t enough, so they’re working to push into:

the corporate and business world
government entities
advertising to consumers (by serving ads in LLM responses, prioritizing sponsored content, and facilitating shopping (and probably taking a cut))

Jobs opportunities for new grads may be shrinking due to generative AI, but it’s too early to be sure. We know that coding and customer service are seeing the biggest impacts, and entry level work (whether it’s for technical roles, finance, legal, or consulting) seems ripe for replacing – AI can definitely do it faster and cheaper, and it might be able to do it as well as new graduates.

But as we saw with the reading comprehension tests, it’s not obvious where AI excels and where it struggles. It can be successful at many things, but it is not actually “intelligent” in the way we think of intelligence. It is brilliant at pattern matching, so it typically does well if the task is similar to something it’s seen before. But it only knows word patterns; it has no awareness of reality or logic or fundamental principles, so it’s not truly “thinking” or “reasoning.”

At its core, it’s still just a really, really good pattern matcher, so although it appears to have knowledge, that comes from mimicking “knowledge.” That’s why RAG is so powerful: it grounds the AI in actual facts and limits its ability to hallucinate. To work accurately and reliably, companies are going to have to be very focused in applying it to tasks where it can succeed. And for most business applications, accuracy is key, so it will require safeguards to limit its weaknesses and detect its mistakes.