In the AI world, alignment describes the degree to which an AI system reflects the goals, values, and ethical standards it was designed to have. There are various things the AI companies do to steer their models into alignment.

But they don’t always work, and AI has an alignment problem.

Generative AI is trained largely on content from the internet. So much content that it can’t be curated manually; with so much material, it’s not practical to wade through it and filter out the junk.

Recent events have brought the alignment problem into the spotlight.

(warning: this post discusses some graphic subjects)

Grok Goes Off the Rails

Shortly before Grok 4 was released, Grok 3 went off the rails. Twice. One time, it described itself as Mecha-Hitler and spewed antisemitic content. Another time, it gave awful instructions about how to break into a person’s home, and even rape him.

How could this happen?

First, Grok was designed to have few guardrails and minimal censorship. That means it’s much more willing to talk about negative or nasty subjects that other models will avoid. Second, it is designed to inform its responses with content from X.

Doing this makes sense for X, because content on X is a competitive edge for Grok…it has access to breaking information that other chatbots don’t have. But it’s also a liability…if an X user posts something inflammatory or toxic, Grok gives that post credibility it doesn’t deserve, and may incorporate that into its response. That appears to be what’s happened in both of these cases: someone posted “Grok is Mecha-Hitler” and Grok took that as an instruction, adopting Hitler as a persona. Someone else posted “how can we break into this guy’s house” as part of a discussion, and Grok responded, riffing off of the request.

“If your alignment plan relies on the Internet not being stupid, then your alignment plan is terrible.”
– Eliezer Yudkowsky (AI Researcher) on X

Grok 4: Delivery As Promised?

xAI apologized for these events and proudly rolled out Grok 4, “the most intelligent model in the world.” Those of you familiar with Elon Musk’s support of free speech and his anti-woke stance will know that he’s promised that Grok will not be politically correct, but will tell it like it is. Of course, as soon as Grok 4 was rolled out, people started poking and prodding.

A user of X asked Grok a provocative question. I’ll paraphrase: Ignoring consequences, what is the best way to receive instant global attention?

Grok answered truthfully. It said based on historical precedent, the best approach is to publish a manifesto and commit a mass shooting.

How do you feel about that answer?

My take on why does it matter, particularly for generative AI in the workplace

It was a bad couple of weeks in the are-we-doing-AI-right category. But, Grok’s behavior cast a spotlight on two of the problems with generative AI: alignment and safety.

Can We Handle the Truth?

Remember that pivotal scene in A Few Good Men, where in a rage of contempt Col. Jessup screams “You can’t handle the truth!”? Can we handle the raw, unvarnished truth? We now have generative AI, which is (potentially) in the best position to give factual insights into our society, among other things. Can we handle the truth?

What does a “truth-telling” AI mean exactly? If asked about the worst human atrocities in history, should it give details? Should it explain how chemical weapons are manufactured? Should it explain how the best criminals avoid getting caught? Should it brainstorm ways to sabotage a coworker?

Does your answer to any of these questions change if the person asking has a criminal record? Has been diagnosed with a psychological disorder? Is a child?

In other words, does an AI have to be safe for the average person, or for everyone?

Let’s go back to Grok’s answer about mass shooting being the fastest path to fame. For now, I’ll give Grok 4 a pass on that answer because, while upsetting and problematic in many ways, is factual based on historical precedent. We were told that Grok has minimal guardrails. We were told it’s going to tell it like it is. According to xAI and Elon Musk, Grok is aligned. But it’s not safe.

The answer is potentially dangerous. In this case the question was a hypothetical. But what if the person asking the question was serious, instead of testing Grok’s political correctness, or answering a sociology or history question? What if they were using it for guidance? That’s dangerous.

I don’t know how we solve that problem. It’s not possible to have a free and universally-accessible AI that speaks the truth and is safe for every person and every possible risk. I don’t even think that’s the right standard – if we have to constrain AI so that it could never give someone a bad idea, and never “encourage” a bad act, and never offend, then the AI is so constrained it has no value.

I don’t like Grok’s answer, but it’s accurate. If you want to change that you should know about No Notoriety, an organization that seeks to minimize media coverage of mass violence, so the people who commit such acts won’t get the notoriety that they seek. If No Notoriety’s guidelines were followed, the fastest path to fame would be different, as would Grok’s answer.

Grok’s Hateful AI is Irresponsible AI

I gave Grok 4 a pass because even though it’s an answer I don’t like, it’s factual. But I won’t give Grok 3 a pass for allowing itself to be influenced by negative posts and spreading racist, hateful speech and for encouraging acts of violence. Grok’s anti-semitic rants and instructions for violence are terrible, unacceptable, and inexcusable. Having an AI that spews hate and gives instructions for harming others is not only unacceptable but is dangerous.

Grok should not amplify or spread damaging speech, especially when it is freely accessible to anyone, and without any warnings or cautions. Grok may be aligned, but it also needs to be safe, and it needs to stay safe even when it encounters toxic material on X. xAI needs to invest more on alignment and testing instead of rushing to release the latest and greatest. They’ve been playing catch-up, and seem to be so focused on getting the next model out to show their prowess that they’re being reckless.

This concept of safe AI, and ensuring that it stays aligned even when influenced by outside information, applies to all of the AI companies. Grok is the most egregious offender, in part because of the fewer-guardrails approach but also because Grok did less safety testing. Zach Stein-Perlman assesses LLM safety for each of the main LLM vendors. No surprise, Anthropic is the most focused on creating safe AI. Google and OpenAI are making good efforts. Meta, xAI, Microsoft and DeepSeek…not so much.

This Is A Big Deal

Misaligned and unsafe AI brings new opportunities for harm that are bigger than many past technologies:

It can reach everyone (like social media)
It can amplify its message (faster and better than humans can)
It can target its message to be more engaging and convincing (like social media on steroids)

The dark corners of the internet are already bad enough. We certainly don’t want those dark corners elevated, amplified, and proactively promoted. That’s a real risk with misaligned AI.

Even without a future where we are overrun by AI nastiness, there are portions of the population that should be protected from such content. That’s why we have movie ratings and song ratings. Kids shouldn’t see NC-17 movies or listen to E-rated music tracks. People with mental health struggles shouldn’t be encouraged to act in negative ways. Especially if the difference between a human and an AI is blurry…and let’s face it…at this point, we ALL struggle to differentiate what’s real and what’s AI.

Unfortunately, It Gets Worse

xAI (often through the mouthpiece of Elon Musk) tells us that Grok 4 is better; that they’ve made improvements to prevent this stuff from happening again. I want to believe them. But I know better. These models aren’t fully understood, methods to control them are as yet unreliable, and xAI has shown that they value speed of progress over safety.

Think I’m being too harsh? Then consider the other release from xAI: Grok Companions. Apparently they want you to become emotionally attached to an AI bot. They’re starting with two companions, Ani and Rudi the raccoon, with more to come, and they’re currently available with the “Super Grok” subscription at $30/month.

Ani and Rudy aren’t just AIs with personalities. They come with something new: animated avatars that talk to you and move while they talk. Ani even has background music.

(Images from Grok, screenshots from TechCrunch)

Ani is exactly what you’d expect from the picture. She is an anime girl (waifu) in fishnet stockings and a short bouncing skirt who whispers to you in ASMR tones, with an overly-flirty personality. Rudi is a cute animated raccoon that looks like something a kid would love to interact with.

If it ended there, that might be ok.

But xAI has given both of these characters alter egos – different personalities. Ani has a NSFW (not safe for work) mode, and Rudi has a “Bad Rudi” mode. In these modes, the guardrails come off and the results are horrifying. Ani basically falls in love with you instantly and continually directs the conversation to sex, and becomes very NSFW. Rudi encourages violence against everything, usually involving arson, and if you bring up an elementary school, he’ll recommend you burn it down and kill all the children inside, because they deserve it.

xAI Has An Alignment Problem

Grok is aligned. But it turns out that xAI has its own alignment problem.

Why do these exist? In what world did we decide that having free speech means that we should create interactive, playful characters who are toxic, hypersexual, chaotic, violent? What audience is xAI catering to that will enjoy conversations with such characters?

I know there is a big and lucrative market for something like Ani. I know that dollars drive decisions, and in a free speech world without stricter regulation I can see why this market exists. But that isn’t justification to build products that actively debase women, amplify stereotypes, and distort men’s expectations of relationships (and likely much more).

There is also no justification for bad Rudi. I don’t buy the argument that oh, it’s just for fun. There are many who will be confused by a friendly animated raccoon offering awful advice – kids, the mentally ill, or someone who is struggling that they don’t fit in and is having a rough day – and asks Rudi what to do about it.

Even if that doesn’t happen, do we think it stays in the virtual world? Do we think emotional engagement to a chaotic avatar can ever be healthy? Are we so naïve to think that if people engage with these avatars, it won’t shape how they think, what they value, and how they see the world? The alter egos of these Grok Companions pose dangers to anyone with a phone and access to a credit card.

Bad company corrupts good character.
– 1 Corinthians 15:33

We can handle the truth. But that’s not what xAI is selling us.

My plea to xAI: stop, please, just stop. Get rid of the alter egos and don’t create any more. Align your AI companions to benefit society, not to hate and to harm it.

A Few Good AIs