
Newer and bigger doesn’t always mean smarter. A study that fact-checked generative AI results found that even the best models were free from hallucinations only about 35% of the time. And they haven’t gotten better with time and resources: OpenAI’s newest GPT-4o model hallucinated at about the same rate as GPT-3.5. The smaller Haiku version of Anthropic’s Claude 3 performed about as well as the bigger Opus model.
And now the models could be collapsing. Some AI researchers are thinking about training AI on AI-generated data, whether it’s on purpose (to avoid copyright claims) or accidental (with more AI-generated content online, it will inevitably find its way into training data). And a separate study showed this leads to “model collapse” — errors get compounded and AI starts to spit out gibberish. Researchers found that high-quality synthetic data can limit this, but with only a third of AI content being error free, it may be some time before that happens.
Don’t fully trust AI just yet. Benchmarking tests are notoriously unreliable. Even the Chatbot Arena — a blind, crowdsourced AI ranking that’s meant to be more objective — can be a bad guide depending on what you are doing with AI, as one developer found this week.