A new study finds that with large-scale language modeling (LLM) are becoming more powerful, they also seem to be getting more prone to making up facts rather than avoiding or refusing to answer questions they can't answer. This suggests that these smarter AI chatbots are actually becoming less reliable.
Published in the journal Nature, the researchers examined some of the industry's leading commercial LLMs: OpenAI's GPT and Meta's LLaMA, as well as BLOOM, an open-source model created by the research group BigScience.
It was found that while the answers to these LLMs became more accurate in many cases, they were overall less reliable and gave a higher percentage of incorrect answers than the old model.
José Hernández-Orallo, a researcher at the Valencia Institute for Artificial Intelligence in Spain, told Nature, "Today, theThey can answer almost everything. That means more right answers, but also more wrong answers. "
Mike Hicks, a philosopher of science and technology at the University of Glasgow, had a harsher take on this, with Hicks (who was not involved in the study) telling Nature, "It seems to me like what we call nonsense, which is getting better and better at pretending to be knowledgeable."
In testing, the models were asked about a variety of topics ranging from math to geography, and were asked to perform tasks such as listing information in a specified order.Overall, the larger, more powerful models gave the most accurate answers, but performed poorly on the more difficult questions, where they were less accurate.
Some of the biggest "liars", according to the researchers, are OpenAI's GPT-4 and o1, but all the LLMs studied seem to follow this trend, with none of the LLaMA family of models able to achieve the accuracy of 60% for even the simplest problems.
And when asked to judge whether a chatbot's answer was accurate or inaccurate, theA small group of participants had 10% to 40% probability errors in judgment.
In short, studies have shown that the larger the AI models (in terms of parameters, training data, and other factors), the higher the percentage of incorrect answers they give.
According to the researchers, the easiest way to solve these problems is to make the LLM less eager to answer everything. according to Hernández-Orallo, "A threshold can be set.When the question is challenging, let the chatbot say 'No, I don't know'." But if chatbots are limited to answering only what they know, it could expose the limitations of the technology.