Research Finds: The Smarter AI Becomes, the More Likely It Is to "Make Things Up"
IT Home, September 29th news, a new study has found that as large language models (LLMs) become more powerful, they seem to be increasingly prone to fabricating facts rather than avoiding or refusing to answer questions they cannot answer. This suggests that these smarter AI chatbots have actually become less reliable.

The study, which was published in the journal Nature, looked at some of the industry-leading commercial LLMs: OpenAI's GPT and Meta's LLaMA, as well as the open-source model BLOOM created by the research group BigScience.
The research found that while the responses from these LLMs have become more accurate in many cases, their overall reliability is worse, with a higher proportion of incorrect answers than older models.
José Hernández-Orallo, a researcher at the Valencian Institute of Artificial Intelligence in Spain, told Nature, "Nowadays, they can almost answer everything. This means more correct answers, but it also means more incorrect answers."
Mike Hicks, a philosopher of science and technology at the University of Glasgow who did not participate in the study, was more critical, telling Nature, "In my view, this is what we call nonsense, and it is becoming increasingly good at pretending to be knowledgeable."
In the tests, these models were asked about various subjects from mathematics to geography and were asked to perform tasks such as listing information in a specified order. Overall, the larger and more powerful models provided the most accurate answers, but they performed poorly on more difficult questions, with lower accuracy rates.
The researchers said that some of the biggest "liars" were OpenAI's GPT-4 and o1, but all the LLMs studied seemed to show this trend. None of the LLaMA series models were able to achieve a 60% accuracy rate, even for the simplest questions.
When asked to judge whether the chatbot's responses were accurate or not, a small group of participants had a 10% to 40% chance of making a wrong judgment.
In summary, the study indicates that the larger the AI model (in terms of parameters, training data, and other factors), the higher the proportion of incorrect answers it gives.
The researchers said the simplest way to solve these problems is to make LLMs less eager to answer everything. Hernández-Orallo said, "You can set a threshold that makes the chatbot say 'no, I don't know' when the question is challenging." However, if chatbots are restricted to only answering what they know, it may expose the limitations of the technology.
A total of 1 page 1 data