I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.
When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?
I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)
So is are LLMs reliable for research like that?
So are LLMs reliable for research like that?
No. Of course not. They're not reliable for anything. They don't have any kind of database of facts and don't know or attempt to know anything at all.
They're just a more advanced version of your phone's predictive text. All they do is try to figure out which words most likely go in what order as a response to the prompt. That's it. There is no logic of any kind dictating what an LLM outputs.
Proof that you can do anything if somebody piles billions of money on you.
The least unreliable LLM I've found by far is perplexity, in the Pro mode. (By the way, if you want to try it out. You get a few free uses a day).
The reason is because the Pro mode doesn't retrieve and spit out information from its internal memory bank, but instead, it uses that information to launch multiple search queries, then summarises the pages it finds, and then gives you that information.
Other LLMs try to answer "from memory" and then add some links at the bottom for fact checking but usually Perplexity's answers come straight from the web so they're usually quite good.
However, I still check (depending on how critical the task is) that the tidbit of information has one or two links next to it, that the links talk about the right thing, and I verify the data myself if it's actually critical that it gets it right. I use it as a beefier search engine, and it works great because it limits the possible hallucinations to the summarisation of pages. But it doesn't eliminate the possibility completely so you still need to do some checking.
If generation temperature is non-zero (which it often is), there is inherent randomness to the output. So even if the first number in a statistic should be 1, sometimese it will just randomly pick any other plausible number. Even if the network always picks the correct token as the highest probability, it's basically doing a coin toss for every token to make answers more creative.
That's on top of hoping the LLM has even seen that data during training AND managed to memorize it during training AND that the networks just happens to be able to reproduce the correct data given your prompt (it might not be able to for a different prompt).
If you want any reliability at all, you need to use RAG AND also you yourself have to double check all the references it quotes (if it even has that capability).
Even if it has all the necessary information to answer correctly in it's context window, it can still answer incorrectly.
None of the current models are anywhere close to producing trustworthy output 100% of the time.