To me the research around solving “hallucination” is a dead end. The models will always hallucinate, and merely reducing the probability that they do so only makes the mistakes more dangerous. The question then becomes “for what purposes (if any) are the models profitable, even if they occasionally hallucinate?” Whoever solves that problem walks away with the market.
It's valid to take either position, that both can be aware of truth or that neither can be, and there has been a lot of philosophical debate about this specific topic with humans since well before even mechanical computers were invented.
Plato's cave comes to mind.
The situation you described is possible, but would require something like a subverting effort of propaganda by the state.
Inferring truth about a social event in a social situation, for example, requires a nuanced set of thought processes and attention mechanisms.
If we had a swarm of LLMs collecting a variety of data from a variety of disparate sources, where the swarm communicates for consensus, it would be very hard to convince them that Moscow is in Connecticut.
Unfortunately we are still stuck in monolithic training run land.
Also, given the lack of imagination everyone has with naming places, I had to check:
This last bit is not a great thing though, as LLMs don't have the direct experience needed to correct factual errors about the external world. Unfortunately we care about the external world, and want them to make accurate statements about it.
It would be possible for LLMs to see inconsistencies across or within sources, and try to resolve those. If perfect, then this would result in a self-consistent description of some world, it just wouldn't necessarily be ours.
I do think it is an extremely inefficient way to have a swarm (e.g. across time through training data) and it would make more sense to solve the pretraining problem (to connect them to the external world as you pointed out) and actually have multiple LLMs in a swarm at the same time.
> The situation you described is possible, but would require something like a subverting effort of propaganda by the state.
Great! LLMs are fed from the same swarm.
> If you pretrained an LLM with data saying Moscow is the capital of Connecticut it would think that is true.
> Well so would a human!
But humans aren't static weights, we update continuously, and we arrive at consensus via communication as we all experience different perspectives. You can fool an entire group through propaganda, but there are boundless historical examples of information making its way in through human communication to overcome said propaganda.
Continual/online learning is still an area of active research.
And it's not for a lack of trying. Logic cannot even handle Narrow Intelligence that deals with parsing the real world (Speech/Image Recognition, Classification, Detection etc). But those are flawed and mis-predict so why build them ? Because they are immensely useful, flaws or no.
Having deeply flawed machines in the sense that they perform their tasks regularly poorly seems like an odd choice to pursue.
>Having deeply flawed machines in the sense that they perform their tasks regularly poorly seems like an odd choice to pursue.
State of the art ANNs are generally mostly right though. Even LLMs are mostly right, that's why hallucinations are particularly annoying.
People do seem to have higher standards for machines but you can't eat your cake and have it. You can't call what you do reasoning and turn around and call the same thing something else because of preconceived notions of what "true" reasoning should be.
That is to say, to our best knowledge humans have no purely logical way of knowing truth ourselves. Human truth seems intrinsically connected to humanity and lived experience with logic being a minor offshoot
I think research about hallucination is actually pretty valuable though. Consider that humans make mistakes, and yet we employ a lot of them for various tasks. LLMs can't do physical labor, but an LLM with a low enough hallucination rate could probably take over many non social desk jobs.
Although in saying that, it seems like it also might need to be able to learn from the tasks it completes, and probably a couple of other things too to be useful. I still think the highish level of hallucination we have right now is a major reason why they haven't replaced a bunch of desk jobs though.
Isn't this just conveniently glossing over the fact that you weren't taught that. It's not a "model of truthfulness", you were taught facts about geography and you learned them.
I am saying that humans can have a "truther" way of knowing some facts through direct experience. However there are a lot of facts where we don't have that kind of truth, and aren't really on any better ground than an LLM.
I'm sure we could make similar communities of LLMs, but instead we treat a task as the role of a single LLM that either succeeds or fails. As you say, perhaps because of the high error rate, the very notion of LLM failure and success is judged differently too. Beyond that, a passable human pilot and a passable LLM pilot might have similar average performance but differ hugely in other measurements.
It's both interesting and sensible that we have this education in the training phase but not the usage phase. Currently we don't tend do any training once the usage phase is reached. This may be at least partially because over-training models for any special purpose task (including RLHF) seems to decrease performance.
I wonder how far you could get by learning from retraining from some checkpoint each time with some way to gradually increase the quality of the limited quantity training data being feed. The newer data could come from tasks the model completed, along with feedback on performance from a human or other software system.
Someone's probably already done this though. I'm just sitting in my armchair here!
I think there are two issues here:
1. The "truthfulness" of the underlying data set, and 2. The faithfulness of the LLM to pass along that truthfulness. Lack of passing along the truthfulness is, I think, the definition of the hallucination.
To your point, if the data set if flawed or factually wrong, the model will always produce the wrong result. But I don't think that's a hallucination.
This isn't true.
You're conflating whether a model (that hasn't been fine tuned) would complete "the capital of Connecticut is ___" with "Moscow", and whether that model contains a bit labeling that fact as "false". (It's not actually stored as a bit, but you get the idea.)
Some sentences that a model learns could be classified as "trivia", and the model learns this category by sentences like "Who needs to know that octopuses have three hearts, that's just trivia". Other sentences a model learns could be classified as "false", and the model learns this category by sentences like "2 + 2 isn't 5". Whether a sentence is "false" isn't particularly important to the model, any more than whether it's "trivia", but it will learn those categories.
There's a pattern to "false" sentences. For example, even if there's no training data directly saying that "the capital of Connecticut is Moscow" is false, there are a lot of other sentences like "Moscow is in Russia" and "Moscow is really far from CT" and "people in Moscow speak Russian", that all together follow the statistical pattern of "false" sentences, so a model could categorize "Moscow is the capital of Connecticut" as "false" even if it's never directly told so.
Some data might support a statistical approach other might not even though it might not contain misrepresentations as such.
Of course, maybe induction is false and gravity will reverse in the next 3 seconds after writing this comment and God will reveal themself to us. We have no justified reason to think otherwise other than the general principle that things behave the way we observe them to and will continue to do so.
If not then I'm not even sure what the disagreement is.
Appart from that, these "manifolds" have noise, so that is another difference with the standard manifolds.
In other words, a model might have local contextual indicators, but not be able to recognize global hard logical contradictions.
As you have put it well, there is no notion of truthfulness encoded in the system as it is built. hence there is no way to fix the problem.
An analogy here is around the development of human languages as a means of communication and as a means of encoding concepts. The only languages that humans have developed that encode truthfulness in a verifiable manner are mathematical in nature. what is needed may be along the lines of encoding concepts with a theorem prover built-in - so what comes out is always valid - but then that will sound like a robot lol, and only a limited subset of human experience can be encoded in this manner.
A more interesting pursuit might be to determine if humans are "hallucinating" in this same way, if only occasionally. Have you ever known one of those pathological liars who lie constantly and about trivial or inconsequential details? Maybe the words they speak are coming straight out of some organic LLM-like faculty. We're all surrounded by p-zombies. All eight of us.
If it was, maybe. But it wasn't.
Training data isn't random - it's real human writing. It's highly correlated with truth and correctness, because humans don't write for the sake of writing, but for practical reasons.
t. @abouelleill
EDIT: I've had some time to think and if you read somewhere that Hartford is the capital of Connecticut, you're right in a Gettier way too. Reading some words that happen to be true is exactly like using a picture of your room as your zoom background. It is a facsimile of the knowledge encoded as words.
Training data is the source of ground truth, if you mess that up that's kind of a you problem, not the model's fault.
- Moscow is a Russian city, and they probably aren't a lot of cities in the US that have strong Russian influences especially in the time when these cities might have been founded
- there's a concept of novelty in trivia, whereby the more unusual the factoid, the better the recall of that fact. If Moscow were indeed the capital of Connecticut, it seems like the kind of thing I might've heard about since it would stand out as being kind of bizarre.
Noticeably this type of inference seems to be relatively distinct from what LLMs are capable of modeling.
One particular case was an attempt to plug GPT-4 as a decision maker for certain actions in a video game. One of those was voting for a declaration of war (all nobles of one faction vote on whether to declare war on another faction). This mostly boils down to assessing risk vs benefits, and for a specific clan in a faction, the risk is that if the war goes badly, they can have some of their fiefs burned down or taken over - but this depends on how close the town or village is to the border with the other faction. The LM was given a database schema to query using SQL, but it didn't include location information.
To my surprise, GPT-4 (correctly!) surmised in its chain-of-thought, without any prompting, that it can use the culture of towns and villages - which was in the schema - as a sensible proxy to query for fiefs that are likely to be close to the potential enemy, and thus likely to be lost if the war goes bad.
In reality humans are wrong basically most of the time. Especially when you go off a humans immediate reaction to a problem which is what we force LLMs to do (unless you're using chain of thought or pause tokens).
That being said there still is a notion of truthfulness because LLMs can also be made to deceive in which case they 'know' to act deceptively.
GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r
Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - https://arxiv.org/abs/2310.06824
The Internal State of an LLM Knows When It's Lying - https://arxiv.org/abs/2304.13734
LLMs Know More Than What They Say - https://arjunbansal.substack.com/p/llms-know-more-than-what-...
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334
To wit, in your first link it seems the figure is just showing the trivial fact that the model is trained on the MMLU dataset (and after RLHF it is no longer optimized for that). The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
I'm not going to bother going through the rest.
I don't yet understand exactly what they are doing in the OP's article but I suspect it also suffers from serious problems.
It's related research either way. And I did read them. I think there's probably issues with the methodology of 4 but it's there anyway because it's interesting research that is related and is not without merit.
>The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
The panel is pretty weak on correlation but it's quite clearly also not the only thing that supports that particular claim neither does it contradict it.
>I'm not going to bother going through the rest.
Ok? That's fine
>I don't yet understand exactly what they are doing in the OP's article but I suspect it also suffers from serious problems.
You are free to assume anything you want.
It very clearly contradicts it: There is no correlation between the predicted truth value and the actual truth value. That is the essence of the claim. If you had read and understood the paper you would be able to specifically detail why that isn't so rather than say vaguely that "it is not the only thing that supports that particular claim".
Not every internet conversation need end in a big debate. You've been pretty rude and i'd just rather not bother.
You also seem to have a lot to say on how much people actually read papers but your first response also took like 5 minutes. I'm sorry but you can't say you've read even one of those in that time. Why would i engage with someone being intellectually dishonest?
You've posted the papers multiple times over the last few months, so no I did not read them in the last five minutes though you could in fact find both of the very basic problems I cited in that amount of time.
I'm even less willing to engage.
> you can't say you've read even one of those in that time.
I'm not sure if you're aware, but most of those papers are well known. All the arxiv papers are from 2022 or 2023. So I think your 5 minutes is pretty far off. I for one have spent hours, but the majority of that was prior to this comment.You're claiming intellectual dishonestly too soon.
That said, @foobarqux, I think you could expand on your point more to clarify. @og_kalu, focus on the topic and claims (even if not obvious) rather than the time
Fair Enough. With the "I'm not going to bother with the rest", it seemed like a now thing.
>focus on the topic and claims (even if not obvious) rather than the time
I should have just done that yes. 0 correlation is obviously false with how much denser the plot is at the extremes and depending on how many questions are in the test set, it could even be pretty strong.
> 0 correlation is obviously false with how much denser the plot is at the extremes and depending on how many questions are in the test set, it could even be pretty strong.
I took it as hyperbole. And honestly I don't find that plot or much of the paper convincing. Though I have a general frustration in that it seems many researchers (especially NLP) willfully do not look for data spoilage. I know they do deduplication but I do question how many try to vet this by manual inspection. Sure, you can't inspect everything, but we have statistics for that. And any inspection I've done leaves me very unconvinced that there is no spoilage. There's quite a lot in most datasets I've seen, which can have a huge change in the interpretation of results. After all, we're elephant fittingI think it comes from the ad hoc nature of evaluation in young fields. It's like you need an elephant but obviously you can't afford one, so you put a dog in an elephant costume and can it an elephant, just to get in the right direction. It takes a long time to get that working and progress can still be made by upgrading the dog costume. But at some point people forgot that we need an elephant so everyone is focused on the intricacies of the costume and some will try dressing up the "elephant" as another animal. Eventually the dog costume isn't "good enough" and leads us in the wrong direction. I think that's where we are now.
I mean do we really think we can measure language with entropy? Fidelity and coherence with FID? We have no mathematical description of language, artistic value, aesthetics, and so on. The biggest improvement has been RLHF where we just use Justice Potter's metric: "I know it when I see it"
I don't think it's malice. I think it's just easy to lose sight of the original goal. ML certainly isn't the only one to have done this but it's also hard to bring rigor in and I think the hype makes it harder. Frankly I think we still aren't ready for a real elephant yet but I'd just be happy if we openly acknowledge the difference between a dog in a costume proxying as an elephant and an actually fucking elephant.
[0] seriously, how do we live in a world where I have to explain what covariance means to people publishing works on diffusion models and working for top companies or at top universities‽
Maybe you’ve inferred some view based on the names of the titles, but in that case you seem to be falling afoul of your own complaint?
If I am wrong or not useful in my posts, I would hope to be allowed to remove what was wrong and/or not useful without losing my standing to share the accurate, useful things. Anything else seems like residual punishment outside the appropriate context.
Related:
Still no lie detector for language models: probing empirical and conceptual roadblocks - https://link.springer.com/article/10.1007/s11098-023-02094-3
Hallucination is Inevitable: An Innate Limitation of Large Language Models - https://arxiv.org/abs/2401.11817
LLMs Will Always Hallucinate, and We Need to Live With This - https://arxiv.org/abs/2409.05746
(disclaimer, I also have not read any of these papers beyond the title!)
The claim in the abstract is:
"""We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format.
Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks."""
The plot is much denser in the origin and top right. How is that 0 correlation ? Depending on the number of their held-out test set, that could be pretty strong correlation even.
And how does that contradict the claims they've made, especially on calibration (Fig 13 down) ?
First we agree by observation that outside of the top-right and bottom-left corners there isn't any meaningful relationship in the data, regardless of what the numerical value of the correlation is. Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5). This is also consistent with the general behavior displayed in figure 13.
If you have some other interpretation of the data you should lay it out. The authors certainly did not do that.
edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix: if the output probabilities for the next token are spread evenly for example (and not have overwhelming probability for a single token) they prompt for additional clarification. They don't really claim anything like the model "knows" whether it's wrong but they say it improves performance.
A y=x relationship is not necessary for meaningful correlation and the abstract is quite clear on out of sample performance either way.
>Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5).
The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
>edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix
I know about entropix. It hinges strongly on the model's representations. If it works, then choosing to call it "knowing" or not is just semantics.
I’m not concerned with correlation (which may or may not indicate an actual relationship) per se, I’m concerned with whether there is a meaningful relationship between predicted and actual. The 12 plot clearly shows that predicted isn’t tracking actual even in the corners. I think one of the lines (predicting 0% but actual is like 40%, going from memory on my phone) of Figure 13 right even more clearly shows there isn’t a meaningful relationship. In any case the authors haven’t made any argument about how those plots support their arguments and I don’t think you can either.
> the abstract is quite clear on out of sample performance either way.
Yes I’m saying the abstract is not supported by the results. You might as well say the title is very clear.
> The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
Now we’ve gone from “the paper shows” to speculating about what the paper might have shown (and even that is probably not possible based on the Figure 13 line I described above)
> choosing to call it "knowing" or not is just semantics.
Yes it’s semantics but that implies it’s meaningless to use the term instead of actual underlying properties.
The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?
Would it have been hard to simply say you found the results unconvincing? There is nothing contradictory in the paper.
> The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?
Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
> Would it have been hard to simply say you found the results unconvincing?
Anyone can look at the graphs, especially Figure 13, and see this isn't a matter of opinion.
> There is nothing contradictory in the paper.
The results contradict the claim the titular claim that "Language Models (Mostly) Know What They Know".
Yeah but Lambada is not the only line there.
>Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
Train the classifier on math questions and get good calibration for math, train the classifier on true/false questions and get good calibration for true/false, train the train the classifier on math but struggle with true/false (and vice versa). This is what "out-of-distribution" is referring to here.
Make no mistake, the fact that both the first two work is evidence that models encode some knowledge about the truthfulness of their responses. If they didn't, it wouldn't work at all. Statistics is not magic and gradient descent won't bring order where there is none.
What out of distribution "failure" here indicates is that "truth" is multifaceted and situation dependent and interpreting the models features is very difficult. You can't train a "general LLM lie detector" but that doesn't mean model features are unable to provide insight into whether a response is true or not.
There are 3 out-of-distribution lines, all of them bad. I explicitly described two of them. Moreover, it seems like the worst time for your uncertainty indicator to silently fail is when you are out of distribution.
But okay, forget about out-of-distribution and go back to Figure 12 which is in-distribution. What relationship are you supposed to take away from the left panel? From what I understand they were trying to train a y=x relationship but as I said previously the plot doesn't show that.
An even bigger problem might be the way the "ground truth" probability is calculated: they sample the model 30 times and take the percentage of correct results as ground truth probability, but it's really fishy to say that the "ground truth" is something that is partly an internal property of the model sampler and not of objective/external fact. I don't have more time to think about this but something is off about it.
All this to say that reading long scientific papers is difficult and time-consuming and let's be honest, you were not posting these links because you've spent hours poring over these papers and understood them, you posted them because the headlines support a world-view you like. As someone else noted you can find good papers that have opposite-concluding headlines (like the work of rao2z).
"Will the ability to let everyone express increase noise?" // Yes
"Will feeding all available data to a processor reduce noise?" // Probably
"Everywhere the world movement seems to be in the direction of centralised economies which can be made to ‘work’ in an economic sense but which are not democratically organised and which tend to establish a caste system. With this go the horrors of emotional nationalism and a tendency to disbelieve in the existence of objective truth because all the facts have to fit in with the words and prophecies of some infallible fuhrer. Already history has in a sense ceased to exist, ie. there is no such thing as a history of our own times which could be universally accepted, and the exact sciences are endangered as soon as military necessity ceases to keep people up to the mark. Hitler can say that the Jews started the war, and if he survives that will become official history. He can’t say that two and two are five, because for the purposes of, say, ballistics they have to make four. But if the sort of world that I am afraid of arrives, a world of two or three great superstates which are unable to conquer one another, two and two could become five if the fuhrer wished it. That, so far as I can see, is the direction in which we are actually moving, though, of course, the process is reversible."
So yeah, you can't force all models to have all the biases that you want them to have. But you most certainly can limit the number of such models and restrict access to them. It's not really any different from how totalitarian societies have treated science in general.
Such a reductionist view of the issue, the mere suggestion that hallucinations can be fixed by tweaking some variable or fixing some bug immediately discredits the resrarchers.
Alternatively, are you saying that they can never be entirely fixed because LLMs are an approximate method? I'm in agreement here, but I don't think the researchers are claiming that they solved hallucinations completely.
Do you think LLMs don't have an internal model of the world? Many people seem to think that, but it is possible to find an internal model of the world in small LLMs trained on specific tasks (See [0] for a nice write-up of someone doing that with an LLM trained on Othello moves). Presumably larger general LLMs have various models inside of them too, but those would be more difficult to locate. That being said, I haven't been keeping up with the literature on LLM interpretation, so someone might have managed it by now.
Yes, if.
Or we could realize that the LLMs output is a random draw from a distribution learned from the training data, i.e. ALL of its outputs are a hallucination. It has no concept of truth or falsehoods.
However, we do know that LLMs posses viable internal models, as I linked to in the post you are responding to. The OP paper notes that the probes it uses find the strongest signal of truth, where truth is defined by whatever the correct answer on each benchmark is, on the middle layers of the model during the activation of these "exact answer" tokens. That is, we have something which statistically correlates with whether the LLM's output matches "benchmark truth" inside the LLM. Assuming that you are willing to grant that "concept" and "internal model" are pretty much the same, this sure sounds like a concept of "benchmark truth" at work. If you aren't willing to grant that, I have no idea of what you mean by concept.
If you mean to say that humans have some model of Objective Truth which is inherently superior, I'd argue that isn't really the case. Human philosophers have been arguing for centuries over how to define truth, and don't seem to have come to any conclusion on the matter. In practice, people have wildly diverging definitions of truth, which depend on things like how religious or skeptical they are, what the standards for truth are in their culture, and various specific quirks from their own personality and life experience.
This paper only measured "benchmark truth" because that is easy to measure, but it seems reasonable to assume that other models of truth exist within them. Given that LLMs are supposed to replicate the words that humans wrote, I suspect that their internal models of truth work out to be some agglomeration (plus some noise) of what various humans think of as truth.
If language communicates thoughts, thoughts have a relationship with reality, and that relationship might be true or false or something else.
Then what thought is LLM language communicating, to what reality does it bear a relationship, and what is the truth or falseness of that language?
To me, LLM generated sentences have no truth or false value, they are strings, literally, not thoughts.
Take the simple "user:how much is two plus two? assistant: two plus two is four". It may seem trivial, but how do ascertain that that statement maps to 2+2=4? Do you make a leap of faith or argue that the word plus maps to the adding function? What about is, does it map to equality? Even if they are the same tokens as water is wet (where wet is not water?). Or are we arguing that the truthfulness lies on the embedding interpretation? Where now tokens and strings merely communicate the multidim embedding space, which could be said to be a thought, now we are mapping some of the vectors in that space as true, and some as false?
Lets assume LLMs don't "think". We feed an LLM an input and get back an output string. It is then possible to interpret that string as having meaning in the same way we interpret human writing as having meaning, even though we may choose not to. At that point, we have created a thought in our heads which could be true or false.
Now lets talk about calculators. We can think of calculators as similar to LLMs, but speaking a more restricted language and giving significantly more reliable results. The calculator takes a thought converted to a string as input from the user, and outputs a string, which the user then converts to a thought. The user values that string creating a thought which has a higher truthiness. People don't like buggy calculators.
I'd say one can view an LLM in exactly the same way, just that they can take a much richer language of thoughts, but output significantly buggier results.
You can't fix bugs as if they were one thing.
Imagine if someone tried to sell you a library that fixes bugs.
Examples include linters, fuzzers, testing frameworks, and memory safe programming languages (as in Rust, but also as in any language with a GC). All these things reduce the number of bugs in the final product by giving you a way to detect them. (except for memory safe languages, which just eliminate a class of bugs) The paper is advertising a method to detect whether a given output is likely to be affected by a "bug", and a taxonomy of the symptoms of such bugs. The paper doesn't provide a way to fix those, and hallucinations don't necessarily have a single cause. Some hallucinations might be fixed by contextual calibration [0], others might be fixed by adding more training data similar to the wrong example.
In any case, you need to find the bad outputs before you can perform any fixes. Because LLMs tend to be used to produce "fuzzy" outputs with no single right answer, traditional testing frameworks and the like aren't always applicable.
It's a panacea
It's of course possible that this paper in particular is fraudulent, but note that there is a field of research making the same basic claim as this paper, so this isn't some one off thing. A reasonable amount of people from different institutions would need to be in on it for the entire field to be fraudulent.
Alternatively, I think you may be objecting to the use of the word "truthfulness" in the abstract of the paper, because you seem to think that only human thoughts can possibly have a true or false value. I'm not actually going to object to the idea that only human thoughts can be true or false, but like the response I wrote to your koan comment, the user can interpret the LLMs output, which gives the user's thought a true or false value.
In this case, philosophically, you can think of this paper as trying to find cases where the LLM outputs strings that the user interprets as false. I think the authors of the paper are probably thinking about true or false more as a property of sentences, and thus a thing mere strings can possess regardless of how they are created. This is also a philosophically valid way to look at it, but differs from your view in a way that possibly made you think their claims absurd.
That "suggestion" is fictional: they haven't suggested this. What they offer is a way to measure the confidence a particular model might have in the product of the model. Further, they point out that there is no universal function to obtain this metric: different models encode it in differently.
Not exactly a "cures cancer" level claim.
But even if you could "detect it" but not cure it, it's still a braindead take. Sorry
I think the paper itself demonstrates that the model has something internally going on which is statistically related to whether it's answer is correct on a given benchmark. Obviously, the LLM will not always be perfectly accurate about these things. However, let's say you are using an LLM to summarize sources. There's no real software system right now that signals whether or not the summary is correct. You could use this technique to train probes to find if a human would agree that the summary is correct, and then flag outputs where the probes say the output wouldn't agree with a human for human review. This is a lot less expensive of a way to detect issues with your LLM than just asking a human to review every single output.
While we don't have great methods for "curing it", we do have some. As I mentioned in a sibling post, contextual calibration and adding/adjusting training data are both options. If you figure out the bug was due to RAG doing something weird, you could adjust your RAG sources/chunking. Regardless, you can't put any human thought into curing bugs that you haven't detected.
Some arguments seem to tacitly hold LLMs to a standard of full-on brain-in-a-vat solipsism, asking them to prove their way out, where they'll obviously fail. The more interesting and practical questions, just like in humans, seem to be a bit removed from that though.
It's not really necessary to answer abstractions about truth and knowledge. Just being able to reject a known-false answer would be of value.
Also, 100% truthfulness then is plagiarism?
Maybe it learns to see when something is true, even if you don't feed it true statements all the time (?)
Is "LLMs know" a true sentence in the sense of the article? Is it not? Can LLMs know something? We will never know.
In the article, a "LLM knows" if it is able to answer correctly in the right circumstances. The article suggests that even if a LLM answers incorrectly the first time, trying again may result in a correct answer, and then proposes a way to pick the right one.
I know some people don't like applying anthropomorphic terms to LLMs, but you still have to give stuff names. I mean, when you say you kill a process, you don't imply a process is a life form. It is just a simple way of saying that you halt the execution of a process and deallocate its resources in a way that can't be overridden. The analogy works, everyone working in the field understands, where is the problem?
With "kill", there is not a lot of space for interpretation, that is why it works.
Take for example the name "Convolutional Neural Networks". Do you prefer that, or let's say "Vision Neural Networks"?
I prefer the first one because it is closer to the mathematical representation. And it does not force you to think that it can only be used for "Vision", which would be biasing the understanding of the model.
> In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance.
"LLMs encode information about truthfulness and leveraging how they encode it enhances error detection" is a meaningful, empirically testable statement.
If I now talk about my personal experience, when I read this article I have to "translate" in my head every appearance of those words to something that I can work with and is objective. And I find that annoying.
Check out the "CoT Deception Monitoring" section. In 0.38% of cases, o1's CoT shows that it knows it's providing incorrect information.
Going beyond hallucinations, models can actually be intentionally deceptive.
...so after having a read through your reference, the money-shot:
Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead.
A third of the discussion follows a pattern of people re-asserting their belief that LLMs can't possibly have knowledge and almost bragging about how they'll ignore any evidence pointing in another direction. They'll ignore it because computers can't possibly understand things in a "real" way and anyone seriously considering the opposite must be deluded about what intelligence is, and they know better.
These discussions are fundamentally sterile. They're not about considering ideas or examining evidence, they're about enforcing orthodoxy. Or rather, complaining very loudly that most people don't tightly adhere to their preferred orthodoxy.