Procedural knowledge in pretraining drives reasoning in large language models

246 points by reqo 3 days ago | 101 comments

largbae 3 days ago |
Is this conclusion similar to my layman's understanding of AlphaGo vs AlphaZero? That human procedural knowledge helps ML training to a point, and from there on becomes a limitation?
dinfinity 3 days ago |
No. They're saying that the model they analyzed used mainly information on _how_ to solve math problems from its training data, rather than documents that contained the answers to the (identical) math problems:
> "We investigate which data influence the model’s produced reasoning traces and how those data relate to the specific problems being addressed. Are models simply ‘retrieving’ answers from previously seen pretraining data and reassembling them, or are they employing a more robust strategy for generalisation?"
> "When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning."
Example reasoning question: > "Prompt Calculate the answer: (7 - 4) * 7 Think step-by-step."
spitfire 3 days ago |
What I further got from this is the models are learning the methods, but not evaluating themselves along the way. They don’t check for errors.
So once they go down a path they can’t properly backtrack.
This feels like the ground truth I’ve experienced in LLMs to date.
spitfire 3 days ago |
I’ll add when I say “learning” I mean memorization. Memorizing on a higher level than facts.
I would love to spend the time and see how altering the query alters the reasoning path. How firm is in the path once it’s chosen?
A high level approach has the possibility to be very computer efficient.
Nevermark 3 days ago |
> Memorizing on a higher level than facts
Which is not memorization, since memorization is defined by its limits: storing information based on its literal form, as apposed to some higher meaning.
It is called generalization. Learning specific examples with a shared memory too small to memorize all the examples, creates a gradient toward a more compact storage form: patterns. Which, unlike memorized examples, are able to generate reasonable guesses for similar but previously unencountered problems.
Generalization does not require reasoning, nor is it required for reasoning. But they often complement each other.
Where reasoning usually means some kind of flexible application of multiple steps. I.e. a sequence of steps, trying alternative steps, stepping forward to a solution, stepping back from the goal, accumulation of ever larger solved subsets or substeps of the problem, etc.
NitpickLawyer 3 days ago |
> So once they go down a path they can’t properly backtrack.
That's what the specific training in o1 / r1 / qwq are addressing. The model outputs things like "i need to ... > thought 1 > ... > wait that's wrong > i need to go back > thought 2 > ... etc
sgt101 3 days ago |
drives retrieval of patterns of procedure?
I mean - like for arithmetic?
ijk 3 days ago |
This would explain the unexpected benefits of training on code.
strken 3 days ago |
That sounds interesting, but I'm a layman and don't know anything about it. Can you provide a link?
I was able to find https://arxiv.org/abs/2408.10914, but I don't have the context to know whether it's the paper you're talking about.
MurizS 3 days ago |
I think GP was probably referring to "Scaling Data-Constrained Language Models" (2305.16264) from NeurIPS 2023, which looked first at how to optimally scale LLMs when training data is limited. There is a short section on mixing code (Python) into the training data and the effect this has on performance on e.g. natural language tasks. One of their findings was that training data can be up to 50% code without actually degrading performance, and in some cases (benchmarks like bAbI and WebNLG) with improvements (probably because these tasks have an emphasis on what they call "long-range state tracking capabilities").
For reference: In the Llama 3 technical report (2407.21783), they mention that they ended up using 17% code tokens in their training data.
eru 3 days ago |
Is the network only trained on the source code, or does it have access to the results of running the code, too?
YetAnotherNick 2 days ago |
Also GPT-3.5 was another extreme if I remember correctly. They first trained only on code then they trained on other text. I can't seem to find the source though.
moffkalast 3 days ago |
There was an interview with Zuckerberg about how they initially split training llama chat models on purely normal text and codellama on code, but later realized that if they combine the training set they get a model that is better at both tasks than each specialized one was.
jpcom 3 days ago |
You mean you need humans to step-by-step solve a problem so a neural net can mimic it? It sounds kinda obvious now that I write it out.
mattdeboard 3 days ago |
No. If I'm understanding correctly it means the software is learning how to solve problems in general by ingesting examples of procedural problem-solving.
jpcom 3 days ago |
You're close, but there’s an important nuance. The process isn't about "learning how to solve problems in general" in the broad sense. It's more specific: the neural network is trained to mimic the step-by-step process demonstrated by humans solving a specific problem.
The distinction is that the software doesn't autonomously derive general problem-solving heuristics from scratch. Instead, it observes examples of how humans solve problems procedurally and uses that to replicate similar reasoning. This is crucial because the step-by-step demonstrations give the model structure and guidance, which is different from learning a generalizable strategy for solving any kind of problem without those examples.
In essence, it's like a neural net learning to follow a recipe by watching a chef cook—rather than inventing its own recipes entirely from first principles.
jebarker 3 days ago |
> In essence, it's like a neural net learning to follow a recipe by watching a chef cook—rather than inventing its own recipes entirely from first principles.
Just like how a chef learns
Retric 3 days ago |
A chef also learns through trial and error not just reading how others have cooked in the past and then copping their motions.
This is exemplified by how altitude has a meaningful impact but isn’t discussed for a given recipe.
exe34 3 days ago |
a text LLM isn't going to learn by trial and error, it's not been given that sort of freedom. RLHF would be the llm version of trial and error - but it's like the chef is only allowed to do that for a few days after years of chef school and from then on, he has to stick to what he has already learnt.
jebarker 3 days ago |
Why isn't LLM pre-training based on next token prediction considered "trial and error"? It seems to fit that description pretty well to me.
exe34 3 days ago |
a chef doesn't get feedback on his meal after picking up the spoon. he gets feedback when he or somebody else tastes the meal part way through and at the end.
Retric 3 days ago |
Pre-training is based on a proxy for desired output not actually desired output. It’s not in the form of responses to a prompt, and 1:1 reproducing copyrighted works in production would be bad.
It’s the difference between a painter copying some work and a painter making an original piece and then get feedback on it. We consider the second trial and error because the full process is being tested not just technique.
Jensson 3 days ago |
There is more than one correct answer in reality, LLM pre-training just trains it to respond the same way as the text did.
Imagine if school only gave correct if you used exactly the same words as the book, that is not "trial and error".
isaacfrond 3 days ago |
I can tell you haven't been in a school in while. That is actually a pretty accurate description of what schools are like nowadays.
Retric 2 days ago |
Pretty accurate != always, which is the point.
scellus 3 days ago |
Yes, except that I'm not so sure there is a clear distinction between following general instructions and generating new heuristics. It's just a difference in the level of abstraction there, and probably not even that one in any discrete sense, more like a continuum.
(Current) models may of course lack sufficient training data to act on a metalevel enough ("be creative problem solvers"), or they may lack deep enough representations to efficiently act in a more creative way. (And those two may be more or less the same thing or not.)
exe34 3 days ago |
it's exactly how we learn. many examples and then general principles. if you start with general principles, everybody drops out.
bravura 3 days ago |
Not "exactly" how we learn. Humans learn through a combination of reinforcement learning (which is costly/risky/painful) and through observation of existing patterns and norms.
Better observation-based learning is a less expensive way of improving existing corpus-based approaches than trial-and-error and participating in an environment.
exe34 3 days ago |
except that the careful observation comes late in the curriculum. children don't learn if you start out with the Stern Gerlach experiment. they sing ABCs.
pfisherman 3 days ago |
The parent of any young child can tell you that they learn through lots of exploration and reinforcement - often to the worry and chagrin of caregivers. Indeed much of our job is to guide exploration away from excessively dangerous “research” activities (ex. locking away cleaning products).
eru 3 days ago |
As an ideal parent, you should give your kids access to activities that seem dangerous, without actually being all too dangerous.
Kids seem to have an internal dial for a desired level of perceived danger, and get up to weird stuff, if they don't get enough perceived danger.
mannykannot 2 days ago |
Up to a point, general instructions can be generated from a collection of specific examples by abstracting what is different between them, but it is not clear to me that abstraction is all you need to come up with novel methods.
This seems consistent with the main points in this paper: one to-the-point statement of the answer to a factual question is all you need [1], while, if you don't have an example of a chain of reasoning in which all of the parameters are the same as those in the prompt, more than one example will be needed.
The authors write "We falsify the hypothesis that the correlations are caused by the fact that the reasoning questions are superficially similar to each other, by using a set of control queries that are also superficially similar but do not require any reasoning and repeating the entire experiment. For the control queries we mostly do not observe a correlation." In the examples of control queries that they give, however, this just amounts to embedding the specific answer to the question asked into language that resembles an example of reasoning to a solution (and in the first example, there is very little of the latter.) The result, in such cases, is that there is much less correlation with genuine examples of reasoning to a solution, but it is not yet clear to me how this fact justifies the claim quoted at the start of this paragraph: if the training set contains the answer stated as a fact, is it surprising that the LLM treats it as such?
[1] One caveat: if the answer to a factual question is widely disputed within the training data, there will likely be many to-the-point statements presented as the one correct answer (or - much less likely, I think - a general agreement that no definitive answer can be given.) The examples given in figure 1 are not like this, however, and it would be interesting to know if the significance of individual documents extends to such cases.
limit499karma 3 days ago |
> it observes
Observe implies sentience that, without question, a neural net simply does not possess. "It" certainly 'records', or more specifically it 'maps', but there is no observer in sight (npi).
> mimic
LLM's do not mimic. The magic is mathematical and happening in the high dimensional space. If there is intrinsic underlying pattern and semantic affinities between process X (used in training) and process Y (used in application), it is very likely that both share proximity, possibly form, in some dimensions of the high dimensional model.
danielbln 3 days ago |
Define "observation". If it's just sensory and information processing then no, it does not require nor simply sentience.
limit499karma 3 days ago |
There is a word for that: a 'recording'. There is no observer thus no observation.
ChadNauseam 3 days ago |
spoken eerily similar to how chatgpt would put it :) https://chatgpt.com/share/674cd11d-a30c-8005-90a3-023d0c9c18...
unit149 3 days ago |
Crucially, this is what MacIntyre's narrativity thesis is talking about:
If a university professor is giving a lecture on decentralized finance and forks into a recipe for chocolate chip cookies: crack two eggs, add a cup of flour, and fold in brown sugar prior to baking, it would break linearity.
A generalizable strategy for synthesizing LLMs differentiated by their training parameters is a tokenization is isolating data sets and then establishing a lattice in uniformity within the field of technics.
MacsHeadroom 3 days ago |
> A generalizable strategy for synthesizing LLMs differentiated by their training parameters is a tokenization is isolating data sets and then establishing a lattice in uniformity within the field of technics.
This comment appears to be incoherent and likely AI-generated text. Let me break down why:
1. While it uses technical-sounding terms related to machine learning (LLMs, tokenization, data sets), the way they're strung together doesn't make logical sense.
2. The grammar is incorrect: - "a tokenization is isolating" is not grammatically valid - The sentence structure breaks down in the middle with two "is" statements - The phrase "establishing a lattice in uniformity within the field of technics" is meaningless jargon
3. If we try to interpret what it might be attempting to say about LLMs (Large Language Models), the ideas don't connect in any meaningful way. "Synthesizing LLMs differentiated by their training parameters" could be trying to discuss creating different LLMs with varying parameters, but the rest doesn't follow logically.
4. The term "field of technics" is particularly suspicious - while "technics" is a real word, it's rarely used in AI/ML discussions and seems thrown in to sound technical.
This text shows common hallmarks of AI-generated content that's trying to sound technical but lacks real meaning - it uses domain-specific vocabulary but combines them in ways that don't make semantic sense, similar to how AI models can sometimes generate plausible-looking but ultimately meaningless technical text.
danielbln 3 days ago |
And that analysis is also LLM generated. It's turtles all the way down, folks.
skybrian 2 days ago |
I don’t think the paper says anything about general problem-solving? They analyzed 40 reasoning problems and I didn’t find a list of them, but the example they use most often is “find the slope of a line.” Apparently there are many examples in the dataset demonstrating how to find the slope of a line, including computer code, and they all influence the answer.
There are many ways you could ask a question about how to find the slope of a line, and the “generalization” going on seems to be unifying the different ways you could ask and answer this question.
It seems fair to say that the LLM did learn to find the slope of a line? But the question has definitely been solved before, many times.
semessier 3 days ago |
that resonates - less facts and more reasoning training data. The most low hanging in terms of non synthetic data probably being mathematical proofs. With prolog and the like many alternate reasoning paths could be generated. It's hard to say if these many-path would help in llm training without access to the gigantic machines (it's so unfair) to try it on.
shermantanktop 3 days ago |
Going meta a bit: comments so far on this post show diametrically opposing understandings of the paper, which demonstrates just how varied the interpretation of complex text can be.
We hold AI to a pretty high standard of correctness, as we should, but humans are not that reliable on matters of fact, let alone on rigor of reasoning.
sigmoid10 3 days ago |
This is extremely common in these discussions. Most humans are not that good at reasoning themselves and fall for the same kind of fallacies over and over because of the way they were brought up (their training data so to speak). And yet they somehow think they can argue why or why not LLMs should be able to do the same. If anything, the current limits of these morels show the limits of human cognition which is spread throughout the internet - because this is literally what they learned from. I believe once we achieve a more independent learning (like we've seen glimpses of in the MuZero paper) these models will blow human intelligence out of the water.
pclmulqdq 3 days ago |
It's because we can put responsibility on humans to be correct but we can't on computers. Humans given the appropriate incentives are very good at their jobs and there is a path for compensation if they screw up. Computers have neither of these things.
sigmoid10 3 days ago |
Humans already put a lot of trust in computers not because they can take responsibility but because traditional software can be made very predictable or at least compliant. There are whole industries built around software standards to ensure that. The problem is we don't yet know enough about identifying and patching problems in these models. Once we get something equivalent to MISRA for LLMs to achieve the same level of compliance, there is very little that could still hold them back.
pclmulqdq 3 days ago |
Yes. Traditional software has an unbroken chain of hard responsibility back to a human. Once you introduce non-determinism, things get weird.
eru 3 days ago |
Non-determinism is a well-understood tool, and does not diminish responsibility.
Grep works just fine, despite implementing non-deterministic finite state machines. Monte Carlo simulations are behind nuclear weapons (where they were invented), weather forecasts, financial trading, etc. Las Vegas algorithms like randomised quicksort also diminish no one's responsibility.
In principle, you can run training and inference on neural networks completely deterministically. But I don't think that makes any difference to responsibility. (To make them deterministic, you obviously have to use pseudo-random number generators with fixed seeds, but less obviously you also have to make sure that when you merge the results of parallel runs, the results 'merge' deterministically. Deterministic parallelism is an extremely interesting field of study! Or, since we are only talking about principles, not what's practical, you could just run everything in series.)
The problem with LLMs is that they are complicated and their actions are hard for humans to predict or reason through. Complexity is the bane of responsibility: if you have a complicated enough system (and a complicated enough management structure involved in producing that system), that's where responsibility goes to die, unless you specifically work to establish it.
In this case, employing LLMs is no worse than employing humans. If upper management gives bad instructions and incentives for lower level employees, we tend to pin the responsibility on upper management.
ndm000 3 days ago |
I think this is a key argument in how powerful AI can become. We may be able to create incredibly intelligent systems, but at the end of the day you can’t send a computer to jail. That inherently limits the power that will be given over to AI. If an AI accidentally kills a person, the worst that could be done to it is that it is turned off, whereas the owners of the AI would be held liable.
patcon 3 days ago |
> Most humans are not that good at reasoning themselves and fall for the same kind of fallacies over and over because of the way they were brought up
Disagree that it's easy to pin on "how they were brought up". It seems very likely that we may learn that the flaws are part of what makes our intelligence "work" and be adaptive to changing environments. It may be favourable in terms of cultural evolution for parents to indoctrinate flawed logic, not unlike how replication errors are part of how evolution can and must work.
In other words: I'm not sure these "failures" of the models are actual failures (in the sense of being non-adaptive and important to the evolutionary processes of intelligence), and further, it is perhaps us humans that are "failing" by over-indexing on "reason" as explanation for how we arrived here and continue to persist in time ;)
eru 3 days ago |
> Disagree that it's easy to pin on "how they were brought up".
Indeed. That might play a role, but another less politically charged aspect to look at is just: how much effort is the human currently putting in?
Humans are often on autopilot, perhaps even most of the time. Autopilot means taking lazy intellectual shortcuts. And to echo your argument: in familiar environments those shortcuts are often a good idea!
If you just do whatever worked last time you were in a similar situation, or whatever your peers are doing, chances are you'll have an easier time than reasoning everything out from scratch. Especially in any situations involving other humans cooperating with you, predictability itself is an asset.
tsumnia 3 days ago |
> And to echo your argument: in familiar environments those shortcuts are often a good idea!
Only to continue to reaffirm the original post, this was some of the basis for my dissertation. Lower-level practice, or exposure to tons of interactive worked examples, allowed students to train the "mental muscle memory" for coding syntax to learn the more general CS concept (like loops instead of for(int i = 0...). The shortcut in this case is learning what the syntax for a loop looks like so that it can BECOME a shortcut. Once its automatic, then it can be compartmentalized as "loop" instead of getting anxious over where the semicolons go.
Terr_ 3 days ago |
> Humans are often on autopilot, perhaps even most of the time
I wonder what responsiveness/results someone would get running an LLM with just ~20 watts for processing and memory, especially if it was getting trained at the same time.
That said, we do have a hardware advantage, what with the enormous swarm of nano-bots using technology and techniques literally beyond our best science. :p
eru 3 days ago |
There's another advantage:
Humans and human language have co-evolved to be compatible. Language makes no such allowance for the needs and quirks of LLMs. (However to a certain extent we design our LLMs to be able to deal with human language.)
mistermann 3 days ago |
Climate change and war are excellent examples demonstrating how far Humans are willing/obligated to take this convention.
eru 3 days ago |
> Most humans are not that good at reasoning themselves [...]
I'd say most humans most of the time. Individual humans can do a lot better (or worse) depending on how much effort they put in, and whether they slept well, had their morning coffee, etc.
> If anything, the current limits of these morels show the limits of human cognition which is spread throughout the internet - because this is literally what they learned from.
I wouldn't go quite so far. Especially because some tasks require smarts, even though there's no smarts in the training data.
The classic example is perhaps programming: the Python interpreter is not intelligent by any stretch of the imagination, but an LLM (or a human) needs smarts to predict what's going to do, especially if you are trying to get it to do something specific.
That example might skirt to close to the MuZero paper that you already mentioned as an exception / extension.
So let's go with a purer example: even the least smart human is a complicated system with a lot of hidden state, parts of that state shine through when that human produces text. Predicting the next token of text just from the previous text is a lot harder and requires a lot more smarts than if you had access to the internal state directly.
It's sort-of like an 'inverse problem'. https://en.wikipedia.org/wiki/Inverse_problem
InDubioProRubio 3 days ago |
Human society, when the individual reaches the limits of their "reasoning", usually produce growths to circumvent these limitations to produce and use artifacts that lurk beyond their limitations. A illiterate can still use Netflix, etc.
The ability to circumvent these limitations, is encoded in company procedures, architecture of hierarchies/gremiums within companies and states. Could AI be "upgraded" beyond human reasoning, by referencing these "meta-organisms" and their reasoning processes that can produce things that are larger then the sum of its parts?
Could AI become smarter by rewarding this meta-reasoning and prompting for it?
"Chat GPT for your next task, you are going to model a company reasoning process internally to produce a better outcome"
This should also allow to circumvent human reasoning bugs - like tribal thinking (which is the reason why we have black and white thinking. You goto agree with the tribes-group-think, else there be civil war risking all members of the tribe. Which is why there always can only be ONE answer, one idea, one plan, one leader - and multiple simultaneous explorations at once as in capitalism cause deep unease)
cscurmudgeon 3 days ago |
> We hold AI to a pretty high standard of correctness, as we should, but humans are not that reliable on matters of fact, let alone on rigor of reasoning.
I never understood this line of reasoning.
1. Humans can't run faster than 30 mph.
2. Therefore we can't complain if cars/trains/transport always go slower than 30 mph.
These comparisons also hide that we are comparing best of AI (massive LLMs) with median/average human reasoning.
XenophileJKO 3 days ago |
The whole social dynamic of this conversation is amazing. How fast complacency happened. In 2010 if you told me I could get a model to respond approximately as intelligently as a low intelligence human, I would be amazed. As a matter of perspective,I am still amazed.
At the same time I see such negative sentiment around the capabilities at their current limits.
We are reaching an era of commodified intelligence, which will be disruptive and surprising. Even the current limited models change the economics dramatically.
eru 3 days ago |
> Even the current limited models change the economics dramatically.
Yes, though at the moment they hype is still a lot bigger than the impact.
But I am fairly confident that even without any new technical ideas for the networks themselves, we will see a lot more economic impact over the next few years, as people work out how to use these new tools.
(Of course, the networks will also evolve still.)
naasking 3 days ago |
I think the impact is already understated. Every non-technical person I know that's still working has used ChatGPT for work at some point, and quite a few of them are using it regularly. And I'm nowhere near Silicon Valley or any serious tech hub.
eru 2 days ago |
Yes, they are great for helping with writer's block or for replacing a Google search.
But there's a lot of stuff they can't really do (in their current form), or can't do reliable, yet.
> Every non-technical person I know that's still working has used ChatGPT for work at some point, and quite a few of them are using it regularly. And I'm nowhere near Silicon Valley or any serious tech hub.
Yes, that makes me optimistic for their future, too.
a_victorp 3 days ago |
The main issue is the expectations. The companies behind these models marketed them as being intelligent or close to it so it's natural for people to expect that and react as such when the expectations are not met
cscurmudgeon 35 minutes ago |
Nobody is disputing that. Just because we have gone to the Moon doesn’t imply we are at Mars.
fsndz 2 days ago |
I find it bizarre that people often criticize LLMs’ capabilities by stating they don’t exhibit human-level performance because they sometimes fail, when occasionally failing is, in fact, a quintessentially human trait. Moreover, how do we reconcile the fact that LLMs perform far better than most humans on all the standardized tests they’ve passed? https://medium.com/@fsndzomga/there-will-be-no-agi-d9be9af44...
ninetyninenine 3 days ago |
>On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies
surprised this gets voted up given the surprising amount of users on HN who think LLMs can't reason at all and that the only way to characterize an LLM is through the lens of a next token predictor. Last time I was talking about LLM intelligence someone rudely told me to read up on how LLMs work and that we already know exactly how they work and they're just token predictors.
ben_w 3 days ago |
The loudest people seem to be those with the most extreme positions, and that includes on "is ${specific AI} (useless|superhuman) for ${domain}?". Perhaps it's just perception, but perhaps the arguments make them persist, as CGP Grey pointed out: https://www.youtube.com/watch?v=rE3j_RHkqJc
As I'm in the middle, I get flack from people on both extremes, as I'm outside their (equivalent of or just literally?) Overton window on this subject. Seems like an odd zone to be in for the opinion "this is a useful tool, but I see loads of ways it can go wrong". Makes me wonder what the real common discourse was of looms during the industrial revolution, and not just the modern summary of that era.
tkgally 3 days ago |
> Makes me wonder what the real common discourse was of looms during the industrial revolution, and not just the modern summary of that era.
Interesting question. I did a little searching with help from Claude, ChatGPT, Google, and the Internet Archive. Here are some links to writings from that time:
“Thoughts on the use of machines, in the cotton manufacture” (1780)
https://archive.org/details/bim_eighteenth-century_thoughts-...
Excerpt: “How many writers and copiers of books were thrown out of employment, or obliged to change it, by the introduction of printing presses? About ten years ago, when the Spinning Jennies came up, old persons, children, and those who could not easily learn to use the new machines, did suffer, for a while; till families had learned to play into one another's hands, by each taking a different kind of work. But the general benefit, which was received from the machines, very soon silenced all objections. And every sensible man now looks upon them with gratitude and approbation. It is probable, this will be the case in all new inventions.”
Kevin Binfield, ed., Writings of the Luddites (1811-1815; 2004)
https://ia903409.us.archive.org/16/items/writings-of-the-lud...
Robert Owen, Observations on the effect of the manufacturing system (1817)
https://archive.org/details/observationsonef00owenrich/page/...
William Radcliffe, Origin of the new system of manufacture commonly called power-loom weaving (1828)
https://archive.org/details/originofnewsyste0000radc
“An address to the Glasgow cotton-spinners on the moral bearing of their association” (1838)
https://catalogue.nla.gov.au/catalog/6023196
vundercind 3 days ago |
The “surprising gaps” are precisely because they’re not reasoning—or, at least, not “reasoning” about the things a human would be to solve the problems, but about some often-correlated but different set of facts about relationships between tokens in writing.
It’s the failure modes that make the distinction clearest.
LLM output is only meaningful, in the way we usually mean that, at the point we assigned external, human meaning to it, after the fact. The LLM wouldn’t stop operating or become “confused” if fed gibberish, because the meaning it’s extracting doesn’t depend on the meaning humans assign things, except by coincidence—which coincidence we foster by feeding them with things we do not regard as gibberish, but that’s beside the point so far as how they “really work” goes.
ninetyninenine 3 days ago |
But you also conveniently ignore the success modes where the answer is too novel to be anything other than reasoning.
The op clearly said LLMs reason so your opinion is totally against and opposed to the opinion of every author of that academic paper.
Why aren’t you condemning this paper?
naasking 3 days ago |
> or, at least, not “reasoning” about the things a human would be to solve the problems
You can't actually infer that either. Humans have considerable context that LLMs lack. You have no basis to infer how a human would reason given the same context as an LLM, or vice versa.
vundercind 2 days ago |
I don't think a human could effectively "reason" after being trained on nonsense (I don't think the training would even take). I think believing generative AI is operating on the same kind of meaning we are is a good way to be surprised when they go from writing like a learned professor for paragraphs to suddenly writing in the same tone and confidence but entirely wrong and with a bunch of made-up crap—it's all made-up crap from their perspective (if you will), we've just guided them into often making up crap that correlates to things we, separately from them, regard (they don't "regard", not in this sense) as non-crap.
[EDIT] To put it another way: if these things were trained to, I dunno, generate strings of onomatopoeia animal and environmental noises, I don't think anybody would be confusing what they're doing with anything terribly similar to human cognition, even if the output were structured and followed on from prompts reasonably sensibly and we were able to often find something like meaning or mood or movement or locality in the output—but they'd be doing exactly the same thing they're doing now. I think the form of the output and the training sets we've chosen are what're making people believe they're doing stuff significantly like thinking, but it's all the same to an LLM.
ninetyninenine 2 days ago |
But how can you be sure? You talk with confidence as if evidence exists to prove what you say but none of this evidence exists. It’s almost as if you’re an LLM yourself making up a claim with zero evidence. Sure you have examples that correlate with your point but nothing that proves your point.
Additionally there exists LLM output that runs counter to your point. Explain LLM output that is correct and novel. There exists correct LLM output on queries that are so novel and unique they don’t exist in any form in the training data. You can easily and I mean really easily make an LLM produce such output.
Again you’re making up your answer here without proof or evidence which is identical to the extrapolation the LLM does. And your answer runs counter to every academic author on that paper. So what I don’t understand from people like you is the level of unhinged confidence that runs border to religion.
Like you were talking about how the wrongness of certain LLM output make the distinction clearest while obviously ignoring the output that makes it unclear.
It’s utterly trivial to get LLMs to output things that disprove your point. But what’s more insane is that you can get LLMs to explain all of what’s being debated in this thread to you.
https://chatgpt.com/share/674dd1fa-4934-8001-bbda-40fe369074...
vundercind 2 days ago |
I ignored your other response to me because I didn't see anything in the abstract that contradicted my posts, but maybe there's something deeper in the paper that does. I'll read more of it later.
I think, though, the disconnect between us is that I don't see this:
> Explain LLM output that is correct and novel.
As something I need to do for my position to be strong. It would be if I'd made different claims, but I haven't made those claims. I can see parts of the paper's abstract that would also be relevant and tough to deal with if I'd made those other claims, so I'm guessing those are the parts you think I need to focus on, but I'm not disputing stuff like (paraphrasing) "LLMs may produce output that follows the pattern of a form of reasoned argument they were trained on, not just the particulars" from the abstract. Sure, maybe they do, OK.
I don't subscribe to (and don't really understand the motivation for) claims that generative AI can't produce output that's not explicitly in their training set, which is a claim I have seen and (I think?) one you're taking me as promoting. Friggin' Markov chain text generators can, why couldn't LLMs? Another formulation is that everything they output is a result of what they were trained on, which is stronger but only because it's, like, tautologically true and not very interesting.
ninetyninenine 2 days ago |
It’s ok if you ignore it. I don’t mind.
Your claim is LLMs can’t reason.
And you say you don’t have to explain why LLMs output novel and correct answers. Well I asked for this because the fact that LLMs output correct and novel answers disproves your point. I disproved your point. Think about it. You think I introduced some orthogonal topic but I didn’t. I stated a condition that the LLM meets that disproves your claim.
So if there exists a prompt and answer pair such that the answer is so novel the probability of it being a correlation or random chance is extraordinarily low then your point is trivially wrong right?
Because if the answer wasn’t arrived by some correlative coincidence which you seem to be claiming then the only other possible way is reasoning.
Again such question and answer pairs actually exist for LLMs. They can be trivially generated like my shared link above which it talks about the entire topic of this sub thread without training data to support it.
Humans fail at reasoning all the time yet we don’t say humans can’t reason. So to keep consistency for the criterion we use for humans. If the LLM reasoned it means it can reason even if it clearly gives the wrong answer sometimes.
Additionally you likely claim all humans can reason. What’s your criterion for that? When you look at a human sometimes it outputs correct and novel answers that are not part of its training data (experiences).
It’s literally the same logic but you subconsciously move the goal posts to be much much higher for an LLM. In fact under this higher criterion all mentally retarded humans and babies can’t reason at all.
naasking 2 days ago |
> I don't think a human could effectively "reason" after being trained on nonsense (I don't think the training would even take)
Go talk to a flat Earther or other religious zealot.
> I think the form of the output and the training sets we've chosen are what're making people believe they're doing stuff significantly like thinking, but it's all the same to an LLM.
Yes, but your mistake is asserting that thinking is not just following a set of patterns that lead from premises to conclusions, but that's literally what deductive logic is.
Let me put it this way, your argument is basically saying that computers can't reproduce human thinking because they can be programmed to output all kinds of nonsense that clearly isn't thinking (or at least, this is what it seems like you're saying). Well sure, but if they're programmed to actually think like humans, which should theoretically be possible given the Bekenstein Bound and the Church-Turing thesis, then clearly they are reproducing human thinking despite the fact that they can also be programmed to produce nonsense.
So the core question is whether artficacts of human thinking, like textbooks, poetry, etc. are sufficient for a learning system that can theoretically learn to reproduce arbitrary functions, to learn to reproduce human thinking. So rather than the training sets being some kind of red herring that are confusing people into concluding that LLMs are thinking, they are in fact central to the whole argument for why LLMs might actually be thinking!
We probably agree that LLMs don't have the same understanding of meaning that humans do, but I'm closer to the position that this is because they haven't been exposed to the same datasets we have, and not necessarily because their fundamental operation is so different. I think fundamental operation is probably a red herring because of Turing equivalence.
vundercind 2 days ago |
> We probably agree that LLMs don't have the same understanding of meaning that humans do
I think this is absolutely key, because to the extent LLM output has meaning we care about, I think it's effectively all supplied by us after the fact. This doesn't mean LLMs can't do interesting and useful things, but I consider some current maximalist takes on the state and immediate likely future of AI as doing something a couple steps up the complexity-chain from believing a pocket calculator that solves "1 + 5" must understand what "6" means to a human. That "6" is just some glowing dots until we assign it meaning and context that the pocket calculator doesn't and cannot have, even though it's really good at solving and displaying the results of calculations.
This model explains, I think, the actual experience of using an LLM better than the model I gather some have, which is that they're doing something pretty close to thinking but just get stuff wrong sometimes, as a human gets stuff wrong sometimes (I think they get things wrong differently from how a human does, and it's because they aren't working with the same kind of meaning we are). I think it's the familiar form of the output that's leading people down this way of thinking about what they're doing, and I think it's wrong in ways that matter, both for policy purposes (pleading that they're just doing what humans do to learn and then produce output from what they learned, when it comes to building them with copyright-protected data, falls rather flat with me, for example—I'm not quite to the point of entirely dismissing the argument, but I'm pretty close to sticking that in the "not even wrong" bucket, in part because of my perspective on how they work) and for actually working productively with generative AI. When these programs fail, it's usually not the way a human does, and using heuristics for recognizing places you need to be cautious when dealing with human output will result in mistakes. "Lies" often looks a lot like "truth" with them and can come out of nowhere, because that's not quite what they deal in, not the way humans do. They don't really lie but, crucially, they also don't tell the truth. But they may produce output that contains information that's wrong or correct, and takes a form that's very useful, or not very useful.
> but I'm closer to the position that this is because they haven't been exposed to the same datasets we have, and not necessarily because their fundamental operation is so different.
I'm not super far from agreeing with this, I think, but also think there's probably some approach (or, I'd expect, set of approaches) we need to layer on top of generative AI to make it do something that I'd consider notably close to human-type thinking, in addition to just being able to poke it and make text come out. I think what we've got now are, in human terms, something like severely afflicted schizophrenics with eidetic memories, high levels of suggestibility, and an absence of ego or self-motivation, which turns out to be pretty damn useful things to have but isn't necessarily something we'll get broadly human-level cognition (or better—I mean, they're already better than a lot of people at some tasks, let's face it, zero people who've ever lived could write bad satirical poetry as fast as an LLM can, much as nobody can solve square roots as fast as a pocket calculator) out of if we just do more of it—I doubt that the basic components needed to bridge that gap are present in the current systems at all. I expect we'll see them fail less as we feed them more energy and data, but for their failures to continue to look alien and surprising, always due to that mismatch between the meaning we're assigning to what they're doing, and their internal sense of "meaning", which are correlated (because we've forced them to be) but not dependent on one another in some necessary way. But yes, giving them more sources of "sensory" input and a kind of will, with associated tools, to seek out more input, is likely the direction we'll need to go to make them capable of more things, rather than just somewhat better at what they do now.
[EDIT] As for why I think our ways of discussing how these work matters, aside from the aforementioned reasons, it's that lay-people are taking our lead on this to some extent, and when we come out acting like these are thinking agents in some serious sense (or cynically promoting them as super dangerous and close to becoming real conscious entities on the verge of being insanely "smart" because, gee would you look at that, we're selling the things—ahem, paging Altman) it's a recipe for cooking up harmful misuse of these tools and of laws and policy that may be at-odds with reality.
moffkalast 3 days ago |
The reality is that both can be true at the same time. Yes they're next token predictors, but sometimes the only way to do that correctly is by actually understanding everything that came before and reasoning logically about it. There's some Sutskever quote that if the input to a model is most of a crime novel, and the next token is the name of the perpetrator, then the model understood the novel.
Transformers are arbitrary function approximators, so there's no hard limitation on what they can or cannot do.
ninetyninenine 2 days ago |
Perhaps reasoning and understanding are things that don’t exist and that humans themselves are just next token predictors.
Heck reasoning and understanding are ill defined concepts. It might even be running on the same bullshit fuel as the word “spirituality”. Maybe the rigorous definition of all these vague words like intelligence or comprehension is really just next token prediction.
moffkalast 2 days ago |
Eh well they were defined in the world where human intelligence was the only thing around, so there were... baked in assumptions. But as the current definitions go, I don't think there's any need for included consciousness or self awareness for these terms to apply, despite what people seem to immediately jump to. There's no part where it needs to mean human-level proficiency either.
Understanding is having knowledge about something, reasoning is making logical conclusions from it, and intelligence is being able to do it continuously with newly presented information. All things that well trained LLMs can arguably do to a detectable extent.
naasking 3 days ago |
I don't think "next token predictor" and "intelligent" are actually mutually exclusive.
ninetyninenine 2 days ago |
I agree. But then why do so many people take this view point? I mean we all can guess as to why but I want to hear the reasoning from them.
rors 3 days ago |
It seems obvious to me that LLMs wouldn't be able to find examples of every single problem posed to them in training data. There wouldn't be enough examples for the factual look up needed in an information retrieval style search. I can believe that they're doing some form of extrapolation to create novel solutions to posed problems.
It's interesting that this paper doesn't contradict the conclusions of the Apple LLM paper[0], where prompts were corrupted to force the LLM into making errors. I can also believe that LLMs can only make small deviations from existing example solutions in creation of these novel solutions.
I hate that we're using the term "reasoning" for this solution generation process. It's a term coined by LLM companies to evoke an almost emotional response on how we talk about this technology. However, it does appear that we are capable of instructing machines to follow a series of steps using natural language, with some degree of ambiguity. That in of itself is a huge stride forward.
[0] https://machinelearning.apple.com/research/gsm-symbolic
ucefkh 3 days ago |
Totally, these companies are pushing towards showcasing their AI models as self thinking and reasoning AI while they are just trained of a lot of amount of data in dataset format which they extrapolate to find the right answer.
They still can't think outsider their box of datasets
pfisherman 3 days ago |
I very much agree with the perspective that LLMs are not suited for “reasoning” in the sense of creative problem solving or application of logic. I think that the real potential in this domain is having them act as a sort of “compiler” layer that bridges the gap between natural language - which is imprecise - and formal languages (sql, prolog, python, lean, etc) that are more suited for solving these types of problems. And then maybe synthesizing the results / outputs of the formal language layer. Basically “agents”.
That being said, I do think that LLMs are capable of “verbal reasoning” operations. I don’t have a good sense of the boundaries that distinguish the logics - verbal, qualitative, quantitative reasoning. What comes to my mind is the verbal sections of standardized tests.
eru 3 days ago |
> I think that the real potential in this domain is having them act as a sort of “compiler” layer that bridges the gap between natural language - which is imprecise - and formal languages (sql, prolog, python, lean, etc) that are more suited for solving these types of problems. And then maybe synthesizing the results / outputs of the formal language layer. Basically “agents”.
Well, if you do all that, would you say that the system has a whole has 'reasoned'? (I think ChatGPT can already call out to Python.)
MacsHeadroom 3 days ago |
The system as a whole has reasoned twice over, verbally and then logically.
eru 3 days ago |
Well, pfisherman seems to disagree with that use of the word reasoning.
joe_the_user 3 days ago |
I can believe that they're doing some form of extrapolation to create novel solutions to posed problems
You can believe it what sort of evidence are you using for this belief?
Edit: Also, the abstract of the Apple paper hardly says "corruption" (implying something tricky), it says that they changed the initial numerical values
og_kalu 3 days ago |
Changing numerical values doesn't do anything to impact the performance of state of the art models (4o, o1-mini, preview)
The only thing that does is the benchmark that introduces "seemingly relevant but ultimately irrelevant information"
fragmede 2 days ago |
> It's a term coined by LLM companies to evoke an almost emotional response on how we talk about this technology.
Anthropomorphizing computers has been happening long before ChatGPT. No one thinks their computer is actually eating their homework when they say that to refer to the fact that their computer crashed and their document wasn't saved, it's just an easy way to refer to the thing it just did. Before LLMs, "the computer is thinking" wasn't an unuttered sentence. Math terms aren't well known to everybody, so saying Claudr is dot-producting an essay for me, or I had ChatGPT dot-product that letter to my boss, no one knows that a dot product is, so even if that's a more technically accurate verb, who's gonna use it? So while AI companies haven't done anything to promote usage of different terms than "thinking" and "reasoning", it's also because those are the most handy terms. It "thinks" there are two R's in strawberries. It dot-products there are two R's in strawberries. It also matrix multiplies, occasionally softmaxes; convolves. But most people aren't Terence Tao and don't have a feel for when something's softmaxing because what even does that mean?
ricardobeat 3 days ago |
Does this mean LLMs might do better if trained on large amounts of student notes, exams, book reviews and such? That would be incredibly interesting.
GarnetFloride 3 days ago |
I have wondered that from time to time, why not train an AI system using educational curricula plus some games and play? It might be fascinating to see what comes out using various systems from around the world.
ilaksh 3 days ago |
They do train on textbooks.
naasking 3 days ago |
Yes, but maybe they should train more on uncertainty and mistakes, that are later followed by clarifications and corrections, like those found in notes.
qeternity 2 days ago |
Textbooks are all you need - https://arxiv.org/pdf/2306.11644
jpcom 3 days ago |
Yep.
btilly 3 days ago |
This is highly relevant to the recent discussion at https://news.ycombinator.com/item?id=42285128.
Google claims that their use of pretraining is a key requirement for being able to deliver a (slightly) better chip design. And they claim that a responding paper that did not attempt to do pretraining, should have been expected to be well below the state of the art in chip design.
Given how important reasoning is for chip design, and given how important pretraining is for driving reasoning in large language models, it is obvious that Google's reasoning is very reasonable. If Google barely beats the state of the art while using pretraining, an attempt that doesn't pretrain should be expected to be well below the current state of the art. And therefore that second attempt's poor performance says nothing about whether Google's results are plausible.
pfisherman 3 days ago |
I am not an expert in the particular application domain of that article; but I can see why their argument of pre training might be valid. It is not especially controversial to say that pre training neural nets improves few shot learning performance. And I suspect there is an inflection point for every problem where pre trained neural nets yield better few shot learning performance than less data hungry approaches - such as hand crafted features or strong priors.
That being said, it seems that the question here is whether that inflection point has been reached in this case.
andai 3 days ago |
> In the extreme case, a language model answering reasoning questions may rely heavily on retrieval from parametric knowledge influenced by a limited set of documents within its pretraining data. In this scenario, specific documents containing the information to be retrieved (i.e. the reasoning traces) contribute significantly to the model’s output, while many other documents play a minimal role.
> Conversely, at the other end of the spectrum, the model may draw from a broad range of documents that are more abstractly related to the question, with each document influencing many different questions similarly, but contributing a relatively small amount to the final output. We propose generalisable reasoning should look like the latter strategy.
Isn't it much more impressive if a model can generalize from a single example?
ScottPowers 3 days ago |
thanks
samirillian 2 days ago |
Okay dumb question, why are the images they generate nightmarish nonsense. Why can’t they procedurally construct a diagram